-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImprovementHandlingOnConnectionReset #32936
ImprovementHandlingOnConnectionReset #32936
Conversation
// which will translate into ClosedChannelException which does not mean server is in unhealthy status. | ||
// But it makes sense to make the server as unhealthy as it is safer to validate the server health again for future requests | ||
for (Uri addressUri : this.addressUris) { | ||
addressUri.setUnhealthy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of Unhealthy
, RevalidationNeeded
is a better one to use for status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update the name in a different PR - as it is internal implementation details, also the naming discussion may take some time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not adding new state into the overall state transition right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, I am more thinking just name changing: change from Unhealthy
-> RevalidationNeeded
. But keep the differentiation can be useful in the future, will present two ways and discuss with team in a different PR.
...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java
Show resolved
Hide resolved
API change check API changes are not detected in this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos/CHANGELOG.md
Outdated
@@ -7,6 +7,7 @@ | |||
#### Breaking Changes | |||
|
|||
#### Bugs Fixed | |||
* Added improvement in `RntbdConnectionStateListener` to better handling scenarios when connection is closed unexpectedly - See [PR 32936](https://github.com/Azure/azure-sdk-for-java/pull/32936) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about: All connection closures will result in unhealthy status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, what about Added improvement in handling for idle connection being closed unexpectedly
- trying to avoid mention the replica status as it is implementation details and thinking about changing the name
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
Tested the following tests locally and succeeded: clientTelemetryWithStageJunoEndpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @xinlian12
/check-enforcer override |
Related issue: #32861. Part of the upgrade resiliency feature work: #28266
Issue:
Replica health status is connected when there is 0 connections. Two possibilities:
Due to the idleEndpoint config, the connection has been closed
Currently, for an idle connection, SDK only mark replica as unhealthy if the connection has been closed gracefully (FIN signal), if the connection has been reset then SDK ignores it.
Using TcpKill to kill one of the established connection when the connection is in idle status (no requests in progress), the above diagnostics has been observed
The exception captured in connectionStateListener is:
Proposal:
ConnectionStateListener will mark the replica as unhealthy on
IOException
instead of limiting to onlyClosedChannelException
Test:
![image](https://user-images.githubusercontent.com/64233642/212188388-53f50a56-d901-42b6-adc8-5d7bcfab3ef3.png)
Using TcpKill to kill one of the established connection when the connection is in idle status(no request in progress). Validated that the replica status has changed into unhealthy.
Motivation for originally limiting to only FIN
Ignoring RST was decided when we would still remove replica form metadata cache when retrieving the close signal. At the time the concern was that random RST (could come form any intermediary TCP hop) would cause unnecessary removal of replica form cache even when backend might not have been the source of the RST. With AsycnNonBlockingCache we don't remove the replica form cache until we have new data - so, the concern above is void.