ImprovementHandlingOnConnectionReset #32936

xinlian12 · 2023-01-12T20:33:57Z

Related issue: #32861. Part of the upgrade resiliency feature work: #28266

Issue:
Replica health status is connected when there is 0 connections. Two possibilities:

Due to the idleEndpoint config, the connection has been closed
Currently, for an idle connection, SDK only mark replica as unhealthy if the connection has been closed gracefully (FIN signal), if the connection has been reset then SDK ignores it.

Diagnostics:
Local repro:
Using TcpKill to kill one of the established connection when the connection is in idle status (no requests in progress), the above diagnostics has been observed
The exception captured in connectionStateListener is:

2023-01-12` 19:25:35,718       [cosmos-rntbd-epoll-2-2] INFO  com.azure.cosmos.implementation.directconnectivity.rntbd.RntbdConnectionStateListener - Will not raise the connection state change event for error
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

Proposal:
ConnectionStateListener will mark the replica as unhealthy on IOException instead of limiting to only ClosedChannelException

Test:
Using TcpKill to kill one of the established connection when the connection is in idle status(no request in progress). Validated that the replica status has changed into unhealthy.

Motivation for originally limiting to only FIN
Ignoring RST was decided when we would still remove replica form metadata cache when retrieving the close signal. At the time the concern was that random RST (could come form any intermediary TCP hop) would cause unnecessary removal of replica form cache even when backend might not have been the source of the RST. With AsycnNonBlockingCache we don't remove the replica form cache until we have new data - so, the concern above is void.

xinlian12 · 2023-01-12T20:39:30Z

...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java

+            // which will translate into ClosedChannelException which does not mean server is in unhealthy status.
+            // But it makes sense to make the server as unhealthy as it is safer to validate the server health again for future requests
+            for (Uri addressUri : this.addressUris) {
+                addressUri.setUnhealthy();


maybe instead of Unhealthy, RevalidationNeeded is a better one to use for status

will update the name in a different PR - as it is internal implementation details, also the naming discussion may take some time

It's not adding new state into the overall state transition right?

not really, I am more thinking just name changing: change from Unhealthy -> RevalidationNeeded. But keep the differentiation can be useful in the future, will present two ways and discuss with team in a different PR.

...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java

azure-sdk · 2023-01-12T20:43:54Z

API change check

API changes are not detected in this pull request.

FabianMeiswinkel

Thanks !

xinlian12 · 2023-01-12T21:53:16Z

/azp run java - cosmos - tests

azure-pipelines · 2023-01-12T21:53:29Z

Azure Pipelines successfully started running 1 pipeline(s).

kirankumarkolli · 2023-01-12T23:10:29Z

sdk/cosmos/azure-cosmos/CHANGELOG.md

@@ -7,6 +7,7 @@
 #### Breaking Changes

 #### Bugs Fixed
+* Added improvement in `RntbdConnectionStateListener` to better handling scenarios when connection is closed unexpectedly - See [PR 32936](https://github.com/Azure/azure-sdk-for-java/pull/32936)


How about: All connection closures will result in unhealthy status?

hmm, what about Added improvement in handling for idle connection being closed unexpectedly - trying to avoid mention the replica status as it is implementation details and thinking about changing the name

xinlian12 · 2023-01-12T23:35:48Z

/azp run java - cosmos - tests

azure-pipelines · 2023-01-12T23:36:02Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2023-01-12T23:36:50Z

/azp run java - cosmos - tests

azure-pipelines · 2023-01-12T23:36:59Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2023-01-13T02:50:05Z

Tested the following tests locally and succeeded:

clientTelemetryWithStageJunoEndpoint
createRecoversFrom410GoneFromServiceOnPartitionSplitDuringIdleTime

kushagraThapar

LGTM, thanks @xinlian12

xinlian12 · 2023-01-13T06:43:45Z

/check-enforcer override

improvement on Connection reset by peer

e4e1eba

xinlian12 requested review from kushagraThapar, FabianMeiswinkel, kirankumarkolli, milismsft, aayush3011, simorenoh, jeet1995 and Pilchie as code owners January 12, 2023 20:33

ghost added the Cosmos label Jan 12, 2023

xinlian12 commented Jan 12, 2023

View reviewed changes

FabianMeiswinkel reviewed Jan 12, 2023

View reviewed changes

...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java Show resolved Hide resolved

FabianMeiswinkel approved these changes Jan 12, 2023

View reviewed changes

annie-mac added 2 commits January 12, 2023 13:23

fix tests

a8b6f8f

update changelog

3a25f38

kirankumarkolli reviewed Jan 12, 2023

View reviewed changes

kirankumarkolli approved these changes Jan 12, 2023

View reviewed changes

update changelog

e13f2d9

kushagraThapar approved these changes Jan 13, 2023

View reviewed changes

xinlian12 merged commit cafe43a into Azure:main Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImprovementHandlingOnConnectionReset #32936

ImprovementHandlingOnConnectionReset #32936

xinlian12 commented Jan 12, 2023 •

edited

Loading

xinlian12 Jan 12, 2023

FabianMeiswinkel Jan 12, 2023

xinlian12 Jan 12, 2023 •

edited

Loading

kirankumarkolli Jan 12, 2023

xinlian12 Jan 12, 2023 •

edited

Loading

azure-sdk commented Jan 12, 2023

FabianMeiswinkel left a comment

xinlian12 commented Jan 12, 2023

azure-pipelines bot commented Jan 12, 2023

kirankumarkolli Jan 12, 2023

xinlian12 Jan 12, 2023 •

edited

Loading

xinlian12 commented Jan 12, 2023

azure-pipelines bot commented Jan 12, 2023

xinlian12 commented Jan 12, 2023

azure-pipelines bot commented Jan 12, 2023

xinlian12 commented Jan 13, 2023

kushagraThapar left a comment

xinlian12 commented Jan 13, 2023

ImprovementHandlingOnConnectionReset #32936

ImprovementHandlingOnConnectionReset #32936

Conversation

xinlian12 commented Jan 12, 2023 • edited Loading

xinlian12 Jan 12, 2023

Choose a reason for hiding this comment

FabianMeiswinkel Jan 12, 2023

Choose a reason for hiding this comment

xinlian12 Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

kirankumarkolli Jan 12, 2023

Choose a reason for hiding this comment

xinlian12 Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

azure-sdk commented Jan 12, 2023

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

xinlian12 commented Jan 12, 2023

azure-pipelines bot commented Jan 12, 2023

kirankumarkolli Jan 12, 2023

Choose a reason for hiding this comment

xinlian12 Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

xinlian12 commented Jan 12, 2023

azure-pipelines bot commented Jan 12, 2023

xinlian12 commented Jan 12, 2023

azure-pipelines bot commented Jan 12, 2023

xinlian12 commented Jan 13, 2023

kushagraThapar left a comment

Choose a reason for hiding this comment

xinlian12 commented Jan 13, 2023

xinlian12 commented Jan 12, 2023 •

edited

Loading

xinlian12 Jan 12, 2023 •

edited

Loading

xinlian12 Jan 12, 2023 •

edited

Loading

xinlian12 Jan 12, 2023 •

edited

Loading