Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImprovementHandlingOnConnectionReset #32936

Conversation

xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Jan 12, 2023

Related issue: #32861. Part of the upgrade resiliency feature work: #28266

Issue:
Replica health status is connected when there is 0 connections. Two possibilities:

Due to the idleEndpoint config, the connection has been closed
Currently, for an idle connection, SDK only mark replica as unhealthy if the connection has been closed gracefully (FIN signal), if the connection has been reset then SDK ignores it.

  • Diagnostics:
    image
  • Local repro:
    Using TcpKill to kill one of the established connection when the connection is in idle status (no requests in progress), the above diagnostics has been observed
    The exception captured in connectionStateListener is:
2023-01-12` 19:25:35,718       [cosmos-rntbd-epoll-2-2] INFO  com.azure.cosmos.implementation.directconnectivity.rntbd.RntbdConnectionStateListener - Will not raise the connection state change event for error
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

Proposal:
ConnectionStateListener will mark the replica as unhealthy on IOException instead of limiting to only ClosedChannelException

Test:
Using TcpKill to kill one of the established connection when the connection is in idle status(no request in progress). Validated that the replica status has changed into unhealthy.
image

Motivation for originally limiting to only FIN
Ignoring RST was decided when we would still remove replica form metadata cache when retrieving the close signal. At the time the concern was that random RST (could come form any intermediary TCP hop) would cause unnecessary removal of replica form cache even when backend might not have been the source of the RST. With AsycnNonBlockingCache we don't remove the replica form cache until we have new data - so, the concern above is void.

// which will translate into ClosedChannelException which does not mean server is in unhealthy status.
// But it makes sense to make the server as unhealthy as it is safer to validate the server health again for future requests
for (Uri addressUri : this.addressUris) {
addressUri.setUnhealthy();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of Unhealthy, RevalidationNeeded is a better one to use for status

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

@xinlian12 xinlian12 Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update the name in a different PR - as it is internal implementation details, also the naming discussion may take some time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not adding new state into the overall state transition right?

Copy link
Member Author

@xinlian12 xinlian12 Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, I am more thinking just name changing: change from Unhealthy -> RevalidationNeeded. But keep the differentiation can be useful in the future, will present two ways and discuss with team in a different PR.

@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -7,6 +7,7 @@
#### Breaking Changes

#### Bugs Fixed
* Added improvement in `RntbdConnectionStateListener` to better handling scenarios when connection is closed unexpectedly - See [PR 32936](https://github.com/Azure/azure-sdk-for-java/pull/32936)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: All connection closures will result in unhealthy status?

Copy link
Member Author

@xinlian12 xinlian12 Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what about Added improvement in handling for idle connection being closed unexpectedly - trying to avoid mention the replica status as it is implementation details and thinking about changing the name

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

Tested the following tests locally and succeeded:

clientTelemetryWithStageJunoEndpoint
createRecoversFrom410GoneFromServiceOnPartitionSplitDuringIdleTime

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xinlian12

@xinlian12
Copy link
Member Author

/check-enforcer override

@xinlian12 xinlian12 merged commit cafe43a into Azure:main Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants