Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel health check improvement for cancelled requests #36225

Merged

Conversation

jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Aug 2, 2023

PR features

Add channel health check gate when the channel sees request cancellations

In service-side node-freeze scenarios, connections / channels may not be severed thus causing the SDK to not detect unhealthiness on such channels, and with end-to-end timeout also configured for the request, such requests do not hit transit timeout either which could have otherwise helped deem the channel as unhealthy.

In this PR, the SDK will track request cancellations and read delays at the channel level and if these properties exceed certain values, then the channel is deemed unhealthy. If such cancellations on a channel are a recurring symptom, the replica associated with this channel is treated as Unhealthy and is removed from the list of replicas to send a request to. This can help reduce the volume of timeouts seen due to end-to-end timeout configured.

Replica validation optimization

Previously replica validation would be triggered for replicas in the Unknown health status despite the replica not being used by a container in the warm-up flow. This would create unnecessary connection attempts which could cause spikes in resource usage. Such replicas are excluded.

Local testing done

  • Verified that channel is being deemed unhealthy after 3 consecutive request cancellations on it and a read delay of more than 10s.
  • Verified that channel being set as unhealthy is resulting in the replica's health status to be Unhealthy.
  • Verified that replica being set as Unhealthy results in a background address refresh with the forceAddress flag as true.

@github-actions github-actions bot added the Cosmos label Aug 2, 2023
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@jeet1995
Copy link
Member Author

jeet1995 commented Aug 4, 2023

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jeet1995 for the optimization

@jeet1995
Copy link
Member Author

jeet1995 commented Aug 4, 2023

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

jeet1995 commented Aug 4, 2023

/check-enforcer override

@jeet1995 jeet1995 merged commit e618ba1 into Azure:main Aug 4, 2023
69 of 80 checks passed
jeet1995 added a commit that referenced this pull request Sep 13, 2023
* Attempt to reproduce Walmart issue.

* Added a cancellation gate to RntbdClientChannelHealthChecker.

* Refactorings.

* Refactorings.

* Added tests.

* Refactorings.

* Fixed replica validation to open only 1 connection to replica.

* Attempt at fixing tests.

* Attempt at fixing tests.

* Attempt at fixing tests.

* Updated CHANGELOG.md.

* Refactorings.

* Refactorings.

* Reacting to review comments.

---------

Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>

(cherry picked from commit e618ba1)
jeet1995 added a commit that referenced this pull request Sep 13, 2023
* Attempt to reproduce Walmart issue.

* Added a cancellation gate to RntbdClientChannelHealthChecker.

* Refactorings.

* Refactorings.

* Added tests.

* Refactorings.

* Fixed replica validation to open only 1 connection to replica.

* Attempt at fixing tests.

* Attempt at fixing tests.

* Attempt at fixing tests.

* Updated CHANGELOG.md.

* Refactorings.

* Refactorings.

* Reacting to review comments.

---------

Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>

(cherry picked from commit e618ba1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants