Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move hedging on top of ClientRetryPolicy #36508

Conversation

FabianMeiswinkel
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel commented Aug 24, 2023

Description

This PR addresses a few issues identified with E2ETimeout and Availability strategy. The current approach to inject the hedging across regions in ReplicatedResourceClient has a few issues:

  • For an operation like readItem the ClientRetryPolicy controlling cross-regional retries sits on top of the ReplicatedResourceClient - which would lead to possibly starting the hedging multiple times (initial attempt and every time the client retry policy retries cross region)
  • The hedging in ReplicatedResourceClient uses Mono.firstWithValue and starts a Composite Mono with Monos for each region - this means if requests fail - the operator would wait for all of them to succeed/fail. In situations where errors are expected - like a legit 404/0 a 404-Conflict etc. this would mean waiting for all hedging attempts to finish before bubbling up the NotFoundExceptions - this would result in unnecessarily high latency for 404/0, 409 or 412.
  • ReplicatedResourceClient would add any Request/Response to store results in ComsosDiagnostics - which means we would end-up with one CosmosDiagnostics instances containing all request/response pairs for all hedging attempts. The initial idea was instead to add multiple CosmosDIagnostics instances to the CosmosDiagnosticsContext - one for each hedging attempt. This would simplify debugging - and also diagnostics and metrics enhancement are designed around this.

To address these challenges this PR is moving the hedging to the RxDocumentClientImpl - so, on top of ClientRetryPolicy. This PR only addresses the changes for point operations - not yet queries

The unit tests added will use a Multi-Region, Multi-Master account to test the availability strategy on a real account.

Changelog should be updated with the query PR

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

jeet1995 and others added 16 commits August 9, 2023 09:01
* Updated read hang grace and cancellation count threshold

* Updated CHANGELOG.md

* Updated CHANGELOG.md

* Updated CHANGELOG.md
* Updated release versions to 4.48.1 for azure-cosmos

* Updated release notes
…abianm/AvailabilityStrategyAllRegionExceptionFix
…6479)

* Refactoring the AvailabilityStrategyNoSuchElementFix

* Reacting to code review comments
…to users/fabianm/AvailabilityStrategyAllRegionExceptionFix
Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes, thanks a lot :)

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel FabianMeiswinkel merged commit 7aa44d6 into Azure:main Sep 15, 2023
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants