-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReplicaValidation #29767
ReplicaValidation #29767
Conversation
API change check API changes are not detected in this pull request. |
...cosmos-encryption/src/main/java/com/azure/cosmos/encryption/CosmosEncryptionAsyncClient.java
Outdated
Show resolved
Hide resolved
6fb295d
to
4665bfd
Compare
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosException.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosException.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosSchedulers.java
Outdated
Show resolved
Hide resolved
...mos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/DiagnosticsClientContext.java
Show resolved
Hide resolved
...src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestArgs.java
Show resolved
Hide resolved
...cosmos/src/test/java/com/azure/cosmos/implementation/DocumentServiceRequestContextTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Annie - great work - well structured implementation and PR description. Really appreciated!
...src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestArgs.java
Show resolved
Hide resolved
...main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdServiceEndpoint.java
Show resolved
Hide resolved
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
Tested the following tests locally and passed: |
/check-enforcer override |
* replicaValidationBeforeUse Co-authored-by: annie-mac <annie-mac@annie-macs-MBP.home> Co-authored-by: annie-mac <annie-mac@annie-macs-MacBook-Pro.local>
Task:
#28271
#25351
#27245
# Description
Customer has reported that during upgrade scenarios, they are experiencing increased request latency.
One of the reason being that during upgrade, replica still undergoing upgrade may still be returned back to SDK. As of today, the request will have 25% chance to hit the replica not ready yet, hence causing ConnectionTimeoutException, which contributes to the increased latency.
Solution:
COSMOS.REPLICA_ADDRESS_VALIDATION_ENABLED
totrue
Replica status machine:
If replica validation is disabled, how the status transition happens:
If replica validation is enabled, how the status transition happens:
CosmosDiagnostic Change:
replicaStatusList
: it will capture the client side replica status when SDK making the replica selection to send the request.-
rv
in clientCfgsOther Q&A:
1. The behavior of regional failover
The
COSMOS.REPLICA_ADDRESS_VALIDATION_ENABLED
will turn on replica validation for all regions the requests will be sent to. It will not proactively create connections to all partitions (different thanopenConnectionsAndInitCaches
), it will only try to open connections to replicas which were marked as unhealthy as the request comes in.2. How unhealthyPending will be transitioned to connected for higher level consistency level
It can be triggered by openConnectionRequest or RxDocumentServiceRequest.
As mentioned above, the selection of the replica will not be blocked by the validation process.
3. 408/503 handling
we are starting from internal 410&forceRefresh
4. What if the connection is closed due to idleConnectionTimeout configuration
The recommendation for customer would be to use idleConnectionTimeout 0 to use this feature.
There is no distinguishment in this PR.
Benchmark Tests:
RU: 100,000, Document Count: 100,000, Region: South Central US
Throughput:
P95 Latency:
P99 Latency:
P999 Latency:
Benchmark Tests:
RU: 6000, Document: 1000, Region: South Central US
Throughput:
P95 Latency:
P99 Latency:
P999 Latency:
Test22 upgrade test results:
CTL RunId: 789635e2-5219-4b9d-860f-e6f4bd3e4259
Pre-created documents: 10000, concurrency 10. Orange line: ReplicaValidation + #28270. Blue line: Master
P95 Latency:
P99 Latency:
Max Latency:
Following PR: