Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReplicaValidation #29767

Merged
merged 29 commits into from
Sep 1, 2022
Merged

ReplicaValidation #29767

merged 29 commits into from
Sep 1, 2022

Conversation

xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Jul 1, 2022

Task:
#28271
#25351
#27245

# Description
Customer has reported that during upgrade scenarios, they are experiencing increased request latency.
One of the reason being that during upgrade, replica still undergoing upgrade may still be returned back to SDK. As of today, the request will have 25% chance to hit the replica not ready yet, hence causing ConnectionTimeoutException, which contributes to the increased latency.

Solution:

  • SDK will track the endpoint health based on client side metrics, and de-prioritize the replica which were marked as - unhealthy.
  • SDK will validate the health of the replica by attempting to open connections to the backend. When SDK refresh addresses back from gateway for a partition, SDK is going to ONLY validate replicas were in unhealthy status by opening connection requests.
  • It is best effort, which means:
      1. If the validation can not finish within 1 min by open connections, the de-prioritize will stop for certain status
      1. The selection of the replica will not be blocked by the validation process. For example, if a request needs to be sent to N replicas, and if there is only N-1 replica in good status, it will still go ahead selecting the Nth replica which needs validation
  • It is opt in only for now, by setting System property COSMOS.REPLICA_ADDRESS_VALIDATION_ENABLED to true
  • Added new fields in cosmos diagnostic for replica status

Replica status machine:
image

If replica validation is disabled, how the status transition happens:
image

If replica validation is enabled, how the status transition happens:
image

CosmosDiagnostic Change:

  • replicaStatusList: it will capture the client side replica status when SDK making the replica selection to send the request.
    - rv in clientCfgs
    image
    image

Other Q&A:
1. The behavior of regional failover
The COSMOS.REPLICA_ADDRESS_VALIDATION_ENABLED will turn on replica validation for all regions the requests will be sent to. It will not proactively create connections to all partitions (different than openConnectionsAndInitCaches), it will only try to open connections to replicas which were marked as unhealthy as the request comes in.
2. How unhealthyPending will be transitioned to connected for higher level consistency level
It can be triggered by openConnectionRequest or RxDocumentServiceRequest.
As mentioned above, the selection of the replica will not be blocked by the validation process.
3. 408/503 handling
we are starting from internal 410&forceRefresh
4. What if the connection is closed due to idleConnectionTimeout configuration
The recommendation for customer would be to use idleConnectionTimeout 0 to use this feature.
There is no distinguishment in this PR.

Benchmark Tests:
RU: 100,000, Document Count: 100,000, Region: South Central US
Throughput:
image
P95 Latency:
image
P99 Latency:
image
P999 Latency:
image

Benchmark Tests:
RU: 6000, Document: 1000, Region: South Central US
Throughput:
image

P95 Latency:
image

P99 Latency:
image

P999 Latency:
image

Test22 upgrade test results:
CTL RunId: 789635e2-5219-4b9d-860f-e6f4bd3e4259
Pre-created documents: 10000, concurrency 10. Orange line: ReplicaValidation + #28270. Blue line: Master
P95 Latency:
image

P99 Latency:
image

Max Latency:
image

Following PR:

  1. Add open connection validation for replicas in unknown status -> region will be limited by the openAsync region
  2. Issue [Cosmos] ConnectionStateListener to invalidate the replica which is reset (not all replicas) #28270

@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@xinlian12 xinlian12 changed the title replicaValidationBeforeUse -- NO REVIEW YET ReplicaValidationBeforeUse Jul 7, 2022
@xinlian12 xinlian12 changed the title ReplicaValidationBeforeUse ReplicaValidation Jul 7, 2022
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Annie - great work - well structured implementation and PR description. Really appreciated!

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

Tested the following tests locally and passed:
openConnectionsAndInitCachesForDirectMode
orderbyContinuationOnUndefinedAndNull

@xinlian12
Copy link
Member Author

/check-enforcer override

@xinlian12 xinlian12 merged commit 4b634d8 into Azure:main Sep 1, 2022
vcolin7 pushed a commit to vcolin7/azure-sdk-for-java that referenced this pull request Sep 9, 2022
* replicaValidationBeforeUse

Co-authored-by: annie-mac <annie-mac@annie-macs-MBP.home>
Co-authored-by: annie-mac <annie-mac@annie-macs-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants