API for threshold based retries #35166

mbhaskar · 2023-05-26T00:52:33Z

Description

As end to end policy may effect the availability due to its faster timeouts, we need strategy to improve availability when endto end timeout is specified.
This PR introduces threshold based retry execution of the requests to improve availability when end to end timeout is specified.

API :

This needs

list of regions to be retried on
Time when this retry has to be triggered

List of regions to be excluded for the request/retries. Example "East US" or "East US, West US" These regions will be excluded from the preferred regions list when executing multi region retries.

The idea here is to provide an option to the user to give a list of regions that they dont want a request to be retried on. This list ideally has to be a sublist of the preferred regions.

CosmosItemRequestOptions#setExcludeRegions

CosmosItemRequestOptions options = new CosmosItemRequestOptions();
options.setExcludeRegions(List.of("East US", "West US");

exclude regions can be used in two ways.

By setting exclude regions, the request would now be routed to the first effective region only (see below). So you can use this to hint a request to go to a particular region.
Example scneario:
Preferred regions: regionA, regionB, regionC

Have a request going to regionA with endToEndTimeout of 500ms.
Once the request fails, you can send a new request with regionA in exclude list.
Now the request only goes to regionB and skips regionA

This can help the scenarios where the user wants to specifically try the request on a particular region.
Note that this works only when EndToEndTimeoutPolicy is set.

You can set an availability strategy on CosmosEndToEndOperationLatencyPolicy to get a better availability in cases when the original request takes very long to execute.

This PR introduces a strategy called ThresholdBasedAvailabilityStrategy

flowchart TD
    A[Request] -->B(Base Request with timeout policy)
    B --> Threshold{timespent < speculativeThreshold}
    Threshold -->| Yes | C{success <= EndToEndTimeout}
    Threshold --> | No | RemoteProcessing[Start requests to other regions in effective region list at T + Tstep*step-1]
    C -->|Yes| D[Return result]
    C -->|No response| E[Cancel and timeout]
    C --> |error response| F[Retry Flow]
    F --> G{timespent < EndToEndTimeout}
    G --> |No| E
    G --> |Yes| F

ThresholdBasedAvailabilityStrategy contains the following parameters

threshold = Threshold Duration in ms
thresholdStep = Threshold step in ms

Threshold is the duration in ms before which if the original request doesnt respond, we issue a request to next region from the list of effective regions.

Effective regions are computed as below. And retries happen only on available effective regions

effectiveRegions = (preferredRegions - excludeRegions)

AvailabilityStrategy thresholdStrategy = 
   new ThresholdBasedAvailabilityStrategy( /*threshold:*/ Duration.ofMillis(300),
                 /*thresholdStep:*/ Duration.ofMillis(100));

Example:
Prefrerred regions: "East US 1", "East US 2", "Central US", "West US 2"
Exclude list: "East US 2"
number of Regions = 2

Effective regions = "East US 1", "Central US"

How to use this ?

CosmosEndToEndOperationLatencyPolicyConfigBuilder builder 
   = new CosmosEndToEndOperationLatencyPolicyConfigBuilder(/*isEnabled:*/ true, 
               /*endToEndTimeout:*/ Duration.ofMillis(2000));
builder.setAvaiabilityStrategy(thresholdStrategy );
CosmosEndToEndOperationLatencyPolicy policyConfig = builder.build();

Enabling it on the client

CosmosAsyncClient cosmosAsyncClient = new CosmosClientBuilder()
    .endpoint(END_POINT)
    .key(KEY)
    .endToEndOperationLatencyPolicyConfig(policyConfig)
    .buildAsyncClient();

This can be enabled or disabled per operation

CosmosItemRequestOptions requestOptions = new CosmosItemRequestOptions();
requestOptions.setCosmosEndToEndOperationLatencyPolicyConfig(policyConfig);
cosmosAsyncClient.getDatabase(DATABASE)
            .getContainer(CONTAINER)
            .readItem("id1", new PartitionKey("id1"), requestOptions, Person.class).block();

Per operation option always overrides the client option.

How does this work?

Initial execution on the requested region starts at t0;
If no response has been received at t0 + threshold milliseconds, it starts another request on the first region from effectiveRegionList
if no response has been received at t0 + threshold+ threshold_step milliseconds, start another request on the region from the next region from effectiveRegionList

This continues only until the number that is set by the user in the options on the number of regions to request to.

Future:

There is a possibility of extending this to configure individual timeout and availability strategy for different kind of operations like point and non point operations

Possible API on the client

CosmosE2EPLatencyPolicyConfigs newGranularConfigs = new CosmosE2ELatencyPolicyConfigsBuilder()
.pointTimeoutConfig(CosmosEndToEndOperationLatencyPolicyConfige)
.queryTimeoutConfig (CosmosEndToEndOperationLatencyPolicyConfige)
.build();

cosmosClientBuilder.setEndToEndLatencyPolicyConfigs(newGranularConfigs);

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…ide regions to skip from the retry execution

...s/azure-cosmos/src/main/java/com/azure/cosmos/availabilitystrategy/AvailabilityStrategy.java

...c/main/java/com/azure/cosmos/implementation/directconnectivity/ReplicatedResourceClient.java

Refactoring

…based-retries-with-excludelists # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosQueryRequestOptions.java

azure-sdk · 2023-06-09T19:20:00Z

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/AvailabilityStrategy.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/BridgeInternal.java

...cosmos/src/main/java/com/azure/cosmos/CosmosEndToEndOperationLatencyPolicyConfigBuilder.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/ThresholdBasedAvailabilityStrategy.java

...ure-cosmos/src/main/java/com/azure/cosmos/implementation/apachecommons/lang/StringUtils.java

...c/main/java/com/azure/cosmos/implementation/directconnectivity/ReplicatedResourceClient.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosQueryRequestOptions.java

...s/azure-cosmos-tests/src/test/java/com/azure/cosmos/EndToEndTimeOutWithAvailabilityTest.java

xinlian12 · 2023-06-15T00:46:51Z

...s/azure-cosmos-tests/src/test/java/com/azure/cosmos/EndToEndTimeOutWithAvailabilityTest.java

+        // Now try the same request with West US 2 excluded
+        options.setExcludeRegions(ImmutableList.of("West US 2"));
+        cosmosItemResponseMono =
+            createdContainer.readItem(itemToRead.getId(), new PartitionKey(itemToRead.getMypk()), options, EndToEndTimeOutValidationTests.TestObject.class);


should we verify the contactedRegion?

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/ThresholdBasedAvailabilityStrategy.java

.../azure-cosmos/src/main/java/com/azure/cosmos/CosmosEndToEndOperationLatencyPolicyConfig.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/ThresholdBasedAvailabilityStrategy.java

kushagraThapar

LGTM, thanks @mbhaskar

.../azure-cosmos/src/main/java/com/azure/cosmos/CosmosEndToEndOperationLatencyPolicyConfig.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/ThresholdBasedAvailabilityStrategy.java

Dismissing the review as the feedback comments have been incorporated and resolved.

mbhaskar · 2023-06-22T00:01:58Z

/azp run java - cosmos - tests

azure-pipelines · 2023-06-22T00:02:25Z

Azure Pipelines successfully started running 1 pipeline(s).

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/AvailabilityStrategy.java

...cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java

...c/main/java/com/azure/cosmos/implementation/directconnectivity/ReplicatedResourceClient.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosQueryRequestOptions.java

FabianMeiswinkel

Thanks - LGTM

mbhaskar · 2023-06-22T21:57:33Z

/azp run java - cosmos - tests

azure-pipelines · 2023-06-22T21:57:46Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2023-06-23T06:30:59Z

/azp run java - cosmos - tests

azure-pipelines · 2023-06-23T06:31:13Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2023-06-23T14:31:02Z

/azp run java - cosmos - tests

azure-pipelines · 2023-06-23T14:31:14Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2023-06-23T20:11:04Z

/check-enforcer override

Initial draft of exposing threshold based retries with option to prov…

37cd2ea

…ide regions to skip from the retry execution

github-actions bot added the Cosmos label May 26, 2023

refactor

cc75339

FabianMeiswinkel reviewed May 26, 2023

View reviewed changes

...s/azure-cosmos/src/main/java/com/azure/cosmos/availabilitystrategy/AvailabilityStrategy.java Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed May 26, 2023

View reviewed changes

...s/azure-cosmos/src/main/java/com/azure/cosmos/availabilitystrategy/AvailabilityStrategy.java Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed May 26, 2023

View reviewed changes

...s/azure-cosmos/src/main/java/com/azure/cosmos/availabilitystrategy/AvailabilityStrategy.java Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed May 26, 2023

View reviewed changes

...c/main/java/com/azure/cosmos/implementation/directconnectivity/ReplicatedResourceClient.java Outdated Show resolved Hide resolved

mbhaskar added 6 commits June 6, 2023 08:37

Adding feed timeout

ef3ee3f

Refactoring

additional implementation

e256f00

Refactoring

8ce9826

Refactoring

ef1da26

Cleanup and refactoring

4190b06

mbhaskar marked this pull request as ready for review June 8, 2023 15:46

mbhaskar requested review from kushagraThapar, kirankumarkolli, xinlian12, milismsft, aayush3011, simorenoh, jeet1995 and Pilchie as code owners June 8, 2023 15:46

mbhaskar changed the title ~~Draft of API for threshold based retries~~ API for threshold based retries Jun 8, 2023

mbhaskar added 3 commits June 8, 2023 17:27

Adding tests and cleanup

f2eb71f

fixing documentation

11b5c89

Javadoc fix

e175125

kushagraThapar reviewed Jun 14, 2023

View reviewed changes

xinlian12 reviewed Jun 15, 2023

View reviewed changes

...s/azure-cosmos-tests/src/test/java/com/azure/cosmos/EndToEndTimeOutWithAvailabilityTest.java Show resolved Hide resolved

xinlian12 reviewed Jun 15, 2023

View reviewed changes

jeet1995 reviewed Jun 20, 2023

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/ThresholdBasedAvailabilityStrategy.java Outdated Show resolved Hide resolved

jeet1995 reviewed Jun 20, 2023

View reviewed changes

.../azure-cosmos/src/main/java/com/azure/cosmos/CosmosEndToEndOperationLatencyPolicyConfig.java Outdated Show resolved Hide resolved

jeet1995 reviewed Jun 20, 2023

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/ThresholdBasedAvailabilityStrategy.java Show resolved Hide resolved

mbhaskar added 2 commits June 21, 2023 15:00

Implementing pr comments

0dd6a67

Fixing an issue

84975f4

kushagraThapar approved these changes Jun 22, 2023

View reviewed changes

.../azure-cosmos/src/main/java/com/azure/cosmos/CosmosEndToEndOperationLatencyPolicyConfig.java Outdated Show resolved Hide resolved

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/ThresholdBasedAvailabilityStrategy.java Show resolved Hide resolved