Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative processing #34686

Merged
merged 21 commits into from
May 12, 2023
Merged

Conversation

mbhaskar
Copy link
Member

@mbhaskar mbhaskar commented Apr 27, 2023

Description

This PR adds support for speculative processing on top of end-to-end timeout to improve availability within timeout

Introduces Threshold based speculative execution

Only for Reads / Queries
If we don't get a successful response (either delayed response or error response) within the threshold value, we go to all preferred regions in parallel.
- We can try one region at a time, with some linear timeline.
- Whichever responds first, we will return, and cancel all the other ongoing requests.
Benefits:
1. Better chances of getting a response from either of those regions.
2. This provides Maximum availability at RUs / compute tradeoff (increased cost)

flowchart TD
    A[Request] -->B(Base Request with timeout policy)
    B --> Threshold{timespent < speculativeThreshold}
    Threshold -->| Yes | C{success <= EndToEndTimeout}
    Threshold --> | No | RemoteProcessing[Start Parallel remote requests to other regions]
    C -->|Yes| D[Return result]
    C -->|No response| E[Cancel and timeout]
    C --> |error response| F[Retry Flow]
    F --> G{timespent < EndToEndTimeout}
    G --> |No| E
    G --> |Yes| F
Loading

If speculative retries are enabled, based on the probability threshold Speculative retries may increase RU costs

When we have sparse data, the algo adds more variance around mean and this may increase the exploration even though the lower mean value is selected until the perf guarantee is increased. (The algo has proofs that this keeps cost under consideration and does only necessary exploration)

Current algo is tuned for 10sample requests. It can be tuned based on time/ number of requests and can be adjusted accordingly

Initial test resuls

Tests:

 

Read document 100 times

30s of delay faults on the local region

  P99 in ms P95 in ms P75 in ms  
No Speculation 2078.12109375 2028.05859375 2016.05859375  
Threshold based Speculation 1134.046875 616.015625 298.0  

How to enable:

Following environment variables enable speculation when end to end timeout policy is specified
Integer value specifying the speculation type

0 - No speculation
1 - Threshold based speculation

COSMOS_SPECULATION_TYPE

The threshold int representing milli seconds
COSMOS_SPECULATION_THRESHOLD

The threshold step int representing milli seconds. This sets when speculation on subsequent regions is triggered
COSMOS_SPECULATION_THRESHOLD_STEP

example: Three regions r1, r2,r3 in preferred regions
Speculation threshold: 500
Specialization step: 100

At time t1 request to r1 is made
If there is no response in 500ms, request to r2 is made
If there is no response at 500+100ms, request to r3 is made

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@mbhaskar mbhaskar changed the title Mbhaskar/speculative processing Speculative processing Apr 27, 2023
…e-processing

# Conflicts:
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/EndToEndTimeOutValidationTests.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/RxDocumentClientImplTest.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/AsyncCacheNonBlockingTest.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/BarrierRequestHelperTest.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncClient.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosClientBuilder.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/AsyncDocumentClient.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RMResources.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RequestOptions.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/AsyncCacheNonBlocking.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/ReplicatedResourceClient.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosItemRequestOptions.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosQueryRequestOptions.java
@mbhaskar
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mbhaskar

@@ -33,7 +33,7 @@
*/
public final class CosmosDiagnostics {
private static final Logger LOGGER = LoggerFactory.getLogger(CosmosDiagnostics.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not private anymore? Are we re-using this anywhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was using in tests to validate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But could revert.

}

public static int speculationThreshold() {
return getJVMConfigAsInt(SPECULATION_THRESHOLD, 500);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should define these default values in separate constant variables above like other methods.


@Override
public void onResponseReceived(URI region, Duration latency) {
// unused
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is unused, may be throw error if it gets called, something like not implemented exception. Can be done in a follow up PR.

@@ -28,7 +28,7 @@
requires java.logging;
requires HdrHistogram;

// public API surface area
// public API surface area
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can revert this change,.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert in follow up PR

@@ -300,6 +311,18 @@ public static int getSessionTokenMismatchMaximumBackoffTimeInMs() {
DEFAULT_SESSION_TOKEN_MISMATCH_MAXIMUM_BACKOFF_TIME_IN_MILLISECONDS);
}

public static int getSpeculationType() {
return getJVMConfigAsInt(SPECULATION_TYPE, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this by default be 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No speculation by default

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks

@mbhaskar
Copy link
Member Author

/check-enforcer override

1 similar comment
@kushagraThapar
Copy link
Member

/check-enforcer override

@mbhaskar mbhaskar enabled auto-merge (squash) May 12, 2023 21:51
@xinlian12
Copy link
Member

/check-enforcer override

2 similar comments
@FabianMeiswinkel
Copy link
Member

/check-enforcer override

@FabianMeiswinkel
Copy link
Member

/check-enforcer override

@mbhaskar
Copy link
Member Author

/check-enforcer evaluate

@kushagraThapar
Copy link
Member

/check-enforcer reset

@kushagraThapar
Copy link
Member

/check-enforcer override

@benbp benbp merged commit 46dd7ac into Azure:main May 12, 2023
81 of 83 checks passed
jairmyree pushed a commit to jairmyree/azure-sdk-for-java that referenced this pull request May 15, 2023
* end-end timeout implementation

* cleanup

* cleanup

* Adding missing comments

* Cleanup and fixing test

* Adding more tests

* - Moving asyncCache to background runnable so that it wont get cancelled when the subscription gets cancelled.
- Cleanup and refactoring

* Refactoring and clean up

* Refactoring and clean up

* Adding timeouts for writes.
Adding more tests

* Adding speculative execution support for read operations

* Adding config options to speculation

* Fixing issue with query where request is not cloning property

* Shading apache math3

* Adding Portions license
Fixing lint issues

* refactoring

* Test fix

* Cleanup and refactoring
Test cleanup
Adding only ThresholdBasedSpeculation

* changing default value

* Removing unused test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants