-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retriable writes #34227
Retriable writes #34227
Conversation
Initial DRAFT for public API review
…to users/fabianm/DiagnosticsProcessor
…to users/fabianm/DiagnosticsProcessor
…to users/fabianm/DiagnosticsProcessor
…to users/fabianm/DiagnosticsProcessor
…to users/fabianm/DiagnosticsProcessor
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosDiagnostics.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosDiagnostics.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
Show resolved
Hide resolved
.../main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
/azp run java - cosmos - test |
No pipelines are associated with this pull request. |
/azp run java - cosmos - spark |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @FabianMeiswinkel , thanks for the amazing work.
I have added few minor comments, nothing blocking, more optimization related though.
.../azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientSideRequestStatistics.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/InternalObjectNode.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/WriteRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/WriteRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Utils.java
Outdated
Show resolved
Hide resolved
...e-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosItemResponse.java
Show resolved
Hide resolved
/azp run java - cosmos - tests |
/azp run java - cosmos - spark |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - spark |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
Only failures caused by known flakiness in changefeed read split tests - going to override and merge |
/check-enforcer override |
Description
This PR will add the option to enable automatic-retires for write operations when they are not guaranteed to be idempotent. By default the Cosmos DB SDK will only issue retries for write operation when the failure condition happened before the request was actually written to the network - or when the error code form the service guarantees that the service never processed the request. This behavior is sued to ensure that retries are not automatically done, when retry is not guaranteed to be idempotent. Imagine an attempt to create a document, the request is written to the network but the request times out after 5 seconds. At this point it is possible that the request was actually received and processed by the service - in which case a retry would not result in a 202-Creat anymore but a 409-Conflict instead. The lack of automatic retries in the SDK for write operation has caused quite a bit of customer confusion and dissatisfaction - many customers would prefer the SDK to at least issue the retries - even when their applications would need to be able to handle some of the idempotency challenges. This PR is adding an opt-in feature that customers can use to enable automatic retries for write operations n the SDK - and the design section iterates over the design considerations and situations caused by idempotency challenges that applications would need to be able to handle.
Public API surface area changes
Idempotency aspects
When opting into automatic retries for write operations even when idempotency cannot be guaranteed a new system property "_trackingId" will be used to help reduce the scenarios where applications need to resolve idempotency issues.
For certain write operations (see below) teh SDK will inject the "_trackingId" system property into the document - when due to retries certain failure conditions (409-conflict, 412-precondition failure) occur, the SDK will issue a read against the current version of the document to validate whether the 409/412 can be resolved (knowing for example that the 409-Conflict on retry was due to the fact that the original request succeeded, if the read document has the same _trackingId as injected).
This means that there is potentially a somewhat higher latency as well as RU-consumption when enabling the automatic retries for write operations - but assuming that otherwise the application would need to build this logic as well overall this should not be a concern.
CREATE
For CREATE operations initial successful attempt to insert a document can lead to a 409-Conflict failure on the retry. To minimize the impact to applications the following flow will be used. With the flow below any 409 raised to the application would have been raised to the application without automatic retries as well - so, no functional special casing would be needed.
Inherits default write retry policy from client - honors useTrackingId config
REPLACE
For replace a successfully processed initial attempt could result in receiving a 412-Precondition failed error-code on the retry - a similar mechanism like for create can be used to avoid raising these false-negative 412s to the application.
Inherits default write retry policy from client - honors useTrackingId config
UPSERT
Upserts are usually used when the sematic of the operation is a PUT - like I don't know whether the document exist or not - but if it already exists, I definitely want to update it to my version of the document. There is no special-casing needed in the SDK (or the application) to accomodate the most common scenarios - if the initial attempt to upsert times out but was actually processed the retry would result in updating the document again. The only caveat is that when applications depend on the status code being returned (201 indicating new document, 200 previously existing document) the automatic retries could result in confusion - a 200 after the retry was processed could mean the document was only created by the initial timed-out upsert request (which actually created the document). The only way to be able to deterministically tell whether a document has been created or updated would be to use Create and Replace on 409 (or replace and create on 404)- so, applications using the 201 vs. 200 status code of upserts to make any business decisions based on the understanding whether the doc is new or not, should not enable automatic retries in the SDK and handle retries manually or switch to a Create + Replace on 409 model.
Inherits default write retry policy from client - ignores useTrackingId config
DELETE
Automatic retries for DELETE can result in higher chances of getting 404 - Not found in the application (when the initial request attempt was actually processed but timed-out and the retry then gets the 404 - Not found. This means applications which are not able to handle 404-NotFound gracefully already for deletes should not enable automatic writes (at least for the write operations) - if applications already handle 404-Not found gracefully automatic retries can be enabled without any issues.
Inherits default write retry policy from client - ignores useTrackingId config
PATCH
For patch whether writes can safely be retried or not depends on the patch instructions itself - replace, set, copy for example would be idempotent while add, move or remove would not be idempotent unless combined with patch precondition checks. So for path we will allow opting-in into automatic retries only on the request options level - assuming that the dev has validated the patch instructions and explicitly wants retries to happen automatically - for all other cases for path automatic retries will keep being disabled - even if on client-level the default is to enable automatic retries for write operations.
Always disabled by default - ignores useTrackingId config
STORED PROCEDURE INVOCATION
No automatic retries supported.
BULK
No automatic retries supported.
PARTITION KEY DELETE / DELETE ALL ITEMS BY PK
No automatic retries supported.
Samples of Public API usage
Enabling retries for an individual create operation using the
_trackingId
system property to resolve retry-caused 409-ConflictsEnabling retries for an individual create operation without using
_trackingId
system property (can result in higher rate of 409-Conflict)Enabling retries for an individual replace operation using the
_trackingId
system property to resolve retry-caused 412-ConflictsEnabling retries for an individual replace operation without using
_trackingId
system property (can result in higher rate of 412-Conflict)Enabling retries for an individual upsert operation
Enabling retries for an individual delete operation
Enabling retries for an individual patch operation
Changing default to enable retries (with
_trackingId
system property usage) as the default behavior unless opted-out via request optionsChanging default to enable just retries (no
_trackingId
system property usage) as the default behavior unless opted-out via request optionsAll SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines