Retriable writes #34227

FabianMeiswinkel · 2023-03-28T13:33:47Z

Description

This PR will add the option to enable automatic-retires for write operations when they are not guaranteed to be idempotent. By default the Cosmos DB SDK will only issue retries for write operation when the failure condition happened before the request was actually written to the network - or when the error code form the service guarantees that the service never processed the request. This behavior is sued to ensure that retries are not automatically done, when retry is not guaranteed to be idempotent. Imagine an attempt to create a document, the request is written to the network but the request times out after 5 seconds. At this point it is possible that the request was actually received and processed by the service - in which case a retry would not result in a 202-Creat anymore but a 409-Conflict instead. The lack of automatic retries in the SDK for write operation has caused quite a bit of customer confusion and dissatisfaction - many customers would prefer the SDK to at least issue the retries - even when their applications would need to be able to handle some of the idempotency challenges. This PR is adding an opt-in feature that customers can use to enable automatic retries for write operations n the SDK - and the design section iterates over the design considerations and situations caused by idempotency challenges that applications would need to be able to handle.

Public API surface area changes

Idempotency aspects

When opting into automatic retries for write operations even when idempotency cannot be guaranteed a new system property "_trackingId" will be used to help reduce the scenarios where applications need to resolve idempotency issues.
For certain write operations (see below) teh SDK will inject the "_trackingId" system property into the document - when due to retries certain failure conditions (409-conflict, 412-precondition failure) occur, the SDK will issue a read against the current version of the document to validate whether the 409/412 can be resolved (knowing for example that the 409-Conflict on retry was due to the fact that the original request succeeded, if the read document has the same _trackingId as injected).
This means that there is potentially a somewhat higher latency as well as RU-consumption when enabling the automatic retries for write operations - but assuming that otherwise the application would need to build this logic as well overall this should not be a concern.

CREATE

For CREATE operations initial successful attempt to insert a document can lead to a 409-Conflict failure on the retry. To minimize the impact to applications the following flow will be used. With the flow below any 409 raised to the application would have been raised to the application without automatic retries as well - so, no functional special casing would be needed.

Inherits default write retry policy from client - honors useTrackingId config

flowchart TD;
  A[Start] --> B[Receive Response for retry];
  B --> C{Is 409?};
  C ----> |No| Z[Done];
  C --> |Yes| D[Read document];
  D --> E{_trackingId == injected trackingId?};
  E ----> |Yes| Z[Done 201];
  E ----> |No| X[Throw 409];

REPLACE

For replace a successfully processed initial attempt could result in receiving a 412-Precondition failed error-code on the retry - a similar mechanism like for create can be used to avoid raising these false-negative 412s to the application.

Inherits default write retry policy from client - honors useTrackingId config

flowchart TD;
  A[Start] --> B[Receive Response for retry];
  B --> C{Is 412?};
  C ----> |No| Z[Done];
  C --> |Yes| D[Read document];
  D --> E{_trackingId == injected trackingId?};
  E ----> |Yes| Z[Done 200];
  E ----> |No| X[Throw 412];

UPSERT

Upserts are usually used when the sematic of the operation is a PUT - like I don't know whether the document exist or not - but if it already exists, I definitely want to update it to my version of the document. There is no special-casing needed in the SDK (or the application) to accomodate the most common scenarios - if the initial attempt to upsert times out but was actually processed the retry would result in updating the document again. The only caveat is that when applications depend on the status code being returned (201 indicating new document, 200 previously existing document) the automatic retries could result in confusion - a 200 after the retry was processed could mean the document was only created by the initial timed-out upsert request (which actually created the document). The only way to be able to deterministically tell whether a document has been created or updated would be to use Create and Replace on 409 (or replace and create on 404)- so, applications using the 201 vs. 200 status code of upserts to make any business decisions based on the understanding whether the doc is new or not, should not enable automatic retries in the SDK and handle retries manually or switch to a Create + Replace on 409 model.

Inherits default write retry policy from client - ignores useTrackingId config

DELETE

Automatic retries for DELETE can result in higher chances of getting 404 - Not found in the application (when the initial request attempt was actually processed but timed-out and the retry then gets the 404 - Not found. This means applications which are not able to handle 404-NotFound gracefully already for deletes should not enable automatic writes (at least for the write operations) - if applications already handle 404-Not found gracefully automatic retries can be enabled without any issues.

Inherits default write retry policy from client - ignores useTrackingId config

PATCH

For patch whether writes can safely be retried or not depends on the patch instructions itself - replace, set, copy for example would be idempotent while add, move or remove would not be idempotent unless combined with patch precondition checks. So for path we will allow opting-in into automatic retries only on the request options level - assuming that the dev has validated the patch instructions and explicitly wants retries to happen automatically - for all other cases for path automatic retries will keep being disabled - even if on client-level the default is to enable automatic retries for write operations.

Always disabled by default - ignores useTrackingId config

STORED PROCEDURE INVOCATION

No automatic retries supported.

BULK

No automatic retries supported.

PARTITION KEY DELETE / DELETE ALL ITEMS BY PK

No automatic retries supported.

Samples of Public API usage

Enabling retries for an individual create operation using the `_trackingId` system property to resolve retry-caused 409-Conflicts

String pkValue = "myPKValue"; // whatever the logical partition key value is      
boolean ENABLE_RETRIES = true;
boolean USE_TRACKING_ID = true;
CosmosItemRequestOptions optionsWithRetry = new CosmosItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(ENABLE_RETRIES, USE_TRACKING_ID);

asyncContainer.createItem(item, new PartitionKey(pkValue), optionsWithRetry).block();

Enabling retries for an individual create operation without using `_trackingId` system property (can result in higher rate of 409-Conflict)

String pkValue = "myPKValue"; // whatever the logical partition key value is      
boolean ENABLE_RETRIES = true;
boolean NO_TRACKING_ID = false;
CosmosItemRequestOptions optionsWithRetry = new CosmosItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(ENABLE_RETRIES, NO_TRACKING_ID);

asyncContainer.createItem(item, new PartitionKey(pkValue), optionsWithRetry).block();

Enabling retries for an individual replace operation using the `_trackingId` system property to resolve retry-caused 412-Conflicts

String pkValue = "myPKValue"; // whatever the logical partition key value is      
boolean ENABLE_RETRIES = true;
boolean USE_TRACKING_ID = true;
CosmosItemRequestOptions optionsWithRetry = new CosmosItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(ENABLE_RETRIES, USE_TRACKING_ID);

asyncContainer.replaceItem(item, id, new PartitionKey(pkValue), optionsWithRetry).block();

Enabling retries for an individual replace operation without using `_trackingId` system property (can result in higher rate of 412-Conflict)

String pkValue = "myPKValue"; // whatever the logical partition key value is      
boolean ENABLE_RETRIES = true;
boolean NO_TRACKING_ID = false;
CosmosItemRequestOptions optionsWithRetry = new CosmosItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(ENABLE_RETRIES, NO_TRACKING_ID);

asyncContainer.replaceItem(item, id, new PartitionKey(pkValue), optionsWithRetry).block();

Enabling retries for an individual upsert operation

CosmosItemRequestOptions optionsWithRetry = new CosmosItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(true, false);

asyncContainer.upsertItem(..., optionsWithRetry).block();

Enabling retries for an individual delete operation

CosmosItemRequestOptions optionsWithRetry = new CosmosItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(true, false);

asyncContainer.deleteItem(..., optionsWithRetry).block();

Enabling retries for an individual patch operation

CosmosPatchItemRequestOptions optionsWithRetry = new CosmosPatchItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(true, false);

asyncContainer.PatchItem(..., optionsWithRetry).block();

Changing default to enable retries (with `_trackingId` system property usage) as the default behavior unless opted-out via request options

System.setProperty("COSMOS.WRITE_RETRY_POLICY", "WITH_TRACKING_ID");

Changing default to enable just retries (no `_trackingId` system property usage) as the default behavior unless opted-out via request options

System.setProperty("COSMOS.WRITE_RETRY_POLICY", "WITH_RETRIES");

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Initial DRAFT for public API review

…to users/fabianm/DiagnosticsProcessor

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosDiagnostics.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java

.../main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestManager.java

xinlian12

LGTM, thanks

FabianMeiswinkel · 2023-04-05T01:54:29Z

/azp run java - cosmos - test

azure-pipelines · 2023-04-05T01:54:35Z

No pipelines are associated with this pull request.

FabianMeiswinkel · 2023-04-05T01:54:42Z

/azp run java - cosmos - spark

azure-pipelines · 2023-04-05T01:54:52Z

Azure Pipelines successfully started running 1 pipeline(s).

FabianMeiswinkel · 2023-04-05T02:07:33Z

/azp run java - cosmos - tests

azure-pipelines · 2023-04-05T02:07:47Z

Azure Pipelines successfully started running 1 pipeline(s).

kushagraThapar

LGTM @FabianMeiswinkel , thanks for the amazing work.
I have added few minor comments, nothing blocking, more optimization related though.

.../azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientSideRequestStatistics.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/InternalObjectNode.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/WriteRetryPolicy.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Utils.java

...e-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosItemResponse.java

FabianMeiswinkel · 2023-04-05T07:27:11Z

/azp run java - cosmos - tests

FabianMeiswinkel · 2023-04-05T07:27:24Z

/azp run java - cosmos - spark

azure-pipelines · 2023-04-05T07:27:25Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2023-04-05T07:27:34Z

Azure Pipelines successfully started running 1 pipeline(s).

FabianMeiswinkel · 2023-04-05T07:43:59Z

/azp run java - cosmos - spark

FabianMeiswinkel · 2023-04-05T07:44:09Z

/azp run java - cosmos - tests

azure-pipelines · 2023-04-05T07:44:09Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2023-04-05T07:44:24Z

Azure Pipelines successfully started running 1 pipeline(s).

FabianMeiswinkel · 2023-04-05T20:36:34Z

Only failures caused by known flakiness in changefeed read split tests - going to override and merge

FabianMeiswinkel · 2023-04-05T20:37:27Z

/check-enforcer override

FabianMeiswinkel added 30 commits January 30, 2023 05:44

Initial DRAFT for public API review

5d126e7

Initial DRAFT for public API review

Update DiagnosticsProvider.java

d53acc3

Adding comments

3519b0a

Iterate on comments

65ecd09

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-java in…

501ddf0

…to users/fabianm/DiagnosticsProcessor

Merging main into PR

b847734

Implementing CosmosDiagnosticsContext.toString (including caching)

d21ca51

Iterating on implementation

8228edf

Fixing SpotBug issues

45824b0

Fixing SpotBug violations

786d0df

Fixing java doc violation

eaa916b

Fixing unit test failures

2064c94

Fixing test failures

07774ea

Refactoring to allow specifying thresholds on config and requestOptions

9b4867b

Update CosmosDiagnosticsThresholds.java

228fdae

Update CosmosItemRequestOptions.java

be5c48d

Update DiagnosticsProvider.java

e7a34b9

Allowing configuration of status code handling

1e180a7

Update CosmosDiagnosticsThresholds.java

d19302b

Update CosmosClientTelemetryConfig.java

6284527

Allowing to configure log levels

da445fc

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-java in…

6a98a8b

…to users/fabianm/DiagnosticsProcessor

Temp

bb87153

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-java in…

ea0f975

…to users/fabianm/DiagnosticsProcessor

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-java in…

96cff8d

…to users/fabianm/DiagnosticsProcessor

Update ImplementationBridgeHelpers.java

6e50318

Removing TracerProvider and starting to switch to DiagnosticsProvider

bc61d50

Iterating on DiagnosticsProvider

e54230f

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-java in…

89e35c8

…to users/fabianm/DiagnosticsProcessor

Addressing SpotBug failures

03f8eca

Reacting to code review feedback

ec31828

xinlian12 reviewed Apr 4, 2023

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosDiagnostics.java Outdated Show resolved Hide resolved

xinlian12 reviewed Apr 4, 2023

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosDiagnostics.java Outdated Show resolved Hide resolved

xinlian12 reviewed Apr 4, 2023

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java Show resolved Hide resolved

xinlian12 reviewed Apr 4, 2023

View reviewed changes

.../main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestManager.java Outdated Show resolved Hide resolved

xinlian12 approved these changes Apr 4, 2023

View reviewed changes

FabianMeiswinkel added 2 commits April 5, 2023 00:36

Addressing code review feedback

fa205d7

Reacting to PR feedback

650a8f9

FabianMeiswinkel enabled auto-merge (squash) April 5, 2023 02:14

kushagraThapar approved these changes Apr 5, 2023

View reviewed changes

Reacted to code review feedback

95830a2

FabianMeiswinkel changed the title ~~Initial DRAFT - Retriable writes~~ Retriable writes Apr 5, 2023

Update CHANGELOG.md

8a4a7da

FabianMeiswinkel merged commit 7832d85 into Azure:main Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retriable writes #34227

Retriable writes #34227

FabianMeiswinkel commented Mar 28, 2023 •

edited

Loading

xinlian12 left a comment

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

kushagraThapar left a comment

FabianMeiswinkel commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

Retriable writes #34227

Retriable writes #34227

Conversation

FabianMeiswinkel commented Mar 28, 2023 • edited Loading

Description

Public API surface area changes

Idempotency aspects

CREATE

REPLACE

UPSERT

DELETE

PATCH

STORED PROCEDURE INVOCATION

BULK

PARTITION KEY DELETE / DELETE ALL ITEMS BY PK

Samples of Public API usage

Enabling retries for an individual create operation using the _trackingId system property to resolve retry-caused 409-Conflicts

Enabling retries for an individual create operation without using _trackingId system property (can result in higher rate of 409-Conflict)

Enabling retries for an individual replace operation using the _trackingId system property to resolve retry-caused 412-Conflicts

Enabling retries for an individual replace operation without using _trackingId system property (can result in higher rate of 412-Conflict)

Enabling retries for an individual upsert operation

Enabling retries for an individual delete operation

Enabling retries for an individual patch operation

Changing default to enable retries (with _trackingId system property usage) as the default behavior unless opted-out via request options

Changing default to enable just retries (no _trackingId system property usage) as the default behavior unless opted-out via request options

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

xinlian12 left a comment

Choose a reason for hiding this comment

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

kushagraThapar left a comment

Choose a reason for hiding this comment

FabianMeiswinkel commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

azure-pipelines bot commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

FabianMeiswinkel commented Apr 5, 2023

FabianMeiswinkel commented Mar 28, 2023 •

edited

Loading

Enabling retries for an individual create operation using the `_trackingId` system property to resolve retry-caused 409-Conflicts

Enabling retries for an individual create operation without using `_trackingId` system property (can result in higher rate of 409-Conflict)

Enabling retries for an individual replace operation using the `_trackingId` system property to resolve retry-caused 412-Conflicts

Enabling retries for an individual replace operation without using `_trackingId` system property (can result in higher rate of 412-Conflict)

Changing default to enable retries (with `_trackingId` system property usage) as the default behavior unless opted-out via request options

Changing default to enable just retries (no `_trackingId` system property usage) as the default behavior unless opted-out via request options