[BUG] Cosmos DB Client gets stuck in timeout retry loop in Gateway mode. #31260

dgpoulet · 2022-10-04T15:16:14Z

Description
Discovered while testing failover behaviour in multi-region / single-region write topology. If you create a Cosmos DB client to connect in gateway mode and issue a simple read query, but use a network testing tool to reproduce a connection timeout to the primary region by dropping all the packets to the primary region endpoint, the client gets stuck in a retry loop continuously trying to connect to the primary region over and over without either retrying to the secondary region OR ending the operation with a failure exception.

To Reproduce
Create a Cosmos DB account with two regions, with a single write region.
Create a Java client app to connect to the region, populating the "preferred regions" property with the primary and secondary region, and set the connection mode to gateway. Have the app connect to Cosmos DB and issue a simple query for a known piece of data.
Before executing the app use a tool like "clumsy" to drop all network packets to the primary endpoint IP address.
Run the app.
The app will retry over and over to the primary endpoint. 503 exceptions are shown in the logs as well as messages indicating the primary region is being marked as unavailable, but the app continues to retry in a loop and does not complete.

Expected behavior
According to the documentation I would expect the Cosmos DB operations to retry against the unavailable region for a certain amount of time and then retry against the secondary region. Failing that I would expect the operations to exit, surfacing an exception to the application.

Setup:

OS: Windows 10
IDE: Eclipse
Library/Libraries: Cosmos DB Java SDK v 4.37.0
Java version: 11

ghost · 2022-10-04T16:08:04Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar, @TheovanKraay

Veleniss · 2022-10-05T11:47:54Z

Description
Discovered while testing failover behaviour in multi-region / single-region write topology. If you create a Cosmos DB client to connect in gateway mode and issue a simple read query, but use a network testing tool to reproduce a connection timeout to the primary region by dropping all the packets to the primary region endpoint, the client gets stuck in a retry loop continuously trying to connect to the primary region over and over without either retrying to the secondary region OR ending the operation with a failure exception.

To Reproduce
Create a Cosmos DB account with two regions, with a single write region.
Create a Java client app to connect to the region, populating the "preferred regions" property with the primary and secondary region, and set the connection mode to gateway. Have the app connect to Cosmos DB and issue a simple query for a known piece of data.
Before executing the app use a tool like "clumsy" to drop all network packets to the primary endpoint IP address.
Run the app.
The app will retry over and over to the primary endpoint. 503 exceptions are shown in the logs as well as messages indicating the primary region is being marked as unavailable, but the app continues to retry in a loop and does not complete.

Expected behavior
According to the documentation I would expect the Cosmos DB operations to retry against the unavailable region for a certain amount of time and then retry against the secondary region. Failing that I would expect the operations to exit, surfacing an exception to the application.

Setup:

OS: Windows 10

IDE: Eclipse

Library/Libraries: Cosmos DB Java SDK v 4.37.0

Java version: 11

Veleniss

xinlian12 · 2022-10-09T16:05:05Z

The fix has released in 4.37.1
https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/CHANGELOG.md#4371-2022-10-07

ghost added the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Oct 4, 2022

TheovanKraay added the Cosmos label Oct 4, 2022

ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Oct 4, 2022

TheovanKraay added cosmos:v4-item Indicates this feature will be shipped as part of V4 release train cosmos-java-ecosystem-zn Client This issue points to a problem in the data-plane of the library. labels Oct 4, 2022

TheovanKraay assigned jeet1995 Oct 4, 2022

This was referenced Oct 7, 2022

Fix to enable reads to failover to preferred/secondary regions jeet1995/azure-sdk-for-java#1

Closed

Read failover fix to preferred locations/regions. #31314

Merged

xinlian12 closed this as completed Oct 9, 2022

github-actions bot locked and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cosmos DB Client gets stuck in timeout retry loop in Gateway mode. #31260

[BUG] Cosmos DB Client gets stuck in timeout retry loop in Gateway mode. #31260

dgpoulet commented Oct 4, 2022

ghost commented Oct 4, 2022

Veleniss commented Oct 5, 2022

xinlian12 commented Oct 9, 2022

[BUG] Cosmos DB Client gets stuck in timeout retry loop in Gateway mode. #31260

[BUG] Cosmos DB Client gets stuck in timeout retry loop in Gateway mode. #31260

Comments

dgpoulet commented Oct 4, 2022

ghost commented Oct 4, 2022

Veleniss commented Oct 5, 2022

xinlian12 commented Oct 9, 2022