[BUG] Cosmos DB Client gets stuck in timeout retry loop in Gateway mode. #31260
Labels
Client
This issue points to a problem in the data-plane of the library.
cosmos:v4-item
Indicates this feature will be shipped as part of V4 release train
Cosmos
Description
Discovered while testing failover behaviour in multi-region / single-region write topology. If you create a Cosmos DB client to connect in gateway mode and issue a simple read query, but use a network testing tool to reproduce a connection timeout to the primary region by dropping all the packets to the primary region endpoint, the client gets stuck in a retry loop continuously trying to connect to the primary region over and over without either retrying to the secondary region OR ending the operation with a failure exception.
To Reproduce
Create a Cosmos DB account with two regions, with a single write region.
Create a Java client app to connect to the region, populating the "preferred regions" property with the primary and secondary region, and set the connection mode to gateway. Have the app connect to Cosmos DB and issue a simple query for a known piece of data.
Before executing the app use a tool like "clumsy" to drop all network packets to the primary endpoint IP address.
Run the app.
The app will retry over and over to the primary endpoint. 503 exceptions are shown in the logs as well as messages indicating the primary region is being marked as unavailable, but the app continues to retry in a loop and does not complete.
Expected behavior
According to the documentation I would expect the Cosmos DB operations to retry against the unavailable region for a certain amount of time and then retry against the secondary region. Failing that I would expect the operations to exit, surfacing an exception to the application.
Setup:
The text was updated successfully, but these errors were encountered: