Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cosmos DB Client gets stuck in timeout retry loop in Gateway mode. #31260

Closed
dgpoulet opened this issue Oct 4, 2022 · 3 comments
Closed
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. cosmos:v4-item Indicates this feature will be shipped as part of V4 release train Cosmos

Comments

@dgpoulet
Copy link

dgpoulet commented Oct 4, 2022

Description
Discovered while testing failover behaviour in multi-region / single-region write topology. If you create a Cosmos DB client to connect in gateway mode and issue a simple read query, but use a network testing tool to reproduce a connection timeout to the primary region by dropping all the packets to the primary region endpoint, the client gets stuck in a retry loop continuously trying to connect to the primary region over and over without either retrying to the secondary region OR ending the operation with a failure exception.

To Reproduce
Create a Cosmos DB account with two regions, with a single write region.
Create a Java client app to connect to the region, populating the "preferred regions" property with the primary and secondary region, and set the connection mode to gateway. Have the app connect to Cosmos DB and issue a simple query for a known piece of data.
Before executing the app use a tool like "clumsy" to drop all network packets to the primary endpoint IP address.
Run the app.
The app will retry over and over to the primary endpoint. 503 exceptions are shown in the logs as well as messages indicating the primary region is being marked as unavailable, but the app continues to retry in a loop and does not complete.

Expected behavior
According to the documentation I would expect the Cosmos DB operations to retry against the unavailable region for a certain amount of time and then retry against the secondary region. Failing that I would expect the operations to exit, surfacing an exception to the application.

Setup:

  • OS: Windows 10
  • IDE: Eclipse
  • Library/Libraries: Cosmos DB Java SDK v 4.37.0
  • Java version: 11
@ghost ghost added the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Oct 4, 2022
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Oct 4, 2022
@ghost
Copy link

ghost commented Oct 4, 2022

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar, @TheovanKraay

@TheovanKraay TheovanKraay added cosmos:v4-item Indicates this feature will be shipped as part of V4 release train cosmos-java-ecosystem-zn Client This issue points to a problem in the data-plane of the library. labels Oct 4, 2022
@Veleniss
Copy link

Veleniss commented Oct 5, 2022

Description
Discovered while testing failover behaviour in multi-region / single-region write topology. If you create a Cosmos DB client to connect in gateway mode and issue a simple read query, but use a network testing tool to reproduce a connection timeout to the primary region by dropping all the packets to the primary region endpoint, the client gets stuck in a retry loop continuously trying to connect to the primary region over and over without either retrying to the secondary region OR ending the operation with a failure exception.

To Reproduce
Create a Cosmos DB account with two regions, with a single write region.
Create a Java client app to connect to the region, populating the "preferred regions" property with the primary and secondary region, and set the connection mode to gateway. Have the app connect to Cosmos DB and issue a simple query for a known piece of data.
Before executing the app use a tool like "clumsy" to drop all network packets to the primary endpoint IP address.
Run the app.
The app will retry over and over to the primary endpoint. 503 exceptions are shown in the logs as well as messages indicating the primary region is being marked as unavailable, but the app continues to retry in a loop and does not complete.

Expected behavior
According to the documentation I would expect the Cosmos DB operations to retry against the unavailable region for a certain amount of time and then retry against the secondary region. Failing that I would expect the operations to exit, surfacing an exception to the application.

Setup:

  • OS: Windows 10
  • IDE: Eclipse
  • Library/Libraries: Cosmos DB Java SDK v 4.37.0
  • Java version: 11

Veleniss

@xinlian12
Copy link
Member

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. cosmos:v4-item Indicates this feature will be shipped as part of V4 release train Cosmos
Projects
None yet
Development

No branches or pull requests

5 participants