Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUERY] CosmosDB Client query latency spike: P99 0.3-3seconds #19750

Closed
2 tasks done
jeffpeiyt opened this issue Mar 10, 2021 · 18 comments
Closed
2 tasks done

[QUERY] CosmosDB Client query latency spike: P99 0.3-3seconds #19750

jeffpeiyt opened this issue Mar 10, 2021 · 18 comments
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Milestone

Comments

@jeffpeiyt
Copy link

jeffpeiyt commented Mar 10, 2021

Query/Question
CosmosDB Client query latency P99 0.3-3seconds. However the P90 is pretty low (e.g. <20ms)

To be more specific, the average of P99 did not go that high, around 40-60ms; However every 2-8 minutes, there is a sporadic 300ms-3seconds spike consistently.

Why is this not a Bug or a feature Request?
Not sure if this is a bug. Could be related to how the client works.

Setup (please complete the following information if applicable):

  • OS: centos7
  • IDE : NA, in production
  • Version of the Library used 4.11.0 java library

Information Checklist

Our code is very similar to the following, and via logging we found .iterator().hasNext() that line, took 0.5-3 seconds.

Our query is simple SELECT * FROM c WHERE c.host=somesite.com

We use those configs such as idleEndpointTimeoutInHour, maxConnectionPerEndpoint etc with the default value, and with direct mode. Service and cosmos DB are in the same region, using private endpoint

Server side latency shows maximum latency is 10ms. So it should be some client-side issue.

Wonder whether the iterator.hasNext() internally has some locking that can make it very slow occasionally?

https://github.com/Azure-Samples/azure-cosmos-java-sql-api-samples/blob/main/src/main/java/com/azure/cosmos/examples/documentcrud/sync/DocumentCRUDQuickstart.java#L169

        // Print
        if (filteredFamilies.iterator().hasNext()) {
            Family family = filteredFamilies.iterator().next();
            logger.info("First query result: Family with (/id, partition key) = (%s,%s)",family.getId(),family.getLastName());
        }
  • Query Added
  • Setup information Added
@msftbot msftbot bot added needs-triage This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 10, 2021
@msftbot
Copy link

msftbot bot commented Mar 10, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar, @anfeldma-ms

@kushagraThapar
Copy link
Member

@mbhaskar - can you please investigate this issue ?

@jeffpeiyt
Copy link
Author

jeffpeiyt commented Mar 10, 2021

To be more specific, the P99 average did not go that high, around 40-60ms, however every 2-8 minutes, there is a sporadic 300ms-3seconds spike consistently. @mbhaskar look forward to your guidance

@jeffpeiyt
Copy link
Author

jeffpeiyt commented Mar 11, 2021

Via TCP dump, we observe the latency is the network latency after tcpdump: "sporadic rise in the time b/w TCP ACK sent by the host on which our application is running and next batch of data received from Cosmos DB"

We sometimes saw all of our service instances in the same VMSS to show 3 seconds latency spike, even though the incoming request count/QPS to our service remains the same (and pretty low (less than 20)).

@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. Cosmos labels Mar 11, 2021
@msftbot msftbot bot removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Mar 11, 2021
@joshfree joshfree added this to the [2021] April milestone Mar 11, 2021
@jeffpeiyt
Copy link
Author

jeffpeiyt commented Mar 15, 2021

@joshfree Hi Josh, I saw you added this issue to "2021 April" milestone. Do you suggest you are aware of the problem and plan to resolve it by the end of April? Thank you for your prompt attention.

@NikhitaKataria
Copy link

NikhitaKataria commented Mar 15, 2021

@joshfree Adding a data point here: When we increase the load, we see a drop in latency. I am not certain if we are hitting the classic case seen due to Nagle's algorithm.

Also briefly mentioned here: https://azure.microsoft.com/en-us/blog/performance-updates-and-tuning-best-practices-for-using-azure-database-for-postgresql/

@NikhitaKataria
Copy link

@joshfree Wondering if we can get some insights into the issue.

@kushagraThapar
Copy link
Member

@mbhaskar - ping.

@mbhaskar
Copy link
Member

mbhaskar commented Apr 1, 2021

@jeffpeiyt Do you have query metrics/cosmos diagnostics for queries where you see the latency issues? Also share the diagnostics for other instances where it takes less time too.

@jeffpeiyt
Copy link
Author

@mbhaskar the "service side latency" in azure portal account metrics looks consistently low (1-1.5ms). however the client-side P99 every 2-5minutes we have a long latency for most queries. It seems to be some network-related issue rather than the cosmos server side.

@jeffpeiyt
Copy link
Author

jeffpeiyt commented Apr 5, 2021

@mbhaskar , as @NikhitaKataria mentioned, we suspected it is related to the Nagle's algorithm and batching up TCP packets. When we increase the load, we see a drop in latency. Is there a setting in Cosmos client that enables "TCP_NODELAY"?

https://azure.microsoft.com/en-us/blog/performance-updates-and-tuning-best-practices-for-using-azure-database-for-postgresql/

@NikhitaKataria
Copy link

@mbhaskar Checking on this again. Is it possible to set TCP_NODELAY on cosmos requests ?

@jeffpeiyt
Copy link
Author

@mbhaskar follow up. Any settings related to TCP_NODELAY ?

@mbhaskar
Copy link
Member

@jeffpeiyt As requested earlier if you can collect the diagnostics for the requests that are taking longer share them, it would be helpful to understand whats going and I will be able to comment better. This is the method that you can use to get the diagnostics for queries
https://docs.microsoft.com/en-us/java/api/com.azure.cosmos.models.feedresponse.getcosmosdiagnostics?view=azure-java-stable#com_azure_cosmos_models_FeedResponse_getCosmosDiagnostics__

@jeffpeiyt
Copy link
Author

jeffpeiyt commented Apr 27, 2021

@mbhaskar makes sense! We are enabling the diagnostic logs this week and will get back to you by mid next week, if not earlier.

@kushagraThapar kushagraThapar added the needs-author-feedback More information is needed from author to address the issue. label May 26, 2021
@msftbot msftbot bot added the no-recent-activity There has been no recent activity on this issue. label Jun 2, 2021
@msftbot
Copy link

msftbot bot commented Jun 2, 2021

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@jeffpeiyt
Copy link
Author

We have enabled the diagnostic logs and moved to an internal channel for further communication.

@msftbot msftbot bot added needs-team-attention This issue needs attention from Azure service team or SDK team and removed needs-author-feedback More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. labels Jun 2, 2021
@lilyjma
Copy link
Contributor

lilyjma commented Mar 31, 2022

thanks @jeffpeiyt -- I'll closed this issue since communication has moved to an internal channel.

@lilyjma lilyjma closed this as completed Mar 31, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

6 participants