New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUERY] CosmosDB Client query latency spike: P99 0.3-3seconds #19750
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar, @anfeldma-ms |
@mbhaskar - can you please investigate this issue ? |
To be more specific, the P99 average did not go that high, around 40-60ms, however every 2-8 minutes, there is a sporadic 300ms-3seconds spike consistently. @mbhaskar look forward to your guidance |
Via TCP dump, we observe the latency is the network latency after tcpdump: "sporadic rise in the time b/w TCP ACK sent by the host on which our application is running and next batch of data received from Cosmos DB" We sometimes saw all of our service instances in the same VMSS to show 3 seconds latency spike, even though the incoming request count/QPS to our service remains the same (and pretty low (less than 20)). |
@joshfree Hi Josh, I saw you added this issue to "2021 April" milestone. Do you suggest you are aware of the problem and plan to resolve it by the end of April? Thank you for your prompt attention. |
@joshfree Adding a data point here: When we increase the load, we see a drop in latency. I am not certain if we are hitting the classic case seen due to Nagle's algorithm. Also briefly mentioned here: https://azure.microsoft.com/en-us/blog/performance-updates-and-tuning-best-practices-for-using-azure-database-for-postgresql/ |
@joshfree Wondering if we can get some insights into the issue. |
@mbhaskar - ping. |
@jeffpeiyt Do you have query metrics/cosmos diagnostics for queries where you see the latency issues? Also share the diagnostics for other instances where it takes less time too. |
@mbhaskar the "service side latency" in azure portal account metrics looks consistently low (1-1.5ms). however the client-side P99 every 2-5minutes we have a long latency for most queries. It seems to be some network-related issue rather than the cosmos server side. |
@mbhaskar , as @NikhitaKataria mentioned, we suspected it is related to the Nagle's algorithm and batching up TCP packets. When we increase the load, we see a drop in latency. Is there a setting in Cosmos client that enables "TCP_NODELAY"? |
@mbhaskar Checking on this again. Is it possible to set TCP_NODELAY on cosmos requests ? |
@mbhaskar follow up. Any settings related to TCP_NODELAY ? |
@jeffpeiyt As requested earlier if you can collect the diagnostics for the requests that are taking longer share them, it would be helpful to understand whats going and I will be able to comment better. This is the method that you can use to get the diagnostics for queries |
@mbhaskar makes sense! We are enabling the diagnostic logs this week and will get back to you by mid next week, if not earlier. |
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you! |
We have enabled the diagnostic logs and moved to an internal channel for further communication. |
thanks @jeffpeiyt -- I'll closed this issue since communication has moved to an internal channel. |
Query/Question
CosmosDB Client query latency P99 0.3-3seconds. However the P90 is pretty low (e.g. <20ms)
To be more specific, the average of P99 did not go that high, around 40-60ms; However every 2-8 minutes, there is a sporadic 300ms-3seconds spike consistently.
Why is this not a Bug or a feature Request?
Not sure if this is a bug. Could be related to how the client works.
Setup (please complete the following information if applicable):
Information Checklist
Our code is very similar to the following, and via logging we found
.iterator().hasNext()
that line, took 0.5-3 seconds.Our query is simple
SELECT * FROM c WHERE c.host=somesite.com
We use those configs such as
idleEndpointTimeoutInHour
,maxConnectionPerEndpoint
etc with the default value, and with direct mode. Service and cosmos DB are in the same region, using private endpointServer side latency shows maximum latency is 10ms. So it should be some client-side issue.
Wonder whether the iterator.hasNext() internally has some locking that can make it very slow occasionally?
https://github.com/Azure-Samples/azure-cosmos-java-sql-api-samples/blob/main/src/main/java/com/azure/cosmos/examples/documentcrud/sync/DocumentCRUDQuickstart.java#L169
The text was updated successfully, but these errors were encountered: