-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos query fanout has high latency when a strict-store is unavailable #7276
Comments
Do you have this option enabled https://github.com/thanos-io/thanos/blob/main/cmd/thanos/query.go#L212? It should solve your issue. |
Yes, i tried with store.response-timeout=15s and then 5s, but it still has similar elevated latency. I do have query.timeout=30s. i notice in the code that store.response-timeout is a Timer for a kinda manual timeout while waiting for cl.Recv call 1. So if the cl.Recv call itself is taking longer than that amount (e.g. 40s) then the call would be that long as well. And the grpc context for cl.Recv doesnt have timeout specified AFAICT. Ref: |
I've dive through the code and I think I found what the problem is. When a store is listed as However, since it's always be part of the active store set, they will always get incoming queries as long as the labels published by that store endpoint matches. The problem is when that store is completely down like in my example above, the Dial's This is seems fine for individual queries since queries are timing out "as expected". However, when there are sizable number of queries that would go through this path (and timing out), it will fill up the "queue" from Proposed fix: Since the healthiness of a store is actually being checked every 5s (see code ref 2), that means at query time, the query flow knows which store is unhealthy (in this case timeout). The original proposal (see ref 3) is to not ignore the strict store (and return partial response) instead of completely forgetting about the store (and return an illusional success response). We can still accommodate that idea but fail fast on unhealthy strict-store by checking whether the strict store is unhealthy and not even attempt to send gRPC request to it, and return an error. If it SGTY I can prep a PR to introduce this fail-fast behavior. Ref: (1): the thanos/pkg/store/proxy_heap.go Line 700 in 968899f
(2): Check endpoint set every 5s Line 519 in 194e1fa
(3): Original proposal of |
Thanos, Prometheus and Golang version used:
v0.29
Object Storage Provider:
Azure
What happened:
Our topology is roughly:
thanos query frontend —> thanos query global -> thanos query sidecars one for each region (r1, r2, t3), and thanos query store
all the thanos query (sidecars / store) are listed as store-strict in the thanos query global cli arguments
When we performed a region maintenance in r2 and scaled down thanos-query-sidecar-r2 completely to 0 pods. We notice an immediate elevation of latency on the whole query path.
Looking at pod:10902/stores of the thanos-query-global, we do see the thanos-query-sidecar-r2 shown up as Down.
thanos-query-global is deployed in region r1, and we dont have mesh setup for the pods so we use cluster internal LB IP to fanout to those thanos-query-sidecar-
What you expected to happen:
There is no pods behind thanos-query-sidecar-r2 service, so we would expect it to failfast and not affecting query latency except for returning a partial result.
How to reproduce it (as minimally and precisely as possible):
Setup a querier namely query-sidecar-r2 in a different region (r2) than the query-global querier (region r1), wire them up as store-strict (or endpoint-strict)
once the stack is up and running, scale down query-sidecar-r2 to 0.
Send a simple query and see latency increased.
Full logs to relevant components:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: