Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor resource utilisation of TSDs #2080

Open
switchtrue opened this issue Feb 16, 2021 · 2 comments
Open

Poor resource utilisation of TSDs #2080

switchtrue opened this issue Feb 16, 2021 · 2 comments
Labels

Comments

@switchtrue
Copy link

switchtrue commented Feb 16, 2021

We are running OpenTSDB against Google Bigtable and are having an issue where multiple parallel requests appear to lock up the TSD to the point where it is unable to accept new requests despite available resources.

We are running the TSDs as an auto-scaling instance group on Google Cloud although I don't think this is relevant to the issue.

Specifically, what we are seeing is:

  • Low CPU (~10%) and memory usage on the TSDs
  • 40% load on Bigtable with reasonable (well within expected bounds) read/writes
  • Low SSD utilisation on Bigtable
  • Highly inconsistent latencies on the same TSDB queries with the same responses (5 to 70 seconds)
  • Failing health checks (timeouts on /api/stats) despite all of the above
  • Increasing the CPU count per TSD improves the issue despite low CPU utilisation

It almost feels like we are reaching a threshold of number of concurrent requests - if we have X number of long running (5-10s) queries we are simply unable to accept new requests despite having capacity to do so. I am unable to find anything in the documentation that indicates how the OpenTSDB webserver works or any tuning parameters around this.

We compiled TSDB ourselves and are simply running the TSDs with tsdb tsd and directing all traffic directly to port 4242 via a GCP load balancer. i.e. we are not running ngix or anything infront of the TSDs.

We have tsd.network.async_io set to True and have attempted to increase tsd.network.worker_threads to 4 * CPU Cores in an attempt to get the TSDs to do more work but this seems to have had no effect.

Our other thought is that we could be stuck behind stop the world garbage collection. We currently have insufficient monitoring to prove this but are working to add this in.

Does anyone have any idea as to why we might be seeing these issues? Are our assumptions possible or are we looking in the wrong place? Is there something additional we can look to tune?

Thanks, Mike

@manolama manolama added the bug label Mar 19, 2021
@manolama
Copy link
Member

I need to contact GCP and see if I can get my Bigtable instance back. The last time I tried it out I noticed that there was some odd behavior around the Bigtable GRPC client wherein it looked like it was taking a long time (as you said 5 seconds to over a minute) with just a single test query. I think the "prod" TSDB code was still working ok with the driver that was set at the time I tried it but I haven't checked since.

Some other things to try:

  • If you standup multiple TSD instances behind a load balancer does the situation improve? It could because there is likely some thread starvation somewhere that's happening in a single TSD (would explain the low CPU utilization).
  • Are you querying through the same TSD that's writing data? If so, do the writes drop during that period?
  • The failing stats checks would be due to being unable to reach Bigtable as that call checks the UID assignment row.
  • Could you capture a thread dump when the queries are stalled please? I'm guessing they're all waiting for data or one or two may be looping on something.

@switchtrue
Copy link
Author

  • Increasing the number of TSDs doesn't seem to have much impact on the situation. Similarly, neither does changing the number of workers via tsd.network.worker_threads although I have not pushed the latter aggressively.
  • We have a separate cluster of TSDs for writes - the write cluster seems unaffected. The read cluster does have tsd.mode set to rw although it's not clear if changing this would lead to any performance benefit.
  • It's difficult to catch it in the act but I have included a thread dump (below) from a period where it's not performing at its best.

threaddump.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants