Skip to content

Fix MimirGoThreadsTooHigh runbook #11845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@
* [ENHANCEMENT] Alerts: Add alerts for invalid cluster validation labels. #11255 #11282 #11413
* [ENHANCEMENT] Dashboards: Improve "Kafka 100th percentile end-to-end latency when ingesters are running (outliers)" panel, computing the baseline latency on `max(10, 10%)` of ingesters instead of a fixed 10 replicas. #11581
* [ENHANCEMENT] Dashboards: Add "per-query memory consumption" and "fallback to Prometheus' query engine" panels to the Queries dashboard. #11626
* [ENHANCEMENT] Alerts: add `MimirGoThreadsTooHigh` alert. #11836
* [ENHANCEMENT] Alerts: Add `MimirGoThreadsTooHigh` alert. #11836 #11845
* [ENHANCEMENT] Dashboards: Add autoscaling row for ruler query-frontends to `Mimir / Remote ruler reads` dashboard. #11838
* [CHANGE] Alerts: Update query for `MimirBucketIndexNotUpdated`. Use `max_over_time` to prevent alert firing when pods rotate. #11311, #11426
* [CHANGE] Alerts: Make alerting threshold for `DistributorGcUsesTooMuchCpu` configurable. #11508.
Expand Down
4 changes: 2 additions & 2 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1826,12 +1826,12 @@ How it **works**:
- In Go, concurrency is handled via goroutines, which are lightweight threads managed by the Go runtime.
- Goroutines are multiplexed onto a small number of actual OS threads by the Go scheduler.
- Go threads are limited to 10K. When this limit is reached, the application panics with error like `runtime: program exceeds 10000-thread limit`.
- If a goroutine makes a syscall that blocks (e.g. network I/O, disk I/O, ...), the Go runtime will try to schedule other goroutines on other OS threads, starting new threads on-demand.
- If a goroutine makes a call to a blocking syscall, for example, disk I/O, the Go runtime tries to schedule other goroutines on other OS threads. This process starts new threads on-demand. Note that network syscalls use asynchronous I/O under the hood, and therefore, don't create this problem.
- Idle go threads are never terminated ([issue](https://github.com/golang/go/issues/14592)), so once an application has a spike in the number of go threads, the process needs to be restarted to get back to a low number of threads.

How to **investigate**:

- Check the process stack trace to find common patterns in where the goroutines were blocked (typically a syscall):
- Check the process stack trace to find common patterns in where the goroutines are blocked on syscall:
- If the application panicked with error like `runtime: program exceeds 10000-thread limit`, check the panic stack trace
- If the application has not panicked yet, issue `kill -QUIT <pid>` to dump the current stack trace of the process

Expand Down