Don't continue running Monitor collectors if we're already timed out#4706
Don't continue running Monitor collectors if we're already timed out#4706
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to stop running additional cluster monitor collectors once the monitoring context has timed out/canceled, and adds unit coverage for the skip behavior.
Changes:
- Add a
parallelismfield toMonitorto control collector goroutine concurrency (defaulting toMONITOR_GOROUTINES_PER_CLUSTER). - Short-circuit
timeCallwhenctx.Err() != nil, emitting a newmonitor.cluster.collector.skippedgauge and returning a wrapped error. - Extend
cluster_test.goto cover the “timeout/cancel during collector causes subsequent collectors to be skipped” scenario.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/monitor/cluster/cluster.go | Adds collector-skipping on context cancellation, introduces configurable parallelism, and emits a new “skipped collector” metric. |
| pkg/monitor/cluster/cluster_test.go | Adds a unit test case asserting collectors are skipped after cancellation and adjusts context usage and monitor parallelism for determinism. |
Comments suppressed due to low confidence (1)
pkg/monitor/cluster/cluster.go:306
- This change skips work inside timeCall when ctx is done, but Monitor() still calls wg.Go for every collector. With SetLimit, this can still end up spawning new goroutines after the context has been canceled (they just return quickly). If the intent is to stop spawning new goroutines once timed out, add a ctx.Err() check in the collectors loop before calling wg.Go and handle remaining collectors without starting goroutines (optionally still emitting the skipped metric / returning a skip error for each).
// Run up to mon.parallelism (default: MONITOR_GOROUTINES_PER_CLUSTER)
// goroutines for collecting metrics
wg := new(errgroup.Group)
wg.SetLimit(mon.parallelism)
// Create a channel capable of buffering one error from every collector
errChan := make(chan error, len(mon.collectors))
for _, f := range mon.collectors {
wg.Go(func() error {
innerErr := mon.timeCall(ctx, f)
if innerErr != nil {
// NOTE: The channel only has room to accommodate one error per
// collector, so if a collector needs to return multiple errors
// they should be joined into a single one (see errors.Join)
// before being added.
errChan <- innerErr
}
return nil
})
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e75e698 to
e1b03b6
Compare
…mit metrics when we have timed out
e1b03b6 to
d2201bf
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
pkg/monitor/cluster/cluster.go:306
- This change skips work inside collectors when ctx is canceled, but Monitor() still schedules a wg.Go goroutine for every remaining collector (they just return immediately). If the intent is to “stop spawning new goroutines” once timed out, consider checking ctx.Err() in the for-loop before calling wg.Go and skipping/emitting the remaining collectors without launching additional goroutines.
// Run up to mon.parallelism (default: MONITOR_GOROUTINES_PER_CLUSTER)
// goroutines for collecting metrics
wg := new(errgroup.Group)
wg.SetLimit(mon.parallelism)
// Create a channel capable of buffering one error from every collector
errChan := make(chan error, len(mon.collectors))
for _, f := range mon.collectors {
wg.Go(func() error {
innerErr := mon.timeCall(ctx, f)
if innerErr != nil {
// NOTE: The channel only has room to accommodate one error per
// collector, so if a collector needs to return multiple errors
// they should be joined into a single one (see errors.Join)
// before being added.
errChan <- innerErr
}
return nil
})
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
the e2e fail is a flake, going to merge |
What this PR does / why we need it:
If we've timed out, stop spawning new goroutines.
Test plan for issue:
Unit tests
Is there any documentation that needs to be updated for this PR?
N/A
How do you know this will function as expected in production?
Testing