Skip to content

Don't continue running Monitor collectors if we're already timed out#4706

Merged
hawkowl merged 5 commits intomasterfrom
hawkowl/monitor-skip-if-expired
Mar 30, 2026
Merged

Don't continue running Monitor collectors if we're already timed out#4706
hawkowl merged 5 commits intomasterfrom
hawkowl/monitor-skip-if-expired

Conversation

@hawkowl
Copy link
Copy Markdown
Collaborator

@hawkowl hawkowl commented Mar 24, 2026

What this PR does / why we need it:

If we've timed out, stop spawning new goroutines.

Test plan for issue:

Unit tests

Is there any documentation that needs to be updated for this PR?

N/A

How do you know this will function as expected in production?

Testing

@hawkowl hawkowl added bug Something isn't working go Pull requests that update Go code skippy pull requests raised by member of Team Skippy labels Mar 24, 2026
Copilot AI review requested due to automatic review settings March 24, 2026 04:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to stop running additional cluster monitor collectors once the monitoring context has timed out/canceled, and adds unit coverage for the skip behavior.

Changes:

  • Add a parallelism field to Monitor to control collector goroutine concurrency (defaulting to MONITOR_GOROUTINES_PER_CLUSTER).
  • Short-circuit timeCall when ctx.Err() != nil, emitting a new monitor.cluster.collector.skipped gauge and returning a wrapped error.
  • Extend cluster_test.go to cover the “timeout/cancel during collector causes subsequent collectors to be skipped” scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/monitor/cluster/cluster.go Adds collector-skipping on context cancellation, introduces configurable parallelism, and emits a new “skipped collector” metric.
pkg/monitor/cluster/cluster_test.go Adds a unit test case asserting collectors are skipped after cancellation and adjusts context usage and monitor parallelism for determinism.
Comments suppressed due to low confidence (1)

pkg/monitor/cluster/cluster.go:306

  • This change skips work inside timeCall when ctx is done, but Monitor() still calls wg.Go for every collector. With SetLimit, this can still end up spawning new goroutines after the context has been canceled (they just return quickly). If the intent is to stop spawning new goroutines once timed out, add a ctx.Err() check in the collectors loop before calling wg.Go and handle remaining collectors without starting goroutines (optionally still emitting the skipped metric / returning a skip error for each).
	// Run up to mon.parallelism (default: MONITOR_GOROUTINES_PER_CLUSTER)
	// goroutines for collecting metrics
	wg := new(errgroup.Group)
	wg.SetLimit(mon.parallelism)

	// Create a channel capable of buffering one error from every collector
	errChan := make(chan error, len(mon.collectors))

	for _, f := range mon.collectors {
		wg.Go(func() error {
			innerErr := mon.timeCall(ctx, f)
			if innerErr != nil {
				// NOTE: The channel only has room to accommodate one error per
				// collector, so if a collector needs to return multiple errors
				// they should be joined into a single one (see errors.Join)
				// before being added.
				errChan <- innerErr
			}
			return nil
		})
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hawkowl hawkowl force-pushed the hawkowl/monitor-skip-if-expired branch from e75e698 to e1b03b6 Compare March 24, 2026 05:19
@tsatam tsatam added the next-release To be included in the next RP release rollout label Mar 25, 2026
Copilot AI review requested due to automatic review settings March 27, 2026 00:01
@hawkowl hawkowl force-pushed the hawkowl/monitor-skip-if-expired branch from e1b03b6 to d2201bf Compare March 27, 2026 00:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/monitor/cluster/cluster.go:306

  • This change skips work inside collectors when ctx is canceled, but Monitor() still schedules a wg.Go goroutine for every remaining collector (they just return immediately). If the intent is to “stop spawning new goroutines” once timed out, consider checking ctx.Err() in the for-loop before calling wg.Go and skipping/emitting the remaining collectors without launching additional goroutines.
	// Run up to mon.parallelism (default: MONITOR_GOROUTINES_PER_CLUSTER)
	// goroutines for collecting metrics
	wg := new(errgroup.Group)
	wg.SetLimit(mon.parallelism)

	// Create a channel capable of buffering one error from every collector
	errChan := make(chan error, len(mon.collectors))

	for _, f := range mon.collectors {
		wg.Go(func() error {
			innerErr := mon.timeCall(ctx, f)
			if innerErr != nil {
				// NOTE: The channel only has room to accommodate one error per
				// collector, so if a collector needs to return multiple errors
				// they should be joined into a single one (see errors.Join)
				// before being added.
				errChan <- innerErr
			}
			return nil
		})
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 27, 2026 00:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@tuxerrante tuxerrante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hawkowl
Copy link
Copy Markdown
Collaborator Author

hawkowl commented Mar 30, 2026

the e2e fail is a flake, going to merge

@hawkowl hawkowl merged commit 466b77a into master Mar 30, 2026
28 of 31 checks passed
@hawkowl hawkowl deleted the hawkowl/monitor-skip-if-expired branch March 30, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working go Pull requests that update Go code next-release To be included in the next RP release rollout skippy pull requests raised by member of Team Skippy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants