Don't continue running Monitor collectors if we're already timed out by hawkowl · Pull Request #4706 · Azure/ARO-RP

hawkowl · 2026-03-24T04:53:57Z

What this PR does / why we need it:

If we've timed out, stop spawning new goroutines.

Test plan for issue:

Unit tests

Is there any documentation that needs to be updated for this PR?

N/A

How do you know this will function as expected in production?

Testing

Copilot

Pull request overview

This PR aims to stop running additional cluster monitor collectors once the monitoring context has timed out/canceled, and adds unit coverage for the skip behavior.

Changes:

Add a parallelism field to Monitor to control collector goroutine concurrency (defaulting to MONITOR_GOROUTINES_PER_CLUSTER).
Short-circuit timeCall when ctx.Err() != nil, emitting a new monitor.cluster.collector.skipped gauge and returning a wrapped error.
Extend cluster_test.go to cover the “timeout/cancel during collector causes subsequent collectors to be skipped” scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
pkg/monitor/cluster/cluster.go	Adds collector-skipping on context cancellation, introduces configurable parallelism, and emits a new “skipped collector” metric.
pkg/monitor/cluster/cluster_test.go	Adds a unit test case asserting collectors are skipped after cancellation and adjusts context usage and monitor parallelism for determinism.

Comments suppressed due to low confidence (1)

pkg/monitor/cluster/cluster.go:306

This change skips work inside timeCall when ctx is done, but Monitor() still calls wg.Go for every collector. With SetLimit, this can still end up spawning new goroutines after the context has been canceled (they just return quickly). If the intent is to stop spawning new goroutines once timed out, add a ctx.Err() check in the collectors loop before calling wg.Go and handle remaining collectors without starting goroutines (optionally still emitting the skipped metric / returning a skip error for each).

	// Run up to mon.parallelism (default: MONITOR_GOROUTINES_PER_CLUSTER)
	// goroutines for collecting metrics
	wg := new(errgroup.Group)
	wg.SetLimit(mon.parallelism)

	// Create a channel capable of buffering one error from every collector
	errChan := make(chan error, len(mon.collectors))

	for _, f := range mon.collectors {
		wg.Go(func() error {
			innerErr := mon.timeCall(ctx, f)
			if innerErr != nil {
				// NOTE: The channel only has room to accommodate one error per
				// collector, so if a collector needs to return multiple errors
				// they should be joined into a single one (see errors.Join)
				// before being added.
				errChan <- innerErr
			}
			return nil
		})
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/monitor/cluster/cluster.go

pkg/monitor/cluster/cluster_test.go

pkg/monitor/cluster/cluster.go

…mit metrics when we have timed out

old

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/monitor/cluster/cluster.go:306

This change skips work inside collectors when ctx is canceled, but Monitor() still schedules a wg.Go goroutine for every remaining collector (they just return immediately). If the intent is to “stop spawning new goroutines” once timed out, consider checking ctx.Err() in the for-loop before calling wg.Go and skipping/emitting the remaining collectors without launching additional goroutines.

	// Run up to mon.parallelism (default: MONITOR_GOROUTINES_PER_CLUSTER)
	// goroutines for collecting metrics
	wg := new(errgroup.Group)
	wg.SetLimit(mon.parallelism)

	// Create a channel capable of buffering one error from every collector
	errChan := make(chan error, len(mon.collectors))

	for _, f := range mon.collectors {
		wg.Go(func() error {
			innerErr := mon.timeCall(ctx, f)
			if innerErr != nil {
				// NOTE: The channel only has room to accommodate one error per
				// collector, so if a collector needs to return multiple errors
				// they should be joined into a single one (see errors.Join)
				// before being added.
				errChan <- innerErr
			}
			return nil
		})
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/monitor/cluster/cluster.go

pkg/monitor/cluster/cluster_test.go

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/monitor/cluster/cluster.go

tuxerrante

LGTM

hawkowl · 2026-03-30T00:19:13Z

the e2e fail is a flake, going to merge

hawkowl added bug Something isn't working go Pull requests that update Go code skippy pull requests raised by member of Team Skippy labels Mar 24, 2026

hawkowl requested review from bennerv, cadenmarchese, hlipsig, jharrington22, mociarain, mrWinston, rogbas, tiguelu and yjst2012 as code owners March 24, 2026 04:53

Copilot AI review requested due to automatic review settings March 24, 2026 04:53

hawkowl requested review from alcasim, kimorris27, sankur-codes, tsatam, tuxerrante, ventifus and wanghaoran1988 as code owners March 24, 2026 04:53

Copilot started reviewing on behalf of hawkowl March 24, 2026 04:54 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

pkg/monitor/cluster/cluster_test.go Outdated Show resolved Hide resolved

hawkowl force-pushed the hawkowl/monitor-skip-if-expired branch from e75e698 to e1b03b6 Compare March 24, 2026 05:19

tuxerrante previously requested changes Mar 25, 2026

View reviewed changes

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

tsatam added the next-release To be included in the next RP release rollout label Mar 25, 2026

aasserzo approved these changes Mar 25, 2026

View reviewed changes

tuxerrante mentioned this pull request Mar 26, 2026

Fix goroutine leak in monitor workOne due to unbuffered channel #4714

Merged

5 tasks

do not continue spawning goroutines if we've already timed out, and e…

d2201bf

…mit metrics when we have timed out

Copilot AI review requested due to automatic review settings March 27, 2026 00:01

hawkowl force-pushed the hawkowl/monitor-skip-if-expired branch from e1b03b6 to d2201bf Compare March 27, 2026 00:01

hawkowl requested a review from kevinobriendotca as a code owner March 27, 2026 00:01

Copilot started reviewing on behalf of hawkowl March 27, 2026 00:02 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

pkg/monitor/cluster/cluster.go Outdated Show resolved Hide resolved

pkg/monitor/cluster/cluster_test.go Show resolved Hide resolved

hawkowl added 3 commits March 27, 2026 11:40

update monitor to use the proper fake metrics for testing

0d95d97

add a test for skipping when the context is cancelled to begin with

5074ed8

fix nit

42963cd

Copilot AI review requested due to automatic review settings March 27, 2026 00:53

Copilot started reviewing on behalf of hawkowl March 27, 2026 00:54 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

pkg/monitor/cluster/cluster.go Show resolved Hide resolved

fix flake

72654e0

tuxerrante approved these changes Mar 27, 2026

View reviewed changes

hawkowl merged commit 466b77a into master Mar 30, 2026
28 of 31 checks passed

hawkowl deleted the hawkowl/monitor-skip-if-expired branch March 30, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't continue running Monitor collectors if we're already timed out#4706

Don't continue running Monitor collectors if we're already timed out#4706
hawkowl merged 5 commits intomasterfrom
hawkowl/monitor-skip-if-expired

hawkowl commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

tuxerrante left a comment

Uh oh!

hawkowl commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hawkowl commented Mar 24, 2026

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

tuxerrante left a comment

Choose a reason for hiding this comment

Uh oh!

hawkowl commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants