Cleanup autoscaler metric subscriptions on shutdown by Andyz26 · Pull Request #798 · Netflix/mantis

Andyz26 · 2025-10-14T00:30:56Z

[Problem]
Multiple instances of stage scaler operators are observed after scaler rule switch.

[Fix]

track the scheduler and stage subscriptions created inside WorkerMetricHandler so repeated activations do not leak StageMetricDataOperator instances
expose a shutdown hook on the handler and invoke it from JobMasterService.shutdown to cancel the master scheduling stream and per-stage schedulers alongside the autoscaler

github-actions · 2025-10-14T00:38:38Z

Test Results

661 tests ±0 650 ✅ ±0 9m 10s ⏱️ +5s
152 suites ±0 11 💤 ±0
152 files ±0 0 ❌ ±0

Results for commit fd1d2ed. ± Comparison against base commit ca9b829.

♻️ This comment has been updated with latest results.

* Add variety of cleanups, fix warnings, improve code/performance (#771) * More fixes * Review feedback, add more * Update nebula.netflixoss use sonatype central portal (#774) * Use com.netflix.nebula.netflixoss 11.6.0 to move publishing to Sonatype Central Portal from Sonatype Legacy OSSRH * Github action: checkout v4 * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Support default tag config as fallback on artifact loading failure (#778) * increase max stage concurrency (#779) * Fix a typo in the Group By docs (#783) * Fix a typo in the Group By docs * Fix broken link to heartbeat documentation * Handle out of sync restarted TE (#784) * Handle out of sync restarted TE * use terminte event on heartbeat * clean up + tests * Revert "Fix Worker Request flow to properly use batching (#775)" (#785) This reverts commit 3b0c92f. * Move common code to utils and cleanup (#789) Co-authored-by: ggao <ggao@netflix.com> * Add job id to log and add running worker failure metrics (#790) Co-authored-by: ggao <ggao@netflix.com> * add job clusters update metrics (#791) * Update worker failure metric (#792) Co-authored-by: ggao <ggao@netflix.com> * Refactor RCActor props overload (#795) Co-authored-by: ggao <ggao@netflix.com> * Add log to check #TE archived was not in disabled state (#793) Co-authored-by: ggao <ggao@netflix.com> * Update CODEOWNERS (#796) * Cleanup autoscaler metric subscriptions on shutdown (#798) * fix leaked auto scaler instance (#801) * Fix test race condition (#803) When disabling a job cluster, the response would sometimes return before the associated jobs were killed. The Delete action would then fail because the job was still active. I was able to reliably reproduce by adding a 200ms sleep in the JobActor.onJobKill. To fix, we just check if the response is returning that error. If so, we retry. Otherwise, we perform the standard checks. [CI Example](https://github.com/Netflix/mantis/pull/797/checks?check_run_id=52633941202) * Fixed up test * Debugging * Validating breakage is from rate limiting * Updating rate limit --------- Co-authored-by: Michael Braun <n3ca88@gmail.com> Co-authored-by: OdysseusLives <achipman@netflix.com> Co-authored-by: Andy Zhang <87735571+Andyz26@users.noreply.github.com> Co-authored-by: Daniel Trager <43889268+dtrager02@users.noreply.github.com> Co-authored-by: eliot-stripe <58606410+eliot-stripe@users.noreply.github.com> Co-authored-by: Gigi Gao <ggjbetty@gmail.com> Co-authored-by: ggao <ggao@netflix.com> Co-authored-by: timmartin-stripe <131782471+timmartin-stripe@users.noreply.github.com>

Cleanup autoscaler metric subscriptions on shutdown

fd1d2ed

Andyz26 requested review from calvin681, dtrager02, fdc-ntflx, hellolittlej and james-lubin as code owners October 14, 2025 00:30

Andyz26 had a problem deploying to Integrate Pull Request October 14, 2025 00:31 — with GitHub Actions Failure

hellolittlej approved these changes Oct 14, 2025

View reviewed changes

Andyz26 merged commit 3ba5b5f into master Oct 14, 2025
5 of 7 checks passed

Andyz26 deleted the andy/fixScalerMetricSubShutdown branch October 14, 2025 16:52

andresgalindo-stripe pushed a commit to andresgalindo-stripe/mantis that referenced this pull request Oct 30, 2025

Cleanup autoscaler metric subscriptions on shutdown (Netflix#798)

f36b11c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup autoscaler metric subscriptions on shutdown#798

Cleanup autoscaler metric subscriptions on shutdown#798
Andyz26 merged 1 commit intomasterfrom
andy/fixScalerMetricSubShutdown

Andyz26 commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Andyz26 commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Oct 14, 2025 •

edited

Loading