Fix leaked auto scaler instance from runtime error during shutdown by Andyz26 · Pull Request #801 · Netflix/mantis

Andyz26 · 2025-10-23T23:06:06Z

Problem:

JobMasterService.shutdown() could fail to clean up metricObserver if jobAutoScaler.shutdown() threw an exception
JobAutoScaler.shutdown() could leave subscriptions active if subject.onCompleted() failed
MetricsClientImpl.closeOut() had a ConcurrentModificationException when iterating over HashMap values outside synchronized block
ScalerControllerActor could leave the system in inconsistent state if shutdown failures occurred

Solution:

Fixed race condition in MetricsClientImpl by moving HashMap iteration inside synchronized block
Added JVM shutdown with 2-second delay when critical autoscaler shutdown fails, ensuring clean restart in containerized environments

github-actions · 2025-10-23T23:13:52Z

Test Results

661 tests ±0 650 ✅ ±0 8m 56s ⏱️ -10s
152 suites ±0 11 💤 ±0
152 files ±0 0 ❌ ±0

Results for commit 407623f. ± Comparison against base commit 3ba5b5f.

♻️ This comment has been updated with latest results.

…ntFix

hellolittlej · 2025-10-23T23:21:52Z

...akka/src/main/java/io/mantisrx/server/worker/jobmaster/akka/rules/ScalerControllerActor.java

-                        log.error("failed to shutdown job auto scaler service in rule: {}, reset and request refresh",
+                        log.error("[FATAL] failed to shutdown job auto scaler service in rule: {}, shutting down JVM process",
                            currentRule, result.failed().get());
+                        // Give some time for logs to flush and cleanup


i thought the log still captured in the stdout stream when container stops?

hellolittlej · 2025-10-23T23:23:40Z

...akka/src/main/java/io/mantisrx/server/worker/jobmaster/akka/rules/ScalerControllerActor.java

                            currentRule, result.failed().get());
+                        // Give some time for logs to flush and cleanup
+                        getContext().system().scheduler().scheduleOnce(
+                            scala.concurrent.duration.Duration.create(2, java.util.concurrent.TimeUnit.SECONDS),


nit: Do we need move this to constant?

i don't think we need to reuse this value anywhere else.

* Add variety of cleanups, fix warnings, improve code/performance (#771) * More fixes * Review feedback, add more * Update nebula.netflixoss use sonatype central portal (#774) * Use com.netflix.nebula.netflixoss 11.6.0 to move publishing to Sonatype Central Portal from Sonatype Legacy OSSRH * Github action: checkout v4 * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Support default tag config as fallback on artifact loading failure (#778) * increase max stage concurrency (#779) * Fix a typo in the Group By docs (#783) * Fix a typo in the Group By docs * Fix broken link to heartbeat documentation * Handle out of sync restarted TE (#784) * Handle out of sync restarted TE * use terminte event on heartbeat * clean up + tests * Revert "Fix Worker Request flow to properly use batching (#775)" (#785) This reverts commit 3b0c92f. * Move common code to utils and cleanup (#789) Co-authored-by: ggao <ggao@netflix.com> * Add job id to log and add running worker failure metrics (#790) Co-authored-by: ggao <ggao@netflix.com> * add job clusters update metrics (#791) * Update worker failure metric (#792) Co-authored-by: ggao <ggao@netflix.com> * Refactor RCActor props overload (#795) Co-authored-by: ggao <ggao@netflix.com> * Add log to check #TE archived was not in disabled state (#793) Co-authored-by: ggao <ggao@netflix.com> * Update CODEOWNERS (#796) * Cleanup autoscaler metric subscriptions on shutdown (#798) * fix leaked auto scaler instance (#801) * Fix test race condition (#803) When disabling a job cluster, the response would sometimes return before the associated jobs were killed. The Delete action would then fail because the job was still active. I was able to reliably reproduce by adding a 200ms sleep in the JobActor.onJobKill. To fix, we just check if the response is returning that error. If so, we retry. Otherwise, we perform the standard checks. [CI Example](https://github.com/Netflix/mantis/pull/797/checks?check_run_id=52633941202) * Fixed up test * Debugging * Validating breakage is from rate limiting * Updating rate limit --------- Co-authored-by: Michael Braun <n3ca88@gmail.com> Co-authored-by: OdysseusLives <achipman@netflix.com> Co-authored-by: Andy Zhang <87735571+Andyz26@users.noreply.github.com> Co-authored-by: Daniel Trager <43889268+dtrager02@users.noreply.github.com> Co-authored-by: eliot-stripe <58606410+eliot-stripe@users.noreply.github.com> Co-authored-by: Gigi Gao <ggjbetty@gmail.com> Co-authored-by: ggao <ggao@netflix.com> Co-authored-by: timmartin-stripe <131782471+timmartin-stripe@users.noreply.github.com>

fix leaked auto scaler instance

43a7e92

Andyz26 requested review from calvin681, dtrager02, fdc-ntflx, hellolittlej and james-lubin as code owners October 23, 2025 23:06

Andyz26 had a problem deploying to Integrate Pull Request October 23, 2025 23:06 — with GitHub Actions Failure

Merge branch 'master' into andyz/autoScalerShutdownBlockByMetricsclie…

407623f

…ntFix

Andyz26 had a problem deploying to Integrate Pull Request October 23, 2025 23:15 — with GitHub Actions Failure

hellolittlej reviewed Oct 23, 2025

View reviewed changes

hellolittlej approved these changes Oct 23, 2025

View reviewed changes

hellolittlej reviewed Oct 23, 2025

View reviewed changes

Andyz26 merged commit 48e024a into master Oct 24, 2025
4 of 5 checks passed

Andyz26 deleted the andyz/autoScalerShutdownBlockByMetricsclientFix branch October 24, 2025 17:21

andresgalindo-stripe pushed a commit to andresgalindo-stripe/mantis that referenced this pull request Oct 30, 2025

fix leaked auto scaler instance (Netflix#801)

c5b92df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix leaked auto scaler instance from runtime error during shutdown #801

Fix leaked auto scaler instance from runtime error during shutdown #801
Andyz26 merged 2 commits intomasterfrom
andyz/autoScalerShutdownBlockByMetricsclientFix

Andyz26 commented Oct 23, 2025

Uh oh!

github-actions bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

hellolittlej Oct 23, 2025

Uh oh!

hellolittlej Oct 23, 2025

Uh oh!

Andyz26 Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Andyz26 commented Oct 23, 2025

Problem:

Solution:

Uh oh!

github-actions bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

hellolittlej Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

hellolittlej Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Andyz26 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Oct 23, 2025 •

edited

Loading