Fix test race condition by timmartin-stripe · Pull Request #803 · Netflix/mantis

timmartin-stripe · 2025-10-24T21:59:00Z

Context

When disabling a job cluster, the response would sometimes return before the associated jobs were killed. The Delete action would then fail because the job was still active. I was able to reliably reproduce by adding a 200ms sleep in the JobActor.onJobKill.

To fix, we just check if the response is returning that error. If so, we retry. Otherwise, we perform the standard checks.

CI Fail Example

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

When disabling a job cluster, the response would sometimes return before the associated jobs were killed. The Delete action would then fail because the job was still active. I was able to reliably reproduce by adding a 200ms sleep in the JobActor.onJobKill. To fix, we just check if the response is returning that error. If so, we retry. Otherwise, we perform the standard checks. [CI Example](https://github.com/Netflix/mantis/pull/797/checks?check_run_id=52633941202)

timmartin-stripe · 2025-10-24T21:59:26Z

mantis-control-plane/mantis-control-plane-server/build.gradle

    testImplementation testFixtures(project(":mantis-common"))
    testImplementation testFixtures(project(":mantis-control-plane:mantis-control-plane-core"))
    testImplementation libraries.commonsIo
+    testImplementation 'org.awaitility:awaitility:4.2.0'


This felt like a reasonable dependency given the async nature of many of the tests.

Andyz26

very nice!

github-actions · 2025-10-24T22:15:56Z

Test Results

152 files ±0 152 suites ±0 9m 29s ⏱️ -2s
661 tests ±0 649 ✅ +1 11 💤 ±0 1 ❌ - 1
662 runs ±0 650 ✅ +1 11 💤 ±0 1 ❌ - 1

For more details on these failures, see this check.

Results for commit 26df2c7. ± Comparison against base commit 48e024a.

When disabling a job cluster, the response would sometimes return before the associated jobs were killed. The Delete action would then fail because the job was still active. I was able to reliably reproduce by adding a 200ms sleep in the JobActor.onJobKill. To fix, we just check if the response is returning that error. If so, we retry. Otherwise, we perform the standard checks. [CI Example](https://github.com/Netflix/mantis/pull/797/checks?check_run_id=52633941202)

* Add variety of cleanups, fix warnings, improve code/performance (#771) * More fixes * Review feedback, add more * Update nebula.netflixoss use sonatype central portal (#774) * Use com.netflix.nebula.netflixoss 11.6.0 to move publishing to Sonatype Central Portal from Sonatype Legacy OSSRH * Github action: checkout v4 * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Support default tag config as fallback on artifact loading failure (#778) * increase max stage concurrency (#779) * Fix a typo in the Group By docs (#783) * Fix a typo in the Group By docs * Fix broken link to heartbeat documentation * Handle out of sync restarted TE (#784) * Handle out of sync restarted TE * use terminte event on heartbeat * clean up + tests * Revert "Fix Worker Request flow to properly use batching (#775)" (#785) This reverts commit 3b0c92f. * Move common code to utils and cleanup (#789) Co-authored-by: ggao <ggao@netflix.com> * Add job id to log and add running worker failure metrics (#790) Co-authored-by: ggao <ggao@netflix.com> * add job clusters update metrics (#791) * Update worker failure metric (#792) Co-authored-by: ggao <ggao@netflix.com> * Refactor RCActor props overload (#795) Co-authored-by: ggao <ggao@netflix.com> * Add log to check #TE archived was not in disabled state (#793) Co-authored-by: ggao <ggao@netflix.com> * Update CODEOWNERS (#796) * Cleanup autoscaler metric subscriptions on shutdown (#798) * fix leaked auto scaler instance (#801) * Fix test race condition (#803) When disabling a job cluster, the response would sometimes return before the associated jobs were killed. The Delete action would then fail because the job was still active. I was able to reliably reproduce by adding a 200ms sleep in the JobActor.onJobKill. To fix, we just check if the response is returning that error. If so, we retry. Otherwise, we perform the standard checks. [CI Example](https://github.com/Netflix/mantis/pull/797/checks?check_run_id=52633941202) * Fixed up test * Debugging * Validating breakage is from rate limiting * Updating rate limit --------- Co-authored-by: Michael Braun <n3ca88@gmail.com> Co-authored-by: OdysseusLives <achipman@netflix.com> Co-authored-by: Andy Zhang <87735571+Andyz26@users.noreply.github.com> Co-authored-by: Daniel Trager <43889268+dtrager02@users.noreply.github.com> Co-authored-by: eliot-stripe <58606410+eliot-stripe@users.noreply.github.com> Co-authored-by: Gigi Gao <ggjbetty@gmail.com> Co-authored-by: ggao <ggao@netflix.com> Co-authored-by: timmartin-stripe <131782471+timmartin-stripe@users.noreply.github.com>

timmartin-stripe requested review from Andyz26, calvin681, dtrager02, fdc-ntflx, hellolittlej and james-lubin as code owners October 24, 2025 21:59

timmartin-stripe commented Oct 24, 2025

View reviewed changes

Andyz26 approved these changes Oct 24, 2025

View reviewed changes

timmartin-stripe had a problem deploying to Integrate Pull Request October 24, 2025 22:08 — with GitHub Actions Failure

Andyz26 merged commit 4be9cc3 into Netflix:master Oct 28, 2025
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test race condition#803

Fix test race condition#803
Andyz26 merged 1 commit intoNetflix:masterfrom
timmartin-stripe:timmartin/fix-race-condition-in-test

timmartin-stripe commented Oct 24, 2025

Uh oh!

timmartin-stripe Oct 24, 2025

Uh oh!

Andyz26 left a comment

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timmartin-stripe commented Oct 24, 2025

Context

Checklist

Uh oh!

timmartin-stripe Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Andyz26 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 24, 2025

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants