Merged
Conversation
Andyz26
approved these changes
Sep 25, 2025
3f10acb to
1f116d0
Compare
andresgalindo-stripe
pushed a commit
to andresgalindo-stripe/mantis
that referenced
this pull request
Oct 30, 2025
Co-authored-by: ggao <ggao@netflix.com>
andresgalindo-stripe
added a commit
that referenced
this pull request
Nov 6, 2025
* Add variety of cleanups, fix warnings, improve code/performance (#771) * More fixes * Review feedback, add more * Update nebula.netflixoss use sonatype central portal (#774) * Use com.netflix.nebula.netflixoss 11.6.0 to move publishing to Sonatype Central Portal from Sonatype Legacy OSSRH * Github action: checkout v4 * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Introduce batching into worker discovery during scaling (#773) * Fix worker state filtering and scheduling update gaps during scaling. This reduces scaling update storms from N individual updates to 1-3 batched updates. - Filter JobSchedulingInfo to only include Started workers, preventing downstream connection failures - Add smart refresh batching with pending worker detection to avoid premature flag resets - Implement WorkerState.isPendingState() helper for consistent state checking - Add comprehensive tests covering scaling scenarios and flag reset edge cases - Include detailed context and analysis documentation of connection mechanisms and scaling optimizations * try stablize flaky ut * add analysis context doc * remove refresh discovery trigger on scaleup request * Fix Worker Request flow to properly use batching (#775) * Support default tag config as fallback on artifact loading failure (#778) * increase max stage concurrency (#779) * Fix a typo in the Group By docs (#783) * Fix a typo in the Group By docs * Fix broken link to heartbeat documentation * Handle out of sync restarted TE (#784) * Handle out of sync restarted TE * use terminte event on heartbeat * clean up + tests * Revert "Fix Worker Request flow to properly use batching (#775)" (#785) This reverts commit 3b0c92f. * Move common code to utils and cleanup (#789) Co-authored-by: ggao <ggao@netflix.com> * Add job id to log and add running worker failure metrics (#790) Co-authored-by: ggao <ggao@netflix.com> * add job clusters update metrics (#791) * Update worker failure metric (#792) Co-authored-by: ggao <ggao@netflix.com> * Refactor RCActor props overload (#795) Co-authored-by: ggao <ggao@netflix.com> * Add log to check #TE archived was not in disabled state (#793) Co-authored-by: ggao <ggao@netflix.com> * Update CODEOWNERS (#796) * Cleanup autoscaler metric subscriptions on shutdown (#798) * fix leaked auto scaler instance (#801) * Fix test race condition (#803) When disabling a job cluster, the response would sometimes return before the associated jobs were killed. The Delete action would then fail because the job was still active. I was able to reliably reproduce by adding a 200ms sleep in the JobActor.onJobKill. To fix, we just check if the response is returning that error. If so, we retry. Otherwise, we perform the standard checks. [CI Example](https://github.com/Netflix/mantis/pull/797/checks?check_run_id=52633941202) * Fixed up test * Debugging * Validating breakage is from rate limiting * Updating rate limit --------- Co-authored-by: Michael Braun <n3ca88@gmail.com> Co-authored-by: OdysseusLives <achipman@netflix.com> Co-authored-by: Andy Zhang <87735571+Andyz26@users.noreply.github.com> Co-authored-by: Daniel Trager <43889268+dtrager02@users.noreply.github.com> Co-authored-by: eliot-stripe <58606410+eliot-stripe@users.noreply.github.com> Co-authored-by: Gigi Gao <ggjbetty@gmail.com> Co-authored-by: ggao <ggao@netflix.com> Co-authored-by: timmartin-stripe <131782471+timmartin-stripe@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
We don't need overload the props because the only 3 usages are all in test, we can just pass the null.
The duplicate of props make it hard to add new params to the resourceclusterActor coz we need duplicate it to multiple constructor.
the new one was introduced from e57a596
Checklist
./gradlew buildcompiles code correctly./gradlew testpasses all tests