Skip to content

Conversation

@christ-tt
Copy link
Collaborator

@christ-tt christ-tt commented Dec 25, 2025

Problem

  • in _process_join, we check if we've received enough nodes (> min_nodes), and if so, we call bootstrap();
  • in _wait_for_bootwtrap, we perform this once again.
  • we may thus see concurrent bootstrap.
  • Similarly, as we handled global_rebalance per node leave, we may see concurrent reboot.

In this PR, we

  • simplify how we check bootstrapped or not;
  • fix concurrent bootstrap by removing the attempt from wait event;
  • fix concurrent re-boot by moving global_rebalance to _process_leave
  • make our logs more informative.

@christ-tt christ-tt requested review from a team, TianyiZhao1437 and gufengc December 25, 2025 06:28
@christ-tt christ-tt enabled auto-merge (squash) December 25, 2025 07:09
@christ-tt
Copy link
Collaborator Author

Pipeline readiness check will be introduced in this PR

@christ-tt christ-tt merged commit 6b9a93d into GradientHQ:main Dec 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants