Skip to content

Conversation

@JasonOE
Copy link
Collaborator

@JasonOE JasonOE commented Nov 7, 2025

Improve Scheduler Global Rebalancing Mechanism and Node Rebalancing Tests

Summary

This PR improves the scheduler's global rebalancing mechanism, optimizes the warmup and truncation process, and adds unit tests for node rebalancing. The main changes include:

  1. Warmup before global rebalance: Adds a warmup step before global rebalancing to detect truncation points via dynamic programming routing and optimize shard allocation
  2. Unified bootstrap and rebalance interface: Refactors the bootstrap method to support both initialization and global rebalancing modes
  3. Code refactoring and optimization: Refactors based on code review feedback, simplifies code structure, and removes redundant methods
  4. Enhanced test coverage: Adds unit tests for node rebalancing scenarios

Technical Details

Warmup and Truncation Mechanism

The warmup phase identifies turning points through layer-level dynamic programming:

  • tail turning point: The route switches away when a node still hosts the layer, indicating that the node's range from this layer to the end can be truncated
  • head turning point: The route first uses the node at a layer (greater than start), indicating that the node's range from start to this layer can be truncated

Global Rebalancing Flow

  1. Clear existing allocations (if clear_existing=True)
  2. Perform global allocation
  3. Optional warmup and truncation (if skip_warmup=False and request_warm_up_for_reshard > 0)
  4. Verify full pipeline
  5. Update bootstrap state

Impact

Modified Files

  • src/scheduling/scheduler.py: Main changes, refactored bootstrap and rebalancing logic
  • src/scheduling/layer_allocation.py: Added reallocate method
  • src/scheduling/request_routing.py: Changed find_turning_points to static method
  • tests/scheduler_tests/test_scheduler.py: Added node rebalancing tests

Backward Compatibility

  • ✅ No breaking changes: All existing APIs remain compatible
  • bootstrap() method default behavior unchanged (clear_existing=False, skip_warmup=False)
  • ✅ Existing tests should continue to pass

Testing

New Tests

  • test_scheduler_three_nodes_sequential_join_leave_rejoin: Validates rebalancing behavior in multi-node scenarios

Test Coverage

  • ✅ Sequential node joins
  • ✅ Node leave and rejoin
  • ✅ Pipeline completeness verification
  • ✅ Layer assignment correctness verification

Benefits

  1. Better resource utilization: The warmup mechanism helps identify optimal shard truncation points, reducing unnecessary layer allocations
  2. Code maintainability: Unified interface reduces code duplication and makes the codebase easier to maintain
  3. More robust rebalancing: Improved rebalancing mechanism better handles dynamic node changes
  4. Better test coverage: New test cases ensure the correctness of rebalancing logic

@JasonOE JasonOE requested a review from a team November 7, 2025 15:32
@JasonOE JasonOE changed the title (feat): warmup before global rebalance feat(scheduler): warmup before global rebalance Nov 7, 2025
@JasonOE JasonOE marked this pull request as draft November 8, 2025 07:32
@christ-tt
Copy link
Collaborator

Overall LGTM. Please add unit tests to verify your updates. Especially for warming up and rebalancing.

@JasonOE JasonOE marked this pull request as ready for review November 12, 2025 13:35
Copy link
Collaborator

@christ-tt christ-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Please add PR description.

@JasonOE JasonOE enabled auto-merge (squash) November 13, 2025 07:20
@JasonOE JasonOE merged commit fdc82a9 into main Nov 13, 2025
7 checks passed
@JasonOE JasonOE deleted the scheduling branch November 13, 2025 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants