Skip to content

Conversation

@JasonOE
Copy link
Collaborator

@JasonOE JasonOE commented Oct 31, 2025

Summary

This PR adds layer allocation information logging to the worker heartbeat mechanism. Workers now receive and log their assigned layer ranges (start_layer, end_layer) and model name during each heartbeat.

Changes

Scheduler Side (rpc_connection_handler.py)

  • Modified node_update() RPC method to return layer allocation information
  • Returns {node_id, model_name, start_layer, end_layer} in the response

Worker Side (p2p/server.py)

  • Updated heartbeat announcer thread to capture and log layer allocation from scheduler response
  • Added info-level logging: Heartbeat: Node {peer_id}... Model: {model_name}, Layers: [{start}, {end})
  • Helps with debugging and monitoring node layer assignments

Benefits

  • Better observability: Workers can now see their assigned layers in logs
  • Debugging aid: Easier to troubleshoot layer allocation issues
  • Real-time monitoring: Layer assignments are visible in heartbeat logs every 10 seconds

Testing

  • Code compiles without errors
  • Commit message follows conventional commits format
  • Manual testing of heartbeat logging (pending)

Related Issues

@JasonOE JasonOE requested a review from gufengc October 31, 2025 12:20
- Return layer allocation in node_update RPC response
- Log layer information (start_layer, end_layer, model_name) in worker heartbeat
- Helps with debugging and monitoring node layer assignments
@JasonOE JasonOE changed the title Heart add layers return feat(p2p): add layer allocation logging in heartbeat Oct 31, 2025
@JasonOE JasonOE requested a review from sl-gn October 31, 2025 14:41
@JasonOE JasonOE requested review from sl-gn and removed request for sl-gn November 3, 2025 07:56
@JasonOE JasonOE merged commit ddd3747 into main Nov 3, 2025
3 checks passed
@JasonOE JasonOE deleted the mini-pr branch November 3, 2025 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants