Skip to content

Add topology-aware scheduling (tree + block)#80

Merged
powderluv merged 3 commits intomainfrom
users/powderluv/topology-scheduling
Apr 14, 2026
Merged

Add topology-aware scheduling (tree + block)#80
powderluv merged 3 commits intomainfrom
users/powderluv/topology-scheduling

Conversation

@powderluv
Copy link
Copy Markdown
Collaborator

Summary

  • Implements topology/tree and topology/block scheduling modes for locality-aware multi-node job placement
  • Closes [Feature]: topology/tree plz #76 (topology/tree for fat-tree fabrics) and closes [Feature]: topology/block needed for mi450 #77 (topology/block for rack co-location)
  • When configured with a switch hierarchy, the backfill scheduler preferentially selects nodes from the same switch (or closest switches) for multi-node jobs

Changes

  • New spur-core/src/topology.rs: Switch, TopologyTree with distance computation, switch grouping, and greedy locality-aware node selection
  • Config: [topology] section with plugin ("tree"/"block"/"none"), [[topology.switches]] definitions, and block_size
  • Data model: switch_name on Node, topology on JobSpec
  • Scheduler: backfill reorders candidates by topology locality when --topology=tree or --topology=block
  • CLI: --topology flag for sbatch
  • Proto: topology field 59 on JobSpec, switch_name field 40 on NodeInfo

Configuration example

[topology]
plugin = "tree"

[[topology.switches]]
name = "rack01"
nodes = "gpu[001-018]"

[[topology.switches]]
name = "rack02"
nodes = "gpu[019-036]"

[[topology.switches]]
name = "fabric0"
switches = "rack01,rack02"

Test plan

  • 10 topology unit tests (distance, grouping, selection, tree/block build)
  • 4 scheduler integration tests (block same-switch, tree same-switch, no-topology default, spanning)
  • Full test suite: 792 tests, 0 failures, 0 regressions

🤖 Generated with Claude Code

powderluv and others added 3 commits April 10, 2026 11:28
Issue #69 (all pods get rank 0):
- peer_nodes contains addr:port strings but target_node is a hostname,
  so starts_with matching always failed, defaulting all ranks to 0
- Fix: derive node_rank from task_offset / tasks_per_node, which the
  dispatcher increments correctly per node

Issue #70 (nodes always show idle):
- register_node() unconditionally set state=Idle, but the K8s node
  watcher re-registers nodes on every Apply event, resetting
  Allocated/Mixed state back to Idle
- Fix: if node already exists, update connection info and resources
  but preserve current state and allocations

Co-Authored-By: Claude <noreply@anthropic.com>
Implements three features needed by the spur-cloud GPUaaS platform:

1. exec_in_job in K8s agent: Uses kube Api<Pod>::exec() to run
   commands inside job pods. Enables web terminal access via the
   spur-cloud platform.

2. stream_job_output in K8s agent: Uses Api<Pod>::log_stream() to
   tail pod logs. Enables real-time log viewing in the web UI.

3. Leader election for spurctld: Adds --enable-leader-election flag
   that uses K8s Lease API for HA deployments. Standby replicas
   block until the leader fails to renew, then take over. No-op
   when flag is absent (bare-metal deploys unaffected).

Changes:
- Implement exec_in_job using kube ws exec in spur-k8s agent
- Implement stream_job_output using kube log_stream in spur-k8s agent
- Add leader_election.rs module to spurctld (172 LoC)
- Add --enable-leader-election and --election-namespace CLI flags
- Add ws feature to kube dependency for exec support
- Add kube + k8s-openapi deps to spurctld for Lease API

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement topology/tree and topology/block scheduling modes for
locality-aware multi-node job placement. Closes #76, closes #77.

When configured with a switch hierarchy, the backfill scheduler
groups candidate nodes by their leaf switch and preferentially
selects nodes from the same switch (or closest switches) for
multi-node jobs. This reduces network hops and improves
communication performance for distributed training workloads.

Changes:
- New `topology.rs` module with Switch, TopologyTree, distance
  computation, switch grouping, and locality-aware node selection
- TopologyConfig in config.rs: `[topology]` section with plugin
  ("tree"/"block"), switch definitions, and block_size
- `switch_name` field on Node, `topology` field on JobSpec
- Backfill scheduler reorders candidates by topology locality
  when job.spec.topology is "tree" or "block"
- `--topology` CLI flag for sbatch
- Proto updates: topology field on JobSpec, switch_name on NodeInfo
- 4 new scheduling tests + 10 topology unit tests (792 total, 0 failures)

Example configuration:
  [topology]
  plugin = "tree"
  [[topology.switches]]
  name = "rack01"
  nodes = "gpu[001-018]"
  [[topology.switches]]
  name = "rack02"
  nodes = "gpu[019-036]"
  [[topology.switches]]
  name = "fabric0"
  switches = "rack01,rack02"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@powderluv powderluv merged commit 9d387ae into main Apr 14, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: topology/block needed for mi450 [Feature]: topology/tree plz

1 participant