Add topology-aware scheduling (tree + block) by powderluv · Pull Request #80 · ROCm/spur

powderluv · 2026-04-14T06:26:36Z

Summary

Implements topology/tree and topology/block scheduling modes for locality-aware multi-node job placement
Closes [Feature]: topology/tree plz #76 (topology/tree for fat-tree fabrics) and closes [Feature]: topology/block needed for mi450 #77 (topology/block for rack co-location)
When configured with a switch hierarchy, the backfill scheduler preferentially selects nodes from the same switch (or closest switches) for multi-node jobs

Changes

New spur-core/src/topology.rs: Switch, TopologyTree with distance computation, switch grouping, and greedy locality-aware node selection
Config: [topology] section with plugin ("tree"/"block"/"none"), [[topology.switches]] definitions, and block_size
Data model: switch_name on Node, topology on JobSpec
Scheduler: backfill reorders candidates by topology locality when --topology=tree or --topology=block
CLI: --topology flag for sbatch
Proto: topology field 59 on JobSpec, switch_name field 40 on NodeInfo

Configuration example

[topology]
plugin = "tree"

[[topology.switches]]
name = "rack01"
nodes = "gpu[001-018]"

[[topology.switches]]
name = "rack02"
nodes = "gpu[019-036]"

[[topology.switches]]
name = "fabric0"
switches = "rack01,rack02"

Test plan

10 topology unit tests (distance, grouping, selection, tree/block build)
4 scheduler integration tests (block same-switch, tree same-switch, no-topology default, spanning)
Full test suite: 792 tests, 0 failures, 0 regressions

🤖 Generated with Claude Code

Issue #69 (all pods get rank 0): - peer_nodes contains addr:port strings but target_node is a hostname, so starts_with matching always failed, defaulting all ranks to 0 - Fix: derive node_rank from task_offset / tasks_per_node, which the dispatcher increments correctly per node Issue #70 (nodes always show idle): - register_node() unconditionally set state=Idle, but the K8s node watcher re-registers nodes on every Apply event, resetting Allocated/Mixed state back to Idle - Fix: if node already exists, update connection info and resources but preserve current state and allocations Co-Authored-By: Claude <noreply@anthropic.com>

Implements three features needed by the spur-cloud GPUaaS platform: 1. exec_in_job in K8s agent: Uses kube Api<Pod>::exec() to run commands inside job pods. Enables web terminal access via the spur-cloud platform. 2. stream_job_output in K8s agent: Uses Api<Pod>::log_stream() to tail pod logs. Enables real-time log viewing in the web UI. 3. Leader election for spurctld: Adds --enable-leader-election flag that uses K8s Lease API for HA deployments. Standby replicas block until the leader fails to renew, then take over. No-op when flag is absent (bare-metal deploys unaffected). Changes: - Implement exec_in_job using kube ws exec in spur-k8s agent - Implement stream_job_output using kube log_stream in spur-k8s agent - Add leader_election.rs module to spurctld (172 LoC) - Add --enable-leader-election and --election-namespace CLI flags - Add ws feature to kube dependency for exec support - Add kube + k8s-openapi deps to spurctld for Lease API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implement topology/tree and topology/block scheduling modes for locality-aware multi-node job placement. Closes #76, closes #77. When configured with a switch hierarchy, the backfill scheduler groups candidate nodes by their leaf switch and preferentially selects nodes from the same switch (or closest switches) for multi-node jobs. This reduces network hops and improves communication performance for distributed training workloads. Changes: - New `topology.rs` module with Switch, TopologyTree, distance computation, switch grouping, and locality-aware node selection - TopologyConfig in config.rs: `[topology]` section with plugin ("tree"/"block"), switch definitions, and block_size - `switch_name` field on Node, `topology` field on JobSpec - Backfill scheduler reorders candidates by topology locality when job.spec.topology is "tree" or "block" - `--topology` CLI flag for sbatch - Proto updates: topology field on JobSpec, switch_name on NodeInfo - 4 new scheduling tests + 10 topology unit tests (792 total, 0 failures) Example configuration: [topology] plugin = "tree" [[topology.switches]] name = "rack01" nodes = "gpu[001-018]" [[topology.switches]] name = "rack02" nodes = "gpu[019-036]" [[topology.switches]] name = "fabric0" switches = "rack01,rack02" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

powderluv and others added 3 commits April 10, 2026 11:28

powderluv merged commit 9d387ae into main Apr 14, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add topology-aware scheduling (tree + block)#80

Add topology-aware scheduling (tree + block)#80
powderluv merged 3 commits intomainfrom
users/powderluv/topology-scheduling

powderluv commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

powderluv commented Apr 14, 2026

Summary

Changes

Configuration example

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant