Multi-node Slurm feature tests + nodelist/exclude/time-limit fixes by powderluv · Pull Request #2 · ROCm/spur

powderluv · 2026-03-18T20:20:26Z

Summary

Adds 35 new integration tests to cluster_test.sh (31 → 66 total) covering the full Slurm multi-node feature surface
Fixes two scheduler gaps the new tests exposed

New test groups

Group	Tests	What's verified
Failed job detection	3	exit non-zero → state=F, output captured
Custom I/O paths	5	-o/-e flags, %j substitution
Env var passthrough	2	--export=VAR reaches job
Node selection	5	--nodelist pins; --exclude avoids
Concurrent scheduling	5	both nodes used simultaneously
Distributed env var correctness	5	RANK 0/1, WORLD_SIZE, MASTER_ADDR on both nodes
Job hold/release	4	-H → PD; scontrol release → runs
Job dependencies	4	afterok keeps B pending until A completes
Time limit enforcement	3	--time=0:00:10 job killed within 25s

Fixes

backfill.rs — find_suitable_nodes now respects --nodelist and --exclude. Both are comma-separated node names; jobs with a nodelist can only land on listed nodes; excluded nodes are never considered. Five unit tests added.

scheduler_loop.rs — Added enforce_time_limits watchdog (10s interval). When now > start_time + time_limit, marks job as Timeout and fires CancelJob RPCs to all agents holding the job.

Test plan

CI cluster job passes all 66 tests on mi300 + mi300-2

🤖 Generated with Claude Code

Adds comprehensive integration tests for Slurm features that matter for multi-node GPU workloads, and fixes the two gaps the tests exposed. ## New integration tests (cluster_test.sh, +35 tests → 66 total) - **Failed job detection**: exit non-zero → state=F, output captured - **Custom I/O paths**: -o/-e flags, %j job-ID substitution in paths - **Env var passthrough**: --export=VAR1,VAR2 reaches the job - **Node selection**: --nodelist pins to named node; --exclude avoids it - **Concurrent scheduling**: two jobs run simultaneously on both nodes - **Distributed env var correctness**: RANK 0/1, WORLD_SIZE=2, MASTER_ADDR and MASTER_PORT verified on both nodes (checks mi300-2 output via SSH) - **Job hold/release**: -H keeps job in PD; scontrol release unblocks it - **Job dependencies**: afterok:A keeps B pending until A completes - **Time limit enforcement**: --time=0:00:10 job killed within 25s ## Gap fixes ### backfill.rs — honour --nodelist / --exclude `find_suitable_nodes` only checked partition and resource capacity; `job.spec.nodelist` and `job.spec.exclude` (comma-separated node names) were parsed by sbatch and stored but never consulted during scheduling. Added pre-loop parsing of both constraints and per-node filter checks. Five new unit tests cover: single-node pin, multi-node pin, no-match → unschedulable, partial-exclude, and full-exclude → unschedulable. ### scheduler_loop.rs — wall-clock time limit enforcement `time_limit` was used only by the backfill for slot duration estimation; running jobs were never killed when they exceeded it. Added `enforce_time_limits`, a 10-second watchdog loop that: 1. Scans running jobs for `now > start_time + time_limit` 2. Calls `complete_job(..., Timeout)` to free resources immediately 3. Fires `CancelJob` RPCs to all agents holding processes for the job Changes: - crates/spur-sched/src/backfill.rs: nodelist/exclude filtering + 5 unit tests - crates/spurctld/src/scheduler_loop.rs: enforce_time_limits watchdog - deploy/bare-metal/cluster_test.sh: 35 new tests across 9 feature groups 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Adds a new CI test that exercises tensor-parallel LLM inference across both MI300X nodes. Each node runs an independent 8-GPU TP group (intra- node RCCL), matching the serving pattern used by vLLM/Megatron-LM. ## New files deploy/bare-metal/inference_test.py Pure-PyTorch TP inference benchmark modelling one LLaMA-3-8B-style decoder layer (hidden=4096, ffn=14336, 32 heads, SwiGLU FFN) sharded across all available GPUs with column/row-parallel linear layers and RCCL all-reduce. Reports throughput (tok/s), peak GPU memory, and prints INFERENCE_OK on completion. Port is derived from SPUR_JOB_ID to avoid conflicts between concurrent runs. deploy/bare-metal/inference_job.sh Spur job wrapper (activates venv, runs inference_test.py). ## Changes deploy/bare-metal/cluster_test.sh New test group 17 (5 tests): 2-node inference job submitted, both nodes complete within 3m, both print INFERENCE_OK, throughput reported on both, output finite (no NaN/Inf). .github/workflows/ci.yml "Deploy fresh binaries" step now also rsyncs inference_test.py, inference_job.sh, distributed_test.py, distributed_job.sh to both nodes so test scripts always match the repo. Verified: 63 tok/s on mi300 (8x MI300X, TP=8). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Two bugs caused CI on PR #2 to fail starting at test 29: 1. `wait_job` and `wait_final_state` returned exit code 1 on timeout. With `set -euo pipefail`, bare `wait_job <id> 30` calls abort the entire script when the timeout fires. Remove the `return 1` in both helpers — timeout is already communicated via stdout echo. 2. `echo "RAN_ON=$(hostname)"` in the nodelist job script always prints "guest" because both MI300X nodes have the same VM hostname. Use `${SPUR_TARGET_NODE:-$(hostname)}` instead; Spur injects SPUR_TARGET_NODE with the registered node name (mi300 / mi300-2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

1. executor.rs: create work_dir before script write so jobs dispatched to a node that doesn't have the submitter's CWD don't fail with ENOENT. The directory is created best-effort (ok() swallows errors). 2. scheduler_loop.rs: on dispatch failure, call complete_job(..., Failed) so jobs don't stay permanently in Running state (ghost jobs that block the scheduler from re-using node slots). 3. cluster_test.sh: - Add `|| true` to grep pipelines at TEST 42 MASTER_ADDR check so set -o pipefail doesn't abort the script when grep matches nothing. - Pin tests 8/9/10 (exitfail, custom-io, %j subst, env passthrough) to `-w mi300` since they check output files on the local filesystem only (no remote_out fallback). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When awk finds the target job and calls exit, any process feeding into awk via a pipe gets SIGPIPE. With only a few jobs in the queue this was harmless (tail had no remaining lines to write). By test 20 the queue grows to ~7 entries and tail gets SIGPIPE on the unread lines, exits 141, and set -o pipefail kills the script. Fix: replace `tail -n +2 | awk` with a single awk invocation that skips the header via `NR == 1 { next }`. Also add `|| true` after the awk pipeline so that the SIGPIPE from squeue (when awk exits early) doesn't propagate as a pipefail exit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. executor.rs: fall back to /tmp when work_dir cannot be created on the agent node (e.g. /home/runner/work/ doesn't exist on worker nodes). Previously create_dir_all failure was silently swallowed then the script write failed with ENOENT, causing dispatch_to_agent to return an error and the scheduler to mark the job as Failed. Now jobs run from /tmp with absolute output paths unaffected. 2. config.rs: add parse_time_seconds() with full second granularity. parse_time_minutes("0:00:10") rounded up to 1 minute, making the time-limit enforcement test (which submits -t 0:00:10) wait 60s for the kill instead of 10s. The sbatch CLI now calls parse_time_seconds() directly so "0:00:10" → 10 seconds correctly. 3. cluster_test.sh: - Time limit poll: extend from 25s to 30s to give the 10s enforcer interval margin (enforcer fires every 10s, limit is 10s → kill within 20s + margin). - Inference timeout: extend from 3m to 10m; RCCL/PyTorch init on 8 GPUs can take several minutes before the benchmark runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three targeted fixes for the remaining CI failures (TEST 51, 53-54): 1. agent_server.rs: Release the running-jobs mutex BEFORE calling report_completion. Previously the lock was held during the gRPC call, so a transient network failure would permanently lose the completion (job removed from map, never reported). Also adds 3-attempt retry with 1s backoff. 2. scheduler_loop.rs: Remove complete_job(Failed) from the dispatch-failure handler. Marking a job as Failed when dispatch fails breaks afterok dependencies — the dependent job sees DependencyResult::Failed and is never scheduled. 3. ci.yml: Clear WAL state directory between CI runs. Ghost Running jobs from previous runs were being replayed on controller restart, polluting the cluster with stale state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

spurd uses the GetNode RPC as a lightweight heartbeat (every 30s), but the controller's get_node handler never called update_heartbeat. After 90 seconds the health checker marked both nodes DOWN, causing the scheduler to stop dispatching jobs. This explains why tests 1-47 pass (within 90s of cluster start) but tests 48+ fail (nodes are DOWN by then). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

powderluv and others added 2 commits March 18, 2026 17:59

powderluv force-pushed the users/powderluv/multi-node-test-coverage branch from 77b5be6 to c111de9 Compare March 19, 2026 01:00

powderluv and others added 7 commits March 18, 2026 18:47

Fix rustfmt formatting

35b382f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

powderluv merged commit 1cefc91 into main Mar 19, 2026
4 checks passed

powderluv deleted the users/powderluv/multi-node-test-coverage branch March 19, 2026 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node Slurm feature tests + nodelist/exclude/time-limit fixes#2

Multi-node Slurm feature tests + nodelist/exclude/time-limit fixes#2
powderluv merged 9 commits intomainfrom
users/powderluv/multi-node-test-coverage

powderluv commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

powderluv commented Mar 18, 2026

Summary

New test groups

Fixes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant