Conversation
Adds comprehensive integration tests for Slurm features that matter for multi-node GPU workloads, and fixes the two gaps the tests exposed. ## New integration tests (cluster_test.sh, +35 tests → 66 total) - **Failed job detection**: exit non-zero → state=F, output captured - **Custom I/O paths**: -o/-e flags, %j job-ID substitution in paths - **Env var passthrough**: --export=VAR1,VAR2 reaches the job - **Node selection**: --nodelist pins to named node; --exclude avoids it - **Concurrent scheduling**: two jobs run simultaneously on both nodes - **Distributed env var correctness**: RANK 0/1, WORLD_SIZE=2, MASTER_ADDR and MASTER_PORT verified on both nodes (checks mi300-2 output via SSH) - **Job hold/release**: -H keeps job in PD; scontrol release unblocks it - **Job dependencies**: afterok:A keeps B pending until A completes - **Time limit enforcement**: --time=0:00:10 job killed within 25s ## Gap fixes ### backfill.rs — honour --nodelist / --exclude `find_suitable_nodes` only checked partition and resource capacity; `job.spec.nodelist` and `job.spec.exclude` (comma-separated node names) were parsed by sbatch and stored but never consulted during scheduling. Added pre-loop parsing of both constraints and per-node filter checks. Five new unit tests cover: single-node pin, multi-node pin, no-match → unschedulable, partial-exclude, and full-exclude → unschedulable. ### scheduler_loop.rs — wall-clock time limit enforcement `time_limit` was used only by the backfill for slot duration estimation; running jobs were never killed when they exceeded it. Added `enforce_time_limits`, a 10-second watchdog loop that: 1. Scans running jobs for `now > start_time + time_limit` 2. Calls `complete_job(..., Timeout)` to free resources immediately 3. Fires `CancelJob` RPCs to all agents holding processes for the job Changes: - crates/spur-sched/src/backfill.rs: nodelist/exclude filtering + 5 unit tests - crates/spurctld/src/scheduler_loop.rs: enforce_time_limits watchdog - deploy/bare-metal/cluster_test.sh: 35 new tests across 9 feature groups 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds a new CI test that exercises tensor-parallel LLM inference across both MI300X nodes. Each node runs an independent 8-GPU TP group (intra- node RCCL), matching the serving pattern used by vLLM/Megatron-LM. ## New files deploy/bare-metal/inference_test.py Pure-PyTorch TP inference benchmark modelling one LLaMA-3-8B-style decoder layer (hidden=4096, ffn=14336, 32 heads, SwiGLU FFN) sharded across all available GPUs with column/row-parallel linear layers and RCCL all-reduce. Reports throughput (tok/s), peak GPU memory, and prints INFERENCE_OK on completion. Port is derived from SPUR_JOB_ID to avoid conflicts between concurrent runs. deploy/bare-metal/inference_job.sh Spur job wrapper (activates venv, runs inference_test.py). ## Changes deploy/bare-metal/cluster_test.sh New test group 17 (5 tests): 2-node inference job submitted, both nodes complete within 3m, both print INFERENCE_OK, throughput reported on both, output finite (no NaN/Inf). .github/workflows/ci.yml "Deploy fresh binaries" step now also rsyncs inference_test.py, inference_job.sh, distributed_test.py, distributed_job.sh to both nodes so test scripts always match the repo. Verified: 63 tok/s on mi300 (8x MI300X, TP=8). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
77b5be6 to
c111de9
Compare
Two bugs caused CI on PR #2 to fail starting at test 29: 1. `wait_job` and `wait_final_state` returned exit code 1 on timeout. With `set -euo pipefail`, bare `wait_job <id> 30` calls abort the entire script when the timeout fires. Remove the `return 1` in both helpers — timeout is already communicated via stdout echo. 2. `echo "RAN_ON=$(hostname)"` in the nodelist job script always prints "guest" because both MI300X nodes have the same VM hostname. Use `${SPUR_TARGET_NODE:-$(hostname)}` instead; Spur injects SPUR_TARGET_NODE with the registered node name (mi300 / mi300-2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1. executor.rs: create work_dir before script write so jobs dispatched
to a node that doesn't have the submitter's CWD don't fail with
ENOENT. The directory is created best-effort (ok() swallows errors).
2. scheduler_loop.rs: on dispatch failure, call complete_job(..., Failed)
so jobs don't stay permanently in Running state (ghost jobs that
block the scheduler from re-using node slots).
3. cluster_test.sh:
- Add `|| true` to grep pipelines at TEST 42 MASTER_ADDR check so
set -o pipefail doesn't abort the script when grep matches nothing.
- Pin tests 8/9/10 (exitfail, custom-io, %j subst, env passthrough)
to `-w mi300` since they check output files on the local filesystem
only (no remote_out fallback).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When awk finds the target job and calls exit, any process feeding into
awk via a pipe gets SIGPIPE. With only a few jobs in the queue this
was harmless (tail had no remaining lines to write). By test 20 the
queue grows to ~7 entries and tail gets SIGPIPE on the unread lines,
exits 141, and set -o pipefail kills the script.
Fix: replace `tail -n +2 | awk` with a single awk invocation that
skips the header via `NR == 1 { next }`. Also add `|| true` after the
awk pipeline so that the SIGPIPE from squeue (when awk exits early)
doesn't propagate as a pipefail exit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. executor.rs: fall back to /tmp when work_dir cannot be created on
the agent node (e.g. /home/runner/work/ doesn't exist on worker
nodes). Previously create_dir_all failure was silently swallowed
then the script write failed with ENOENT, causing dispatch_to_agent
to return an error and the scheduler to mark the job as Failed.
Now jobs run from /tmp with absolute output paths unaffected.
2. config.rs: add parse_time_seconds() with full second granularity.
parse_time_minutes("0:00:10") rounded up to 1 minute, making the
time-limit enforcement test (which submits -t 0:00:10) wait 60s
for the kill instead of 10s. The sbatch CLI now calls
parse_time_seconds() directly so "0:00:10" → 10 seconds correctly.
3. cluster_test.sh:
- Time limit poll: extend from 25s to 30s to give the 10s enforcer
interval margin (enforcer fires every 10s, limit is 10s → kill
within 20s + margin).
- Inference timeout: extend from 3m to 10m; RCCL/PyTorch init on
8 GPUs can take several minutes before the benchmark runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three targeted fixes for the remaining CI failures (TEST 51, 53-54): 1. agent_server.rs: Release the running-jobs mutex BEFORE calling report_completion. Previously the lock was held during the gRPC call, so a transient network failure would permanently lose the completion (job removed from map, never reported). Also adds 3-attempt retry with 1s backoff. 2. scheduler_loop.rs: Remove complete_job(Failed) from the dispatch-failure handler. Marking a job as Failed when dispatch fails breaks afterok dependencies — the dependent job sees DependencyResult::Failed and is never scheduled. 3. ci.yml: Clear WAL state directory between CI runs. Ghost Running jobs from previous runs were being replayed on controller restart, polluting the cluster with stale state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spurd uses the GetNode RPC as a lightweight heartbeat (every 30s), but the controller's get_node handler never called update_heartbeat. After 90 seconds the health checker marked both nodes DOWN, causing the scheduler to stop dispatching jobs. This explains why tests 1-47 pass (within 90s of cluster start) but tests 48+ fail (nodes are DOWN by then). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cluster_test.sh(31 → 66 total) covering the full Slurm multi-node feature surfaceNew test groups
Fixes
backfill.rs—find_suitable_nodesnow respects--nodelistand--exclude. Both are comma-separated node names; jobs with a nodelist can only land on listed nodes; excluded nodes are never considered. Five unit tests added.scheduler_loop.rs— Addedenforce_time_limitswatchdog (10s interval). Whennow > start_time + time_limit, marks job as Timeout and firesCancelJobRPCs to all agents holding the job.Test plan
🤖 Generated with Claude Code