Skip to content

Multi-node Slurm feature tests + nodelist/exclude/time-limit fixes#2

Merged
powderluv merged 9 commits intomainfrom
users/powderluv/multi-node-test-coverage
Mar 19, 2026
Merged

Multi-node Slurm feature tests + nodelist/exclude/time-limit fixes#2
powderluv merged 9 commits intomainfrom
users/powderluv/multi-node-test-coverage

Conversation

@powderluv
Copy link
Copy Markdown
Collaborator

Summary

  • Adds 35 new integration tests to cluster_test.sh (31 → 66 total) covering the full Slurm multi-node feature surface
  • Fixes two scheduler gaps the new tests exposed

New test groups

Group Tests What's verified
Failed job detection 3 exit non-zero → state=F, output captured
Custom I/O paths 5 -o/-e flags, %j substitution
Env var passthrough 2 --export=VAR reaches job
Node selection 5 --nodelist pins; --exclude avoids
Concurrent scheduling 5 both nodes used simultaneously
Distributed env var correctness 5 RANK 0/1, WORLD_SIZE, MASTER_ADDR on both nodes
Job hold/release 4 -H → PD; scontrol release → runs
Job dependencies 4 afterok keeps B pending until A completes
Time limit enforcement 3 --time=0:00:10 job killed within 25s

Fixes

backfill.rsfind_suitable_nodes now respects --nodelist and --exclude. Both are comma-separated node names; jobs with a nodelist can only land on listed nodes; excluded nodes are never considered. Five unit tests added.

scheduler_loop.rs — Added enforce_time_limits watchdog (10s interval). When now > start_time + time_limit, marks job as Timeout and fires CancelJob RPCs to all agents holding the job.

Test plan

  • CI cluster job passes all 66 tests on mi300 + mi300-2

🤖 Generated with Claude Code

powderluv and others added 2 commits March 18, 2026 17:59
Adds comprehensive integration tests for Slurm features that matter for
multi-node GPU workloads, and fixes the two gaps the tests exposed.

## New integration tests (cluster_test.sh, +35 tests → 66 total)

- **Failed job detection**: exit non-zero → state=F, output captured
- **Custom I/O paths**: -o/-e flags, %j job-ID substitution in paths
- **Env var passthrough**: --export=VAR1,VAR2 reaches the job
- **Node selection**: --nodelist pins to named node; --exclude avoids it
- **Concurrent scheduling**: two jobs run simultaneously on both nodes
- **Distributed env var correctness**: RANK 0/1, WORLD_SIZE=2, MASTER_ADDR
  and MASTER_PORT verified on both nodes (checks mi300-2 output via SSH)
- **Job hold/release**: -H keeps job in PD; scontrol release unblocks it
- **Job dependencies**: afterok:A keeps B pending until A completes
- **Time limit enforcement**: --time=0:00:10 job killed within 25s

## Gap fixes

### backfill.rs — honour --nodelist / --exclude
`find_suitable_nodes` only checked partition and resource capacity;
`job.spec.nodelist` and `job.spec.exclude` (comma-separated node names)
were parsed by sbatch and stored but never consulted during scheduling.
Added pre-loop parsing of both constraints and per-node filter checks.
Five new unit tests cover: single-node pin, multi-node pin, no-match →
unschedulable, partial-exclude, and full-exclude → unschedulable.

### scheduler_loop.rs — wall-clock time limit enforcement
`time_limit` was used only by the backfill for slot duration estimation;
running jobs were never killed when they exceeded it.  Added
`enforce_time_limits`, a 10-second watchdog loop that:
  1. Scans running jobs for `now > start_time + time_limit`
  2. Calls `complete_job(..., Timeout)` to free resources immediately
  3. Fires `CancelJob` RPCs to all agents holding processes for the job

Changes:
- crates/spur-sched/src/backfill.rs: nodelist/exclude filtering + 5 unit tests
- crates/spurctld/src/scheduler_loop.rs: enforce_time_limits watchdog
- deploy/bare-metal/cluster_test.sh: 35 new tests across 9 feature groups

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds a new CI test that exercises tensor-parallel LLM inference across
both MI300X nodes. Each node runs an independent 8-GPU TP group (intra-
node RCCL), matching the serving pattern used by vLLM/Megatron-LM.

## New files

deploy/bare-metal/inference_test.py
  Pure-PyTorch TP inference benchmark modelling one LLaMA-3-8B-style
  decoder layer (hidden=4096, ffn=14336, 32 heads, SwiGLU FFN) sharded
  across all available GPUs with column/row-parallel linear layers and
  RCCL all-reduce.  Reports throughput (tok/s), peak GPU memory, and
  prints INFERENCE_OK on completion.  Port is derived from SPUR_JOB_ID
  to avoid conflicts between concurrent runs.

deploy/bare-metal/inference_job.sh
  Spur job wrapper (activates venv, runs inference_test.py).

## Changes

deploy/bare-metal/cluster_test.sh
  New test group 17 (5 tests): 2-node inference job submitted,
  both nodes complete within 3m, both print INFERENCE_OK,
  throughput reported on both, output finite (no NaN/Inf).

.github/workflows/ci.yml
  "Deploy fresh binaries" step now also rsyncs inference_test.py,
  inference_job.sh, distributed_test.py, distributed_job.sh to both
  nodes so test scripts always match the repo.

Verified: 63 tok/s on mi300 (8x MI300X, TP=8).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@powderluv powderluv force-pushed the users/powderluv/multi-node-test-coverage branch from 77b5be6 to c111de9 Compare March 19, 2026 01:00
powderluv and others added 7 commits March 18, 2026 18:47
Two bugs caused CI on PR #2 to fail starting at test 29:

1. `wait_job` and `wait_final_state` returned exit code 1 on timeout.
   With `set -euo pipefail`, bare `wait_job <id> 30` calls abort the
   entire script when the timeout fires. Remove the `return 1` in both
   helpers — timeout is already communicated via stdout echo.

2. `echo "RAN_ON=$(hostname)"` in the nodelist job script always prints
   "guest" because both MI300X nodes have the same VM hostname. Use
   `${SPUR_TARGET_NODE:-$(hostname)}` instead; Spur injects
   SPUR_TARGET_NODE with the registered node name (mi300 / mi300-2).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
1. executor.rs: create work_dir before script write so jobs dispatched
   to a node that doesn't have the submitter's CWD don't fail with
   ENOENT. The directory is created best-effort (ok() swallows errors).

2. scheduler_loop.rs: on dispatch failure, call complete_job(..., Failed)
   so jobs don't stay permanently in Running state (ghost jobs that
   block the scheduler from re-using node slots).

3. cluster_test.sh:
   - Add `|| true` to grep pipelines at TEST 42 MASTER_ADDR check so
     set -o pipefail doesn't abort the script when grep matches nothing.
   - Pin tests 8/9/10 (exitfail, custom-io, %j subst, env passthrough)
     to `-w mi300` since they check output files on the local filesystem
     only (no remote_out fallback).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When awk finds the target job and calls exit, any process feeding into
awk via a pipe gets SIGPIPE. With only a few jobs in the queue this
was harmless (tail had no remaining lines to write). By test 20 the
queue grows to ~7 entries and tail gets SIGPIPE on the unread lines,
exits 141, and set -o pipefail kills the script.

Fix: replace `tail -n +2 | awk` with a single awk invocation that
skips the header via `NR == 1 { next }`. Also add `|| true` after the
awk pipeline so that the SIGPIPE from squeue (when awk exits early)
doesn't propagate as a pipefail exit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. executor.rs: fall back to /tmp when work_dir cannot be created on
   the agent node (e.g. /home/runner/work/ doesn't exist on worker
   nodes). Previously create_dir_all failure was silently swallowed
   then the script write failed with ENOENT, causing dispatch_to_agent
   to return an error and the scheduler to mark the job as Failed.
   Now jobs run from /tmp with absolute output paths unaffected.

2. config.rs: add parse_time_seconds() with full second granularity.
   parse_time_minutes("0:00:10") rounded up to 1 minute, making the
   time-limit enforcement test (which submits -t 0:00:10) wait 60s
   for the kill instead of 10s. The sbatch CLI now calls
   parse_time_seconds() directly so "0:00:10" → 10 seconds correctly.

3. cluster_test.sh:
   - Time limit poll: extend from 25s to 30s to give the 10s enforcer
     interval margin (enforcer fires every 10s, limit is 10s → kill
     within 20s + margin).
   - Inference timeout: extend from 3m to 10m; RCCL/PyTorch init on
     8 GPUs can take several minutes before the benchmark runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three targeted fixes for the remaining CI failures (TEST 51, 53-54):

1. agent_server.rs: Release the running-jobs mutex BEFORE calling
   report_completion. Previously the lock was held during the gRPC
   call, so a transient network failure would permanently lose the
   completion (job removed from map, never reported). Also adds
   3-attempt retry with 1s backoff.

2. scheduler_loop.rs: Remove complete_job(Failed) from the
   dispatch-failure handler. Marking a job as Failed when dispatch
   fails breaks afterok dependencies — the dependent job sees
   DependencyResult::Failed and is never scheduled.

3. ci.yml: Clear WAL state directory between CI runs. Ghost Running
   jobs from previous runs were being replayed on controller restart,
   polluting the cluster with stale state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spurd uses the GetNode RPC as a lightweight heartbeat (every 30s),
but the controller's get_node handler never called update_heartbeat.
After 90 seconds the health checker marked both nodes DOWN, causing
the scheduler to stop dispatching jobs.

This explains why tests 1-47 pass (within 90s of cluster start) but
tests 48+ fail (nodes are DOWN by then).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@powderluv powderluv merged commit 1cefc91 into main Mar 19, 2026
4 checks passed
@powderluv powderluv deleted the users/powderluv/multi-node-test-coverage branch March 19, 2026 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant