[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462
Conversation
Three of the nine mi300x compute nodes are currently unusable:
- chi-mi300x-033, chi-mi300x-037: down (Not responding)
- chi-mi300x-049: drained for persistent /nvme_home disk-full
(kept down by a watchdog re-applying State=DOWN every 10s)
Without a nodelist filter, salloc sometimes lands a job on a node
that's about to be drained or that has a half-extracted enroot dir,
causing 'pyxis: failed to create container filesystem (No space left
on device)' / 'srun: Node failure' / 'manifest unknown'-style errors
visible in PRs #1426 and #1403.
Add an explicit --nodelist of the 6 healthy nodes (mirroring how
runners/launch_b300-nv.sh:336 pins to the known-good B300 set).
There was a problem hiding this comment.
LGTM — small ops fix pinning salloc to known-good mi300x nodes, mirroring the existing pattern in launch_b300-nv.sh.
Extended reasoning...
Overview
This PR adds a --nodelist constraint to the salloc invocation in runners/launch_mi300x-amds.sh, restricting allocations to six known-good mi300x nodes (034-036, 054, 057-058) and excluding three currently-unusable nodes (033, 037 down; 049 drained for disk-full). The pattern mirrors an existing usage in runners/launch_b300-nv.sh.
Security risks
None. This is a runner orchestration script for CI benchmark jobs; the change is a Slurm scheduling constraint with no security-sensitive surface (no auth, crypto, permissions, or user-controllable input affected).
Level of scrutiny
Low. This is a one-line operational hotfix to an internal runner script. The change is mechanical, the rationale is documented inline as a comment, and the impact is bounded: at worst, salloc will fail to allocate if all listed nodes are busy, which is the same failure surface as today when a doomed node is picked.
Other factors
- No bugs were reported by the bug hunting system.
- The hardcoded nodelist is a known tradeoff for a hotfix — future node health changes will require another PR — but that maintenance pattern is already in use elsewhere in this repo.
- No prior reviewer comments to address.
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
--nodelist=chi-mi300x-[034-036,054,057-058].ord.vultr.cpe.ice.amd.comto the mi300x salloc, mirroring the pattern already used inrunners/launch_b300-nv.sh:336.chi-mi300x-033,chi-mi300x-037—down(Not responding)chi-mi300x-049— drained for persistent/nvme_homedisk-full (kept down by a watchdog re-applyingState=DOWNevery 10s)No space left on device) orsrun: Node failure. See Update qwen3.5-bf16-mi300x-sglang SGLang image to v0.5.12-rocm720-mi30x #1426 and Update gptoss-fp4-mi300x-vllm vLLM ROCm image to v0.21.0 #1403 for current examples.Test plan
Node failure/pyxis ... No space left on deviceerrors after this lands.🤖 Generated with Claude Code