Skip to content

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462

Merged
functionstackx merged 1 commit into
mainfrom
add-mi300x-nodelist-filter
May 18, 2026
Merged

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462
functionstackx merged 1 commit into
mainfrom
add-mi300x-nodelist-filter

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

  • Adds an explicit --nodelist=chi-mi300x-[034-036,054,057-058].ord.vultr.cpe.ice.amd.com to the mi300x salloc, mirroring the pattern already used in runners/launch_b300-nv.sh:336.
  • Three of the nine mi300x nodes are currently unusable:
    • chi-mi300x-033, chi-mi300x-037down (Not responding)
    • chi-mi300x-049 — drained for persistent /nvme_home disk-full (kept down by a watchdog re-applying State=DOWN every 10s)
  • Symptom this fixes: PRs land on a doomed node and fail at pyxis extraction (No space left on device) or srun: Node failure. See Update qwen3.5-bf16-mi300x-sglang SGLang image to v0.5.12-rocm720-mi30x #1426 and Update gptoss-fp4-mi300x-vllm vLLM ROCm image to v0.21.0 #1403 for current examples.

Test plan

  • Subsequent mi300x sweeps land only on the 6 healthy nodes.
  • No Node failure / pyxis ... No space left on device errors after this lands.

🤖 Generated with Claude Code

Three of the nine mi300x compute nodes are currently unusable:
  - chi-mi300x-033, chi-mi300x-037: down (Not responding)
  - chi-mi300x-049: drained for persistent /nvme_home disk-full
    (kept down by a watchdog re-applying State=DOWN every 10s)

Without a nodelist filter, salloc sometimes lands a job on a node
that's about to be drained or that has a half-extracted enroot dir,
causing 'pyxis: failed to create container filesystem (No space left
on device)' / 'srun: Node failure' / 'manifest unknown'-style errors
visible in PRs #1426 and #1403.

Add an explicit --nodelist of the 6 healthy nodes (mirroring how
runners/launch_b300-nv.sh:336 pins to the known-good B300 set).
@functionstackx functionstackx requested a review from a team May 18, 2026 01:24
@functionstackx functionstackx merged commit 4da367c into main May 18, 2026
4 checks passed
@functionstackx functionstackx deleted the add-mi300x-nodelist-filter branch May 18, 2026 01:24
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small ops fix pinning salloc to known-good mi300x nodes, mirroring the existing pattern in launch_b300-nv.sh.

Extended reasoning...

Overview

This PR adds a --nodelist constraint to the salloc invocation in runners/launch_mi300x-amds.sh, restricting allocations to six known-good mi300x nodes (034-036, 054, 057-058) and excluding three currently-unusable nodes (033, 037 down; 049 drained for disk-full). The pattern mirrors an existing usage in runners/launch_b300-nv.sh.

Security risks

None. This is a runner orchestration script for CI benchmark jobs; the change is a Slurm scheduling constraint with no security-sensitive surface (no auth, crypto, permissions, or user-controllable input affected).

Level of scrutiny

Low. This is a one-line operational hotfix to an internal runner script. The change is mechanical, the rationale is documented inline as a comment, and the impact is bounded: at worst, salloc will fail to allocate if all listed nodes are busy, which is the same failure surface as today when a doomed node is picked.

Other factors

  • No bugs were reported by the bug hunting system.
  • The hardcoded nodelist is a known tradeoff for a hotfix — future node health changes will require another PR — but that maintenance pattern is already in use elsewhere in this repo.
  • No prior reviewer comments to address.

functionstackx added a commit that referenced this pull request May 18, 2026
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three
[Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every
failure landed on chi-mi325x-pod1-121 with

  enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted

before the .sqsh import even completes; subsequent pyxis mount then
fails with "No such file or directory". The same image works cleanly
on every other up node (017/018/019/020/027) — confirmed not OOM and
not a recipe issue.

This matches the existing pattern for mi300x in #1462 (pin salloc away
from chronically-bad nodes); for mi325x there's currently only the one
node to exclude, so use --exclude rather than --nodelist so we don't
have to maintain the allow-list as nodes come and go.

pod1-121 has separately been drained on the controller with a watchdog
(per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix
the underlying setcap regression.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant