[Klaud Cold] runners(mi300x): pin salloc to known-good nodes by functionstackx · Pull Request #1462 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T01:24:15Z

Summary

Adds an explicit --nodelist=chi-mi300x-[034-036,054,057-058].ord.vultr.cpe.ice.amd.com to the mi300x salloc, mirroring the pattern already used in runners/launch_b300-nv.sh:336.
Three of the nine mi300x nodes are currently unusable:
- chi-mi300x-033, chi-mi300x-037 — down (Not responding)
- chi-mi300x-049 — drained for persistent /nvme_home disk-full (kept down by a watchdog re-applying State=DOWN every 10s)
Symptom this fixes: PRs land on a doomed node and fail at pyxis extraction (No space left on device) or srun: Node failure. See Update qwen3.5-bf16-mi300x-sglang SGLang image to v0.5.12-rocm720-mi30x #1426 and Update gptoss-fp4-mi300x-vllm vLLM ROCm image to v0.21.0 #1403 for current examples.

Test plan

Subsequent mi300x sweeps land only on the 6 healthy nodes.
No Node failure / pyxis ... No space left on device errors after this lands.

🤖 Generated with Claude Code

Three of the nine mi300x compute nodes are currently unusable: - chi-mi300x-033, chi-mi300x-037: down (Not responding) - chi-mi300x-049: drained for persistent /nvme_home disk-full (kept down by a watchdog re-applying State=DOWN every 10s) Without a nodelist filter, salloc sometimes lands a job on a node that's about to be drained or that has a half-extracted enroot dir, causing 'pyxis: failed to create container filesystem (No space left on device)' / 'srun: Node failure' / 'manifest unknown'-style errors visible in PRs #1426 and #1403. Add an explicit --nodelist of the 6 healthy nodes (mirroring how runners/launch_b300-nv.sh:336 pins to the known-good B300 set).

claude

LGTM — small ops fix pinning salloc to known-good mi300x nodes, mirroring the existing pattern in launch_b300-nv.sh.

Extended reasoning...

Overview

This PR adds a --nodelist constraint to the salloc invocation in runners/launch_mi300x-amds.sh, restricting allocations to six known-good mi300x nodes (034-036, 054, 057-058) and excluding three currently-unusable nodes (033, 037 down; 049 drained for disk-full). The pattern mirrors an existing usage in runners/launch_b300-nv.sh.

Security risks

None. This is a runner orchestration script for CI benchmark jobs; the change is a Slurm scheduling constraint with no security-sensitive surface (no auth, crypto, permissions, or user-controllable input affected).

Level of scrutiny

Low. This is a one-line operational hotfix to an internal runner script. The change is mechanical, the rationale is documented inline as a comment, and the impact is bounded: at worst, salloc will fail to allocate if all listed nodes are busy, which is the same failure surface as today when a doomed node is picked.

Other factors

No bugs were reported by the bug hunting system.
The hardcoded nodelist is a known tradeoff for a hotfix — future node health changes will require another PR — but that maintenance pattern is already in use elsewhere in this repo.
No prior reviewer comments to address.

Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 18, 2026 01:24

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

functionstackx merged commit 4da367c into main May 18, 2026
4 checks passed

functionstackx deleted the add-mi300x-nodelist-filter branch May 18, 2026 01:24

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] runners(mi325x): exclude broken enroot node chi-mi325x-pod1-121 #1477

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462
functionstackx merged 1 commit into
mainfrom
add-mi300x-nodelist-filter

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Test plan

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant