Conversation
Signed-off-by: seungrokj <seungrok.jung@amd.com>
There was a problem hiding this comment.
LGTM — straightforward addition of --exclusive to salloc across all three AMD runner scripts, consistent with the existing NV cluster scripts.
Extended reasoning...
Overview
This PR modifies three AMD Slurm launch scripts (launch_mi300x-amds.sh, launch_mi325x-amd.sh, launch_mi355x-amds.sh) to add --exclusive to their salloc calls. The change is identical in nature across all three files: one flag inserted into an existing command line.
Security Risks
No security-sensitive code is touched. These are CI runner scripts for benchmark execution on dedicated hardware. No auth, crypto, permissions, or external service calls are modified.
Level of Scrutiny
This is a low-risk, mechanical change. The --exclusive flag is already used in the NV cluster scripts, and the PR provides clear benchmark evidence showing significant performance improvement (TPOT ~22ms vs ~38ms with --exclusive on MI355X). The change is consistent with established patterns in the codebase.
Other Factors
The flagged bugs (missing -u $USER in squeue and no empty JOB_ID guard in the mi355x single-node path) are pre-existing issues not introduced by this PR. They are noted as inline comments for the team to address separately. The core --exclusive addition is correct and self-contained.
MI325X now has its own PR (#931). Update perf-changelog to list specific non-TP8 MI355X configs instead of wildcards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-opening of #929 which was reverted. Add --exclusive flag to MI355X salloc (single-node) and sbatch (multi-node) to prevent node sharing during benchmarks, matching NVIDIA runner behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [AMD] Fix slurm command for AMD devices Re-opening of #929 which was reverted. Add --exclusive flag to MI355X salloc (single-node) and sbatch (multi-node) to prevent node sharing during benchmarks, matching NVIDIA runner behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update perf-changelog.yaml --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hi @cquil11 @functionstackx
Motivation is AMD slurm clusters are missing --exclusive option and they showed regressed performance. NV slurm clusters are already using --exclusive option.
without --exclusive

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23433485252
with --exclusive

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23439928217
Regards,
Seungrok