[Klaud Cold] Update gptoss-fp4-mi325x-vllm vLLM ROCm image to v0.21.0#1467
Conversation
Update vLLM ROCm image from v0.17.0 (70d old) to v0.21.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009953602 |
There was a problem hiding this comment.
LGTM — straightforward vLLM ROCm image version bump for a single recipe, with corresponding changelog entry.
Extended reasoning...
Overview
This PR bumps the gptoss-fp4-mi325x-vllm recipe's image from vllm/vllm-openai-rocm:v0.17.0 to v0.21.0 (a single-line change in .github/configs/amd-master.yaml) and adds a matching entry to perf-changelog.yaml. Only one recipe is touched.
Security risks
None. These are benchmark config files (image tag + changelog text), not production code or auth/permissions changes.
Level of scrutiny
Low. This is a mechanical image version bump on a perf benchmark recipe, with no logic changes. The target version v0.21.0 is already in use elsewhere in the same file (kimik2.5-*-vllm, minimaxm2.5-*-vllm, gptoss-fp4-mi355x-vllm), so it's a well-validated image tag in this repo.
Other factors
The bug hunting system found no issues. The full-sweep-enabled label means the sweep itself will validate the new image works for this recipe before merge — if the image is broken, CI will catch it. Changelog format matches the existing pattern in the file.
DiagnosisAll 11 failed jobs landed on the same broken node The 8 successful jobs ran on healthy nodes (018, 019, 020, 027) and completed normally. The successful tp=8 benchmark shows 223.71 GiB available KV cache memory with Failed run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26009955168 No code fix neededThis failure is caused by node Note: the |
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009955168 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26010692647 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26013860625 |
Summary
Update vLLM ROCm image from v0.17.0 (70d old) to v0.21.0
Recipes touched: `gptoss-fp4-mi325x-vllm`
Test plan
🤖 Generated with Claude Code