Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1441
Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1441Klaud-Cold wants to merge 7 commits into
Conversation
…mi35x-20260517 Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM — straightforward SGLang ROCm image tag bump for glm5.1-fp4-mi355x-sglang with a corresponding changelog entry.
Extended reasoning...
Overview
This PR is a single-line image tag bump for the glm5.1-fp4-mi355x-sglang config in .github/configs/amd-master.yaml (v0.5.10rc0-rocm720-mi35x-20260415 → v0.5.12-rocm720-mi35x-20260517) plus a matching new entry appended to perf-changelog.yaml. No code or runtime logic changes.
Security risks
None. The change is data-only YAML, just selecting a different upstream container image tag. No auth, crypto, secrets, or permission surfaces are touched.
Level of scrutiny
Low. This is the standard recurring pattern for benchmark image bumps already established across many sibling configs in the same file (e.g., the qwen3.5 and glm5 entries use the same image-tag shape). It's a mechanical, easy-to-revert change.
Other factors
The bug hunting system flagged one nit — the new perf-changelog entry's pr-link still contains the literal XXX placeholder instead of the PR number. That's a documentation traceability issue only and is captured as an inline comment; it doesn't affect the runtime correctness of the bump itself, so I'm comfortable approving rather than blocking.
| - glm5.1-fp4-mi355x-sglang | ||
| description: | ||
| - "Update SGLang ROCm image from v0.5.10rc0-rocm720-mi35x-20260415 to v0.5.12-rocm720-mi35x-20260517" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — the XXX placeholder was never substituted with the real PR number. This link will 404 and breaks changelog traceability; replace XXX with 1441 before merging.
Extended reasoning...
What the bug is
Line 2631 of perf-changelog.yaml (the newly added entry in this PR) contains:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXThe XXX is a literal placeholder left over from the changelog-entry template and was not substituted with the actual PR number.
How it manifests
Anyone clicking the link from the perf-changelog (or any tooling that follows pr-link URLs to attribute changes to a PR) will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which 404s — there is no PR numbered XXX. The changelog loses its traceability for this entry: a reader can no longer jump from the SGLang ROCm image bump (v0.5.10rc0 → v0.5.12) back to the PR that introduced it.
Why surrounding code didn't catch it
The file is data-only YAML; there is no schema validator that asserts pr-link resolves to an existing PR, and no pre-commit/CI hook that rejects literal XXX placeholders. Every nearby entry (PRs 1394, 1416, 1423, 1429) correctly uses a numeric PR id in the same URL shape, so the convention is unambiguous — this one was just a missed find-and-replace.
Step-by-step proof
- Open this PR's metadata — the PR number is
#1441. - Open the diff for
perf-changelog.yaml(the only changelog change in this PR). The new entry ends withpr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. XXXis not a number, so the URL points to no real GitHub resource. Fetchinghttps://github.com/SemiAnalysisAI/InferenceX/pull/XXXwill return GitHub's standard 404 page.- Compare to the immediately preceding entry in the same diff context (the vLLM v0.15.1→v0.20.2 bump for
gptoss-fp4-b200-vllm), which correctly carriespr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1394— confirming the convention is a numeric PR id, not a placeholder.
Fix
Replace pull/XXX with pull/1441 on line 2631:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1441Impact
Documentation-only — no runtime behavior is affected, and no automation in the repo appears to parse pr-link strictly. But the link is the changelog's only pointer back to the source change, so leaving it broken degrades the changelog's auditability. Trivially fixable before merge, hence severity nit.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25998804987 |
|
e2e test result: SLURM infrastructure failure Run 26000058779: All 4 single-node jobs failed — SLURM |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25998804987 |
Keep main's perf-changelog.yaml and append glm5.1-fp4-mi355x-sglang entry at tail. amd-master.yaml auto-merged cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the -20260517 nightly suffix so the recipe uses the lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag rather than a date-pinned nightly build.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005299621 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005440099 |
Docker Hub does not publish a clean lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag — only the dated nightly variant. The earlier switch to the un-suffixed tag was a mistake (caused 'manifest not found' on every job). Restoring the dated nightly tag that does exist.
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005776675 |
) Root-caused via the failed sweeps on #1431, #1432, #1440, #1441, #1443 — every failure landed on either: mia1-p01-g09 pyxis: failed to create container filesystem (extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs) mia1-p01-g11 permission denied while trying to connect to docker.sock (cluster-cleanup `docker stop` step fails; cascading into pyxis-init failure) Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the launcher wasn't excluding them. This mirrors the existing pattern in runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121). Once this lands the 5 affected mi355x PRs can be rebased to pick it up and the failed jobs will land on healthy nodes only. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009684420 |
Summary
glm5.1-fp4-mi355x-sglangimage fromlmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415tolmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517Ref #1154
Generated with Claude Code