Update qwen3.5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1444
Conversation
…-rocm720-mi35x-20260517 Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - qwen3.5-fp8-mi355x-sglang-mtp | ||
| description: | ||
| - "Update SGLang ROCm image from v0.5.10rc0-rocm720-mi35x-20260414 to v0.5.12-rocm720-mi35x-20260517" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🔴 The new perf-changelog.yaml entry uses https://github.com/SemiAnalysisAI/InferenceX/pull/XXX as its pr-link — a placeholder that was never substituted with the real PR number. This should be https://github.com/SemiAnalysisAI/InferenceX/pull/1444 to match every other entry in the file and to make the link actually resolve.
Extended reasoning...
What the bug is
At perf-changelog.yaml:2632, the newly-added entry for the qwen3.5-fp8-mi355x-sglang and qwen3.5-fp8-mi355x-sglang-mtp image bump ends with:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXXXX is a placeholder — almost certainly the literal text from a changelog template — that was never replaced with the actual PR number before this PR was opened.
Why this is wrong
The PR metadata for this change is #1444, so the link should be https://github.com/SemiAnalysisAI/InferenceX/pull/1444. Every other recent entry in the same file uses a real numeric PR id — the four entries immediately preceding this one point at #1423, #1429, #1416, and #1394 (lines 2606, 2612, 2619, 2625). The convention is unambiguous and this entry breaks it.
Impact
- The link is a hard 404 —
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXdoes not (and cannot) resolve to a valid GitHub PR, so anyone clicking it from the changelog lands on an error page. - Any tooling that consumes
perf-changelog.yamland parsespr-link(e.g. release-notes generators, dashboards, or scripts that cross-reference entries with GitHub PR titles/labels) will either skip this row, log a warning, or break entirely when it cannot resolveXXXto an integer. - The changelog loses its provenance for this image bump — once the PR is merged you can no longer trace the entry back to its discussion/review.
Step-by-step proof
- Open the diff for
perf-changelog.yamlin this PR. The added block (lines 2627–2632) is:- config-keys: - qwen3.5-fp8-mi355x-sglang - qwen3.5-fp8-mi355x-sglang-mtp description: - "Update SGLang ROCm image from v0.5.10rc0-rocm720-mi35x-20260414 to v0.5.12-rocm720-mi35x-20260517" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
- Look at the four immediately-preceding entries (lines 2606, 2612, 2619, 2625): each ends with
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<integer>(1423, 1429, 1416, 1394). No other entry usesXXXor any non-numeric token. - The PR creating this entry is
#1444(see PR metadata in this review). - Therefore
XXXon line 2632 is a leftover template placeholder.
Fix
Replace XXX with 1444:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1444Trivial one-character-class change; no other lines need updating.
|
|
||
| qwen3.5-fp8-mi355x-sglang: | ||
| image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414 | ||
| image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 |
There was a problem hiding this comment.
🟡 Nit: the qwen3.5-fp8-mi355x-sglang-agentic sibling (line 271) still pins lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414, but this PR bumps the parent qwen3.5-fp8-mi355x-sglang to v0.5.12-rocm720-mi35x-20260517. The comment block immediately above the -agentic entry states 'Metadata is identical to origin/main's qwen3.5-fp8-mi355x-sglang' — after merge that claim becomes inaccurate. Consider either bumping the -agentic image to match, or adding a 'Reasons below' note in the same style as kimik2.5-fp4-mi355x-vllm-agentic / minimaxm2.5-...-vllm-agentic to document the intentional image divergence.
Extended reasoning...
What the comment claims. Lines 263–267 of .github/configs/amd-master.yaml carry this block, introduced by PR #1393 when the agentic-coding sibling was first split off:\n\n\n# Diverged from qwen3.5-fp8-mi355x-sglang (agentic-coding sibling). Metadata is\n# identical to origin/main's qwen3.5-fp8-mi355x-sglang; the split exists because this\n# PR adds an agentic-coding scenarios block that differs from main\n# (either main had none or had a different conc/offload sweep).\n# The original qwen3.5-fp8-mi355x-sglang entry stays byte-identical to origin/main.\n\n\nThe first sentence is a present-tense factual assertion ("Metadata is identical to origin/main's qwen3.5-fp8-mi355x-sglang") about a relationship between two configs that both live in this file. The last sentence reinforces this by saying the parent stays byte-identical to main.\n\nWhy this PR breaks the assertion. This PR bumps the parent qwen3.5-fp8-mi355x-sglang (line 222) from v0.5.10rc0-rocm720-mi35x-20260414 → v0.5.12-rocm720-mi35x-20260517. It does not touch the -agentic sibling at line 271, which remains pinned to the older v0.5.10rc0 image. So once this PR lands on main, the claim "Metadata is identical to origin/main's qwen3.5-fp8-mi355x-sglang" is no longer true — the image: field of the -agentic sibling diverges from the parent it claims to mirror.\n\nStep-by-step proof:\n1. Pre-PR state on main: parent image = v0.5.10rc0-...-20260414; -agentic image = v0.5.10rc0-...-20260414. Comment is accurate (both fields identical).\n2. This PR's diff modifies only the parent's image: to v0.5.12-...-20260517 (line 222), and the corresponding -mtp sibling (line 244). The -agentic entry at line 271 is untouched.\n3. Post-merge state on main: parent image = v0.5.12-...-20260517; -agentic image = v0.5.10rc0-...-20260414. The two fields differ, so the comment's assertion that they are "identical" is now stale.\n\nConvention in the same file. When -agentic siblings intentionally diverge from their fixed-seq-len parent on the image field, this file's convention is to call it out explicitly. See the kimik2.5-fp4-mi355x-vllm-agentic block (lines ~430-460 in the modified file), which carries a 'Reasons below' header with an explicit image: 'old' -> 'new' line, and the three minimaxm2.5-...-vllm-agentic blocks which do the same. The qwen3.5-fp8-mi355x-sglang-agentic block has no such documentation because, until this PR, it didn't need any — the images were aligned. After this PR the divergence is undocumented.\n\nAddressing the refutation that this is a historical snapshot. One reading of the comment is that "this PR" refers to #1393 (the PR that created the divergence), making the whole block a historical narrative about how the split came to be rather than an invariant. That reading is plausible for the "this PR adds..." clause, but the framing sentences ("Metadata is identical..." and "The original ... stays byte-identical to origin/main") use present/continuous tense and read as ongoing claims about file state, not a past event. Future readers of the comment after merge will be misled. The fix is cheap either way: bump the image (one-line change) or replace the now-stale assertion with a Reasons below block matching the kimik2.5 / minimaxm2.5 pattern.\n\nImpact and how to fix. Purely a documentation/maintenance consistency issue — both images are independently valid and the system runs correctly, so this is a nit, not a blocking bug. To resolve, either: (a) also bump line 271 to lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 (and add the key to the perf-changelog entry), or (b) replace lines 263–267 with a 'Reasons below' block listing image: 'v0.5.10rc0-rocm720-mi35x-20260414' -> 'v0.5.12-rocm720-mi35x-20260517' (intentionally pinned because <reason>).
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25998847451 |
|
e2e test result: SLURM infrastructure failure Run 26000388078: All single-node and eval jobs failed — SLURM |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25998847451 |
Drops the -20260517 nightly suffix so the recipe uses the lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag rather than a date-pinned nightly build.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005296676 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005437689 |
Docker Hub does not publish a clean lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag — only the dated nightly variant. The earlier switch to the un-suffixed tag was a mistake (caused 'manifest not found' on every job). Restoring the dated nightly tag that does exist.
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005772723 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009681780 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26074651620 |
Summary
qwen3.5-fp8-mi355x-sglangimage fromlmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414tolmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517qwen3.5-fp8-mi355x-sglang-mtpimage fromlmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414tolmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517Ref #1154
Generated with Claude Code