Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1443
Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1443Klaud-Cold wants to merge 5 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
3 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM — straightforward SGLang ROCm image bump; please replace the placeholder in the perf-changelog pr-link with before merge.
Extended reasoning...
Overview
This PR bumps the SGLang ROCm image for two configs (qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp) in .github/configs/amd-master.yaml from v0.5.10rc0-rocm720-mi35x-20260415 to v0.5.12-rocm720-mi35x-20260517, and adds a corresponding entry to perf-changelog.yaml. Two files, four lines of actual change plus a 7-line changelog block.
Security risks
None. This is a benchmark/CI config data file — no executable code, no auth, no secrets. The new image tag points to a vendor-published SGLang ROCm image used by the perf sweep runners; the same image registry (lmsysorg/sglang-rocm) is already in use across many configs in this file.
Level of scrutiny
Low. Image version bumps for benchmark configs are mechanical and follow an established pattern visible throughout this PR's surrounding context (qwen3.5-bf16-mi325x-sglang, glm5-fp8-mi355x-sglang, etc. all live in the same file and get bumped similarly via dedicated PRs — #1394, #1416, #1429 are the immediate predecessors in the changelog).
Other factors
The single issue surfaced by the bug hunter is a cosmetic placeholder (pull/XXX instead of pull/1443) in the changelog pr-link — already attached as an inline comment. It doesn't affect runtime behavior but should be fixed before merge so the changelog back-reference resolves. No CODEOWNER-sensitive paths, no logic changes, and the PR is labeled full-sweep-enabled so CI will exercise the new image.
| - qwen3.5-bf16-mi355x-sglang-mtp | ||
| description: | ||
| - "Update SGLang ROCm image from v0.5.10rc0-rocm720-mi35x-20260415 to v0.5.12-rocm720-mi35x-20260517" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🟡 The new perf-changelog entry added by this PR contains a placeholder pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX at line 2632 instead of the real PR number (1443). Every other recent entry in the file uses the actual PR number, so this should be replaced with /pull/1443 before merge to avoid leaving a broken changelog link.
Extended reasoning...
What the bug is: The new perf-changelog.yaml entry added by this PR (#1443) at lines 2626-2632 documents the SGLang ROCm image bump for qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp. The pr-link field on line 2632 is set to a literal placeholder string:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXThe specific code path that triggers it: This is a static YAML data file (the perf changelog), so the bug manifests not at runtime but the moment the PR is merged — the URL .../pull/XXX does not resolve to a real PR, and any tooling or human reader that follows the link from the changelog entry hits a 404 (or a different, unrelated PR if one ever exists with that path).
Why existing code doesn't prevent it: There is no schema validation or CI check that enforces pr-link references a numeric PR ID — it's a plain string field. The author replaced every other placeholder with concrete values (model names, image tags, dates) but left this one with the boilerplate XXX.
Step-by-step proof:
- Open
perf-changelog.yamlat line 2632 in this PR's diff. - Observe the value:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. - Compare to the immediately preceding entries in the same file:
- line 2606:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1423 - line 2612:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1429 - line 2619:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1416 - line 2625:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1394
- line 2606:
- The PR metadata (and the description, which references issue [Auto] Docker Image Updates Available - 2026-04-25 #1154) confirms this PR's number is 1443.
- After merge, navigating to
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXreturns 404, so the changelog entry has no working back-reference.
Impact: Cosmetic / documentation only — no runtime behavior is affected, no benchmark is impacted. But the perf-changelog exists specifically so future readers can trace any image/version bump back to the PR that introduced it; a placeholder link defeats that purpose for this entry.
How to fix: Replace XXX with 1443 on line 2632 before merge:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1443|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25998833055 |
|
e2e test result: SLURM infrastructure failure Run 26000387400: All single-node and eval jobs failed — SLURM |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25998833055 |
…2-rocm720-mi35x-20260517 Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
7695122 to
5f42a2e
Compare
Drops the -20260517 nightly suffix so the recipe uses the lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag rather than a date-pinned nightly build.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005298161 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005438628 |
Docker Hub does not publish a clean lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag — only the dated nightly variant. The earlier switch to the un-suffixed tag was a mistake (caused 'manifest not found' on every job). Restoring the dated nightly tag that does exist.
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26005774349 |
Summary
qwen3.5-bf16-mi355x-sglangimage fromlmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415tolmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517qwen3.5-bf16-mi355x-sglang-mtpimage fromlmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415tolmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517Ref #1154
Generated with Claude Code