Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520#1528
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
fa55687 to
7034272
Compare
|
|
||
| dsv4-fp4-gb300-dynamo-sglang: | ||
| image: lmsysorg/sglang:nightly-dev-cu13-20260519-dbac4647 | ||
| image: lmsysorg/sglang:nightly-dev-cu13-20260520-425dffbd |
There was a problem hiding this comment.
🟡 Missing perf-changelog.yaml entry for this image bump. The immediately-preceding PR #1492 (20260518 → 20260519 bump of this same dsv4-fp4-gb300-dynamo-sglang config-key) added an explicit entry under that key, and other recent image-bump PRs (#1411, #1444, #1475) followed the same convention. Consider adding a parallel entry to keep the changelog consistent (also worth noting the SGLANG_OPT_FP8_WO_A_GEMM=0 removal, which is a functional change worth recording).
Extended reasoning...
What's missing
This PR bumps the SGLang image for the dsv4-fp4-gb300-dynamo-sglang config-key (in .github/configs/nvidia-master.yaml:8762) from nightly-dev-cu13-20260519-dbac4647 to nightly-dev-cu13-20260520-425dffbd and, alongside that, removes the SGLANG_OPT_FP8_WO_A_GEMM=0 workaround from six disagg-gb300-*.yaml recipes (PR description: "fixed in 0520 nightly via sgl-project/sglang#25805"). It does not add an entry to perf-changelog.yaml.
Why this is a convention break
The immediately-preceding PR for this same config-key — #1492 (commit 80c944e, 20260518 → 20260519) — added an explicit entry to perf-changelog.yaml at lines 3020–3024:
Update SGLang image from nightly-dev-cu13-20260518-c67b2870 to nightly-dev-cu13-20260519-dbac4647
The same pattern shows up across other recent image-bump PRs:
- [AMD/ROCm] qwen3.5-fp8-mi355x-atom, Bump image to rocm/atom:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom20260511 #1411 (qwen3.5-fp8-mi355x-atom image bump) — touched
perf-changelog.yaml - [Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 #1475 (qwen3.5-fp4-b300-sglang/mtp image bump) — touched
perf-changelog.yaml(+7 lines) - Update qwen3.5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1444 / Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1443 / Update glm5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1440 — same pattern
The current PR (fa55687) modifies 7 files (.github/configs/nvidia-master.yaml + six disagg-gb300-*.yaml recipes) but does not touch perf-changelog.yaml at all.
Step-by-step proof
git show 80c944e --statfor PR Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 showsperf-changelog.yaml | 14 ++++++++— i.e. the 20260518→20260519 bump added a changelog entry.perf-changelog.yamllines 3012–3024 still contain that entry underdsv4-fp4-gb300-dynamo-sglang.git show fa55687 --statfor the current PR lists 7 modified files:.github/configs/nvidia-master.yamlplus the sixdisagg-gb300-*.yamlrecipes.perf-changelog.yamlis not in the list.- The PR performs the exact same kind of change as Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 (sequential nightly bump of the same key), plus an extra functional change (removing
SGLANG_OPT_FP8_WO_A_GEMM=0from prefill+decode environments in 6 recipes), which is arguably even more worth recording.
Impact
This is a documentation/observability concern, not a runtime bug — the recipes themselves will run fine. The missed entry only affects the historical perf-tracking trail for this config-key. Given that the previous bump (one day earlier, same author) did add the entry, this looks more like an oversight than an intentional skip.
Suggested fix
Add a perf-changelog.yaml entry under dsv4-fp4-gb300-dynamo-sglang along the lines of:
Update SGLang image from nightly-dev-cu13-20260519-dbac4647 to nightly-dev-cu13-20260520-425dffbd; remove SGLANG_OPT_FP8_WO_A_GEMM=0 workaround (fixed upstream in sgl-project/sglang#25805).
| SGLANG_LOG_FORWARD_ITERS: "1" | ||
| SGLANG_LOG_MS: "1" | ||
| SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60" | ||
| SGLANG_OPT_FP8_WO_A_GEMM: "0" | ||
| # is single-node only and corrupts results in 2-node decode setups. |
There was a problem hiding this comment.
🟡 The PR removes the SGLANG_OPT_FP8_WO_A_GEMM: "0" line in each decode_environment block but leaves the trailing comment # is single-node only and corrupts results in 2-node decode setups. behind. The orphan now reads as a subject-less sentence fragment beginning with "is" — the line above it is unrelated (SGLANG_REQUEST_STATE_WAIT_TIMEOUT in 5 yamls, SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW in the tp4-tp4 yaml). Nit — the comment should be dropped along with the env var. Affects all 6 modified YAMLs (10p1d, 12p1d, 1p1d-dep, 1p1d-tp4-tp4, 4p1d, 8p1d).
Extended reasoning...
What the bug is
The PR removes SGLANG_OPT_FP8_WO_A_GEMM: "0" from both the prefill_environment and decode_environment blocks because the underlying issue was fixed upstream (sgl-project/sglang#25805). However, in the decode_environment block of every modified YAML the trailing comment immediately below the removed line — # is single-node only and corrupts results in 2-node decode setups. — is left behind.
Step-by-step proof (using disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml)
Pre-PR snippet (decode_environment, ending lines):
SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60"
SGLANG_OPT_FP8_WO_A_GEMM: "0"
# is single-node only and corrupts results in 2-node decode setups.Post-PR snippet (lines 117-119 of the new file):
SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60"
# is single-node only and corrupts results in 2-node decode setups.
sglang_config:The comment now dangles directly under SGLANG_REQUEST_STATE_WAIT_TIMEOUT, which is a request-state timeout that has nothing to do with single-node/2-node decode correctness. A reader encountering this hunk has no way to know what "is single-node only" refers to — the subject of the sentence has been deleted.
Why this is a nit, not a functional bug
YAML treats # as a line comment, so parsing still succeeds and runtime behavior is unaffected. The damage is purely to future readability: the next person bumping the image (or chasing a decode-correctness issue) will see an unattributed warning about 2-node decode corruption and have no way to act on it. It also occurs in the same hunks the PR is already touching, so the cleanup cost is essentially zero.
Addressing the refutation
One verifier dismissed this as duplicate-of-already-refuted and "long-standing leftover commentary style." That framing is not quite right:
- The comment is not pre-existing in its current orphaned form — before this PR it was the trailing description of a real
SGLANG_OPT_FP8_WO_A_GEMM: "0"line directly above it. The PR is what creates the orphan by removing the subject. - A second verifier noted that the comment originated as the second line of a 2-line comment block about
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2(still visible intact in sibling files likedisagg-low-latency-1p1d-tp4-tp4-mtp.yaml). Either way, in these modified files it is now dangling. - The PR is the natural moment to clean it up: the change is mechanically aligned with the diff hunks (just extend the
-to include the comment line), no broader refactoring needed.
How to fix
In each of the 6 modified files, also remove the trailing # is single-node only and corrupts results in 2-node decode setups. line in decode_environment (lines ~120 in the 5 dep yamls, ~111 in disagg-gb300-1p1d-tp4-tp4-2-c1.yaml). No other change needed.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26161681168 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26161681168 |
|
/reuse-sweep-run |
Summary
nightly-dev-cu13-20260519-dbac4647tonightly-dev-cu13-20260520-425dffbd