Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519#1492
Conversation
| model: | ||
| path: "deepseek-v4-pro" | ||
| container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev" | ||
| container: "lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870" |
There was a problem hiding this comment.
🟡 Nit: the file-level preamble (lines 14-18) of all 5 modified recipe yamls says container is "restored to the alias mapped in launch_gb300.sh's srtslurm.yaml (lmsysorg/sglang:deepseek-v4-grace-blackwell ...)", but the same diff changes container: on line 36 to the fully-qualified pinned tag lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870 — no shared substring with the alias the preamble names. A reader following the comment will grep srtslurm.yaml for deepseek-v4-grace-blackwell and be confused why it doesn't match line 36. Consider updating the parenthetical in the 5 recipe files (10p1d, 12p1d, 1p1d-dep, 1p1d-tp4, 4p1d, 8p1d) to the new image tag, or generalizing it so future image bumps don't drift again.
Extended reasoning...
What the comment says vs. what's there now
Each of the 5 modified recipe yamls carries an identical preamble paragraph at lines 14-18 explaining how this file was forked from PR #1213's upstream-style recipe:
Other adjustments back to the InferenceX cluster shape: container & model.path restored to the aliases mapped in launch_gb300.sh's srtslurm.yaml (
lmsysorg/sglang:deepseek-v4-grace-blackwellanddeepseek-v4-pro); ...
After this PR, line 36 of each file reads:
container: "lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870"deepseek-v4-pro is still correct for model.path, but the container side has no shared substring with the alias the preamble names.
Step-by-step proof a reader gets confused
- Reader opens
disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml, seescontainer: "lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870"on line 36. - Wonders "where is this image alias resolved?" — preamble at line 16 says "the aliases mapped in launch_gb300.sh's srtslurm.yaml (
lmsysorg/sglang:deepseek-v4-grace-blackwell...)". - Reader greps
runners/launch_gb300-cw.sh(or its srtslurm.yaml) fordeepseek-v4-grace-blackwell— finds adsv4-grace-blackwellkey, but its value isn'tnightly-dev-cu13-20260518-c67b2870. - Reader is left thinking either (a) the comment is stale, or (b) they're looking at the wrong mapping. In reality the resolution works via the dynamic
"${IMAGE}": ${SQUASH_FILE}entry in containers, but the preamble doesn't mention that path.
Addressing the refutation
The strongest counter-argument is: pre-PR container: was lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev, which already didn't exactly match the comment's lmsysorg/sglang:deepseek-v4-grace-blackwell, so this is pre-existing drift, not a new bug. That's partly true — the comment was never a literal mirror. But pre-PR both strings shared the deepseek-v4-grace-blackwell substring, so a reader following the comment would still land on the right alias. Post-PR there is no substring overlap, which turns mild stylistic drift into a real wild-goose-chase. The PR diff is the line that crossed the threshold.
Second counter: image bumps are routine and many prior bumps haven't updated preambles. That argues for severity nit, which is what I'm filing it as — not for skipping the fix entirely. The diff already modifies line 36 (the container value) in all 5 files; updating the adjacent preamble in the same hunk is low-cost and stops the comment from continuing to rot. A one-line generalization ("the SGLang container image and model alias managed in launch_gb300.sh's srtslurm.yaml") would future-proof it.
Impact
Documentation-only. No runtime effect — the ${IMAGE} dynamic key in runners/launch_gb300-cw.sh:186-193 resolves the new tag correctly. The cost is reader confusion when onboarding to these recipes or debugging container resolution.
Fix
Either (a) update the parenthetical to the new image tag in the 5 modified files, or (b) replace the alias name with a generic phrase like "the SGLang image alias managed via launch_gb300.sh's srtslurm.yaml mapping" so the comment doesn't need touching on every bump.
| @@ -136,7 +118,6 @@ backend: | |||
| SGLANG_LOG_FORWARD_ITERS: "1" | |||
| SGLANG_LOG_MS: "1" | |||
There was a problem hiding this comment.
🟡 The PR removed the first line of a two-line comment block (# SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2) in all 6 modified YAML recipe files but left behind the continuation line # is single-node only and corrupts results in 2-node decode setups.. The dangling line is now a grammatically incomplete sentence fragment with no subject. Either drop the orphan line or restore the first line to preserve the original explanation.
Extended reasoning...
What is the bug?
In this PR, the line # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 is removed from all six modified YAML recipes, but the continuation of that comment on the next line — # is single-node only and corrupts results in 2-node decode setups. — is left in place. Without its first line, the remaining comment is a sentence fragment that starts with the verb is and has no subject, so a future reader cannot tell what is single-node only or what corrupts results.
Where it appears
The orphan line is visible in the post-PR files at:
disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml(around line 121, end ofdecode_environment)disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml(same position)disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml(same position)disagg-gb300-1p1d-tp4-tp4-2-c1.yaml(near line 113, end ofdecode_environment)disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml(same position as the first three)disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml(same position)
Step-by-step proof
Look at the diff for disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml around lines 136–138 of the original file:
SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60"
- # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2
# is single-node only and corrupts results in 2-node decode setups.
The hunk deletes line 1 of the two-line comment but keeps line 2. Confirm by reading the post-PR file content embedded in the PR: each of the six recipes now ends its decode_environment block with the lone line # is single-node only and corrupts results in 2-node decode setups. — with no antecedent for is.
Impact
Purely cosmetic. YAML treats #-prefixed lines as comments, so this has zero effect on srtctl/SGLang runtime behavior. However, the recipe files are the human-facing source of truth for cluster configuration, and the broken comment looks like an accidental partial edit rather than an intentional change, which degrades readability across six places.
How to fix
Either remove the orphan continuation line entirely (matches the apparent intent of dropping the CAR_V2 commentary now that more env vars have been cleaned up), or restore the first line so the explanation reads as a complete sentence. The cleanest fix is to delete the orphan line in all six files.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26017958676 |
9f8af61 to
d9ec555
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26044699951 |
d9ec555 to
6cbf211
Compare
Fix benchmark client crash due to transformers not recognizing deepseek_v4 model type when loading tokenizer.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26099478534 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26101398081 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26116158481 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26117178678 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26117178678 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26143763116 |
Summary
lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-devtolmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870for all non-MTP disagg configsmoe-a2a-backendfromdeepeptomegamoefor wideep configsSGLANG_ENABLE_THINKING→SGLANG_DEFAULT_THINKING,SGLANG_REASONING_EFFORT→SGLANG_DSV4_REASONING_EFFORT)