Skip to content

Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519#1492

Merged
functionstackx merged 6 commits into
mainfrom
dpskv4-gb300-nonmtp-nightly-20260518
May 20, 2026
Merged

Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519#1492
functionstackx merged 6 commits into
mainfrom
dpskv4-gb300-nonmtp-nightly-20260518

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

Summary

  • Update SGLang container image from lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev to lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870 for all non-MTP disagg configs
  • Switch moe-a2a-backend from deepep to megamoe for wideep configs
  • Remove obsolete/redundant environment variables that no longer exist in sglang main or whose defaults already match the set values
  • Replace deprecated env vars (SGLANG_ENABLE_THINKINGSGLANG_DEFAULT_THINKING, SGLANG_REASONING_EFFORTSGLANG_DSV4_REASONING_EFFORT)

model:
path: "deepseek-v4-pro"
container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev"
container: "lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: the file-level preamble (lines 14-18) of all 5 modified recipe yamls says container is "restored to the alias mapped in launch_gb300.sh's srtslurm.yaml (lmsysorg/sglang:deepseek-v4-grace-blackwell ...)", but the same diff changes container: on line 36 to the fully-qualified pinned tag lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870 — no shared substring with the alias the preamble names. A reader following the comment will grep srtslurm.yaml for deepseek-v4-grace-blackwell and be confused why it doesn't match line 36. Consider updating the parenthetical in the 5 recipe files (10p1d, 12p1d, 1p1d-dep, 1p1d-tp4, 4p1d, 8p1d) to the new image tag, or generalizing it so future image bumps don't drift again.

Extended reasoning...

What the comment says vs. what's there now

Each of the 5 modified recipe yamls carries an identical preamble paragraph at lines 14-18 explaining how this file was forked from PR #1213's upstream-style recipe:

Other adjustments back to the InferenceX cluster shape: container & model.path restored to the aliases mapped in launch_gb300.sh's srtslurm.yaml (lmsysorg/sglang:deepseek-v4-grace-blackwell and deepseek-v4-pro); ...

After this PR, line 36 of each file reads:

container: "lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870"

deepseek-v4-pro is still correct for model.path, but the container side has no shared substring with the alias the preamble names.

Step-by-step proof a reader gets confused

  1. Reader opens disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml, sees container: "lmsysorg/sglang:nightly-dev-cu13-20260518-c67b2870" on line 36.
  2. Wonders "where is this image alias resolved?" — preamble at line 16 says "the aliases mapped in launch_gb300.sh's srtslurm.yaml (lmsysorg/sglang:deepseek-v4-grace-blackwell...)".
  3. Reader greps runners/launch_gb300-cw.sh (or its srtslurm.yaml) for deepseek-v4-grace-blackwell — finds a dsv4-grace-blackwell key, but its value isn't nightly-dev-cu13-20260518-c67b2870.
  4. Reader is left thinking either (a) the comment is stale, or (b) they're looking at the wrong mapping. In reality the resolution works via the dynamic "${IMAGE}": ${SQUASH_FILE} entry in containers, but the preamble doesn't mention that path.

Addressing the refutation

The strongest counter-argument is: pre-PR container: was lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev, which already didn't exactly match the comment's lmsysorg/sglang:deepseek-v4-grace-blackwell, so this is pre-existing drift, not a new bug. That's partly true — the comment was never a literal mirror. But pre-PR both strings shared the deepseek-v4-grace-blackwell substring, so a reader following the comment would still land on the right alias. Post-PR there is no substring overlap, which turns mild stylistic drift into a real wild-goose-chase. The PR diff is the line that crossed the threshold.

Second counter: image bumps are routine and many prior bumps haven't updated preambles. That argues for severity nit, which is what I'm filing it as — not for skipping the fix entirely. The diff already modifies line 36 (the container value) in all 5 files; updating the adjacent preamble in the same hunk is low-cost and stops the comment from continuing to rot. A one-line generalization ("the SGLang container image and model alias managed in launch_gb300.sh's srtslurm.yaml") would future-proof it.

Impact

Documentation-only. No runtime effect — the ${IMAGE} dynamic key in runners/launch_gb300-cw.sh:186-193 resolves the new tag correctly. The cost is reader confusion when onboarding to these recipes or debugging container resolution.

Fix

Either (a) update the parenthetical to the new image tag in the 5 modified files, or (b) replace the alias name with a generic phrase like "the SGLang image alias managed via launch_gb300.sh's srtslurm.yaml mapping" so the comment doesn't need touching on every bump.

@@ -136,7 +118,6 @@ backend:
SGLANG_LOG_FORWARD_ITERS: "1"
SGLANG_LOG_MS: "1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PR removed the first line of a two-line comment block (# SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2) in all 6 modified YAML recipe files but left behind the continuation line # is single-node only and corrupts results in 2-node decode setups.. The dangling line is now a grammatically incomplete sentence fragment with no subject. Either drop the orphan line or restore the first line to preserve the original explanation.

Extended reasoning...

What is the bug?

In this PR, the line # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 is removed from all six modified YAML recipes, but the continuation of that comment on the next line — # is single-node only and corrupts results in 2-node decode setups. — is left in place. Without its first line, the remaining comment is a sentence fragment that starts with the verb is and has no subject, so a future reader cannot tell what is single-node only or what corrupts results.

Where it appears

The orphan line is visible in the post-PR files at:

  • disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml (around line 121, end of decode_environment)
  • disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml (same position)
  • disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml (same position)
  • disagg-gb300-1p1d-tp4-tp4-2-c1.yaml (near line 113, end of decode_environment)
  • disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml (same position as the first three)
  • disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml (same position)

Step-by-step proof

Look at the diff for disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml around lines 136–138 of the original file:

     SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60"
-    # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2
     # is single-node only and corrupts results in 2-node decode setups.

The hunk deletes line 1 of the two-line comment but keeps line 2. Confirm by reading the post-PR file content embedded in the PR: each of the six recipes now ends its decode_environment block with the lone line # is single-node only and corrupts results in 2-node decode setups. — with no antecedent for is.

Impact

Purely cosmetic. YAML treats #-prefixed lines as comments, so this has zero effect on srtctl/SGLang runtime behavior. However, the recipe files are the human-facing source of truth for cluster configuration, and the broken comment looks like an accidental partial edit rather than an intentional change, which degrades readability across six places.

How to fix

Either remove the orphan continuation line entirely (matches the apparent intent of dropping the CAR_V2 commentary now that more env vars have been cleaned up), or restore the first line so the explanation reads as a complete sentence. The cleanest fix is to delete the orphan line in all six files.

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the dpskv4-gb300-nonmtp-nightly-20260518 branch from 9f8af61 to d9ec555 Compare May 18, 2026 15:55
@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the dpskv4-gb300-nonmtp-nightly-20260518 branch from d9ec555 to 6cbf211 Compare May 19, 2026 13:12
@yhyang201 yhyang201 changed the title Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260518 Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 May 19, 2026
Fix benchmark client crash due to transformers not recognizing
deepseek_v4 model type when loading tokenizer.
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 80c944e into main May 20, 2026
4 of 5 checks passed
@functionstackx functionstackx deleted the dpskv4-gb300-nonmtp-nightly-20260518 branch May 20, 2026 05:40
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants