Skip to content

Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520#1528

Merged
Oseltamivir merged 1 commit into
mainfrom
update-sglang-nightly-0520
May 20, 2026
Merged

Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520#1528
Oseltamivir merged 1 commit into
mainfrom
update-sglang-nightly-0520

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented May 20, 2026

Summary

  • Bump non-MTP disagg SGLang image from nightly-dev-cu13-20260519-dbac4647 to nightly-dev-cu13-20260520-425dffbd

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@yhyang201 yhyang201 force-pushed the update-sglang-nightly-0520 branch from fa55687 to 7034272 Compare May 20, 2026 12:10
@yhyang201 yhyang201 changed the title Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520 [DONOTMERGE] Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520 May 20, 2026

dsv4-fp4-gb300-dynamo-sglang:
image: lmsysorg/sglang:nightly-dev-cu13-20260519-dbac4647
image: lmsysorg/sglang:nightly-dev-cu13-20260520-425dffbd
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing perf-changelog.yaml entry for this image bump. The immediately-preceding PR #1492 (20260518 → 20260519 bump of this same dsv4-fp4-gb300-dynamo-sglang config-key) added an explicit entry under that key, and other recent image-bump PRs (#1411, #1444, #1475) followed the same convention. Consider adding a parallel entry to keep the changelog consistent (also worth noting the SGLANG_OPT_FP8_WO_A_GEMM=0 removal, which is a functional change worth recording).

Extended reasoning...

What's missing

This PR bumps the SGLang image for the dsv4-fp4-gb300-dynamo-sglang config-key (in .github/configs/nvidia-master.yaml:8762) from nightly-dev-cu13-20260519-dbac4647 to nightly-dev-cu13-20260520-425dffbd and, alongside that, removes the SGLANG_OPT_FP8_WO_A_GEMM=0 workaround from six disagg-gb300-*.yaml recipes (PR description: "fixed in 0520 nightly via sgl-project/sglang#25805"). It does not add an entry to perf-changelog.yaml.

Why this is a convention break

The immediately-preceding PR for this same config-key — #1492 (commit 80c944e, 20260518 → 20260519) — added an explicit entry to perf-changelog.yaml at lines 3020–3024:

Update SGLang image from nightly-dev-cu13-20260518-c67b2870 to nightly-dev-cu13-20260519-dbac4647

The same pattern shows up across other recent image-bump PRs:

The current PR (fa55687) modifies 7 files (.github/configs/nvidia-master.yaml + six disagg-gb300-*.yaml recipes) but does not touch perf-changelog.yaml at all.

Step-by-step proof

  1. git show 80c944e --stat for PR Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 shows perf-changelog.yaml | 14 ++++++++ — i.e. the 20260518→20260519 bump added a changelog entry.
  2. perf-changelog.yaml lines 3012–3024 still contain that entry under dsv4-fp4-gb300-dynamo-sglang.
  3. git show fa55687 --stat for the current PR lists 7 modified files: .github/configs/nvidia-master.yaml plus the six disagg-gb300-*.yaml recipes. perf-changelog.yaml is not in the list.
  4. The PR performs the exact same kind of change as Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260519 #1492 (sequential nightly bump of the same key), plus an extra functional change (removing SGLANG_OPT_FP8_WO_A_GEMM=0 from prefill+decode environments in 6 recipes), which is arguably even more worth recording.

Impact

This is a documentation/observability concern, not a runtime bug — the recipes themselves will run fine. The missed entry only affects the historical perf-tracking trail for this config-key. Given that the previous bump (one day earlier, same author) did add the entry, this looks more like an oversight than an intentional skip.

Suggested fix

Add a perf-changelog.yaml entry under dsv4-fp4-gb300-dynamo-sglang along the lines of:

Update SGLang image from nightly-dev-cu13-20260519-dbac4647 to nightly-dev-cu13-20260520-425dffbd; remove SGLANG_OPT_FP8_WO_A_GEMM=0 workaround (fixed upstream in sgl-project/sglang#25805).

Comment on lines 118 to 121
SGLANG_LOG_FORWARD_ITERS: "1"
SGLANG_LOG_MS: "1"
SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60"
SGLANG_OPT_FP8_WO_A_GEMM: "0"
# is single-node only and corrupts results in 2-node decode setups.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PR removes the SGLANG_OPT_FP8_WO_A_GEMM: "0" line in each decode_environment block but leaves the trailing comment # is single-node only and corrupts results in 2-node decode setups. behind. The orphan now reads as a subject-less sentence fragment beginning with "is" — the line above it is unrelated (SGLANG_REQUEST_STATE_WAIT_TIMEOUT in 5 yamls, SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW in the tp4-tp4 yaml). Nit — the comment should be dropped along with the env var. Affects all 6 modified YAMLs (10p1d, 12p1d, 1p1d-dep, 1p1d-tp4-tp4, 4p1d, 8p1d).

Extended reasoning...

What the bug is

The PR removes SGLANG_OPT_FP8_WO_A_GEMM: "0" from both the prefill_environment and decode_environment blocks because the underlying issue was fixed upstream (sgl-project/sglang#25805). However, in the decode_environment block of every modified YAML the trailing comment immediately below the removed line — # is single-node only and corrupts results in 2-node decode setups. — is left behind.

Step-by-step proof (using disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml)

Pre-PR snippet (decode_environment, ending lines):

    SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60"
    SGLANG_OPT_FP8_WO_A_GEMM: "0"
    # is single-node only and corrupts results in 2-node decode setups.

Post-PR snippet (lines 117-119 of the new file):

    SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60"
    # is single-node only and corrupts results in 2-node decode setups.

  sglang_config:

The comment now dangles directly under SGLANG_REQUEST_STATE_WAIT_TIMEOUT, which is a request-state timeout that has nothing to do with single-node/2-node decode correctness. A reader encountering this hunk has no way to know what "is single-node only" refers to — the subject of the sentence has been deleted.

Why this is a nit, not a functional bug

YAML treats # as a line comment, so parsing still succeeds and runtime behavior is unaffected. The damage is purely to future readability: the next person bumping the image (or chasing a decode-correctness issue) will see an unattributed warning about 2-node decode corruption and have no way to act on it. It also occurs in the same hunks the PR is already touching, so the cleanup cost is essentially zero.

Addressing the refutation

One verifier dismissed this as duplicate-of-already-refuted and "long-standing leftover commentary style." That framing is not quite right:

  1. The comment is not pre-existing in its current orphaned form — before this PR it was the trailing description of a real SGLANG_OPT_FP8_WO_A_GEMM: "0" line directly above it. The PR is what creates the orphan by removing the subject.
  2. A second verifier noted that the comment originated as the second line of a 2-line comment block about SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 (still visible intact in sibling files like disagg-low-latency-1p1d-tp4-tp4-mtp.yaml). Either way, in these modified files it is now dangling.
  3. The PR is the natural moment to clean it up: the change is mechanically aligned with the diff hunks (just extend the - to include the comment line), no broader refactoring needed.

How to fix

In each of the 6 modified files, also remove the trailing # is single-node only and corrupts results in 2-node decode setups. line in decode_environment (lines ~120 in the 5 dep yamls, ~111 in disagg-gb300-1p1d-tp4-tp4-2-c1.yaml). No other change needed.

@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 changed the title [DONOTMERGE] Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520 Update dpskv4 GB300 non-MTP disagg SGLang image to nightly-20260520 May 20, 2026
@Oseltamivir
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Oseltamivir Oseltamivir merged commit 59980fe into main May 20, 2026
71 of 72 checks passed
@Oseltamivir Oseltamivir deleted the update-sglang-nightly-0520 branch May 20, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants