Update dpskv4 GB300 MTP disagg SGLang image to nightly-20260519 by yhyang201 · Pull Request #1478 · SemiAnalysisAI/InferenceX

yhyang201 · 2026-05-18T03:09:52Z

Summary

Update SGLang container image from nightly-dev-cu13-20260510-2473659e to v0.5.12 for all MTP configs (mid-curve and low-latency)
Remove obsolete/redundant environment variables that no longer exist in sglang release/v0.5.12 or whose defaults already match the set values

github-actions · 2026-05-18T03:15:33Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26011543688
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26011543688

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 perf-changelog.yaml:2661 — The new perf-changelog.yaml entry sets pr-link to .../pull/1477, but this is PR #1478 (PR #1477 is an unrelated change). Likely a copy/paste typo — please update the link to https://github.com/SemiAnalysisAI/InferenceX/pull/1478 so the changelog correctly attributes this change.
Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml (lines 2655–2661 in the post-PR file) sets:
```
- config-keys:
    - dpskv4-gb300-sglang
  description:
    - "Update SGLang image to v0.5.12 and remove obsolete env vars for MTP configs"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477
```
However, this PR is #1478, not #1477. PR #1477 is a separate, unrelated change. The pr-link should point to #1478 so that the changelog entry correctly identifies the PR that introduced the dpskv4-gb300-sglang image bump.

How it manifests / why existing code doesn't prevent it

perf-changelog.yaml is descriptive metadata used for provenance/audit — there is no schema validator that cross-references pr-link against the PR number in which the entry is added, so the typo passes CI silently. The convention is established by every prior entry though: each pr-link matches the PR that merged it (e.g., the immediately preceding entry on line ~2649 references /pull/1458, which matches commit 3542221 [Klaud Cold] Update qwen3.5-fp8-h200-sglang SGLang image to v0.5.12-cu130 (#1458)).

Step-by-step proof
1. PR metadata for this review unambiguously identifies the PR as <pr number='1478'>.
2. The diff for perf-changelog.yaml adds one new entry at the bottom whose only externally-resolvable identifier is pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477.
3. 1477 ≠ 1478, so a reader who clicks the link from the changelog will land on a different PR that does not describe this image bump.
4. Compare to the previous three entries in the same file:
  - line ~2643 → /pull/1427
  - line ~2649 → /pull/1459
  - line ~2655 → /pull/1458 (corresponds to recent commit 3542221)
    In every case, the pr-link matches the PR that introduced the entry. This entry breaks that convention.
Impact

Low — no runtime/CI impact, since process_changelog.py (and consumers of the YAML) don't validate pr-link. The cost is purely traceability: anyone investigating why dpskv4-gb300-sglang moved to v0.5.12 will follow the link to the wrong PR.

Fix

Change line 2661 from:
```
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477
```
to:
```
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1478
```
🔴 perf-changelog.yaml:2657-2661 — The new perf-changelog entry uses config-key dpskv4-gb300-sglang, but no such key exists in .github/configs/nvidia-master.yaml — the actual MTP config is dsv4-fp4-gb300-dynamo-sglang-mtp (line 8803), and every other dsv4 changelog entry uses the dsv4- prefix. Since utils/process_changelog.py:get_config_keys_from_master does an exact-match lookup for non-wildcard keys and raises ValueError('Config key {key} not found in master configs.'), this entry will hard-fail the changelog validation in the run-sweep workflow on push-to-main and won't drive any benchmark sweep. The key should be dsv4-fp4-gb300-dynamo-sglang-mtp (or wildcarded as dsv4-fp4-gb300-dynamo-sglang*).

Extended reasoning...

What the bug is. The perf-changelog entry added at perf-changelog.yaml:2657-2661 uses config-keys: [dpskv4-gb300-sglang]. The string dpskv4 (with a p) appears nowhere else in the repository — not in .github/configs/nvidia-master.yaml and not in any other changelog entry. The actual MTP config for the recipe files touched by this PR is dsv4-fp4-gb300-dynamo-sglang-mtp at line 8803 of .github/configs/nvidia-master.yaml (with the non-MTP sibling dsv4-fp4-gb300-dynamo-sglang at line 8703). Every prior dsv4 changelog entry (e.g. lines 2018, 2272-2273, 2280-2281) uses the dsv4-fp4-gb300-dynamo-sglang / -mtp keys consistently.\n\nWhy it breaks CI. utils/process_changelog.py:get_config_keys_from_master (lines 85-99) iterates the config-keys list. For a key without * it does an exact-match elif key not in master_config: raise ValueError(f"Config key {key} not found in master configs."). There is no fallback or fuzzy match. The .github/workflows/run-sweep.yml workflow runs process_changelog.py against perf-changelog.yaml on changes/push to main, so this entry will hard-fail validation rather than silently noop.\n\nImpact. Two compounding effects: (1) the changelog-driven sweep tooling errors out on this entry, blocking the workflow run that the entry was meant to trigger; (2) even if the failure were tolerated, the entry would not drive benchmarks for the MTP configs it is meant to cover, so the image bump from nightly-dev-cu13-20260510-2473659e → v0.5.12 (the substantive change in this PR) would go un-benchmarked.\n\nStep-by-step proof.\n1. process_changelog.py loads perf-changelog.yaml and calls get_config_keys_from_master with config_keys = ['dpskv4-gb300-sglang'] and master_config = the dict loaded from nivdia-master.yaml.\n2. The loop reaches key = 'dpskv4-gb300-sglang'. It contains no *, so the wildcard branch is skipped.\n3. 'dpskv4-gb300-sglang' not in master_config evaluates to True (master_config contains dsv4-fp4-gb300-dynamo-sglang and dsv4-fp4-gb300-dynamo-sglang-mtp, neither of which equals the queried key).\n4. raise ValueError("Config key dpskv4-gb300-sglang not found in master configs.") fires, aborting the workflow step.\n\nFix. Change the config-keys value to the real config name. Since this PR only touches MTP recipes, the precise key is dsv4-fp4-gb300-dynamo-sglang-mtp. If both MTP and non-MTP are intended to share this entry, use the wildcard dsv4-fp4-gb300-dynamo-sglang* (the same form used elsewhere in this file for matching both variants).

claude · 2026-05-18T03:16:06Z

 model:
  path: "deepseek-v4-pro"
-  container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e"
+  container: "lmsysorg/sglang:v0.5.12"


🔴 The PR bumps model.container to lmsysorg/sglang:v0.5.12 in all six MTP recipe YAMLs, but the matching entry in .github/configs/nvidia-master.yaml (dsv4-fp4-gb300-dynamo-sglang-mtp at line 8803) still has image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. AGENTS.md is explicit that multi-node srt-slurm image bumps must edit recipe yaml AND nvidia-master.yaml together, and that model.container must equal image: (the launcher uses the latter as the container-alias key) — otherwise either the bump doesn't take effect or results get mislabeled. Please bump image: on line 8804 to lmsysorg/sglang:v0.5.12 as well.

Extended reasoning...

What the bug is

This PR bumps model.container from lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e to lmsysorg/sglang:v0.5.12 in all six MTP recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. The corresponding entry in .github/configs/nvidia-master.yaml, however, was not updated. Line 8804 still reads:

dsv4-fp4-gb300-dynamo-sglang-mtp: image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034

That entry owns all six recipes touched by this PR: lines 8826, 8842, 8857, 8872, 8887, and 8902 contain CONFIG_FILE= references pointing at exactly the six modified recipe files (disagg-low-latency-1p1d-tp4-tp4-mtp.yaml, disagg-low-latency-1p6d-dep4-tp4-mtp.yaml, disagg-mid-curve-1p1d-dep4-dep8-mtp.yaml, disagg-mid-curve-1p1d-dep4-dep16-mtp.yaml, disagg-mid-curve-2p1d-dep4-dep8-mtp.yaml, disagg-mid-curve-4p1d-dep4-dep8-mtp.yaml).

Why this matters (explicit AGENTS.md rule)

AGENTS.md line 113 lays out the invariant directly:

Multi-node srt-slurm changes must edit the recipe yaml AND nvidia-master.yaml together. srtctl reads only the recipe (model.container, resources, prefill/decode workers); the sweep generator (utils/matrix_logic/generate_sweep_configs.py) reads nvidia-master.yaml for frontend labels — its prefill/decode numbers never reach srtctl. Recipe-only edits mislabel results, master-only edits don't take effect. For image bumps, model.container must equal image:, since the launcher uses the latter as the container-alias key.

This PR is a recipe-only edit for an image bump, which is precisely the case AGENTS.md warns against.

Step-by-step proof

Apply this PR. The six recipe YAMLs now declare container: lmsysorg/sglang:v0.5.12.

The CI sweep generator (generate_sweep_configs.py) is invoked with .github/configs/nvidia-master.yaml (per the docs and run-sweep.yml workflow).

It reads the dsv4-fp4-gb300-dynamo-sglang-mtp entry, which still has image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. That string is used as the container-alias key / label.

srtctl separately reads each recipe and launches lmsysorg/sglang:v0.5.12.

Result: the workload actually runs on v0.5.12, but is labeled/recorded under nightly-dev-cu13-20260509-9ee83034 — i.e. mislabeled benchmark results. Conversely, if the alias key is required to match for the launcher path, the bump simply does not take effect for those configs.

Either failure mode is the exact outcome AGENTS.md describes.

How to fix

Update .github/configs/nvidia-master.yaml line 8804 to:

dsv4-fp4-gb300-dynamo-sglang-mtp: image: lmsysorg/sglang:v0.5.12

so that image: matches the model.container set in every recipe this entry references. (Note: the master image was ...20260509-9ee83034 while the recipes were on ...20260510-2473659e — these had already drifted, so the fix is to align both on v0.5.12.)

github-actions · 2026-05-18T03:41:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26011782461
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26011782461

functionstackx · 2026-05-18T03:51:08Z

@claude rebase & resolve conflicts

Klaud-Cold · 2026-05-18T03:51:36Z

Claude finished @functionstackx's task in 1m 50s —— View job

Rebase & Resolve Conflicts

Fetch latest origin/main
Rebase dpskv4-gb300-sglang-v0.5.12 onto origin/main
Resolve merge conflict in perf-changelog.yaml
Force-push rebased branch

Rebased onto origin/main (4 new commits: #1446, #1415, #1428, #1477). There was one conflict in perf-changelog.yaml where both main and this branch appended entries at the end — resolved by keeping both (main's qwen3.5-bf16-b200-sglang entry, then this PR's dsv4-fp4-gb300-dynamo-sglang-mtp entry).

Branch is now 1 commit ahead, 0 behind main: 03c49caf.
| Branch

functionstackx · 2026-05-18T03:54:31Z

@claude add the image change to nvidia-master.yaml too container: "lmsysorg/sglang:v0.5.12

Klaud-Cold · 2026-05-18T03:54:57Z

Claude finished @functionstackx's task in 45s —— View job

Update `nvidia-master.yaml` Image

Read current nvidia-master.yaml to find the dsv4-fp4-gb300-dynamo-sglang-mtp entry
Update image: from nightly-dev-cu13-20260509-9ee83034 to lmsysorg/sglang:v0.5.12
Verified perf-changelog.yaml already has correct config-key and PR link (fixed during earlier rebase)
Commit and push

Updated .github/configs/nvidia-master.yaml line 8805 so image: matches the model.container in the recipe YAMLs (lmsysorg/sglang:v0.5.12). Pushed as 39575483.

View job | Branch

github-actions · 2026-05-18T03:55:19Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26012188045
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26012188045

github-actions · 2026-05-18T03:56:46Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26012578968
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26012578968

github-actions · 2026-05-18T04:44:01Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26012641192
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26012641192

github-actions · 2026-05-18T15:23:04Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26042905161
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26042905161

github-actions · 2026-05-18T15:26:59Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26043111886
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26043111886

github-actions · 2026-05-18T15:52:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26043927158
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26043927158

github-actions · 2026-05-18T16:03:58Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26044904959
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26044904959

…v vars

github-actions · 2026-05-19T14:49:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26100097029
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26100097029

yhyang201 requested a review from a team May 18, 2026 03:09

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

yhyang201 added a commit that referenced this pull request May 18, 2026

Fix perf-changelog PR link to #1478

bd7fa0b

yhyang201 added the full-sweep-enabled label May 18, 2026

claude Bot reviewed May 18, 2026

View reviewed changes

yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 0f3b945 to cc55442 Compare May 18, 2026 03:40

yhyang201 requested review from 1am9trash, billishyahao, chunfangamd, jgangani, kedarpotdar-nv, seungrokj and yctseng0211 as code owners May 18, 2026 03:40

Klaud-Cold force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from cc55442 to 03c49ca Compare May 18, 2026 03:53

yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch 2 times, most recently from fb1e9d4 to 4debe31 Compare May 18, 2026 15:22

yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 4debe31 to 67291ab Compare May 18, 2026 15:26

yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 67291ab to 0c7c630 Compare May 18, 2026 15:41

yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 0c7c630 to 8adc687 Compare May 18, 2026 15:59

yhyang201 changed the title ~~Update dpskv4 GB300 MTP SGLang image to v0.5.12~~ Update dpskv4 GB300 MTP disagg SGLang image to nightly-20260519 May 19, 2026

yhyang201 added 2 commits May 19, 2026 21:21

Update dpskv4 GB300 MTP SGLang image to nightly-20260518 and clean en…

c34271c

…v vars

Update dsv4 GB300 MTP SGLang image to nightly-20260519

77727b2

yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 8adc687 to 77727b2 Compare May 19, 2026 13:24

Conversation

yhyang201 commented May 18, 2026

Summary

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters (explicit AGENTS.md rule)

Step-by-step proof

How to fix

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Klaud-Cold commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebase & Resolve Conflicts

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Klaud-Cold commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update nvidia-master.yaml Image

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Klaud-Cold commented May 18, 2026 •

edited

Loading

Klaud-Cold commented May 18, 2026 •

edited

Loading

Update `nvidia-master.yaml` Image