Update dpskv4 GB300 MTP disagg SGLang image to nightly-20260519#1478
Update dpskv4 GB300 MTP disagg SGLang image to nightly-20260519#1478yhyang201 wants to merge 2 commits into
Conversation
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26011543688 |
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
perf-changelog.yaml:2661— The newperf-changelog.yamlentry setspr-linkto.../pull/1477, but this is PR #1478 (PR #1477 is an unrelated change). Likely a copy/paste typo — please update the link tohttps://github.com/SemiAnalysisAI/InferenceX/pull/1478so the changelog correctly attributes this change.Extended reasoning...
What the bug is
The new entry appended to
perf-changelog.yaml(lines 2655–2661 in the post-PR file) sets:- config-keys: - dpskv4-gb300-sglang description: - "Update SGLang image to v0.5.12 and remove obsolete env vars for MTP configs" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477
However, this PR is #1478, not #1477. PR #1477 is a separate, unrelated change. The
pr-linkshould point to #1478 so that the changelog entry correctly identifies the PR that introduced the dpskv4-gb300-sglang image bump.How it manifests / why existing code doesn't prevent it
perf-changelog.yamlis descriptive metadata used for provenance/audit — there is no schema validator that cross-referencespr-linkagainst the PR number in which the entry is added, so the typo passes CI silently. The convention is established by every prior entry though: eachpr-linkmatches the PR that merged it (e.g., the immediately preceding entry on line ~2649 references/pull/1458, which matches commit3542221 [Klaud Cold] Update qwen3.5-fp8-h200-sglang SGLang image to v0.5.12-cu130 (#1458)).Step-by-step proof
- PR metadata for this review unambiguously identifies the PR as
<pr number='1478'>. - The diff for
perf-changelog.yamladds one new entry at the bottom whose only externally-resolvable identifier ispr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477. - 1477 ≠ 1478, so a reader who clicks the link from the changelog will land on a different PR that does not describe this image bump.
- Compare to the previous three entries in the same file:
- line ~2643 →
/pull/1427 - line ~2649 →
/pull/1459 - line ~2655 →
/pull/1458(corresponds to recent commit 3542221)
In every case, thepr-linkmatches the PR that introduced the entry. This entry breaks that convention.
- line ~2643 →
Impact
Low — no runtime/CI impact, since
process_changelog.py(and consumers of the YAML) don't validatepr-link. The cost is purely traceability: anyone investigating why dpskv4-gb300-sglang moved to v0.5.12 will follow the link to the wrong PR.Fix
Change line 2661 from:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477to:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1478 - PR metadata for this review unambiguously identifies the PR as
-
🔴
perf-changelog.yaml:2657-2661— The new perf-changelog entry uses config-keydpskv4-gb300-sglang, but no such key exists in.github/configs/nvidia-master.yaml— the actual MTP config isdsv4-fp4-gb300-dynamo-sglang-mtp(line 8803), and every other dsv4 changelog entry uses thedsv4-prefix. Sinceutils/process_changelog.py:get_config_keys_from_masterdoes an exact-match lookup for non-wildcard keys and raisesValueError('Config key {key} not found in master configs.'), this entry will hard-fail the changelog validation in the run-sweep workflow on push-to-main and won't drive any benchmark sweep. The key should bedsv4-fp4-gb300-dynamo-sglang-mtp(or wildcarded asdsv4-fp4-gb300-dynamo-sglang*).Extended reasoning...
What the bug is. The perf-changelog entry added at
perf-changelog.yaml:2657-2661usesconfig-keys: [dpskv4-gb300-sglang]. The stringdpskv4(with ap) appears nowhere else in the repository — not in.github/configs/nvidia-master.yamland not in any other changelog entry. The actual MTP config for the recipe files touched by this PR isdsv4-fp4-gb300-dynamo-sglang-mtpat line 8803 of.github/configs/nvidia-master.yaml(with the non-MTP siblingdsv4-fp4-gb300-dynamo-sglangat line 8703). Every prior dsv4 changelog entry (e.g. lines 2018, 2272-2273, 2280-2281) uses thedsv4-fp4-gb300-dynamo-sglang/-mtpkeys consistently.\n\nWhy it breaks CI.utils/process_changelog.py:get_config_keys_from_master(lines 85-99) iterates theconfig-keyslist. For a key without*it does an exact-matchelif key not in master_config: raise ValueError(f"Config key {key} not found in master configs."). There is no fallback or fuzzy match. The.github/workflows/run-sweep.ymlworkflow runsprocess_changelog.pyagainstperf-changelog.yamlon changes/push to main, so this entry will hard-fail validation rather than silently noop.\n\nImpact. Two compounding effects: (1) the changelog-driven sweep tooling errors out on this entry, blocking the workflow run that the entry was meant to trigger; (2) even if the failure were tolerated, the entry would not drive benchmarks for the MTP configs it is meant to cover, so the image bump fromnightly-dev-cu13-20260510-2473659e→v0.5.12(the substantive change in this PR) would go un-benchmarked.\n\nStep-by-step proof.\n1.process_changelog.pyloadsperf-changelog.yamland callsget_config_keys_from_masterwithconfig_keys = ['dpskv4-gb300-sglang']andmaster_config= the dict loaded fromnivdia-master.yaml.\n2. The loop reacheskey = 'dpskv4-gb300-sglang'. It contains no*, so the wildcard branch is skipped.\n3.'dpskv4-gb300-sglang' not in master_configevaluates toTrue(master_config containsdsv4-fp4-gb300-dynamo-sglanganddsv4-fp4-gb300-dynamo-sglang-mtp, neither of which equals the queried key).\n4.raise ValueError("Config key dpskv4-gb300-sglang not found in master configs.")fires, aborting the workflow step.\n\nFix. Change theconfig-keysvalue to the real config name. Since this PR only touches MTP recipes, the precise key isdsv4-fp4-gb300-dynamo-sglang-mtp. If both MTP and non-MTP are intended to share this entry, use the wildcarddsv4-fp4-gb300-dynamo-sglang*(the same form used elsewhere in this file for matching both variants).
| model: | ||
| path: "deepseek-v4-pro" | ||
| container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e" | ||
| container: "lmsysorg/sglang:v0.5.12" |
There was a problem hiding this comment.
🔴 The PR bumps model.container to lmsysorg/sglang:v0.5.12 in all six MTP recipe YAMLs, but the matching entry in .github/configs/nvidia-master.yaml (dsv4-fp4-gb300-dynamo-sglang-mtp at line 8803) still has image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. AGENTS.md is explicit that multi-node srt-slurm image bumps must edit recipe yaml AND nvidia-master.yaml together, and that model.container must equal image: (the launcher uses the latter as the container-alias key) — otherwise either the bump doesn't take effect or results get mislabeled. Please bump image: on line 8804 to lmsysorg/sglang:v0.5.12 as well.
Extended reasoning...
What the bug is
This PR bumps model.container from lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e to lmsysorg/sglang:v0.5.12 in all six MTP recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. The corresponding entry in .github/configs/nvidia-master.yaml, however, was not updated. Line 8804 still reads:
dsv4-fp4-gb300-dynamo-sglang-mtp:
image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034That entry owns all six recipes touched by this PR: lines 8826, 8842, 8857, 8872, 8887, and 8902 contain CONFIG_FILE= references pointing at exactly the six modified recipe files (disagg-low-latency-1p1d-tp4-tp4-mtp.yaml, disagg-low-latency-1p6d-dep4-tp4-mtp.yaml, disagg-mid-curve-1p1d-dep4-dep8-mtp.yaml, disagg-mid-curve-1p1d-dep4-dep16-mtp.yaml, disagg-mid-curve-2p1d-dep4-dep8-mtp.yaml, disagg-mid-curve-4p1d-dep4-dep8-mtp.yaml).
Why this matters (explicit AGENTS.md rule)
AGENTS.md line 113 lays out the invariant directly:
Multi-node srt-slurm changes must edit the recipe yaml AND
nvidia-master.yamltogether.srtctlreads only the recipe (model.container, resources, prefill/decode workers); the sweep generator (utils/matrix_logic/generate_sweep_configs.py) readsnvidia-master.yamlfor frontend labels — its prefill/decode numbers never reachsrtctl. Recipe-only edits mislabel results, master-only edits don't take effect. For image bumps,model.containermust equalimage:, since the launcher uses the latter as the container-alias key.
This PR is a recipe-only edit for an image bump, which is precisely the case AGENTS.md warns against.
Step-by-step proof
- Apply this PR. The six recipe YAMLs now declare
container: lmsysorg/sglang:v0.5.12. - The CI sweep generator (
generate_sweep_configs.py) is invoked with.github/configs/nvidia-master.yaml(per the docs and run-sweep.yml workflow). - It reads the
dsv4-fp4-gb300-dynamo-sglang-mtpentry, which still hasimage: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. That string is used as the container-alias key / label. srtctlseparately reads each recipe and launcheslmsysorg/sglang:v0.5.12.- Result: the workload actually runs on
v0.5.12, but is labeled/recorded undernightly-dev-cu13-20260509-9ee83034— i.e. mislabeled benchmark results. Conversely, if the alias key is required to match for the launcher path, the bump simply does not take effect for those configs.
Either failure mode is the exact outcome AGENTS.md describes.
How to fix
Update .github/configs/nvidia-master.yaml line 8804 to:
dsv4-fp4-gb300-dynamo-sglang-mtp:
image: lmsysorg/sglang:v0.5.12so that image: matches the model.container set in every recipe this entry references. (Note: the master image was ...20260509-9ee83034 while the recipes were on ...20260510-2473659e — these had already drifted, so the fix is to align both on v0.5.12.)
0f3b945 to
cc55442
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26011782461 |
|
@claude rebase & resolve conflicts |
|
Claude finished @functionstackx's task in 1m 50s —— View job Rebase & Resolve Conflicts
Rebased onto Branch is now 1 commit ahead, 0 behind main: |
cc55442 to
03c49ca
Compare
|
@claude add the image change to nvidia-master.yaml too container: "lmsysorg/sglang:v0.5.12 |
|
Claude finished @functionstackx's task in 45s —— View job Update
|
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26012188045 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26012578968 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26012641192 |
fb1e9d4 to
4debe31
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26042905161 |
4debe31 to
67291ab
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26043111886 |
67291ab to
0c7c630
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26043927158 |
0c7c630 to
8adc687
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26044904959 |
8adc687 to
77727b2
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26100097029 |
Summary
nightly-dev-cu13-20260510-2473659etov0.5.12for all MTP configs (mid-curve and low-latency)