dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark by yhyang201 · Pull Request #1176 · SemiAnalysisAI/InferenceX

yhyang201 · 2026-04-26T13:03:28Z

Summary

Add ultra-high-concurrency (conc=2048) DP-attention recipe for dsv4 8k1k on B300: TP=8, deepep mega_moe backend, chunked-prefill 65536, mem-fraction-static 0.87, max-running-requests 2560
Make benchmark_lib.sh run_benchmark_serving() accept optional --request-rate parameter (defaults to inf, no impact on existing callers); conc=2048 recipe uses --request-rate 16
Add dsv4-fp4-b300-sglang-mtp benchmark config and script (EAGLE/MTP speculative decoding)

Test plan

CI runs dsv4-fp4-b300-sglang 8k1k conc=2048 successfully with the new recipe
Existing dsv4-fp4-b300-sglang configs (conc=1, 32, 512) are unaffected
All other benchmarks using run_benchmark_serving() still default to --request-rate inf

🤖 Generated with Claude Code

The tokenizer for DSv4-Pro has no chat_template set, so --use-chat-template causes benchmark_serving.py to crash with ValueError. Remove it to align with dsv4_fp4_b300_sglang.sh. Also add a floor of 8 to --max-running-requests to match the base script and avoid too-low values at low concurrency.

Rebase perf-changelog.yaml on latest main (preserving #1173 and #1174 entries) and append the MTP config entry for PR #1166.

Add an ultra-high-concurrency DP-attention recipe (TP=8, deepep mega_moe, chunked-prefill 65536, request-rate 16) for the 8k1k workload at conc=2048. To support finite request-rate, make benchmark_lib.sh's run_benchmark_serving() accept an optional --request-rate parameter (defaults to inf so all existing callers are unaffected). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-26T13:03:36Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-26T13:03:36Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Remove 1k1k and other 8k1k search-space entries so CI only runs the new conc=2048 recipe. Original configs noted in comments for restore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 · 2026-04-26T13:07:23Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-26T13:07:31Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24957382969
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 97a7e7d
Approval: not required (trusted collaborator).

…periment" This reverts commit 97a7e7d.

…st-rate" This reverts commit cea70e5.

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 .github/configs/nvidia-master.yaml:1886-1899 — The new dsv4-fp4-b300-sglang-mtp config (lines 1886-1899 of .github/configs/nvidia-master.yaml) documents three CONC bands (A=TP8 1-8, B=TP4 16-128, C=TP4/EP4 dp-attn 64-512) in both the inline YAML comments and the matching perf-changelog.yaml entry, but the actual search-space: only contains band A — a single { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp } entry for both ISL=1024 and ISL=8192. As a side effect the DP_ATTENTION=true branch in the new benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh (deepep + mega_moe + chunked-prefill 32768) is unreachable, since no entry sets dp-attn: true. Either add the missing band B/C entries or revise the comments/changelog/script to drop the documented-but-absent recipes.
Extended reasoning...

What the bug is

The new MTP config block at .github/configs/nvidia-master.yaml lines 1886-1899 carries an explicit comment block describing a three-band sweep:
```
# Three CONC bands sweep with EAGLE/MTP (3/1/4) on top:
#   A: TP=8 ep=1            -- conc 1-8    (latency-bound, full TP)
#   B: TP=4 ep=1            -- conc 16-128 (TP-only, mid batch)
#   C: TP=4 ep=4 dp-attn    -- conc 64-512 (DP-attn + EP, large batch)
# Overlap: B/C at conc 64,128 (TP-only vs DP-attn EP head-to-head).
```
But the actual search-space for both seq-lens is just one entry:
```
- isl: 1024
  osl: 1024
  search-space:
  - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }
- isl: 8192
  osl: 1024
  search-space:
  - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }
```
Bands B and C are entirely absent. The perf-changelog.yaml entry added in this same PR also asserts 'Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128' — so the documented intent and the configured behavior diverge in three places at once.

Code path / proof
1. CI sweeps the search-space Cartesian product per seq-len. With one entry per seq-len and conc-start: 1, conc-end: 8, the actual sweep is band A only — roughly 8 sweep points across 1k1k and 8k1k, not the documented ~30+.
2. benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh reads DP_ATTENTION from the per-entry dp-attn key. The script branches:
```
if [ "${DP_ATTENTION}" = "true" ]; then
    # Large-batch EP path: deepep + mega_moe.
    ...
    PARALLEL_ARGS=( --dp-size "$TP" --enable-dp-attention --moe-a2a-backend deepep ... )
    CHUNKED_PREFILL_SIZE=32768
else
    # Small-batch TP-only path: flashinfer_mxfp4.
    ...
fi
```
  No search-space entry has dp-attn: true, so DP_ATTENTION will never be "true" under this YAML. The entire deepep + mega_moe branch (including the SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1, SGLANG_OPT_FIX_HASH_MEGA_MOE=1, etc. env exports and CHUNKED_PREFILL_SIZE=32768) is dead code as merged.
3. For comparison, the non-MTP sibling dsv4-fp4-b300-sglang at lines 1856-1869 demonstrates the expected multi-band pattern with three explicit entries plus the new tp:8, ep:8, dp-attn:true, conc:2048 entry, so the schema clearly supports what the comments describe.
Why existing code doesn't catch this

There is no validator that cross-checks YAML comments or perf-changelog descriptions against the actual search-space shape. Both look syntactically valid, so the omission slips through.

Impact
- CI runs only band A (~8 sweep points) instead of the documented ~30+, materially under-covering the recipe being introduced.
- The DP-attention recipe (deepep + mega_moe + chunked-prefill 32768), which has a non-trivial set of SGLANG_OPT_* env tunings tied to it, never executes — so any regression in that path won't be caught by these benchmarks.
- The published changelog entry will be misleading to anyone reading historical perf data.
How to fix

Either:
1. Add the missing entries so the YAML matches the comments and changelog, e.g.:
```
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }
- { tp: 4, ep: 1, conc-start: 16, conc-end: 128, spec-decoding: mtp }
- { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 512, spec-decoding: mtp }
```
  (per-seq-len, mirroring the structure used by dsv4-fp4-b300-sglang), or
2. Remove the three-band comment block, drop the corresponding bullet from perf-changelog.yaml, and delete the unreachable DP_ATTENTION=true branch from dsv4_fp4_b300_sglang_mtp.sh so the repo state is internally consistent.

claude · 2026-04-26T13:11:41Z

+    --max-running-requests "$(( CONC * 3 / 2 > 8 ? CONC * 3 / 2 : 8 ))" \
+    --mem-fraction-static 0.90 \
+    --swa-full-tokens-ratio 0.1 \
+    "${SPEC_FLAGS[@]}" \


🟡 The new MTP script hardcodes --swa-full-tokens-ratio 0.1 at line 121, while the parent dsv4_fp4_b300_sglang.sh uses an ISL-conditional that picks 0.5 for ISL=1024 with the explicit comment that '0.5 was tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default'. Since the MTP YAML exercises both 1024/1024 and 8192/1024, the 1k1k MTP run silently drops the empirical tuning — please either mirror the conditional or add a comment explaining why MTP intentionally diverges.

Extended reasoning...

The divergence

The parent benchmarks/single_node/dsv4_fp4_b300_sglang.sh (lines ~98-104 after this PR) sets SWA_FULL_TOKENS_RATIO based on ISL:

# 1k inputs need more SWA cache headroom on B300 than 8k inputs do; 0.5 was # tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default. if [[ "$ISL" == "1024" ]]; then SWA_FULL_TOKENS_RATIO=0.5 else SWA_FULL_TOKENS_RATIO=0.1 fi

The new dsv4_fp4_b300_sglang_mtp.sh at line 121 instead hardcodes:

--swa-full-tokens-ratio 0.1 \

with no ISL branching.

Why this matters

The new MTP YAML config at .github/configs/nvidia-master.yaml (lines ~1885-1893) exercises both isl: 1024 and isl: 8192 sequence-length configs. So the 1k1k MTP run will use the cookbook default 0.1 instead of the empirically tuned 0.5 that the parent script's author specifically called out as needed for B300 cache headroom on 1k inputs.

Step-by-step proof

CI launches dsv4-fp4-b300-sglang-mtp with the band-A entry { tp: 8, ep: 1, conc-start: 1, conc-end: 8 } against isl: 1024, osl: 1024.

The launcher invokes dsv4_fp4_b300_sglang_mtp.sh with ISL=1024.

At line 121, the script unconditionally passes --swa-full-tokens-ratio 0.1 to sglang serve.

The parent script, given the same ISL=1024, would have passed 0.5 per its empirical tuning comment.

Result: the 1k1k MTP sweep runs with the cookbook default the parent script's author explicitly flagged as suboptimal for B300 1k inputs.

Addressing the refutation

The refutation argues the MTP script is a deliberately distinct recipe, that EAGLE adds memory overhead favoring smaller SWA reservation, and that low concurrencies (band A only, per bug_002) make SWA pressure minimal. These are reasonable hypotheses, but they are hypotheses — the parent script's comment is an empirical claim the author already calibrated, and the MTP recipe inherits the rest of the parent's tuning surface (same model, same B300 hardware, same ISL=1024). If the divergence is intentional (e.g., EAGLE memory overhead changes the optimal SWA tradeoff), a one-line comment would document that and prevent future readers from assuming the omission was an oversight. The fact that the recipe author left no such comment, while the parent author did leave one specifically calling out the 1k vs 8k distinction, is exactly the signal that this deserves to be flagged.

Suggested fix

Either mirror the parent's ISL-conditional:

if [[ "$ISL" == "1024" ]]; then SWA_FULL_TOKENS_RATIO=0.5 else SWA_FULL_TOKENS_RATIO=0.1 fi

and substitute --swa-full-tokens-ratio "$SWA_FULL_TOKENS_RATIO" at line 121, or add a brief comment at line 121 explaining why MTP intentionally uses the cookbook default (e.g., 'EAGLE draft-model + verification overhead favors smaller SWA reservation, so we use the cookbook default instead of the parent's 1k1k empirical 0.5').

claude · 2026-04-26T13:11:41Z

+- config-keys:
+    - dsv4-fp4-b300-sglang-mtp
+  description:
+    - "Add DeepSeek-V4-Pro FP4 B300 SGLang benchmark with EAGLE/MTP speculative decoding"
+    - "Image: lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd211e300dbb76924d56c5cbe6cc3ee5ee2fe314859cb8774f5bc070f3 (pinned for deep_gemm transform_weights_for_mega_moe support; same digest as PR #1158)"
+    - "Model: deepseek-ai/DeepSeek-V4-Pro"
+    - "EAGLE/MTP flags hardcoded in script: num-steps=3, eagle-topk=1, num-draft-tokens=4"
+    - "Recipe (MoE backend, chunked-prefill) selected in script by dp-attn: TP-only + flashinfer_mxfp4 (small batch) vs DP-attn + deepep mega_moe (large batch)"
+    - "Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128"
+    - "Configs: 1k1k and 8k1k, no validation.py / launcher / yaml-field changes (knob-free)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166


🟡 The newly added dsv4-fp4-b300-sglang-mtp changelog entry sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166, but this is PR #1176; PR #1166 is an unrelated open PR. Likely a digit transposition — please change the link to /pull/1176 so the changelog correctly attributes the entry once merged.

Extended reasoning...

What the bug is

perf-changelog.yaml lines 1879-1889 add a new entry for the dsv4-fp4-b300-sglang-mtp config (the new MTP benchmark introduced by this PR). Its pr-link is set to:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166

But this PR is #1176 ("dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark"). PR #1166 is a different, unrelated PR. The link should be /pull/1176.

Why this is wrong

The clear convention in perf-changelog.yaml is that pr-link points to the PR that introduces the entry — every prior entry in the file follows this pattern (e.g., the entry above this one points to PR #1174, which was the merged PR introducing it; the entry being added here is the one this PR adds and so should point to #1176).

A grep of perf-changelog.yaml shows pull/1166 only appears at this new entry on line 1889, and pull/1176 does not appear anywhere in the file — so the typo is not duplicated and there is no other entry that should already be carrying #1176.

Step-by-step proof

Open the PR diff for perf-changelog.yaml. The entire 11-line block at lines 1879-1889 is being newly added (no - lines, only +), and the new config-keys is dsv4-fp4-b300-sglang-mtp.

Look up the config: dsv4-fp4-b300-sglang-mtp is also added by this same PR in .github/configs/nvidia-master.yaml (lines 1875+) and by the new script benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh. The description in the changelog ("Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512)", EAGLE/MTP flags 3/1/4, image digest sha256:26e116bd…f5bc070f3) matches exactly what this PR adds in those files. So the entry is unambiguously the one being introduced by this PR.

The PR number is 1176 (per the PR metadata <pr number="1176">). PR sglang dsv4 MTP #1166 is a different PR (per the verifier check: an open PR titled 'sglang dsv4 MTP', not yet merged). They share the digits "116", consistent with a transposition typo.

Convention check: every other recent entry in this file uses pr-link matching the PR that adds it (e.g., the entry directly above this one at line 1877 is /pull/1174, matching the prior commit fc93e84; the commit message of fc93e84 itself even says "append the MTP config entry for PR sglang dsv4 MTP #1166" — confirming 1166 was a placeholder/wrong number rather than an intentional cross-reference).

Impact

Documentation/metadata only — no runtime behavior is affected. However, once merged, the changelog will permanently misattribute the introduction of dsv4-fp4-b300-sglang-mtp to PR #1166 (an unrelated PR by a different author), so anyone clicking through to find the introducing PR will land on the wrong page. Easy to fix before merge.

How to fix

Change line 1889 from:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166

to:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1176

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…(4,1,5)" This reverts commit 14369b1.

hnyls2002 and others added 15 commits April 25, 2026 20:50

sglang dsv4 mtp

148223d

knob-driven recipe selection

c883e8d

self-contained mtp config; recipe via dp-attn

3a49ed1

add mtp_1 (1/1/2) variant

6f1b80a

knob-driven recipe selection

1b34a8d

pin sglang image to mega_moe-capable digest

481482a

drop mtp_1 knob; align with PR #1158 image digest

47fefec

Merge branch 'main' into sglang-dsv4-MTP

bfa254d

update nvidia-master.yaml

287ef26

Merge branch 'main' into sglang-dsv4-MTP

e4ddf8f

fix: restore trailing newline in perf-changelog.yaml

f64505b

perf-changelog: add dsv4-fp4-b300-sglang-mtp entry

fc93e84

Rebase perf-changelog.yaml on latest main (preserving #1173 and #1174 entries) and append the MTP config entry for PR #1166.

merge main and resolve perf-changelog.yaml conflict

4155a49

yhyang201 requested a review from a team April 26, 2026 13:03

yhyang201 requested review from Qiaolin-Yu, jgangani and kedarpotdar-nv as code owners April 26, 2026 13:03

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

dsv4-b300-sglang: temporarily keep only conc=2048 8k1k for experiment

97a7e7d

Remove 1k1k and other 8k1k search-space entries so CI only runs the new conc=2048 recipe. Original configs noted in comments for restore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 added 2 commits April 26, 2026 21:10

Revert "dsv4-b300-sglang: temporarily keep only conc=2048 8k1k for ex…

628e47b

…periment" This reverts commit 97a7e7d.

Revert "dsv4-b300-sglang: add conc=2048 8k1k recipe with finite reque…

1526e9d

…st-rate" This reverts commit cea70e5.

claude Bot reviewed Apr 26, 2026

View reviewed changes

yhyang201 and others added 2 commits April 26, 2026 21:12

dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5)

14369b1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to …

42b294d

…(4,1,5)" This reverts commit 14369b1.

yhyang201 closed this Apr 26, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark#1176

dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark#1176
yhyang201 wants to merge 20 commits intomainfrom
sglang-dsv4-MTP

yhyang201 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

yhyang201 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Apr 26, 2026

Uh oh!

claude Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yhyang201 commented Apr 26, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

yhyang201 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What the bug is

Code path / proof

Why existing code doesn't catch this

Impact

How to fix

Uh oh!

claude Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 26, 2026

Choose a reason for hiding this comment

What the bug is

Why this is wrong

Step-by-step proof

Impact

How to fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants