dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark#1176
dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark#1176
Conversation
The tokenizer for DSv4-Pro has no chat_template set, so --use-chat-template causes benchmark_serving.py to crash with ValueError. Remove it to align with dsv4_fp4_b300_sglang.sh. Also add a floor of 8 to --max-running-requests to match the base script and avoid too-low values at low concurrency.
Add an ultra-high-concurrency DP-attention recipe (TP=8, deepep mega_moe, chunked-prefill 65536, request-rate 16) for the 8k1k workload at conc=2048. To support finite request-rate, make benchmark_lib.sh's run_benchmark_serving() accept an optional --request-rate parameter (defaults to inf so all existing callers are unaffected). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Remove 1k1k and other 8k1k search-space entries so CI only runs the new conc=2048 recipe. Original configs noted in comments for restore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24957382969 |
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
.github/configs/nvidia-master.yaml:1886-1899— The newdsv4-fp4-b300-sglang-mtpconfig (lines 1886-1899 of.github/configs/nvidia-master.yaml) documents three CONC bands (A=TP8 1-8, B=TP4 16-128, C=TP4/EP4 dp-attn 64-512) in both the inline YAML comments and the matchingperf-changelog.yamlentry, but the actualsearch-space:only contains band A — a single{ tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }entry for both ISL=1024 and ISL=8192. As a side effect theDP_ATTENTION=truebranch in the newbenchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh(deepep + mega_moe + chunked-prefill 32768) is unreachable, since no entry setsdp-attn: true. Either add the missing band B/C entries or revise the comments/changelog/script to drop the documented-but-absent recipes.Extended reasoning...
What the bug is
The new MTP config block at
.github/configs/nvidia-master.yamllines 1886-1899 carries an explicit comment block describing a three-band sweep:# Three CONC bands sweep with EAGLE/MTP (3/1/4) on top: # A: TP=8 ep=1 -- conc 1-8 (latency-bound, full TP) # B: TP=4 ep=1 -- conc 16-128 (TP-only, mid batch) # C: TP=4 ep=4 dp-attn -- conc 64-512 (DP-attn + EP, large batch) # Overlap: B/C at conc 64,128 (TP-only vs DP-attn EP head-to-head).
But the actual
search-spacefor both seq-lens is just one entry:- isl: 1024 osl: 1024 search-space: - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp } - isl: 8192 osl: 1024 search-space: - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }
Bands B and C are entirely absent. The
perf-changelog.yamlentry added in this same PR also asserts 'Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128' — so the documented intent and the configured behavior diverge in three places at once.Code path / proof
- CI sweeps the
search-spaceCartesian product per seq-len. With one entry per seq-len andconc-start: 1, conc-end: 8, the actual sweep is band A only — roughly 8 sweep points across 1k1k and 8k1k, not the documented ~30+. benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.shreadsDP_ATTENTIONfrom the per-entrydp-attnkey. The script branches:No search-space entry hasif [ "${DP_ATTENTION}" = "true" ]; then # Large-batch EP path: deepep + mega_moe. ... PARALLEL_ARGS=( --dp-size "$TP" --enable-dp-attention --moe-a2a-backend deepep ... ) CHUNKED_PREFILL_SIZE=32768 else # Small-batch TP-only path: flashinfer_mxfp4. ... fi
dp-attn: true, soDP_ATTENTIONwill never be "true" under this YAML. The entire deepep + mega_moe branch (including theSGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1,SGLANG_OPT_FIX_HASH_MEGA_MOE=1, etc. env exports andCHUNKED_PREFILL_SIZE=32768) is dead code as merged.- For comparison, the non-MTP sibling
dsv4-fp4-b300-sglangat lines 1856-1869 demonstrates the expected multi-band pattern with three explicit entries plus the newtp:8, ep:8, dp-attn:true, conc:2048entry, so the schema clearly supports what the comments describe.
Why existing code doesn't catch this
There is no validator that cross-checks YAML comments or perf-changelog descriptions against the actual
search-spaceshape. Both look syntactically valid, so the omission slips through.Impact
- CI runs only band A (~8 sweep points) instead of the documented ~30+, materially under-covering the recipe being introduced.
- The DP-attention recipe (deepep + mega_moe + chunked-prefill 32768), which has a non-trivial set of
SGLANG_OPT_*env tunings tied to it, never executes — so any regression in that path won't be caught by these benchmarks. - The published changelog entry will be misleading to anyone reading historical perf data.
How to fix
Either:
- Add the missing entries so the YAML matches the comments and changelog, e.g.:
(per-seq-len, mirroring the structure used by
search-space: - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp } - { tp: 4, ep: 1, conc-start: 16, conc-end: 128, spec-decoding: mtp } - { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 512, spec-decoding: mtp }
dsv4-fp4-b300-sglang), or - Remove the three-band comment block, drop the corresponding bullet from
perf-changelog.yaml, and delete the unreachableDP_ATTENTION=truebranch fromdsv4_fp4_b300_sglang_mtp.shso the repo state is internally consistent.
- CI sweeps the
| --max-running-requests "$(( CONC * 3 / 2 > 8 ? CONC * 3 / 2 : 8 ))" \ | ||
| --mem-fraction-static 0.90 \ | ||
| --swa-full-tokens-ratio 0.1 \ | ||
| "${SPEC_FLAGS[@]}" \ |
There was a problem hiding this comment.
🟡 The new MTP script hardcodes --swa-full-tokens-ratio 0.1 at line 121, while the parent dsv4_fp4_b300_sglang.sh uses an ISL-conditional that picks 0.5 for ISL=1024 with the explicit comment that '0.5 was tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default'. Since the MTP YAML exercises both 1024/1024 and 8192/1024, the 1k1k MTP run silently drops the empirical tuning — please either mirror the conditional or add a comment explaining why MTP intentionally diverges.
Extended reasoning...
The divergence
The parent benchmarks/single_node/dsv4_fp4_b300_sglang.sh (lines ~98-104 after this PR) sets SWA_FULL_TOKENS_RATIO based on ISL:
# 1k inputs need more SWA cache headroom on B300 than 8k inputs do; 0.5 was
# tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default.
if [[ "$ISL" == "1024" ]]; then
SWA_FULL_TOKENS_RATIO=0.5
else
SWA_FULL_TOKENS_RATIO=0.1
fiThe new dsv4_fp4_b300_sglang_mtp.sh at line 121 instead hardcodes:
--swa-full-tokens-ratio 0.1 \with no ISL branching.
Why this matters
The new MTP YAML config at .github/configs/nvidia-master.yaml (lines ~1885-1893) exercises both isl: 1024 and isl: 8192 sequence-length configs. So the 1k1k MTP run will use the cookbook default 0.1 instead of the empirically tuned 0.5 that the parent script's author specifically called out as needed for B300 cache headroom on 1k inputs.
Step-by-step proof
- CI launches
dsv4-fp4-b300-sglang-mtpwith the band-A entry{ tp: 8, ep: 1, conc-start: 1, conc-end: 8 }againstisl: 1024, osl: 1024. - The launcher invokes
dsv4_fp4_b300_sglang_mtp.shwithISL=1024. - At line 121, the script unconditionally passes
--swa-full-tokens-ratio 0.1tosglang serve. - The parent script, given the same
ISL=1024, would have passed0.5per its empirical tuning comment. - Result: the 1k1k MTP sweep runs with the cookbook default the parent script's author explicitly flagged as suboptimal for B300 1k inputs.
Addressing the refutation
The refutation argues the MTP script is a deliberately distinct recipe, that EAGLE adds memory overhead favoring smaller SWA reservation, and that low concurrencies (band A only, per bug_002) make SWA pressure minimal. These are reasonable hypotheses, but they are hypotheses — the parent script's comment is an empirical claim the author already calibrated, and the MTP recipe inherits the rest of the parent's tuning surface (same model, same B300 hardware, same ISL=1024). If the divergence is intentional (e.g., EAGLE memory overhead changes the optimal SWA tradeoff), a one-line comment would document that and prevent future readers from assuming the omission was an oversight. The fact that the recipe author left no such comment, while the parent author did leave one specifically calling out the 1k vs 8k distinction, is exactly the signal that this deserves to be flagged.
Suggested fix
Either mirror the parent's ISL-conditional:
if [[ "$ISL" == "1024" ]]; then
SWA_FULL_TOKENS_RATIO=0.5
else
SWA_FULL_TOKENS_RATIO=0.1
fiand substitute --swa-full-tokens-ratio "$SWA_FULL_TOKENS_RATIO" at line 121, or add a brief comment at line 121 explaining why MTP intentionally uses the cookbook default (e.g., 'EAGLE draft-model + verification overhead favors smaller SWA reservation, so we use the cookbook default instead of the parent's 1k1k empirical 0.5').
| - config-keys: | ||
| - dsv4-fp4-b300-sglang-mtp | ||
| description: | ||
| - "Add DeepSeek-V4-Pro FP4 B300 SGLang benchmark with EAGLE/MTP speculative decoding" | ||
| - "Image: lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd211e300dbb76924d56c5cbe6cc3ee5ee2fe314859cb8774f5bc070f3 (pinned for deep_gemm transform_weights_for_mega_moe support; same digest as PR #1158)" | ||
| - "Model: deepseek-ai/DeepSeek-V4-Pro" | ||
| - "EAGLE/MTP flags hardcoded in script: num-steps=3, eagle-topk=1, num-draft-tokens=4" | ||
| - "Recipe (MoE backend, chunked-prefill) selected in script by dp-attn: TP-only + flashinfer_mxfp4 (small batch) vs DP-attn + deepep mega_moe (large batch)" | ||
| - "Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128" | ||
| - "Configs: 1k1k and 8k1k, no validation.py / launcher / yaml-field changes (knob-free)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166 |
There was a problem hiding this comment.
🟡 The newly added dsv4-fp4-b300-sglang-mtp changelog entry sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166, but this is PR #1176; PR #1166 is an unrelated open PR. Likely a digit transposition — please change the link to /pull/1176 so the changelog correctly attributes the entry once merged.
Extended reasoning...
What the bug is
perf-changelog.yaml lines 1879-1889 add a new entry for the dsv4-fp4-b300-sglang-mtp config (the new MTP benchmark introduced by this PR). Its pr-link is set to:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166But this PR is #1176 ("dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark"). PR #1166 is a different, unrelated PR. The link should be /pull/1176.
Why this is wrong
The clear convention in perf-changelog.yaml is that pr-link points to the PR that introduces the entry — every prior entry in the file follows this pattern (e.g., the entry above this one points to PR #1174, which was the merged PR introducing it; the entry being added here is the one this PR adds and so should point to #1176).
A grep of perf-changelog.yaml shows pull/1166 only appears at this new entry on line 1889, and pull/1176 does not appear anywhere in the file — so the typo is not duplicated and there is no other entry that should already be carrying #1176.
Step-by-step proof
- Open the PR diff for
perf-changelog.yaml. The entire 11-line block at lines 1879-1889 is being newly added (no-lines, only+), and the newconfig-keysisdsv4-fp4-b300-sglang-mtp. - Look up the config:
dsv4-fp4-b300-sglang-mtpis also added by this same PR in.github/configs/nvidia-master.yaml(lines 1875+) and by the new scriptbenchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh. The description in the changelog ("Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512)", EAGLE/MTP flags 3/1/4, image digestsha256:26e116bd…f5bc070f3) matches exactly what this PR adds in those files. So the entry is unambiguously the one being introduced by this PR. - The PR number is 1176 (per the PR metadata
<pr number="1176">). PR sglang dsv4 MTP #1166 is a different PR (per the verifier check: an open PR titled 'sglang dsv4 MTP', not yet merged). They share the digits "116", consistent with a transposition typo. - Convention check: every other recent entry in this file uses
pr-linkmatching the PR that adds it (e.g., the entry directly above this one at line 1877 is/pull/1174, matching the prior commitfc93e84; the commit message offc93e84itself even says "append the MTP config entry for PR sglang dsv4 MTP #1166" — confirming 1166 was a placeholder/wrong number rather than an intentional cross-reference).
Impact
Documentation/metadata only — no runtime behavior is affected. However, once merged, the changelog will permanently misattribute the introduction of dsv4-fp4-b300-sglang-mtp to PR #1166 (an unrelated PR by a different author), so anyone clicking through to find the introducing PR will land on the wrong page. Easy to fix before merge.
How to fix
Change line 1889 from:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166to:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1176Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…(4,1,5)" This reverts commit 14369b1.
Summary
benchmark_lib.shrun_benchmark_serving()accept optional--request-rateparameter (defaults toinf, no impact on existing callers); conc=2048 recipe uses--request-rate 16Test plan
run_benchmark_serving()still default to--request-rate inf🤖 Generated with Claude Code