Skip to content

dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark#1176

Closed
yhyang201 wants to merge 20 commits intomainfrom
sglang-dsv4-MTP
Closed

dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark#1176
yhyang201 wants to merge 20 commits intomainfrom
sglang-dsv4-MTP

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

Summary

  • Add ultra-high-concurrency (conc=2048) DP-attention recipe for dsv4 8k1k on B300: TP=8, deepep mega_moe backend, chunked-prefill 65536, mem-fraction-static 0.87, max-running-requests 2560
  • Make benchmark_lib.sh run_benchmark_serving() accept optional --request-rate parameter (defaults to inf, no impact on existing callers); conc=2048 recipe uses --request-rate 16
  • Add dsv4-fp4-b300-sglang-mtp benchmark config and script (EAGLE/MTP speculative decoding)

Test plan

  • CI runs dsv4-fp4-b300-sglang 8k1k conc=2048 successfully with the new recipe
  • Existing dsv4-fp4-b300-sglang configs (conc=1, 32, 512) are unaffected
  • All other benchmarks using run_benchmark_serving() still default to --request-rate inf

🤖 Generated with Claude Code

hnyls2002 and others added 15 commits April 25, 2026 20:50
The tokenizer for DSv4-Pro has no chat_template set, so
--use-chat-template causes benchmark_serving.py to crash with
ValueError. Remove it to align with dsv4_fp4_b300_sglang.sh.

Also add a floor of 8 to --max-running-requests to match the
base script and avoid too-low values at low concurrency.
Rebase perf-changelog.yaml on latest main (preserving #1173 and #1174
entries) and append the MTP config entry for PR #1166.
Add an ultra-high-concurrency DP-attention recipe (TP=8, deepep
mega_moe, chunked-prefill 65536, request-rate 16) for the 8k1k
workload at conc=2048.

To support finite request-rate, make benchmark_lib.sh's
run_benchmark_serving() accept an optional --request-rate parameter
(defaults to inf so all existing callers are unaffected).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Remove 1k1k and other 8k1k search-space entries so CI only runs the
new conc=2048 recipe. Original configs noted in comments for restore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24957382969
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 97a7e7d
Approval: not required (trusted collaborator).

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 .github/configs/nvidia-master.yaml:1886-1899 — The new dsv4-fp4-b300-sglang-mtp config (lines 1886-1899 of .github/configs/nvidia-master.yaml) documents three CONC bands (A=TP8 1-8, B=TP4 16-128, C=TP4/EP4 dp-attn 64-512) in both the inline YAML comments and the matching perf-changelog.yaml entry, but the actual search-space: only contains band A — a single { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp } entry for both ISL=1024 and ISL=8192. As a side effect the DP_ATTENTION=true branch in the new benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh (deepep + mega_moe + chunked-prefill 32768) is unreachable, since no entry sets dp-attn: true. Either add the missing band B/C entries or revise the comments/changelog/script to drop the documented-but-absent recipes.

    Extended reasoning...

    What the bug is

    The new MTP config block at .github/configs/nvidia-master.yaml lines 1886-1899 carries an explicit comment block describing a three-band sweep:

    # Three CONC bands sweep with EAGLE/MTP (3/1/4) on top:
    #   A: TP=8 ep=1            -- conc 1-8    (latency-bound, full TP)
    #   B: TP=4 ep=1            -- conc 16-128 (TP-only, mid batch)
    #   C: TP=4 ep=4 dp-attn    -- conc 64-512 (DP-attn + EP, large batch)
    # Overlap: B/C at conc 64,128 (TP-only vs DP-attn EP head-to-head).

    But the actual search-space for both seq-lens is just one entry:

    - isl: 1024
      osl: 1024
      search-space:
      - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }
    - isl: 8192
      osl: 1024
      search-space:
      - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }

    Bands B and C are entirely absent. The perf-changelog.yaml entry added in this same PR also asserts 'Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128' — so the documented intent and the configured behavior diverge in three places at once.

    Code path / proof

    1. CI sweeps the search-space Cartesian product per seq-len. With one entry per seq-len and conc-start: 1, conc-end: 8, the actual sweep is band A only — roughly 8 sweep points across 1k1k and 8k1k, not the documented ~30+.
    2. benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh reads DP_ATTENTION from the per-entry dp-attn key. The script branches:
      if [ "${DP_ATTENTION}" = "true" ]; then
          # Large-batch EP path: deepep + mega_moe.
          ...
          PARALLEL_ARGS=( --dp-size "$TP" --enable-dp-attention --moe-a2a-backend deepep ... )
          CHUNKED_PREFILL_SIZE=32768
      else
          # Small-batch TP-only path: flashinfer_mxfp4.
          ...
      fi
      No search-space entry has dp-attn: true, so DP_ATTENTION will never be "true" under this YAML. The entire deepep + mega_moe branch (including the SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1, SGLANG_OPT_FIX_HASH_MEGA_MOE=1, etc. env exports and CHUNKED_PREFILL_SIZE=32768) is dead code as merged.
    3. For comparison, the non-MTP sibling dsv4-fp4-b300-sglang at lines 1856-1869 demonstrates the expected multi-band pattern with three explicit entries plus the new tp:8, ep:8, dp-attn:true, conc:2048 entry, so the schema clearly supports what the comments describe.

    Why existing code doesn't catch this

    There is no validator that cross-checks YAML comments or perf-changelog descriptions against the actual search-space shape. Both look syntactically valid, so the omission slips through.

    Impact

    • CI runs only band A (~8 sweep points) instead of the documented ~30+, materially under-covering the recipe being introduced.
    • The DP-attention recipe (deepep + mega_moe + chunked-prefill 32768), which has a non-trivial set of SGLANG_OPT_* env tunings tied to it, never executes — so any regression in that path won't be caught by these benchmarks.
    • The published changelog entry will be misleading to anyone reading historical perf data.

    How to fix

    Either:

    1. Add the missing entries so the YAML matches the comments and changelog, e.g.:
      search-space:
      - { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }
      - { tp: 4, ep: 1, conc-start: 16, conc-end: 128, spec-decoding: mtp }
      - { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 512, spec-decoding: mtp }
      (per-seq-len, mirroring the structure used by dsv4-fp4-b300-sglang), or
    2. Remove the three-band comment block, drop the corresponding bullet from perf-changelog.yaml, and delete the unreachable DP_ATTENTION=true branch from dsv4_fp4_b300_sglang_mtp.sh so the repo state is internally consistent.

Comment on lines +119 to +122
--max-running-requests "$(( CONC * 3 / 2 > 8 ? CONC * 3 / 2 : 8 ))" \
--mem-fraction-static 0.90 \
--swa-full-tokens-ratio 0.1 \
"${SPEC_FLAGS[@]}" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new MTP script hardcodes --swa-full-tokens-ratio 0.1 at line 121, while the parent dsv4_fp4_b300_sglang.sh uses an ISL-conditional that picks 0.5 for ISL=1024 with the explicit comment that '0.5 was tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default'. Since the MTP YAML exercises both 1024/1024 and 8192/1024, the 1k1k MTP run silently drops the empirical tuning — please either mirror the conditional or add a comment explaining why MTP intentionally diverges.

Extended reasoning...

The divergence

The parent benchmarks/single_node/dsv4_fp4_b300_sglang.sh (lines ~98-104 after this PR) sets SWA_FULL_TOKENS_RATIO based on ISL:

# 1k inputs need more SWA cache headroom on B300 than 8k inputs do; 0.5 was
# tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default.
if [[ "$ISL" == "1024" ]]; then
    SWA_FULL_TOKENS_RATIO=0.5
else
    SWA_FULL_TOKENS_RATIO=0.1
fi

The new dsv4_fp4_b300_sglang_mtp.sh at line 121 instead hardcodes:

--swa-full-tokens-ratio 0.1 \

with no ISL branching.

Why this matters

The new MTP YAML config at .github/configs/nvidia-master.yaml (lines ~1885-1893) exercises both isl: 1024 and isl: 8192 sequence-length configs. So the 1k1k MTP run will use the cookbook default 0.1 instead of the empirically tuned 0.5 that the parent script's author specifically called out as needed for B300 cache headroom on 1k inputs.

Step-by-step proof

  1. CI launches dsv4-fp4-b300-sglang-mtp with the band-A entry { tp: 8, ep: 1, conc-start: 1, conc-end: 8 } against isl: 1024, osl: 1024.
  2. The launcher invokes dsv4_fp4_b300_sglang_mtp.sh with ISL=1024.
  3. At line 121, the script unconditionally passes --swa-full-tokens-ratio 0.1 to sglang serve.
  4. The parent script, given the same ISL=1024, would have passed 0.5 per its empirical tuning comment.
  5. Result: the 1k1k MTP sweep runs with the cookbook default the parent script's author explicitly flagged as suboptimal for B300 1k inputs.

Addressing the refutation

The refutation argues the MTP script is a deliberately distinct recipe, that EAGLE adds memory overhead favoring smaller SWA reservation, and that low concurrencies (band A only, per bug_002) make SWA pressure minimal. These are reasonable hypotheses, but they are hypotheses — the parent script's comment is an empirical claim the author already calibrated, and the MTP recipe inherits the rest of the parent's tuning surface (same model, same B300 hardware, same ISL=1024). If the divergence is intentional (e.g., EAGLE memory overhead changes the optimal SWA tradeoff), a one-line comment would document that and prevent future readers from assuming the omission was an oversight. The fact that the recipe author left no such comment, while the parent author did leave one specifically calling out the 1k vs 8k distinction, is exactly the signal that this deserves to be flagged.

Suggested fix

Either mirror the parent's ISL-conditional:

if [[ "$ISL" == "1024" ]]; then
    SWA_FULL_TOKENS_RATIO=0.5
else
    SWA_FULL_TOKENS_RATIO=0.1
fi

and substitute --swa-full-tokens-ratio "$SWA_FULL_TOKENS_RATIO" at line 121, or add a brief comment at line 121 explaining why MTP intentionally uses the cookbook default (e.g., 'EAGLE draft-model + verification overhead favors smaller SWA reservation, so we use the cookbook default instead of the parent's 1k1k empirical 0.5').

Comment thread perf-changelog.yaml
Comment on lines +1879 to +1889
- config-keys:
- dsv4-fp4-b300-sglang-mtp
description:
- "Add DeepSeek-V4-Pro FP4 B300 SGLang benchmark with EAGLE/MTP speculative decoding"
- "Image: lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd211e300dbb76924d56c5cbe6cc3ee5ee2fe314859cb8774f5bc070f3 (pinned for deep_gemm transform_weights_for_mega_moe support; same digest as PR #1158)"
- "Model: deepseek-ai/DeepSeek-V4-Pro"
- "EAGLE/MTP flags hardcoded in script: num-steps=3, eagle-topk=1, num-draft-tokens=4"
- "Recipe (MoE backend, chunked-prefill) selected in script by dp-attn: TP-only + flashinfer_mxfp4 (small batch) vs DP-attn + deepep mega_moe (large batch)"
- "Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128"
- "Configs: 1k1k and 8k1k, no validation.py / launcher / yaml-field changes (knob-free)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The newly added dsv4-fp4-b300-sglang-mtp changelog entry sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166, but this is PR #1176; PR #1166 is an unrelated open PR. Likely a digit transposition — please change the link to /pull/1176 so the changelog correctly attributes the entry once merged.

Extended reasoning...

What the bug is

perf-changelog.yaml lines 1879-1889 add a new entry for the dsv4-fp4-b300-sglang-mtp config (the new MTP benchmark introduced by this PR). Its pr-link is set to:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166

But this PR is #1176 ("dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark"). PR #1166 is a different, unrelated PR. The link should be /pull/1176.

Why this is wrong

The clear convention in perf-changelog.yaml is that pr-link points to the PR that introduces the entry — every prior entry in the file follows this pattern (e.g., the entry above this one points to PR #1174, which was the merged PR introducing it; the entry being added here is the one this PR adds and so should point to #1176).

A grep of perf-changelog.yaml shows pull/1166 only appears at this new entry on line 1889, and pull/1176 does not appear anywhere in the file — so the typo is not duplicated and there is no other entry that should already be carrying #1176.

Step-by-step proof

  1. Open the PR diff for perf-changelog.yaml. The entire 11-line block at lines 1879-1889 is being newly added (no - lines, only +), and the new config-keys is dsv4-fp4-b300-sglang-mtp.
  2. Look up the config: dsv4-fp4-b300-sglang-mtp is also added by this same PR in .github/configs/nvidia-master.yaml (lines 1875+) and by the new script benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh. The description in the changelog ("Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512)", EAGLE/MTP flags 3/1/4, image digest sha256:26e116bd…f5bc070f3) matches exactly what this PR adds in those files. So the entry is unambiguously the one being introduced by this PR.
  3. The PR number is 1176 (per the PR metadata <pr number="1176">). PR sglang dsv4 MTP #1166 is a different PR (per the verifier check: an open PR titled 'sglang dsv4 MTP', not yet merged). They share the digits "116", consistent with a transposition typo.
  4. Convention check: every other recent entry in this file uses pr-link matching the PR that adds it (e.g., the entry directly above this one at line 1877 is /pull/1174, matching the prior commit fc93e84; the commit message of fc93e84 itself even says "append the MTP config entry for PR sglang dsv4 MTP #1166" — confirming 1166 was a placeholder/wrong number rather than an intentional cross-reference).

Impact

Documentation/metadata only — no runtime behavior is affected. However, once merged, the changelog will permanently misattribute the introduction of dsv4-fp4-b300-sglang-mtp to PR #1166 (an unrelated PR by a different author), so anyone clicking through to find the introducing PR will land on the wrong page. Easy to fix before merge.

How to fix

Change line 1889 from:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166

to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1176

yhyang201 and others added 2 commits April 26, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants