Skip to content

[co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak#1185

Merged
cquil11 merged 3 commits intomainfrom
chore/dsv4-fp4-b300-sglang-recipe-split
Apr 26, 2026
Merged

[co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak#1185
cquil11 merged 3 commits intomainfrom
chore/dsv4-fp4-b300-sglang-recipe-split

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Apr 26, 2026

co-authored with sglang community maintainers leads at radixark

Summary

Single-commit follow-up to the #1184 baseline. Re-applies the recipe-per-CONC split and DP-attention tuning that were running pre-revert, in one shot. Excludes #1173's --max-running-requests floor (that was discovery, not intent).

What's included

  • fix sgl b200/b300 dpsk-v4 script #1158 — recipe-per-CONC dispatch in the launch script (low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC); image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd...; nvidia-master.yaml matrix expanded to TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k.
  • dsv4-b300-sglang: retry #1174SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 in the DP-attention branch of the launch script.

What's excluded

Files

  • .github/configs/nvidia-master.yamldsv4-fp4-b300-sglang block updated to the recipe-per-CONC matrix; comment updated to reference the renamed dsv4_fp4_b300_sglang.sh script.
  • benchmarks/single_node/dsv4_fp4_b300_sglang.sh — recipe dispatch logic (DP_ATTENTION env knob), DeepEP/mega_moe env vars, SWA tweak.
  • perf-changelog.yaml — single new entry to trigger the sweep.

Test plan

  • Sweep run on B300 completes end-to-end: low-latency, balanced, and max-throughput rows all launch.
  • Result filenames carry the ep=/dpa= suffixes that distinguish the three recipes.
  • DP-attn path picks up SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 (visible in the SGLANG_* env dump in server.log).

🤖 Generated with Claude Code

Squashes the cumulative changes from #1158 and #1174 into a single
commit on top of the #1184 baseline. Excludes the iterative
--max-running-requests floor from #1173.

- Image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd...
- Search space: TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn
  conc=512 for both 1k1k and 8k1k
- Script dispatches on DP_ATTENTION knob: TP-only (flashinfer_mxfp4)
  vs DP-attn (deepep + prefill-delayer + mega_moe env vars)
- DP-attn path enables SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread perf-changelog.yaml
Comment on lines +81 to +89
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-runner-backend flashinfer_mxfp4
--disable-flashinfer-autotune
--deepep-config "$DEEPEP_CONFIG"
--chunked-prefill-size 16384
--enable-prefill-delayer
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The DP_ATTENTION=true branch (lines 81-89) builds PARALLEL_ARGS without --moe-a2a-backend deepep, so the max-throughput recipe at conc=512 runs with ep_size=1 (the default) instead of ep=4 despite nvidia-master.yaml declaring { tp: 4, ep: 4, dp-attn: true } and the perf-changelog advertising "max-throughput (TP=4, EP=4, DP-attn + DeepEP)". Result filenames will carry ep=4-dpa=true while sglang actually runs EP=1, mislabeling the artifact. To fix: add --moe-a2a-backend deepep (and drop --moe-runner-backend flashinfer_mxfp4, which is the non-EP backend) to match the sibling dsv4_fp4_b300_sglang_mtp.sh (lines 86-91) and dsv4_fp4_b200.sh (lines 63-69).

Extended reasoning...

What the bug is

The DP-attention branch in benchmarks/single_node/dsv4_fp4_b300_sglang.sh builds PARALLEL_ARGS like this:

if [ "${DP_ATTENTION}" = "true" ]; then
    ...
    PARALLEL_ARGS=(
        --dp-size "$TP"
        --enable-dp-attention
        --moe-runner-backend flashinfer_mxfp4
        --disable-flashinfer-autotune
        --deepep-config "$DEEPEP_CONFIG"
        --chunked-prefill-size 16384
        --enable-prefill-delayer
    )
    ...
fi

The --moe-a2a-backend deepep flag is missing. Per the YAML comment added by this same PR (.github/configs/nvidia-master.yaml:1854):

ep is implicit in sglang: --moe-a2a-backend deepep forces ep_size=tp_size, while low-latency leaves ep_size at the default of 1.

Without that flag, sglang leaves ep_size at the default of 1, so the run never actually exercises EP=4 + DeepEP.

How it manifests / step-by-step proof

  1. The YAML row is { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }. The orchestrator sets TP=4, EP_SIZE=4, DP_ATTENTION=true, CONC=512 in env.
  2. The script enters the DP_ATTENTION=true branch and assembles PARALLEL_ARGS without --moe-a2a-backend deepep.
  3. sglang serve is invoked with --tp 4 --dp-size 4 --enable-dp-attention --moe-runner-backend flashinfer_mxfp4 --deepep-config ... but no --moe-a2a-backend deepep. Sglang therefore leaves ep_size=1 (default) and the --deepep-config is dead config because no DeepEP backend was selected.
  4. The benchmark template (benchmark-tmpl.yml:146) bakes EP_SIZE and DP_ATTENTION into the result filename via ep${EP_SIZE} and dpa=${DP_ATTENTION} suffixes — so the output carries ep=4-dpa=true even though the runtime is actually EP=1.

Why existing code does not prevent it

Nothing in this script reads EP_SIZE or asserts it against the actual sglang flags. The shell script just passes through TP from env and constructs --dp-size "$TP"; EP_SIZE only appears in the result filename, never in the launch command. The two sister scripts get this right — dsv4_fp4_b300_sglang_mtp.sh:89 and dsv4_fp4_b200.sh:66 both pass --moe-a2a-backend deepep and omit --moe-runner-backend flashinfer_mxfp4 in their DP-attention branches, because deepep handles MoE on its own.

Impact

The recipe-per-CONC max-throughput row at conc=512 (the headline row of this PR) does not actually exercise the configuration the YAML and perf-changelog advertise. The artifact filenames are mislabeled (ep=4-dpa=true vs. actual EP=1), so any downstream pareto analysis comparing this row against dsv4-fp4-b200-sglang's real EP=8 + DeepEP run will draw incorrect conclusions about B300 vs B200. This defeats the entire purpose of the recipe-per-CONC split that this PR introduces.

Suggested fix

Mirror the sister scripts: add --moe-a2a-backend deepep and drop --moe-runner-backend flashinfer_mxfp4 in the DP_ATTENTION=true branch. The TP-only else branch correctly keeps flashinfer_mxfp4. Alternatively, if the intent really is no DeepEP for B300, update the YAML row to ep: 1 and the perf-changelog to drop the "+ DeepEP" claim — but the former is the obviously-correct fix given the YAML comment and sister scripts.

cquil11 added a commit that referenced this pull request Apr 26, 2026
…1158)

Mirror of #1185 for the b200 side. Re-applies the b200-specific
changes from #1158 on top of the #1186 baseline.

- Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4...
- Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars
- Search space gets conc-start=1 in low-latency rows (was 4)
- Recipe-per-CONC dispatch in script: low-latency / balanced /
  max-throughput selected by DP_ATTENTION + CONC

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 changed the title dsv4-fp4-b300-sglang: recipe-per-CONC split + DP-attn SWA tweak [NVIDIA][SGLang][redo PR] DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak Apr 26, 2026
Comment thread benchmarks/single_node/dsv4_fp4_b300_sglang.sh Outdated
@cquil11 cquil11 changed the title [NVIDIA][SGLang][redo PR] DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak Apr 26, 2026
@cquil11 cquil11 merged commit 47cc6ac into main Apr 26, 2026
10 of 16 checks passed
@cquil11 cquil11 deleted the chore/dsv4-fp4-b300-sglang-recipe-split branch April 26, 2026 20:54
cquil11 added a commit that referenced this pull request Apr 26, 2026
…C split (#1187)

* dsv4-fp4-b200-sglang: recipe-per-CONC dispatch (re-apply b200 part of #1158)

Mirror of #1185 for the b200 side. Re-applies the b200-specific
changes from #1158 on top of the #1186 baseline.

- Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4...
- Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars
- Search space gets conc-start=1 in low-latency rows (was 4)
- Recipe-per-CONC dispatch in script: low-latency / balanced /
  max-throughput selected by DP_ATTENTION + CONC

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf-changelog: add pr-link for #1187

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* dsv4-fp4-b200-sglang: restore --disable-radix-cache flag

The flag was accidentally dropped during the recipe-per-CONC rewrite.
Restoring it to match the baseline methodology (prefix caching disabled)
and stay consistent with all other dsv4 sister scripts.

Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update changelog and YAML comments to match two-way DP_ATTENTION dispatch

The script has two branches (DP_ATTENTION true/false), not three CONC-keyed
recipes. Both balanced and max-throughput rows use the same DP-attention +
DeepEP flags — only --max-running-requests differs. Updated the
nvidia-master.yaml comment block and perf-changelog description to accurately
reflect this two-recipe dispatch.

Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
@functionstackx functionstackx changed the title [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak [co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants