[co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak by cquil11 · Pull Request #1185 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-26T19:58:30Z

co-authored with sglang community maintainers leads at radixark

Summary

Single-commit follow-up to the #1184 baseline. Re-applies the recipe-per-CONC split and DP-attention tuning that were running pre-revert, in one shot. Excludes #1173's --max-running-requests floor (that was discovery, not intent).

What's included

fix sgl b200/b300 dpsk-v4 script #1158 — recipe-per-CONC dispatch in the launch script (low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC); image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd...; nvidia-master.yaml matrix expanded to TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k.
dsv4-b300-sglang: retry #1174 — SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 in the DP-attention branch of the launch script.

What's excluded

dsv4-fp4 sglang b300: floor --max-running-requests at 8 #1173 — --max-running-requests floor at 8 (kept the formula CONC * 3 / 2 without the floor). This was discovery from a single low-CONC sweep; not part of the recipe intent.

Files

.github/configs/nvidia-master.yaml — dsv4-fp4-b300-sglang block updated to the recipe-per-CONC matrix; comment updated to reference the renamed dsv4_fp4_b300_sglang.sh script.
benchmarks/single_node/dsv4_fp4_b300_sglang.sh — recipe dispatch logic (DP_ATTENTION env knob), DeepEP/mega_moe env vars, SWA tweak.
perf-changelog.yaml — single new entry to trigger the sweep.

Test plan

Sweep run on B300 completes end-to-end: low-latency, balanced, and max-throughput rows all launch.
Result filenames carry the ep=/dpa= suffixes that distinguish the three recipes.
DP-attn path picks up SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 (visible in the SGLANG_* env dump in server.log).

🤖 Generated with Claude Code

Squashes the cumulative changes from #1158 and #1174 into a single commit on top of the #1184 baseline. Excludes the iterative --max-running-requests floor from #1173. - Image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd... - Search space: TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k - Script dispatches on DP_ATTENTION knob: TP-only (flashinfer_mxfp4) vs DP-attn (deepep + prefill-delayer + mega_moe env vars) - DP-attn path enables SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-26T19:58:38Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-26T19:58:38Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Oseltamivir

lgtm

claude · 2026-04-26T20:08:12Z

+    PARALLEL_ARGS=(
+        --dp-size "$TP"
+        --enable-dp-attention
+        --moe-runner-backend flashinfer_mxfp4
+        --disable-flashinfer-autotune
+        --deepep-config "$DEEPEP_CONFIG"
+        --chunked-prefill-size 16384
+        --enable-prefill-delayer
+    )


🔴 The DP_ATTENTION=true branch (lines 81-89) builds PARALLEL_ARGS without --moe-a2a-backend deepep, so the max-throughput recipe at conc=512 runs with ep_size=1 (the default) instead of ep=4 despite nvidia-master.yaml declaring { tp: 4, ep: 4, dp-attn: true } and the perf-changelog advertising "max-throughput (TP=4, EP=4, DP-attn + DeepEP)". Result filenames will carry ep=4-dpa=true while sglang actually runs EP=1, mislabeling the artifact. To fix: add --moe-a2a-backend deepep (and drop --moe-runner-backend flashinfer_mxfp4, which is the non-EP backend) to match the sibling dsv4_fp4_b300_sglang_mtp.sh (lines 86-91) and dsv4_fp4_b200.sh (lines 63-69).

Extended reasoning...

What the bug is

The DP-attention branch in benchmarks/single_node/dsv4_fp4_b300_sglang.sh builds PARALLEL_ARGS like this:

if [ "${DP_ATTENTION}" = "true" ]; then ... PARALLEL_ARGS=( --dp-size "$TP" --enable-dp-attention --moe-runner-backend flashinfer_mxfp4 --disable-flashinfer-autotune --deepep-config "$DEEPEP_CONFIG" --chunked-prefill-size 16384 --enable-prefill-delayer ) ... fi

The --moe-a2a-backend deepep flag is missing. Per the YAML comment added by this same PR (.github/configs/nvidia-master.yaml:1854):

ep is implicit in sglang: --moe-a2a-backend deepep forces ep_size=tp_size, while low-latency leaves ep_size at the default of 1.

Without that flag, sglang leaves ep_size at the default of 1, so the run never actually exercises EP=4 + DeepEP.

How it manifests / step-by-step proof

The YAML row is { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }. The orchestrator sets TP=4, EP_SIZE=4, DP_ATTENTION=true, CONC=512 in env.

The script enters the DP_ATTENTION=true branch and assembles PARALLEL_ARGS without --moe-a2a-backend deepep.

sglang serve is invoked with --tp 4 --dp-size 4 --enable-dp-attention --moe-runner-backend flashinfer_mxfp4 --deepep-config ... but no --moe-a2a-backend deepep. Sglang therefore leaves ep_size=1 (default) and the --deepep-config is dead config because no DeepEP backend was selected.

The benchmark template (benchmark-tmpl.yml:146) bakes EP_SIZE and DP_ATTENTION into the result filename via ep${EP_SIZE} and dpa=${DP_ATTENTION} suffixes — so the output carries ep=4-dpa=true even though the runtime is actually EP=1.

Why existing code does not prevent it

Nothing in this script reads EP_SIZE or asserts it against the actual sglang flags. The shell script just passes through TP from env and constructs --dp-size "$TP"; EP_SIZE only appears in the result filename, never in the launch command. The two sister scripts get this right — dsv4_fp4_b300_sglang_mtp.sh:89 and dsv4_fp4_b200.sh:66 both pass --moe-a2a-backend deepep and omit --moe-runner-backend flashinfer_mxfp4 in their DP-attention branches, because deepep handles MoE on its own.

Impact

The recipe-per-CONC max-throughput row at conc=512 (the headline row of this PR) does not actually exercise the configuration the YAML and perf-changelog advertise. The artifact filenames are mislabeled (ep=4-dpa=true vs. actual EP=1), so any downstream pareto analysis comparing this row against dsv4-fp4-b200-sglang's real EP=8 + DeepEP run will draw incorrect conclusions about B300 vs B200. This defeats the entire purpose of the recipe-per-CONC split that this PR introduces.

Suggested fix

Mirror the sister scripts: add --moe-a2a-backend deepep and drop --moe-runner-backend flashinfer_mxfp4 in the DP_ATTENTION=true branch. The TP-only else branch correctly keeps flashinfer_mxfp4. Alternatively, if the intent really is no DeepEP for B300, update the YAML row to ep: 1 and the perf-changelog to drop the "+ DeepEP" claim — but the former is the obviously-correct fix given the YAML comment and sister scripts.

…1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…C split (#1187) * dsv4-fp4-b200-sglang: recipe-per-CONC dispatch (re-apply b200 part of #1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1187 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dsv4-fp4-b200-sglang: restore --disable-radix-cache flag The flag was accidentally dropped during the recipe-per-CONC rewrite. Restoring it to match the baseline methodology (prefix caching disabled) and stay consistent with all other dsv4 sister scripts. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update changelog and YAML comments to match two-way DP_ATTENTION dispatch The script has two branches (DP_ATTENTION true/false), not three CONC-keyed recipes. Both balanced and max-throughput rows use the same DP-attention + DeepEP flags — only --max-running-requests differs. Updated the nvidia-master.yaml comment block and perf-changelog description to accurately reflect this two-recipe dispatch. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

cquil11 requested a review from a team April 26, 2026 19:58

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

cquil11 requested review from jgangani and kedarpotdar-nv as code owners April 26, 2026 19:58

perf-changelog: add pr-link for #1185

6ac5239

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11 added the full-sweep-enabled label Apr 26, 2026

Oseltamivir approved these changes Apr 26, 2026

View reviewed changes

claude Bot reviewed Apr 26, 2026

View reviewed changes

cquil11 mentioned this pull request Apr 26, 2026

dsv4-fp4-b200-sglang: revert b200 portion of #1158 #1186

Merged

1 task

cquil11 mentioned this pull request Apr 26, 2026

[NVIDIA][SGLang][redo PR] B200 DeepSeek v4 FP4 SGLang: recipe-per-CONC split #1187

Merged

2 tasks

cquil11 changed the title ~~dsv4-fp4-b300-sglang: recipe-per-CONC split + DP-attn SWA tweak~~ [NVIDIA][SGLang][redo PR] DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak Apr 26, 2026

Qiaolin-Yu reviewed Apr 26, 2026

View reviewed changes

Comment thread benchmarks/single_node/dsv4_fp4_b300_sglang.sh Outdated

cquil11 changed the title ~~[NVIDIA][SGLang][redo PR] DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak~~ [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak Apr 26, 2026

Apply suggestion from @Qiaolin-Yu

cdf426f

Qiaolin-Yu approved these changes Apr 26, 2026

View reviewed changes

cquil11 merged commit 47cc6ac into main Apr 26, 2026
10 of 16 checks passed

cquil11 deleted the chore/dsv4-fp4-b300-sglang-recipe-split branch April 26, 2026 20:54

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak#1185

[co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak#1185
cquil11 merged 3 commits intomainfrom
chore/dsv4-fp4-b300-sglang-recipe-split

cquil11 commented Apr 26, 2026 •

edited by functionstackx

Loading

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Oseltamivir left a comment

Uh oh!

Uh oh!

claude Bot Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cquil11 commented Apr 26, 2026 • edited by functionstackx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!