Conversation
Squashes the cumulative changes from #1158 and #1174 into a single commit on top of the #1184 baseline. Excludes the iterative --max-running-requests floor from #1173. - Image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd... - Search space: TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k - Script dispatches on DP_ATTENTION knob: TP-only (flashinfer_mxfp4) vs DP-attn (deepep + prefill-delayer + mega_moe env vars) - DP-attn path enables SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| PARALLEL_ARGS=( | ||
| --dp-size "$TP" | ||
| --enable-dp-attention | ||
| --moe-runner-backend flashinfer_mxfp4 | ||
| --disable-flashinfer-autotune | ||
| --deepep-config "$DEEPEP_CONFIG" | ||
| --chunked-prefill-size 16384 | ||
| --enable-prefill-delayer | ||
| ) |
There was a problem hiding this comment.
🔴 The DP_ATTENTION=true branch (lines 81-89) builds PARALLEL_ARGS without --moe-a2a-backend deepep, so the max-throughput recipe at conc=512 runs with ep_size=1 (the default) instead of ep=4 despite nvidia-master.yaml declaring { tp: 4, ep: 4, dp-attn: true } and the perf-changelog advertising "max-throughput (TP=4, EP=4, DP-attn + DeepEP)". Result filenames will carry ep=4-dpa=true while sglang actually runs EP=1, mislabeling the artifact. To fix: add --moe-a2a-backend deepep (and drop --moe-runner-backend flashinfer_mxfp4, which is the non-EP backend) to match the sibling dsv4_fp4_b300_sglang_mtp.sh (lines 86-91) and dsv4_fp4_b200.sh (lines 63-69).
Extended reasoning...
What the bug is
The DP-attention branch in benchmarks/single_node/dsv4_fp4_b300_sglang.sh builds PARALLEL_ARGS like this:
if [ "${DP_ATTENTION}" = "true" ]; then
...
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-runner-backend flashinfer_mxfp4
--disable-flashinfer-autotune
--deepep-config "$DEEPEP_CONFIG"
--chunked-prefill-size 16384
--enable-prefill-delayer
)
...
fiThe --moe-a2a-backend deepep flag is missing. Per the YAML comment added by this same PR (.github/configs/nvidia-master.yaml:1854):
ep is implicit in sglang:
--moe-a2a-backend deepepforcesep_size=tp_size, while low-latency leavesep_sizeat the default of 1.
Without that flag, sglang leaves ep_size at the default of 1, so the run never actually exercises EP=4 + DeepEP.
How it manifests / step-by-step proof
- The YAML row is
{ tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }. The orchestrator setsTP=4,EP_SIZE=4,DP_ATTENTION=true,CONC=512in env. - The script enters the
DP_ATTENTION=truebranch and assemblesPARALLEL_ARGSwithout--moe-a2a-backend deepep. sglang serveis invoked with--tp 4 --dp-size 4 --enable-dp-attention --moe-runner-backend flashinfer_mxfp4 --deepep-config ...but no--moe-a2a-backend deepep. Sglang therefore leavesep_size=1(default) and the--deepep-configis dead config because no DeepEP backend was selected.- The benchmark template (
benchmark-tmpl.yml:146) bakesEP_SIZEandDP_ATTENTIONinto the result filename viaep${EP_SIZE}anddpa=${DP_ATTENTION}suffixes — so the output carriesep=4-dpa=trueeven though the runtime is actually EP=1.
Why existing code does not prevent it
Nothing in this script reads EP_SIZE or asserts it against the actual sglang flags. The shell script just passes through TP from env and constructs --dp-size "$TP"; EP_SIZE only appears in the result filename, never in the launch command. The two sister scripts get this right — dsv4_fp4_b300_sglang_mtp.sh:89 and dsv4_fp4_b200.sh:66 both pass --moe-a2a-backend deepep and omit --moe-runner-backend flashinfer_mxfp4 in their DP-attention branches, because deepep handles MoE on its own.
Impact
The recipe-per-CONC max-throughput row at conc=512 (the headline row of this PR) does not actually exercise the configuration the YAML and perf-changelog advertise. The artifact filenames are mislabeled (ep=4-dpa=true vs. actual EP=1), so any downstream pareto analysis comparing this row against dsv4-fp4-b200-sglang's real EP=8 + DeepEP run will draw incorrect conclusions about B300 vs B200. This defeats the entire purpose of the recipe-per-CONC split that this PR introduces.
Suggested fix
Mirror the sister scripts: add --moe-a2a-backend deepep and drop --moe-runner-backend flashinfer_mxfp4 in the DP_ATTENTION=true branch. The TP-only else branch correctly keeps flashinfer_mxfp4. Alternatively, if the intent really is no DeepEP for B300, update the YAML row to ep: 1 and the perf-changelog to drop the "+ DeepEP" claim — but the former is the obviously-correct fix given the YAML comment and sister scripts.
…1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…C split (#1187) * dsv4-fp4-b200-sglang: recipe-per-CONC dispatch (re-apply b200 part of #1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1187 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dsv4-fp4-b200-sglang: restore --disable-radix-cache flag The flag was accidentally dropped during the recipe-per-CONC rewrite. Restoring it to match the baseline methodology (prefix caching disabled) and stay consistent with all other dsv4 sister scripts. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update changelog and YAML comments to match two-way DP_ATTENTION dispatch The script has two branches (DP_ATTENTION true/false), not three CONC-keyed recipes. Both balanced and max-throughput rows use the same DP-attention + DeepEP flags — only --max-running-requests differs. Updated the nvidia-master.yaml comment block and perf-changelog description to accurately reflect this two-recipe dispatch. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
co-authored with sglang community maintainers leads at radixark
Summary
Single-commit follow-up to the #1184 baseline. Re-applies the recipe-per-CONC split and DP-attention tuning that were running pre-revert, in one shot. Excludes #1173's
--max-running-requestsfloor (that was discovery, not intent).What's included
DP_ATTENTION+CONC); image pinned tolmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd...;nvidia-master.yamlmatrix expanded to TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k.SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1in the DP-attention branch of the launch script.What's excluded
--max-running-requestsfloor at 8 (kept the formulaCONC * 3 / 2without the floor). This was discovery from a single low-CONC sweep; not part of the recipe intent.Files
.github/configs/nvidia-master.yaml—dsv4-fp4-b300-sglangblock updated to the recipe-per-CONC matrix; comment updated to reference the renameddsv4_fp4_b300_sglang.shscript.benchmarks/single_node/dsv4_fp4_b300_sglang.sh— recipe dispatch logic (DP_ATTENTIONenv knob), DeepEP/mega_moe env vars, SWA tweak.perf-changelog.yaml— single new entry to trigger the sweep.Test plan
ep=/dpa=suffixes that distinguish the three recipes.SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1(visible in theSGLANG_*env dump inserver.log).🤖 Generated with Claude Code