Add B300 config: qwen3.5-bf16-sglang-mtp#1082
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| export NCCL_NVLS_ENABLE=1 | ||
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true | ||
| export PYTHONUNBUFFERED=1 |
There was a problem hiding this comment.
🔴 The new qwen3.5_bf16_b300_mtp.sh script is missing the SGLANG_ENABLE_SPEC_V2=1 environment variable that every other SGLang EAGLE MTP benchmark script in the repo includes, causing SGLang to fall back to the slower V1 speculative decoding path. This degrades MTP throughput and may produce inflated acceptance rates, making benchmark results non-comparable to the FP8 B300 MTP config. Fix by prepending SGLANG_ENABLE_SPEC_V2=1 to the server launch line (line 55) alongside the existing PYTHONNOUSERSITE=1 prefix.
Extended reasoning...
What the bug is and how it manifests
The new benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh script adds EAGLE speculative decoding flags (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) to the SGLang server launch command at line 55, but omits the SGLANG_ENABLE_SPEC_V2=1 environment variable. Without this flag, SGLang selects its legacy V1 speculative decoding code path even when EAGLE arguments are supplied.
The specific code path that triggers it
Line 55 of the new script reads:
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...The directly analogous script for the same hardware and same image (qwen3.5_fp8_b300_mtp.sh, PR #1035, line 34) reads:
SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...The same pattern appears in dsr1_fp8_b200_mtp.sh (line 57), dsr1_fp8_b300_mtp.sh (line 61), and qwen3.5_fp8_h200_mtp.sh (line 38). All four scripts using SGLang with EAGLE on recent images include the flag; the new BF16 B300 script is the sole outlier.
Why existing code does not prevent it
The EAGLE speculative decoding CLI flags are passed correctly — SGLANG_ENABLE_SPEC_V2 is a separate runtime toggle that must be exported as an env var before the Python process starts. SGLang does not error or warn when the flag is absent; it silently downgrades to the V1 path, so there is no automatic signal that the script is misconfigured.
What the impact would be
Benchmark runs will use SGLang's slower, less optimised V1 speculation path. PR #1017 was a dedicated follow-up fix titled 'Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP', demonstrating that the omission has a real, documented performance impact. Additionally, the V1 path can produce artificially high speculative acceptance rates, meaning the reported MTP numbers would not be comparable to the FP8 B300 MTP config that does use V2.
How to fix it
Prepend SGLANG_ENABLE_SPEC_V2=1 to the server launch line, matching the pattern in qwen3.5_fp8_b300_mtp.sh:
SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...Alternatively, add export SGLANG_ENABLE_SPEC_V2=1 alongside the other export statements at lines 23-26.
Step-by-step proof
- The script exports
NCCL_NVLS_ENABLE=1,SGL_ENABLE_JIT_DEEPGEMM=false,SGLANG_ENABLE_FLASHINFER_GEMM=true, andPYTHONUNBUFFERED=1(lines 23-26), but notSGLANG_ENABLE_SPEC_V2. - At line 55 the server is launched with
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server— noSGLANG_ENABLE_SPEC_V2=1prefix. - SGLang checks this env var at startup to decide which speculation engine to use; since it is unset (defaults to 0/false), it activates the V1 path.
- The four comparable MTP scripts (
dsr1_fp8_b200_mtp.sh,dsr1_fp8_b300_mtp.sh,qwen3.5_fp8_h200_mtp.sh,qwen3.5_fp8_b300_mtp.sh) all set the flag, so their results are produced by the V2 path. - Any throughput or acceptance-rate comparison between the new BF16 B300 MTP config and existing MTP configs will therefore compare V1 results against V2 results — an apples-to-oranges comparison that corrupts benchmark conclusions.
Mirrors the existing qwen3.5-bf16-b300-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Script also passes --use-chat-template to run_benchmark_serving, as required by AGENTS.md for all MTP configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4f176c1 to
f33c2d7
Compare
Summary
qwen3.5-bf16-b300-sglang-mtpconfig mirroring the existingqwen3.5-bf16-b300-sglangnon-MTP recipe, plus a newbenchmarks/single_node/qwen3.5_bf16_b300_mtp.shlaunch script.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.--use-chat-templatetorun_benchmark_servingper the AGENTS.md requirement for all MTP scripts.spec-decoding: mtpso the runner picks up the_mtp.shvariant.perf-changelog.yamldiff is append-only (no modifications to any existing line).Test plan
bash -n benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh— bash syntax OK.git diff perf-changelog.yamlshows only additions.python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml— emits 20 entries (2 ISL/OSL × 2 TP variants × 5 concurrencies) with spec-decoding=mtp.🤖 Generated with Claude Code