Add B300 config: qwen3.5-fp4-sglang-mtp#1083
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \ | ||
| --reasoning-parser qwen3 \ | ||
| --tool-call-parser qwen3_coder \ | ||
| --mamba-scheduler-strategy no_buffer \ | ||
| --quantization modelopt_fp4 --fp4-gemm-backend flashinfer_cutlass \ | ||
| --kv-cache-dtype fp8_e4m3 \ | ||
| --mamba-ssm-dtype bfloat16 \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ |
There was a problem hiding this comment.
🔴 The new qwen3.5_fp4_b300_mtp.sh script is missing SGLANG_ENABLE_SPEC_V2=1 before the python3 -m sglang.launch_server invocation. Without this flag, EAGLE speculative decoding will fall back to the older spec v1 code path, producing inaccurate or suboptimal benchmark results — add SGLANG_ENABLE_SPEC_V2=1 as an inline env var prefix before PYTHONNOUSERSITE=1 python3 on line 62.
Extended reasoning...
What the bug is and how it manifests
The new benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh launches the SGLang server at line 62 with:
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...
It omits the SGLANG_ENABLE_SPEC_V2=1 env-var prefix that every other MTP/EAGLE script in the repository includes. Without this flag, SGLang selects the older speculative-decoding v1 code path even though the EAGLE algorithm requires the v2 path.
The specific code path that triggers it
Every other MTP benchmark script sets the flag inline before the python3 invocation:
qwen3.5_fp8_b300_mtp.shline 34:SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...qwen3.5_fp8_h200_mtp.shline 38:SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server ...dsr1_fp8_b200_mtp.shline 57:SGLANG_ENABLE_SPEC_V2=1 ...dsr1_fp8_b300_mtp.shline 61:SGLANG_ENABLE_SPEC_V2=1 ...
This PR's script is the only MTP launch script in the repo that omits it.
Why existing code doesn't prevent it
There is no global export of SGLANG_ENABLE_SPEC_V2 in benchmark_lib.sh or the container entrypoint; each script is responsible for setting it inline. The bash syntax check (bash -n) listed in the test plan confirms only syntax validity, not correctness of env vars. The omission silently degrades behavior at runtime.
What the impact would be
SGLang v0.5.10.post1-cu130 requires SGLANG_ENABLE_SPEC_V2=1 for EAGLE speculative decoding to use the optimised v2 scheduler. Without it, the server runs the v1 speculative path, which yields lower acceptance rates and reduced throughput — meaning all benchmark numbers (tokens/s, TTFT, ITL) collected under this config will be unrepresentative of the intended MTP configuration. The perf-changelog entry for PR #1017 explicitly documents this requirement: "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" because EAGLE requires spec v2.
How to fix it
Prepend SGLANG_ENABLE_SPEC_V2=1 to the server launch line, matching the pattern of all other MTP scripts:
SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL ...Step-by-step proof
- The YAML config (
.github/configs/nvidia-master.yaml) marks all search-space entries withspec-decoding: mtp, meaning the runner selects this_mtp.shvariant specifically to exercise EAGLE speculative decoding. - The script passes
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4to the server — confirming EAGLE is intended. - However, without
SGLANG_ENABLE_SPEC_V2=1, SGLang's internal feature flag for the v2 speculative scheduler remainsfalse. - SGLang falls back to the v1 path: the EAGLE draft model still runs, but the v1 scheduler does not handle EAGLE's multi-token acceptance correctly, leading to degraded throughput and inaccurate acceptance-rate telemetry.
- Any benchmark result filed under this config will therefore underrepresent true MTP performance — the exact issue PR [NV] Update: sglang v2 Qwen3.5 h200 MTP #1017 was created to fix for the FP8 H200 MTP script.
Mirrors the existing qwen3.5-fp4-b300-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Script also passes --use-chat-template to run_benchmark_serving, as required by AGENTS.md for all MTP configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
702a778 to
d02bedc
Compare
Summary
qwen3.5-fp4-b300-sglang-mtpconfig mirroring the existingqwen3.5-fp4-b300-sglangnon-MTP recipe, plus a newbenchmarks/single_node/qwen3.5_fp4_b300_mtp.shlaunch script.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.--use-chat-templatetorun_benchmark_servingper the AGENTS.md requirement for all MTP scripts.spec-decoding: mtpso the runner picks up the_mtp.shvariant.perf-changelog.yamldiff is append-only (no modifications to any existing line).Test plan
bash -n benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh— bash syntax OK.git diff perf-changelog.yamlshows only additions.python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml— emits 24 entries (2 ISL/OSL × 2 search-space rows × 6 concurrencies) with spec-decoding=mtp.🤖 Generated with Claude Code