Add B200 config: qwen3.5-fp4-sglang-mtp#1075
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \ | ||
| --quantization modelopt_fp4 --fp4-gemm-backend flashinfer_cutlass \ | ||
| --kv-cache-dtype fp8_e4m3 \ | ||
| --mamba-ssm-dtype bfloat16 \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --context-length $CONTEXT_LENGTH --disable-radix-cache \ | ||
| --attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \ | ||
| $EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \ | ||
| --tokenizer-worker-num 6 --stream-interval 30 \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ |
There was a problem hiding this comment.
🔴 The new qwen3.5_fp4_b200_mtp.sh script is missing SGLANG_ENABLE_SPEC_V2=1 in its server launch command, which is present in most comparable MTP scripts (qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh). Without this env var, SGLang may fall back to the older Spec V1 speculative decoding code path, producing non-representative MTP benchmark numbers. Add SGLANG_ENABLE_SPEC_V2=1 before the PYTHONNOUSERSITE=1 python3 -m sglang.launch_server invocation to match the established pattern.
Extended reasoning...
What the bug is and how it manifests
The server launch command in benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh (lines 56–75) starts with PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ... but is missing the SGLANG_ENABLE_SPEC_V2=1 environment variable prefix. SGLang's speculative decoding has two implementations: the newer, optimized Spec V2 path and the older Spec V1 path. Without the env var, SGLang defaults to Spec V1, producing suboptimal MTP throughput numbers.
The specific code path that triggers it
Lines 56–75 of the new script launch the SGLang server with EAGLE flags (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) but omit SGLANG_ENABLE_SPEC_V2=1. Compare with qwen3.5_fp8_h200_mtp.sh line 38 and qwen3.5_fp8_b300_mtp.sh line 34, both of which prepend SGLANG_ENABLE_SPEC_V2=1 to identical launch patterns.
Why existing code doesn't prevent it
The script was created by copying the non-MTP base script qwen3.5_fp4_b200.sh and adding EAGLE flags — but the non-MTP base has no need for SGLANG_ENABLE_SPEC_V2=1, so the env var was never present to copy over. There is no linting or template enforcement to catch this class of omission.
Addressing the refutation
One verifier noted that qwen3.5_fp8_b200_mtp.sh also lacks SGLANG_ENABLE_SPEC_V2=1, suggesting this may be an intentional pattern for same-hardware (B200) scripts or that newer nightly images default to Spec V2. This is a fair observation. However: (1) there is no documented evidence that the nightly-dev-20260402 image enables Spec V2 by default; (2) PR #1017 was explicitly created to retroactively add this var to qwen3.5_fp8_h200_mtp.sh after it was missed, establishing that its omission produces non-representative results and must be corrected; (3) four of the six single-node MTP SGLang scripts include it. The safer and more consistent choice is to explicitly set the env var.
Step-by-step proof
- Runner picks up
qwen3.5-fp4-b200-sglang-mtpconfig (spec-decoding: mtp → selectsqwen3.5_fp4_b200_mtp.sh). - Script executes:
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ... --speculative-algorithm EAGLE ... - Because
SGLANG_ENABLE_SPEC_V2is unset, SGLang initializes the legacy Spec V1 speculative decoding engine. - Benchmark runs against a server using Spec V1 instead of the optimized Spec V2, yielding lower (non-representative) MTP throughput numbers compared to the H200 and B300 MTP configs.
How to fix it
Prefix the launch command with SGLANG_ENABLE_SPEC_V2=1, matching the pattern in qwen3.5_fp8_b300_mtp.sh line 34:
SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ |
There was a problem hiding this comment.
🔴 The new qwen3.5_fp4_b200_mtp.sh omits --use-chat-template from its run_benchmark_serving invocation (lines 86-96), which is required for all MTP/EAGLE speculative decoding scripts. Without this flag, random token sequences are used instead of chat-formatted prompts, causing EAGLE acceptance rates and throughput numbers to be artificially inflated relative to real-world workloads. Add --use-chat-template to the run_benchmark_serving call, consistent with every other MTP script in the repo.
Extended reasoning...
What the bug is and how it manifests
benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh enables EAGLE speculative decoding via --speculative-algorithm EAGLE on the server, but the corresponding run_benchmark_serving call (lines 86–96) does not pass --use-chat-template. This means the benchmark client sends raw random token sequences rather than chat-formatted prompts to the server.
The specific code path that triggers it
The run_benchmark_serving call in the new script is:
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/The flag --use-chat-template is absent. Every other MTP benchmark script in the repo includes it: qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b300_mtp.sh (line 77), dsr1_fp8_b200_mtp.sh (line 113), dsr1_fp8_b300_mtp.sh (line 77).
Why existing code does not prevent it
The non-MTP baseline qwen3.5_fp4_b200.sh also lacks --use-chat-template, which is acceptable for a non-speculative script because acceptance rates are not relevant there. The author appears to have copied the non-MTP script and added the EAGLE server flags without applying the MTP-specific benchmark client correction documented in PR #647.
What the impact would be
EAGLE speculative decoding acceptance rates are highly sensitive to prompt distribution. The draft model is trained on chat-formatted sequences; feeding it raw random tokens produces token distributions that are much easier to speculate on, inflating acceptance rates and derived throughput figures. The benchmark results for qwen3.5-fp4-b200-sglang-mtp will not reflect real-world chat workload performance, potentially overstating the benefit of MTP for this config.
Step-by-step proof
- The new config
qwen3.5-fp4-b200-sglang-mtplaunches the SGLang server with--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4. - The benchmark client is invoked via
run_benchmark_servingwithout--use-chat-template. - Without this flag,
benchmark_serving.pygenerates synthetic random token prompts rather than applying the model's chat template. - Random token inputs are far from the EAGLE draft model's training distribution, causing the draft model to produce tokens that match the target model's output at an artificially high rate.
- The measured acceptance rate and tokens/second throughput will be inflated compared to actual chat workloads.
- By contrast, running with
--use-chat-template(as all other MTP scripts do) uses properly formatted chat prompts that reflect real usage patterns.
How to fix it
Add --use-chat-template to the run_benchmark_serving invocation in benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh, matching the pattern used by every other MTP script in the repo.
Mirrors the existing qwen3.5-fp4-b200-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e21f5b8 to
44a190a
Compare
Summary
qwen3.5-fp4-b200-sglang-mtpconfig mirroring the existingqwen3.5-fp4-b200-sglangnon-MTP recipe, plus a newbenchmarks/single_node/qwen3.5_fp4_b200_mtp.shlaunch script.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.spec-decoding: mtpso the runner picks up the_mtp.shvariant.perf-changelog.yamlentry (PR link placeholder; update after merge per AGENTS.md).Test plan
bash -n benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh— bash syntax OK.python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml— emits 12 entries (2 ISL/OSL × 6 concurrencies) with spec-decoding=mtp.🤖 Generated with Claude Code