Add B200 config: qwen3.5-bf16-sglang-mtp#1074
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ |
There was a problem hiding this comment.
🔴 The run_benchmark_serving call in qwen3.5_bf16_b200_mtp.sh is missing the --use-chat-template flag, which is present in every other MTP benchmark script in the codebase. Without this flag, the benchmark client sends raw prompts instead of chat-formatted ones, causing EAGLE speculative draft tokens to match more easily and artificially inflating the reported MTP acceptance rate and throughput numbers.
Extended reasoning...
What the bug is and how it manifests
The new benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh omits --use-chat-template from its run_benchmark_serving call (lines 77–87). This flag controls whether the benchmark client wraps prompts in the model's chat template before sending them to the server. When absent, raw random prompts are sent, which have a very different token distribution than actual chat-formatted prompts.
The specific code path that triggers it
The run_benchmark_serving function (defined in benchmark_lib.sh) conditionally applies the chat template when --use-chat-template is passed. Without the flag, the benchmark sends bare text continuations — the exact format that EAGLE speculative decoding is most likely to predict accurately, because the draft model was trained on structured chat sequences but random prompt continuations happen to share short token n-grams.
Why existing code doesn't prevent it
The flag is optional in benchmark_lib.sh, so the script runs without error. There is no validation requiring MTP scripts to include this flag. The author appears to have copied from the non-MTP base script (qwen3.5_bf16_b200.sh), which also lacks --use-chat-template, and forgot to add the MTP-required flag. This is confirmed by the verifiers: the BF16 non-MTP base is one of the few scripts in the repo missing this flag, while every other MTP variant (qwen3.5_fp8_b200_mtp.sh:91, qwen3.5_fp8_h200_mtp.sh:82, qwen3.5_fp8_b300_mtp.sh:77, dsr1_fp8_b200_mtp.sh:113, dsr1_fp8_b300_mtp.sh:117, dsr1_fp4_b200_trt_mtp.sh:135, etc.) includes it.
Impact
PR #647 explicitly documented this rationale when adding the flag: 'Without this arg, MTP acceptance rates are artificially high.' Without --use-chat-template, the reported MTP acceptance rate will be inflated, making the speculative decoding look more effective than it really is. This renders the benchmark results misleading — the reported throughput numbers will not reflect real-world performance where users send chat-formatted requests.
How to fix it
Add --use-chat-template \ to the run_benchmark_serving call in qwen3.5_bf16_b200_mtp.sh, following the pattern from every other MTP script in the repository.
Step-by-step proof
- The benchmark client is called without
--use-chat-template(lines 77–87 of the new file) benchmark_lib.shonly applies the chat template when this flag is explicitly passed- The random dataset prompts are sent as raw text to the SGLang server
- The EAGLE draft model predicts tokens for a raw prompt continuation — a distribution that is simpler than structured chat continuations
- Draft token acceptance rate is measured as higher than it would be with chat-formatted prompts
- The throughput numbers reported for this MTP config are inflated relative to what would be observed in production use
Addressing the duplication concern
One verifier flagged this as a duplicate of bug_002. Whether or not bug_002 covers the same finding, the underlying bug is unambiguously real — the omission is confirmed by three independent verifiers and the codebase-wide pattern is definitive. The bug should be fixed regardless of which report is canonical.
Mirrors the existing qwen3.5-bf16-b200-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8781143 to
214602e
Compare

Summary
qwen3.5-bf16-b200-sglang-mtpconfig mirroring the existingqwen3.5-bf16-b200-sglangnon-MTP recipe, plus a newbenchmarks/single_node/qwen3.5_bf16_b200_mtp.shlaunch script.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.spec-decoding: mtpso the runner picks up the_mtp.shvariant.perf-changelog.yamlentry to trigger the sweep (PR link is a placeholder and will be updated after merge per AGENTS.md).Test plan
python3 -c "import yaml; yaml.safe_load(open('.github/configs/nvidia-master.yaml'))"— YAML parses.python3 -c "import yaml; yaml.safe_load(open('perf-changelog.yaml'))"— YAML parses.bash -n benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh— bash syntax OK.python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml— emits 10 entries (2 ISL/OSL × 5 concurrencies) with spec-decoding=mtp.🤖 Generated with Claude Code