Add B200 config: qwen3.5-bf16-sglang-mtp by functionstackx · Pull Request #1074 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-18T00:21:16Z

Summary

Adds qwen3.5-bf16-b200-sglang-mtp config mirroring the existing qwen3.5-bf16-b200-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh launch script.
Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Search space carries spec-decoding: mtp so the runner picks up the _mtp.sh variant.
Adds a perf-changelog.yaml entry to trigger the sweep (PR link is a placeholder and will be updated after merge per AGENTS.md).

Test plan

python3 -c "import yaml; yaml.safe_load(open('.github/configs/nvidia-master.yaml'))" — YAML parses.
python3 -c "import yaml; yaml.safe_load(open('perf-changelog.yaml'))" — YAML parses.
bash -n benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh — bash syntax OK.
python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml — emits 10 entries (2 ISL/OSL × 5 concurrencies) with spec-decoding=mtp.
CI sweep passes on B200.

🤖 Generated with Claude Code

github-actions · 2026-04-18T00:21:25Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-18T00:27:29Z

+run_benchmark_serving \
+    --model "$MODEL" \
+    --port "$PORT" \
+    --backend vllm \
+    --input-len "$ISL" \
+    --output-len "$OSL" \
+    --random-range-ratio "$RANDOM_RANGE_RATIO" \
+    --num-prompts "$((CONC * 10))" \
+    --max-concurrency "$CONC" \
+    --result-filename "$RESULT_FILENAME" \
+    --result-dir /workspace/


🔴 The run_benchmark_serving call in qwen3.5_bf16_b200_mtp.sh is missing the --use-chat-template flag, which is present in every other MTP benchmark script in the codebase. Without this flag, the benchmark client sends raw prompts instead of chat-formatted ones, causing EAGLE speculative draft tokens to match more easily and artificially inflating the reported MTP acceptance rate and throughput numbers.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh omits --use-chat-template from its run_benchmark_serving call (lines 77–87). This flag controls whether the benchmark client wraps prompts in the model's chat template before sending them to the server. When absent, raw random prompts are sent, which have a very different token distribution than actual chat-formatted prompts.

The specific code path that triggers it

The run_benchmark_serving function (defined in benchmark_lib.sh) conditionally applies the chat template when --use-chat-template is passed. Without the flag, the benchmark sends bare text continuations — the exact format that EAGLE speculative decoding is most likely to predict accurately, because the draft model was trained on structured chat sequences but random prompt continuations happen to share short token n-grams.

Why existing code doesn't prevent it

The flag is optional in benchmark_lib.sh, so the script runs without error. There is no validation requiring MTP scripts to include this flag. The author appears to have copied from the non-MTP base script (qwen3.5_bf16_b200.sh), which also lacks --use-chat-template, and forgot to add the MTP-required flag. This is confirmed by the verifiers: the BF16 non-MTP base is one of the few scripts in the repo missing this flag, while every other MTP variant (qwen3.5_fp8_b200_mtp.sh:91, qwen3.5_fp8_h200_mtp.sh:82, qwen3.5_fp8_b300_mtp.sh:77, dsr1_fp8_b200_mtp.sh:113, dsr1_fp8_b300_mtp.sh:117, dsr1_fp4_b200_trt_mtp.sh:135, etc.) includes it.

Impact

PR #647 explicitly documented this rationale when adding the flag: 'Without this arg, MTP acceptance rates are artificially high.' Without --use-chat-template, the reported MTP acceptance rate will be inflated, making the speculative decoding look more effective than it really is. This renders the benchmark results misleading — the reported throughput numbers will not reflect real-world performance where users send chat-formatted requests.

How to fix it

Add --use-chat-template \ to the run_benchmark_serving call in qwen3.5_bf16_b200_mtp.sh, following the pattern from every other MTP script in the repository.

Step-by-step proof

The benchmark client is called without --use-chat-template (lines 77–87 of the new file)

benchmark_lib.sh only applies the chat template when this flag is explicitly passed

The random dataset prompts are sent as raw text to the SGLang server

The EAGLE draft model predicts tokens for a raw prompt continuation — a distribution that is simpler than structured chat continuations

Draft token acceptance rate is measured as higher than it would be with chat-formatted prompts

The throughput numbers reported for this MTP config are inflated relative to what would be observed in production use

Addressing the duplication concern

One verifier flagged this as a duplicate of bug_002. Whether or not bug_002 covers the same finding, the underlying bug is unambiguously real — the omission is confirmed by three independent verifiers and the codebase-wide pattern is definitive. The bug should be fixed regardless of which report is canonical.

functionstackx · 2026-04-18T02:28:18Z

@kedarpotdar-nv @Ankur-singh fyi

Mirrors the existing qwen3.5-bf16-b200-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 18, 2026 00:21

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 18, 2026 00:21

github-project-automation bot added this to InferenceMAX Board Apr 18, 2026

functionstackx added the sweep-enabled label Apr 18, 2026

claude bot reviewed Apr 18, 2026

View reviewed changes

functionstackx mentioned this pull request Apr 18, 2026

Docs: require --use-chat-template in MTP scripts #1079

Merged

1 task

functionstackx removed the sweep-enabled label Apr 18, 2026

functionstackx and others added 2 commits April 17, 2026 22:29

functionstackx force-pushed the claude/add-qwen3.5-bf16-b200-mtp branch from 8781143 to 214602e Compare April 18, 2026 02:29

functionstackx merged commit 7e3f6ac into main Apr 18, 2026
17 checks passed

functionstackx deleted the claude/add-qwen3.5-bf16-b200-mtp branch April 18, 2026 02:31

github-project-automation bot moved this to Done in InferenceMAX Board Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B200 config: qwen3.5-bf16-sglang-mtp#1074

Add B200 config: qwen3.5-bf16-sglang-mtp#1074
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-bf16-b200-mtp

functionstackx commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

functionstackx commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 18, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant