Skip to content

Add B200 config: qwen3.5-bf16-sglang-mtp#1074

Merged
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-bf16-b200-mtp
Apr 18, 2026
Merged

Add B200 config: qwen3.5-bf16-sglang-mtp#1074
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-bf16-b200-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Adds qwen3.5-bf16-b200-sglang-mtp config mirroring the existing qwen3.5-bf16-b200-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh launch script.
  • Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
  • Search space carries spec-decoding: mtp so the runner picks up the _mtp.sh variant.
  • Adds a perf-changelog.yaml entry to trigger the sweep (PR link is a placeholder and will be updated after merge per AGENTS.md).

Test plan

  • python3 -c "import yaml; yaml.safe_load(open('.github/configs/nvidia-master.yaml'))" — YAML parses.
  • python3 -c "import yaml; yaml.safe_load(open('perf-changelog.yaml'))" — YAML parses.
  • bash -n benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh — bash syntax OK.
  • python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml — emits 10 entries (2 ISL/OSL × 5 concurrencies) with spec-decoding=mtp.
  • CI sweep passes on B200.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +77 to +87
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The run_benchmark_serving call in qwen3.5_bf16_b200_mtp.sh is missing the --use-chat-template flag, which is present in every other MTP benchmark script in the codebase. Without this flag, the benchmark client sends raw prompts instead of chat-formatted ones, causing EAGLE speculative draft tokens to match more easily and artificially inflating the reported MTP acceptance rate and throughput numbers.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh omits --use-chat-template from its run_benchmark_serving call (lines 77–87). This flag controls whether the benchmark client wraps prompts in the model's chat template before sending them to the server. When absent, raw random prompts are sent, which have a very different token distribution than actual chat-formatted prompts.

The specific code path that triggers it

The run_benchmark_serving function (defined in benchmark_lib.sh) conditionally applies the chat template when --use-chat-template is passed. Without the flag, the benchmark sends bare text continuations — the exact format that EAGLE speculative decoding is most likely to predict accurately, because the draft model was trained on structured chat sequences but random prompt continuations happen to share short token n-grams.

Why existing code doesn't prevent it

The flag is optional in benchmark_lib.sh, so the script runs without error. There is no validation requiring MTP scripts to include this flag. The author appears to have copied from the non-MTP base script (qwen3.5_bf16_b200.sh), which also lacks --use-chat-template, and forgot to add the MTP-required flag. This is confirmed by the verifiers: the BF16 non-MTP base is one of the few scripts in the repo missing this flag, while every other MTP variant (qwen3.5_fp8_b200_mtp.sh:91, qwen3.5_fp8_h200_mtp.sh:82, qwen3.5_fp8_b300_mtp.sh:77, dsr1_fp8_b200_mtp.sh:113, dsr1_fp8_b300_mtp.sh:117, dsr1_fp4_b200_trt_mtp.sh:135, etc.) includes it.

Impact

PR #647 explicitly documented this rationale when adding the flag: 'Without this arg, MTP acceptance rates are artificially high.' Without --use-chat-template, the reported MTP acceptance rate will be inflated, making the speculative decoding look more effective than it really is. This renders the benchmark results misleading — the reported throughput numbers will not reflect real-world performance where users send chat-formatted requests.

How to fix it

Add --use-chat-template \ to the run_benchmark_serving call in qwen3.5_bf16_b200_mtp.sh, following the pattern from every other MTP script in the repository.

Step-by-step proof

  1. The benchmark client is called without --use-chat-template (lines 77–87 of the new file)
  2. benchmark_lib.sh only applies the chat template when this flag is explicitly passed
  3. The random dataset prompts are sent as raw text to the SGLang server
  4. The EAGLE draft model predicts tokens for a raw prompt continuation — a distribution that is simpler than structured chat continuations
  5. Draft token acceptance rate is measured as higher than it would be with chat-formatted prompts
  6. The throughput numbers reported for this MTP config are inflated relative to what would be observed in production use

Addressing the duplication concern

One verifier flagged this as a duplicate of bug_002. Whether or not bug_002 covers the same finding, the underlying bug is unambiguously real — the omission is confirmed by three independent verifiers and the codebase-wide pattern is definitive. The bug should be fixed regardless of which report is canonical.

@functionstackx
Copy link
Copy Markdown
Contributor Author

@kedarpotdar-nv @Ankur-singh fyi
image

functionstackx and others added 2 commits April 17, 2026 22:29
Mirrors the existing qwen3.5-bf16-b200-sglang non-MTP recipe and adds
EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)
via the standard spec-decoding=mtp suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the
client-side prompts match what the model was trained to predict; without
it the spec-decoding quality regresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-qwen3.5-bf16-b200-mtp branch from 8781143 to 214602e Compare April 18, 2026 02:29
@functionstackx functionstackx merged commit 7e3f6ac into main Apr 18, 2026
17 checks passed
@functionstackx functionstackx deleted the claude/add-qwen3.5-bf16-b200-mtp branch April 18, 2026 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant