Skip to content

Add B300 config: qwen3.5-bf16-sglang-mtp#1082

Merged
functionstackx merged 1 commit intomainfrom
claude/add-qwen3.5-bf16-b300-mtp
Apr 18, 2026
Merged

Add B300 config: qwen3.5-bf16-sglang-mtp#1082
functionstackx merged 1 commit intomainfrom
claude/add-qwen3.5-bf16-b300-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Adds qwen3.5-bf16-b300-sglang-mtp config mirroring the existing qwen3.5-bf16-b300-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh launch script.
  • Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
  • Passes --use-chat-template to run_benchmark_serving per the AGENTS.md requirement for all MTP scripts.
  • Search space rows carry spec-decoding: mtp so the runner picks up the _mtp.sh variant.
  • perf-changelog.yaml diff is append-only (no modifications to any existing line).

Test plan

  • YAML parses for both master config and perf-changelog.
  • bash -n benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh — bash syntax OK.
  • git diff perf-changelog.yaml shows only additions.
  • python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml — emits 20 entries (2 ISL/OSL × 2 TP variants × 5 concurrencies) with spec-decoding=mtp.
  • CI sweep passes on B300.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +23 to +26
export NCCL_NVLS_ENABLE=1
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
export PYTHONUNBUFFERED=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new qwen3.5_bf16_b300_mtp.sh script is missing the SGLANG_ENABLE_SPEC_V2=1 environment variable that every other SGLang EAGLE MTP benchmark script in the repo includes, causing SGLang to fall back to the slower V1 speculative decoding path. This degrades MTP throughput and may produce inflated acceptance rates, making benchmark results non-comparable to the FP8 B300 MTP config. Fix by prepending SGLANG_ENABLE_SPEC_V2=1 to the server launch line (line 55) alongside the existing PYTHONNOUSERSITE=1 prefix.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh script adds EAGLE speculative decoding flags (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) to the SGLang server launch command at line 55, but omits the SGLANG_ENABLE_SPEC_V2=1 environment variable. Without this flag, SGLang selects its legacy V1 speculative decoding code path even when EAGLE arguments are supplied.

The specific code path that triggers it

Line 55 of the new script reads:

PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

The directly analogous script for the same hardware and same image (qwen3.5_fp8_b300_mtp.sh, PR #1035, line 34) reads:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

The same pattern appears in dsr1_fp8_b200_mtp.sh (line 57), dsr1_fp8_b300_mtp.sh (line 61), and qwen3.5_fp8_h200_mtp.sh (line 38). All four scripts using SGLang with EAGLE on recent images include the flag; the new BF16 B300 script is the sole outlier.

Why existing code does not prevent it

The EAGLE speculative decoding CLI flags are passed correctly — SGLANG_ENABLE_SPEC_V2 is a separate runtime toggle that must be exported as an env var before the Python process starts. SGLang does not error or warn when the flag is absent; it silently downgrades to the V1 path, so there is no automatic signal that the script is misconfigured.

What the impact would be

Benchmark runs will use SGLang's slower, less optimised V1 speculation path. PR #1017 was a dedicated follow-up fix titled 'Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP', demonstrating that the omission has a real, documented performance impact. Additionally, the V1 path can produce artificially high speculative acceptance rates, meaning the reported MTP numbers would not be comparable to the FP8 B300 MTP config that does use V2.

How to fix it

Prepend SGLANG_ENABLE_SPEC_V2=1 to the server launch line, matching the pattern in qwen3.5_fp8_b300_mtp.sh:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

Alternatively, add export SGLANG_ENABLE_SPEC_V2=1 alongside the other export statements at lines 23-26.

Step-by-step proof

  1. The script exports NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true, and PYTHONUNBUFFERED=1 (lines 23-26), but not SGLANG_ENABLE_SPEC_V2.
  2. At line 55 the server is launched with PYTHONNOUSERSITE=1 python3 -m sglang.launch_server — no SGLANG_ENABLE_SPEC_V2=1 prefix.
  3. SGLang checks this env var at startup to decide which speculation engine to use; since it is unset (defaults to 0/false), it activates the V1 path.
  4. The four comparable MTP scripts (dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh) all set the flag, so their results are produced by the V2 path.
  5. Any throughput or acceptance-rate comparison between the new BF16 B300 MTP config and existing MTP configs will therefore compare V1 results against V2 results — an apples-to-oranges comparison that corrupts benchmark conclusions.

Mirrors the existing qwen3.5-bf16-b300-sglang non-MTP recipe and adds
EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)
via the standard spec-decoding=mtp suffix. Script also passes
--use-chat-template to run_benchmark_serving, as required by AGENTS.md
for all MTP configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-qwen3.5-bf16-b300-mtp branch from 4f176c1 to f33c2d7 Compare April 18, 2026 03:17
@functionstackx functionstackx merged commit b76876a into main Apr 18, 2026
3 checks passed
@functionstackx functionstackx deleted the claude/add-qwen3.5-bf16-b300-mtp branch April 18, 2026 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant