Add B300 config: qwen3.5-bf16-sglang-mtp by functionstackx · Pull Request #1082 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-18T02:10:38Z

Summary

Adds qwen3.5-bf16-b300-sglang-mtp config mirroring the existing qwen3.5-bf16-b300-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh launch script.
Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Passes --use-chat-template to run_benchmark_serving per the AGENTS.md requirement for all MTP scripts.
Search space rows carry spec-decoding: mtp so the runner picks up the _mtp.sh variant.
perf-changelog.yaml diff is append-only (no modifications to any existing line).

Test plan

YAML parses for both master config and perf-changelog.
bash -n benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh — bash syntax OK.
git diff perf-changelog.yaml shows only additions.
python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml — emits 20 entries (2 ISL/OSL × 2 TP variants × 5 concurrencies) with spec-decoding=mtp.
CI sweep passes on B300.

🤖 Generated with Claude Code

github-actions · 2026-04-18T02:10:47Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-18T02:10:47Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-18T02:10:47Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-18T02:22:13Z

+export NCCL_NVLS_ENABLE=1
+export SGL_ENABLE_JIT_DEEPGEMM=false
+export SGLANG_ENABLE_FLASHINFER_GEMM=true
+export PYTHONUNBUFFERED=1


🔴 The new qwen3.5_bf16_b300_mtp.sh script is missing the SGLANG_ENABLE_SPEC_V2=1 environment variable that every other SGLang EAGLE MTP benchmark script in the repo includes, causing SGLang to fall back to the slower V1 speculative decoding path. This degrades MTP throughput and may produce inflated acceptance rates, making benchmark results non-comparable to the FP8 B300 MTP config. Fix by prepending SGLANG_ENABLE_SPEC_V2=1 to the server launch line (line 55) alongside the existing PYTHONNOUSERSITE=1 prefix.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh script adds EAGLE speculative decoding flags (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) to the SGLang server launch command at line 55, but omits the SGLANG_ENABLE_SPEC_V2=1 environment variable. Without this flag, SGLang selects its legacy V1 speculative decoding code path even when EAGLE arguments are supplied.

The specific code path that triggers it

Line 55 of the new script reads:

PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

The directly analogous script for the same hardware and same image (qwen3.5_fp8_b300_mtp.sh, PR #1035, line 34) reads:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

The same pattern appears in dsr1_fp8_b200_mtp.sh (line 57), dsr1_fp8_b300_mtp.sh (line 61), and qwen3.5_fp8_h200_mtp.sh (line 38). All four scripts using SGLang with EAGLE on recent images include the flag; the new BF16 B300 script is the sole outlier.

Why existing code does not prevent it

The EAGLE speculative decoding CLI flags are passed correctly — SGLANG_ENABLE_SPEC_V2 is a separate runtime toggle that must be exported as an env var before the Python process starts. SGLang does not error or warn when the flag is absent; it silently downgrades to the V1 path, so there is no automatic signal that the script is misconfigured.

What the impact would be

Benchmark runs will use SGLang's slower, less optimised V1 speculation path. PR #1017 was a dedicated follow-up fix titled 'Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP', demonstrating that the omission has a real, documented performance impact. Additionally, the V1 path can produce artificially high speculative acceptance rates, meaning the reported MTP numbers would not be comparable to the FP8 B300 MTP config that does use V2.

How to fix it

Prepend SGLANG_ENABLE_SPEC_V2=1 to the server launch line, matching the pattern in qwen3.5_fp8_b300_mtp.sh:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

Alternatively, add export SGLANG_ENABLE_SPEC_V2=1 alongside the other export statements at lines 23-26.

Step-by-step proof

The script exports NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true, and PYTHONUNBUFFERED=1 (lines 23-26), but not SGLANG_ENABLE_SPEC_V2.

At line 55 the server is launched with PYTHONNOUSERSITE=1 python3 -m sglang.launch_server — no SGLANG_ENABLE_SPEC_V2=1 prefix.

SGLang checks this env var at startup to decide which speculation engine to use; since it is unset (defaults to 0/false), it activates the V1 path.

The four comparable MTP scripts (dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh) all set the flag, so their results are produced by the V2 path.

Any throughput or acceptance-rate comparison between the new BF16 B300 MTP config and existing MTP configs will therefore compare V1 results against V2 results — an apples-to-oranges comparison that corrupts benchmark conclusions.

Mirrors the existing qwen3.5-bf16-b300-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Script also passes --use-chat-template to run_benchmark_serving, as required by AGENTS.md for all MTP configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 18, 2026 02:10

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 18, 2026 02:10

github-project-automation bot added this to InferenceMAX Board Apr 18, 2026

functionstackx added the sweep-enabled label Apr 18, 2026

claude bot reviewed Apr 18, 2026

View reviewed changes

functionstackx removed the sweep-enabled label Apr 18, 2026

functionstackx force-pushed the claude/add-qwen3.5-bf16-b300-mtp branch from 4f176c1 to f33c2d7 Compare April 18, 2026 03:17

functionstackx merged commit b76876a into main Apr 18, 2026
3 checks passed

functionstackx deleted the claude/add-qwen3.5-bf16-b300-mtp branch April 18, 2026 03:17

github-project-automation bot moved this to Done in InferenceMAX Board Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B300 config: qwen3.5-bf16-sglang-mtp#1082

Add B300 config: qwen3.5-bf16-sglang-mtp#1082
functionstackx merged 1 commit intomainfrom
claude/add-qwen3.5-bf16-b300-mtp

functionstackx commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 18, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant