Add B300 config: qwen3.5-fp4-sglang-mtp by functionstackx · Pull Request #1083 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-18T02:12:29Z

Summary

Adds qwen3.5-fp4-b300-sglang-mtp config mirroring the existing qwen3.5-fp4-b300-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh launch script.
Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Passes --use-chat-template to run_benchmark_serving per the AGENTS.md requirement for all MTP scripts.
Search space rows carry spec-decoding: mtp so the runner picks up the _mtp.sh variant.
perf-changelog.yaml diff is append-only (no modifications to any existing line).

Test plan

YAML parses for both master config and perf-changelog.
bash -n benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh — bash syntax OK.
git diff perf-changelog.yaml shows only additions.
python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml — emits 24 entries (2 ISL/OSL × 2 search-space rows × 6 concurrencies) with spec-decoding=mtp.
CI sweep passes on B300.

🤖 Generated with Claude Code

github-actions · 2026-04-18T02:12:38Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-18T02:12:38Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-18T02:20:25Z

+
+set -x
+PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
+--trust-remote-code \
+--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
+--reasoning-parser qwen3 \
+--tool-call-parser qwen3_coder \
+--mamba-scheduler-strategy no_buffer \
+--quantization modelopt_fp4 --fp4-gemm-backend flashinfer_cutlass \
+--kv-cache-dtype fp8_e4m3 \
+--mamba-ssm-dtype bfloat16 \
+--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
+--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \


🔴 The new qwen3.5_fp4_b300_mtp.sh script is missing SGLANG_ENABLE_SPEC_V2=1 before the python3 -m sglang.launch_server invocation. Without this flag, EAGLE speculative decoding will fall back to the older spec v1 code path, producing inaccurate or suboptimal benchmark results — add SGLANG_ENABLE_SPEC_V2=1 as an inline env var prefix before PYTHONNOUSERSITE=1 python3 on line 62.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh launches the SGLang server at line 62 with:

PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

It omits the SGLANG_ENABLE_SPEC_V2=1 env-var prefix that every other MTP/EAGLE script in the repository includes. Without this flag, SGLang selects the older speculative-decoding v1 code path even though the EAGLE algorithm requires the v2 path.

The specific code path that triggers it

Every other MTP benchmark script sets the flag inline before the python3 invocation:

qwen3.5_fp8_b300_mtp.sh line 34: SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

qwen3.5_fp8_h200_mtp.sh line 38: SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server ...

dsr1_fp8_b200_mtp.sh line 57: SGLANG_ENABLE_SPEC_V2=1 ...

dsr1_fp8_b300_mtp.sh line 61: SGLANG_ENABLE_SPEC_V2=1 ...

This PR's script is the only MTP launch script in the repo that omits it.

Why existing code doesn't prevent it

There is no global export of SGLANG_ENABLE_SPEC_V2 in benchmark_lib.sh or the container entrypoint; each script is responsible for setting it inline. The bash syntax check (bash -n) listed in the test plan confirms only syntax validity, not correctness of env vars. The omission silently degrades behavior at runtime.

What the impact would be

SGLang v0.5.10.post1-cu130 requires SGLANG_ENABLE_SPEC_V2=1 for EAGLE speculative decoding to use the optimised v2 scheduler. Without it, the server runs the v1 speculative path, which yields lower acceptance rates and reduced throughput — meaning all benchmark numbers (tokens/s, TTFT, ITL) collected under this config will be unrepresentative of the intended MTP configuration. The perf-changelog entry for PR #1017 explicitly documents this requirement: "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" because EAGLE requires spec v2.

How to fix it

Prepend SGLANG_ENABLE_SPEC_V2=1 to the server launch line, matching the pattern of all other MTP scripts:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL ...

Step-by-step proof

The YAML config (.github/configs/nvidia-master.yaml) marks all search-space entries with spec-decoding: mtp, meaning the runner selects this _mtp.sh variant specifically to exercise EAGLE speculative decoding.

The script passes --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 to the server — confirming EAGLE is intended.

However, without SGLANG_ENABLE_SPEC_V2=1, SGLang's internal feature flag for the v2 speculative scheduler remains false.

SGLang falls back to the v1 path: the EAGLE draft model still runs, but the v1 scheduler does not handle EAGLE's multi-token acceptance correctly, leading to degraded throughput and inaccurate acceptance-rate telemetry.

Any benchmark result filed under this config will therefore underrepresent true MTP performance — the exact issue PR [NV] Update: sglang v2 Qwen3.5 h200 MTP #1017 was created to fix for the FP8 H200 MTP script.

Mirrors the existing qwen3.5-fp4-b300-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Script also passes --use-chat-template to run_benchmark_serving, as required by AGENTS.md for all MTP configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 18, 2026 02:12

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 18, 2026 02:12

github-project-automation bot added this to InferenceMAX Board Apr 18, 2026

functionstackx added the sweep-enabled label Apr 18, 2026

claude bot reviewed Apr 18, 2026

View reviewed changes

functionstackx removed the sweep-enabled label Apr 18, 2026

functionstackx force-pushed the claude/add-qwen3.5-fp4-b300-mtp branch from 702a778 to d02bedc Compare April 18, 2026 03:56

functionstackx merged commit 071f849 into main Apr 18, 2026
3 checks passed

functionstackx deleted the claude/add-qwen3.5-fp4-b300-mtp branch April 18, 2026 03:56

github-project-automation bot moved this to Done in InferenceMAX Board Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B300 config: qwen3.5-fp4-sglang-mtp#1083

Add B300 config: qwen3.5-fp4-sglang-mtp#1083
functionstackx merged 1 commit intomainfrom
claude/add-qwen3.5-fp4-b300-mtp

functionstackx commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 18, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant