Add B200 config: qwen3.5-fp4-sglang-mtp by functionstackx · Pull Request #1075 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-18T00:22:46Z

Summary

Adds qwen3.5-fp4-b200-sglang-mtp config mirroring the existing qwen3.5-fp4-b200-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh launch script.
Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Search space carries spec-decoding: mtp so the runner picks up the _mtp.sh variant.
Adds a perf-changelog.yaml entry (PR link placeholder; update after merge per AGENTS.md).

Test plan

YAML parses for both master config and perf-changelog.
bash -n benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh — bash syntax OK.
python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml — emits 12 entries (2 ISL/OSL × 6 concurrencies) with spec-decoding=mtp.
CI sweep passes on B200.

🤖 Generated with Claude Code

github-actions · 2026-04-18T00:22:54Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-18T00:34:33Z

+
+# Start GPU monitoring (power, temperature, clocks every second)
+start_gpu_monitor
+
+set -x
+PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
+--trust-remote-code \
+--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
+--quantization modelopt_fp4 --fp4-gemm-backend flashinfer_cutlass \
+--kv-cache-dtype fp8_e4m3 \
+--mamba-ssm-dtype bfloat16 \
+--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
+--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
+--context-length $CONTEXT_LENGTH --disable-radix-cache \
+--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
+$EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
+--tokenizer-worker-num 6 --stream-interval 30 \
+--speculative-algorithm EAGLE \
+--speculative-num-steps 3 \
+--speculative-eagle-topk 1 \


🔴 The new qwen3.5_fp4_b200_mtp.sh script is missing SGLANG_ENABLE_SPEC_V2=1 in its server launch command, which is present in most comparable MTP scripts (qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh). Without this env var, SGLang may fall back to the older Spec V1 speculative decoding code path, producing non-representative MTP benchmark numbers. Add SGLANG_ENABLE_SPEC_V2=1 before the PYTHONNOUSERSITE=1 python3 -m sglang.launch_server invocation to match the established pattern.

Extended reasoning...

What the bug is and how it manifests

The server launch command in benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh (lines 56–75) starts with PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ... but is missing the SGLANG_ENABLE_SPEC_V2=1 environment variable prefix. SGLang's speculative decoding has two implementations: the newer, optimized Spec V2 path and the older Spec V1 path. Without the env var, SGLang defaults to Spec V1, producing suboptimal MTP throughput numbers.

The specific code path that triggers it

Lines 56–75 of the new script launch the SGLang server with EAGLE flags (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) but omit SGLANG_ENABLE_SPEC_V2=1. Compare with qwen3.5_fp8_h200_mtp.sh line 38 and qwen3.5_fp8_b300_mtp.sh line 34, both of which prepend SGLANG_ENABLE_SPEC_V2=1 to identical launch patterns.

Why existing code doesn't prevent it

The script was created by copying the non-MTP base script qwen3.5_fp4_b200.sh and adding EAGLE flags — but the non-MTP base has no need for SGLANG_ENABLE_SPEC_V2=1, so the env var was never present to copy over. There is no linting or template enforcement to catch this class of omission.

Addressing the refutation

One verifier noted that qwen3.5_fp8_b200_mtp.sh also lacks SGLANG_ENABLE_SPEC_V2=1, suggesting this may be an intentional pattern for same-hardware (B200) scripts or that newer nightly images default to Spec V2. This is a fair observation. However: (1) there is no documented evidence that the nightly-dev-20260402 image enables Spec V2 by default; (2) PR #1017 was explicitly created to retroactively add this var to qwen3.5_fp8_h200_mtp.sh after it was missed, establishing that its omission produces non-representative results and must be corrected; (3) four of the six single-node MTP SGLang scripts include it. The safer and more consistent choice is to explicitly set the env var.

Step-by-step proof

Runner picks up qwen3.5-fp4-b200-sglang-mtp config (spec-decoding: mtp → selects qwen3.5_fp4_b200_mtp.sh).

Script executes: PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ... --speculative-algorithm EAGLE ...

Because SGLANG_ENABLE_SPEC_V2 is unset, SGLang initializes the legacy Spec V1 speculative decoding engine.

Benchmark runs against a server using Spec V1 instead of the optimized Spec V2, yielding lower (non-representative) MTP throughput numbers compared to the H200 and B300 MTP configs.

How to fix it

Prefix the launch command with SGLANG_ENABLE_SPEC_V2=1, matching the pattern in qwen3.5_fp8_b300_mtp.sh line 34:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

claude · 2026-04-18T00:34:34Z

+run_benchmark_serving \
+    --model "$MODEL" \
+    --port "$PORT" \
+    --backend vllm \
+    --input-len "$ISL" \
+    --output-len "$OSL" \
+    --random-range-ratio "$RANDOM_RANGE_RATIO" \
+    --num-prompts "$((CONC * 10))" \
+    --max-concurrency "$CONC" \
+    --result-filename "$RESULT_FILENAME" \
+    --result-dir /workspace/


🔴 The new qwen3.5_fp4_b200_mtp.sh omits --use-chat-template from its run_benchmark_serving invocation (lines 86-96), which is required for all MTP/EAGLE speculative decoding scripts. Without this flag, random token sequences are used instead of chat-formatted prompts, causing EAGLE acceptance rates and throughput numbers to be artificially inflated relative to real-world workloads. Add --use-chat-template to the run_benchmark_serving call, consistent with every other MTP script in the repo.

Extended reasoning...

What the bug is and how it manifests

benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh enables EAGLE speculative decoding via --speculative-algorithm EAGLE on the server, but the corresponding run_benchmark_serving call (lines 86–96) does not pass --use-chat-template. This means the benchmark client sends raw random token sequences rather than chat-formatted prompts to the server.

The specific code path that triggers it

The run_benchmark_serving call in the new script is:

run_benchmark_serving \ --model "$MODEL" \ --port "$PORT" \ --backend vllm \ --input-len "$ISL" \ --output-len "$OSL" \ --random-range-ratio "$RANDOM_RANGE_RATIO" \ --num-prompts "$((CONC * 10))" \ --max-concurrency "$CONC" \ --result-filename "$RESULT_FILENAME" \ --result-dir /workspace/

The flag --use-chat-template is absent. Every other MTP benchmark script in the repo includes it: qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b300_mtp.sh (line 77), dsr1_fp8_b200_mtp.sh (line 113), dsr1_fp8_b300_mtp.sh (line 77).

Why existing code does not prevent it

The non-MTP baseline qwen3.5_fp4_b200.sh also lacks --use-chat-template, which is acceptable for a non-speculative script because acceptance rates are not relevant there. The author appears to have copied the non-MTP script and added the EAGLE server flags without applying the MTP-specific benchmark client correction documented in PR #647.

What the impact would be

EAGLE speculative decoding acceptance rates are highly sensitive to prompt distribution. The draft model is trained on chat-formatted sequences; feeding it raw random tokens produces token distributions that are much easier to speculate on, inflating acceptance rates and derived throughput figures. The benchmark results for qwen3.5-fp4-b200-sglang-mtp will not reflect real-world chat workload performance, potentially overstating the benefit of MTP for this config.

Step-by-step proof

The new config qwen3.5-fp4-b200-sglang-mtp launches the SGLang server with --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.

The benchmark client is invoked via run_benchmark_serving without --use-chat-template.

Without this flag, benchmark_serving.py generates synthetic random token prompts rather than applying the model's chat template.

Random token inputs are far from the EAGLE draft model's training distribution, causing the draft model to produce tokens that match the target model's output at an artificially high rate.

The measured acceptance rate and tokens/second throughput will be inflated compared to actual chat workloads.

By contrast, running with --use-chat-template (as all other MTP scripts do) uses properly formatted chat prompts that reflect real usage patterns.

How to fix it

Add --use-chat-template to the run_benchmark_serving invocation in benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh, matching the pattern used by every other MTP script in the repo.

Mirrors the existing qwen3.5-fp4-b200-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 18, 2026 00:22

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 18, 2026 00:22

github-project-automation bot added this to InferenceMAX Board Apr 18, 2026

functionstackx added the sweep-enabled label Apr 18, 2026

claude bot reviewed Apr 18, 2026

View reviewed changes

functionstackx mentioned this pull request Apr 18, 2026

Docs: require --use-chat-template in MTP scripts #1079

Merged

1 task

functionstackx removed the sweep-enabled label Apr 18, 2026

functionstackx and others added 2 commits April 18, 2026 02:37

functionstackx force-pushed the claude/add-qwen3.5-fp4-b200-mtp branch from e21f5b8 to 44a190a Compare April 18, 2026 06:37

functionstackx merged commit 1b4c27d into main Apr 18, 2026
3 checks passed

github-project-automation bot moved this to Done in InferenceMAX Board Apr 18, 2026

functionstackx deleted the claude/add-qwen3.5-fp4-b200-mtp branch April 18, 2026 06:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B200 config: qwen3.5-fp4-sglang-mtp#1075

Add B200 config: qwen3.5-fp4-sglang-mtp#1075
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-fp4-b200-mtp

functionstackx commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 18, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant