Add MI355X config: qwen3.5-fp8-sglang-mtp by functionstackx · Pull Request #1076 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-18T00:24:21Z

Summary

Adds qwen3.5-fp8-mi355x-sglang-mtp config mirroring the existing qwen3.5-fp8-mi355x-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh launch script.
Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Search space rows carry spec-decoding: mtp so the MI355X runner picks up the _mtp.sh variant.
Adds a perf-changelog.yaml entry (PR link placeholder; update after merge per AGENTS.md).

Test plan

YAML parses for both master config and perf-changelog.
bash -n benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh — bash syntax OK.
python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml — emits 17 entries with spec-decoding=mtp (same sweep shape as the non-MTP config).
CI sweep passes on MI355X.

🤖 Generated with Claude Code

github-actions · 2026-04-18T00:24:29Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-18T00:24:30Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-18T00:29:12Z

+run_benchmark_serving \
+    --model "$MODEL" \
+    --port "$PORT" \
+    --backend vllm \
+    --input-len "$ISL" \
+    --output-len "$OSL" \
+    --random-range-ratio "$RANDOM_RANGE_RATIO" \
+    --num-prompts "$((CONC * 10))" \
+    --max-concurrency "$CONC" \
+    --result-filename "$RESULT_FILENAME" \
+    --result-dir /workspace/


🔴 The run_benchmark_serving call in qwen3.5_fp8_mi355x_mtp.sh is missing --use-chat-template, which is present in every other MTP benchmark script in the repo. Without it, the benchmark sends raw text prompts instead of chat-formatted messages, causing EAGLE speculative decoding acceptance rates to be artificially inflated and throughput numbers to appear better than real production workloads.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh script omits --use-chat-template from its run_benchmark_serving call (lines 61–71). Every other MTP benchmark script in the repository includes this flag: qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_b300_mtp.sh (line 77), dsr1_fp8_b200_mtp.sh (line 113), dsr1_fp8_b300_mtp.sh (line 117), dsr1_fp4_mi355x_atom_mtp.sh (line 76), and dsr1_fp8_mi355x_atom_mtp.sh (line 76).

The specific code path that triggers it

When run_benchmark_serving is invoked without --use-chat-template, the benchmark client sends raw text prompts directly to the server. With the flag, it wraps each prompt in the model's chat template (e.g., <|im_start|>user\n...<|im_end|>) before sending. The tokenized distributions differ substantially between these two modes.

Why existing code doesn't prevent it

The non-MTP qwen3.5_fp8_mi355x.sh script also lacks --use-chat-template, but this omission is inconsequential for non-speculative workloads because there is no draft model whose acceptance rate depends on the prompt distribution. For MTP/EAGLE, the draft model's next-token predictions must closely match the target model's distribution over realistic chat-formatted inputs. Sending raw text skews this distribution toward patterns that are easier for the draft model to predict, inflating acceptance rates.

What the impact would be

The perf-changelog entry for PR #647 documents this exact issue: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." The same mechanism applies to Qwen3.5 MTP. Benchmarks run with this script will report misleadingly optimistic speculative decoding speedups for the qwen3.5-fp8-mi355x-sglang-mtp config, producing TPS numbers that cannot be reproduced in production where users send chat-formatted messages.

How to fix it

Add --use-chat-template to the run_benchmark_serving call, matching the pattern used by all other MTP scripts:

run_benchmark_serving \ --model "$MODEL" \ --port "$PORT" \ --backend vllm \ --input-len "$ISL" \ --output-len "$OSL" \ --random-range-ratio "$RANDOM_RANGE_RATIO" \ --num-prompts "$((CONC * 10))" \ --max-concurrency "$CONC" \ --result-filename "$RESULT_FILENAME" \ --result-dir /workspace/ \ --use-chat-template

Step-by-step proof

PR Add MI355X config: qwen3.5-fp8-sglang-mtp #1076 adds benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh as a new file.

The file's run_benchmark_serving call (lines 61–71 in the diff) lists 10 arguments; --use-chat-template is not among them.

Compare with qwen3.5_fp8_b200_mtp.sh (line 91) or qwen3.5_fp8_h200_mtp.sh (line 82): both include --use-chat-template after --result-dir.

The perf-changelog.yaml entry for PR Add benchmark script code style guidelines to PR review workflow #647 (already merged) states the consequence: artificially high MTP acceptance rates when the flag is absent.

Therefore, any CI run on MI355X using this script will collect inflated TPS numbers that do not represent real-workload performance.

Mirrors the existing qwen3.5-fp8-mi355x-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 18, 2026 00:24

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners April 18, 2026 00:24

github-project-automation bot added this to InferenceMAX Board Apr 18, 2026

functionstackx added the sweep-enabled label Apr 18, 2026

claude bot reviewed Apr 18, 2026

View reviewed changes

functionstackx mentioned this pull request Apr 18, 2026

Docs: require --use-chat-template in MTP scripts #1079

Merged

1 task

functionstackx removed the sweep-enabled label Apr 18, 2026

functionstackx and others added 2 commits April 18, 2026 01:59

functionstackx force-pushed the claude/add-qwen3.5-fp8-mi355x-mtp branch from 129bfba to 1b4f933 Compare April 18, 2026 05:59

functionstackx merged commit c053d09 into main Apr 18, 2026
3 checks passed

functionstackx deleted the claude/add-qwen3.5-fp8-mi355x-mtp branch April 18, 2026 06:00

github-project-automation bot moved this to Done in InferenceMAX Board Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MI355X config: qwen3.5-fp8-sglang-mtp#1076

Add MI355X config: qwen3.5-fp8-sglang-mtp#1076
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-fp8-mi355x-mtp

functionstackx commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 18, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant