Add MI355X config: qwen3.5-fp8-sglang-mtp#1076
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ |
There was a problem hiding this comment.
🔴 The run_benchmark_serving call in qwen3.5_fp8_mi355x_mtp.sh is missing --use-chat-template, which is present in every other MTP benchmark script in the repo. Without it, the benchmark sends raw text prompts instead of chat-formatted messages, causing EAGLE speculative decoding acceptance rates to be artificially inflated and throughput numbers to appear better than real production workloads.
Extended reasoning...
What the bug is and how it manifests
The new benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh script omits --use-chat-template from its run_benchmark_serving call (lines 61–71). Every other MTP benchmark script in the repository includes this flag: qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_b300_mtp.sh (line 77), dsr1_fp8_b200_mtp.sh (line 113), dsr1_fp8_b300_mtp.sh (line 117), dsr1_fp4_mi355x_atom_mtp.sh (line 76), and dsr1_fp8_mi355x_atom_mtp.sh (line 76).
The specific code path that triggers it
When run_benchmark_serving is invoked without --use-chat-template, the benchmark client sends raw text prompts directly to the server. With the flag, it wraps each prompt in the model's chat template (e.g., <|im_start|>user\n...<|im_end|>) before sending. The tokenized distributions differ substantially between these two modes.
Why existing code doesn't prevent it
The non-MTP qwen3.5_fp8_mi355x.sh script also lacks --use-chat-template, but this omission is inconsequential for non-speculative workloads because there is no draft model whose acceptance rate depends on the prompt distribution. For MTP/EAGLE, the draft model's next-token predictions must closely match the target model's distribution over realistic chat-formatted inputs. Sending raw text skews this distribution toward patterns that are easier for the draft model to predict, inflating acceptance rates.
What the impact would be
The perf-changelog entry for PR #647 documents this exact issue: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." The same mechanism applies to Qwen3.5 MTP. Benchmarks run with this script will report misleadingly optimistic speculative decoding speedups for the qwen3.5-fp8-mi355x-sglang-mtp config, producing TPS numbers that cannot be reproduced in production where users send chat-formatted messages.
How to fix it
Add --use-chat-template to the run_benchmark_serving call, matching the pattern used by all other MTP scripts:
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-templateStep-by-step proof
- PR Add MI355X config: qwen3.5-fp8-sglang-mtp #1076 adds
benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.shas a new file. - The file's
run_benchmark_servingcall (lines 61–71 in the diff) lists 10 arguments;--use-chat-templateis not among them. - Compare with
qwen3.5_fp8_b200_mtp.sh(line 91) orqwen3.5_fp8_h200_mtp.sh(line 82): both include--use-chat-templateafter--result-dir. - The
perf-changelog.yamlentry for PR Add benchmark script code style guidelines to PR review workflow #647 (already merged) states the consequence: artificially high MTP acceptance rates when the flag is absent. - Therefore, any CI run on MI355X using this script will collect inflated TPS numbers that do not represent real-workload performance.
Mirrors the existing qwen3.5-fp8-mi355x-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
129bfba to
1b4f933
Compare
Summary
qwen3.5-fp8-mi355x-sglang-mtpconfig mirroring the existingqwen3.5-fp8-mi355x-sglangnon-MTP recipe, plus a newbenchmarks/single_node/qwen3.5_fp8_mi355x_mtp.shlaunch script.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.spec-decoding: mtpso the MI355X runner picks up the_mtp.shvariant.perf-changelog.yamlentry (PR link placeholder; update after merge per AGENTS.md).Test plan
bash -n benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh— bash syntax OK.python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml— emits 17 entries with spec-decoding=mtp (same sweep shape as the non-MTP config).🤖 Generated with Claude Code