Skip to content

Add MI355X config: qwen3.5-fp8-sglang-mtp#1076

Merged
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-fp8-mi355x-mtp
Apr 18, 2026
Merged

Add MI355X config: qwen3.5-fp8-sglang-mtp#1076
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-fp8-mi355x-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Adds qwen3.5-fp8-mi355x-sglang-mtp config mirroring the existing qwen3.5-fp8-mi355x-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh launch script.
  • Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
  • Search space rows carry spec-decoding: mtp so the MI355X runner picks up the _mtp.sh variant.
  • Adds a perf-changelog.yaml entry (PR link placeholder; update after merge per AGENTS.md).

Test plan

  • YAML parses for both master config and perf-changelog.
  • bash -n benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh — bash syntax OK.
  • python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml — emits 17 entries with spec-decoding=mtp (same sweep shape as the non-MTP config).
  • CI sweep passes on MI355X.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +61 to +71
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The run_benchmark_serving call in qwen3.5_fp8_mi355x_mtp.sh is missing --use-chat-template, which is present in every other MTP benchmark script in the repo. Without it, the benchmark sends raw text prompts instead of chat-formatted messages, causing EAGLE speculative decoding acceptance rates to be artificially inflated and throughput numbers to appear better than real production workloads.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh script omits --use-chat-template from its run_benchmark_serving call (lines 61–71). Every other MTP benchmark script in the repository includes this flag: qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_b300_mtp.sh (line 77), dsr1_fp8_b200_mtp.sh (line 113), dsr1_fp8_b300_mtp.sh (line 117), dsr1_fp4_mi355x_atom_mtp.sh (line 76), and dsr1_fp8_mi355x_atom_mtp.sh (line 76).

The specific code path that triggers it

When run_benchmark_serving is invoked without --use-chat-template, the benchmark client sends raw text prompts directly to the server. With the flag, it wraps each prompt in the model's chat template (e.g., <|im_start|>user\n...<|im_end|>) before sending. The tokenized distributions differ substantially between these two modes.

Why existing code doesn't prevent it

The non-MTP qwen3.5_fp8_mi355x.sh script also lacks --use-chat-template, but this omission is inconsequential for non-speculative workloads because there is no draft model whose acceptance rate depends on the prompt distribution. For MTP/EAGLE, the draft model's next-token predictions must closely match the target model's distribution over realistic chat-formatted inputs. Sending raw text skews this distribution toward patterns that are easier for the draft model to predict, inflating acceptance rates.

What the impact would be

The perf-changelog entry for PR #647 documents this exact issue: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." The same mechanism applies to Qwen3.5 MTP. Benchmarks run with this script will report misleadingly optimistic speculative decoding speedups for the qwen3.5-fp8-mi355x-sglang-mtp config, producing TPS numbers that cannot be reproduced in production where users send chat-formatted messages.

How to fix it

Add --use-chat-template to the run_benchmark_serving call, matching the pattern used by all other MTP scripts:

run_benchmark_serving \
    --model "$MODEL" \
    --port "$PORT" \
    --backend vllm \
    --input-len "$ISL" \
    --output-len "$OSL" \
    --random-range-ratio "$RANDOM_RANGE_RATIO" \
    --num-prompts "$((CONC * 10))" \
    --max-concurrency "$CONC" \
    --result-filename "$RESULT_FILENAME" \
    --result-dir /workspace/ \
    --use-chat-template

Step-by-step proof

  1. PR Add MI355X config: qwen3.5-fp8-sglang-mtp #1076 adds benchmarks/single_node/qwen3.5_fp8_mi355x_mtp.sh as a new file.
  2. The file's run_benchmark_serving call (lines 61–71 in the diff) lists 10 arguments; --use-chat-template is not among them.
  3. Compare with qwen3.5_fp8_b200_mtp.sh (line 91) or qwen3.5_fp8_h200_mtp.sh (line 82): both include --use-chat-template after --result-dir.
  4. The perf-changelog.yaml entry for PR Add benchmark script code style guidelines to PR review workflow #647 (already merged) states the consequence: artificially high MTP acceptance rates when the flag is absent.
  5. Therefore, any CI run on MI355X using this script will collect inflated TPS numbers that do not represent real-workload performance.

functionstackx and others added 2 commits April 18, 2026 01:59
Mirrors the existing qwen3.5-fp8-mi355x-sglang non-MTP recipe and adds
EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)
via the standard spec-decoding=mtp suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the
client-side prompts match what the model was trained to predict; without
it the spec-decoding quality regresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-qwen3.5-fp8-mi355x-mtp branch from 129bfba to 1b4f933 Compare April 18, 2026 05:59
@functionstackx functionstackx merged commit c053d09 into main Apr 18, 2026
3 checks passed
@functionstackx functionstackx deleted the claude/add-qwen3.5-fp8-mi355x-mtp branch April 18, 2026 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant