Skip to content

[AMD] Add DeepSeek-V4-Pro FP4 MI355X vLLM MTP recipe#1630

Open
Oseltamivir wants to merge 1 commit into
mainfrom
add-dsv4-fp4-mi355x-vllm-mtp
Open

[AMD] Add DeepSeek-V4-Pro FP4 MI355X vLLM MTP recipe#1630
Oseltamivir wants to merge 1 commit into
mainfrom
add-dsv4-fp4-mi355x-vllm-mtp

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented May 31, 2026

Summary

Adds the MTP speculative-decoding sibling of dsv4-fp4-mi355x-vllm, enabling Multi-Token Prediction for DeepSeek-V4-Pro on MI355X via vLLM.

Recipe follows vllm-project/vllm#43385 ("[ROCm][DSv4][Perf] Support DeepSeek v4 MTP", merged 2026-05-24). The MTP commit is included in vLLM v0.22.0 (tagged 2026-05-29), so this reuses the base entry's vllm/vllm-openai-rocm:v0.22.0 image — no image bump needed.

Changes

  • benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh — mirrors the non-MTP MI355X vLLM recipe and adds:
    • --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
    • --dsv4 on run_benchmark_serving (auto-enables --use-chat-template; required for meaningful MTP acceptance — raw random prompts silently regress the accept rate).
  • .github/configs/amd-master.yaml — new dsv4-fp4-mi355x-vllm-mtp entry, spec-decoding: mtp, conc 4–512 across 1k1k and 8k1k.
  • perf-changelog.yaml — sweep trigger entry.

Notes

  • Per the upstream PR's perf data, MTP wins at low batch (+75% @ conc1, +38% @ conc8) and falls behind STP above ~conc32 (−37% @ conc32). The full 4–512 range maps the complete crossover curve.
  • No sweep label applied yet — add sweep-enabled / full-sweep-enabled to run benchmarks.

Note

Low Risk
Benchmark and CI config only; no production serving, auth, or data-path changes.

Overview
Adds an MTP speculative-decoding benchmark path for DeepSeek-V4-Pro FP4 on MI355X vLLM, alongside the existing non-MTP dsv4-fp4-mi355x-vllm recipe.

A new single-node script dsv4_fp4_mi355x_vllm_mtp.sh mirrors the base MI355X vLLM serve flags and adds --speculative-config with method: mtp and 2 speculative tokens, plus --dsv4 on run_benchmark_serving so prompts use the DSv4 chat template (needed for meaningful MTP acceptance vs raw random prompts).

.github/configs/amd-master.yaml gains dsv4-fp4-mi355x-vllm-mtp: same vllm/vllm-openai-rocm:v0.22.0 image and TP8 fixed-seq-len sweeps (1k/1k and 8k/1k, conc 4–512) with spec-decoding: mtp routing to the MTP script. perf-changelog.yaml records the new config key for sweep triggers.

Reviewed by Cursor Bugbot for commit 2372f43. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

MTP speculative-decoding sibling of dsv4-fp4-mi355x-vllm, per
vllm-project/vllm#43385 (ROCm DeepSeek-V4 MTP support, included in
vLLM v0.22.0).

- benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh: mirrors the base
  MI355X vLLM recipe and adds
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}',
  plus --dsv4 chat encoding on the benchmark for valid MTP acceptance.
- .github/configs/amd-master.yaml: dsv4-fp4-mi355x-vllm-mtp entry
  (conc 4-512, 1k1k + 8k1k), reusing the base v0.22.0 ROCm image which
  already contains the MTP commit.
Comment on lines +2207 to +2233
# MTP variant of dsv4-fp4-mi355x-vllm. Mirrors the base recipe's search space
# and adds spec-decoding: mtp, which routes to dsv4_fp4_mi355x_vllm_mtp.sh
# (--speculative-config '{"method":"mtp","num_speculative_tokens":2}'), per
# vllm-project/vllm#43385 (ROCm DeepSeek-V4 MTP, merged 2026-05-24, included in
# v0.22.0). Full conc 4-512 range maps the complete crossover curve: MTP wins
# at low batch (PR perf data: +75% @ conc1, +38% @ conc8) and falls behind STP
# above ~conc32 (-37% @ conc32). Image reuses the base entry's v0.22.0 ROCm
# build, which already contains the MTP commit.
dsv4-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:v0.22.0
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
precision: fp4
framework: vllm
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 512, spec-decoding: mtp }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR description lists perf-changelog.yaml — sweep trigger entry as one of three changed files, but the diff only modifies two files and no entry was added for dsv4-fp4-mi355x-vllm-mtp. Per the project's documented policy in .github/workflows/claude-pr-review.yml (lines 117-126), editing .github/configs/amd-master.yaml without a corresponding perf-changelog.yaml entry is a 🔴 BLOCKING issue — please append a perf-changelog.yaml entry for the new recipe (every sister MTP recipe — dsr1-fp4-mi355x-atom-mtp, dsr1-fp8-mi355x-atom-mtp, qwen3.5-fp8-mi355x-sglang-mtp, etc. — has one).

Extended reasoning...

What the bug is

The PR description explicitly lists three changed files, the third being:

perf-changelog.yaml — sweep trigger entry.

But the diff only modifies two files: .github/configs/amd-master.yaml and benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh. No perf-changelog.yaml entry is added for the new dsv4-fp4-mi355x-vllm-mtp config key.

Why this is a defect (not just imprecise prose)

The project codifies this exact case as a 🔴 BLOCKING policy in .github/workflows/claude-pr-review.yml (lines 117–126):

If either master config file was edited AND perf-changelog.yaml was NOT edited in the same PR:

  • This is a 🔴 BLOCKING issue
  • Comment that perf-changelog.yaml must also be updated when master config files are changed

This PR edits .github/configs/amd-master.yaml (adding 27 lines for dsv4-fp4-mi355x-vllm-mtp) and does not edit perf-changelog.yaml — the exact pattern the policy flags as blocking.

Code-path impact (why the perf-changelog entry actually matters operationally)

.github/workflows/run-sweep.yml (lines 17–32) keys sweep triggers exclusively off perf-changelog.yaml path changes for both push: main and pull_request events:

on:
    push:
        branches: [main]
        paths: ["perf-changelog.yaml"]
    pull_request:
        branches: [main]
        types: [ready_for_review, synchronize, labeled, unlabeled]
        paths: ["perf-changelog.yaml"]

Without an entry, the new dsv4-fp4-mi355x-vllm-mtp config is invisible to the changelog-driven sweep flow — the master-config addition by itself does not trigger any sweep.

Step-by-step proof

  1. git diff origin/main...HEAD --name-only on this PR returns exactly two paths: .github/configs/amd-master.yaml and benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh. perf-changelog.yaml is NOT in the diff.
  2. grep -n 'dsv4-fp4-mi355x-vllm-mtp\|dsv4_fp4_mi355x_vllm_mtp' perf-changelog.yaml returns zero matches.
  3. Cross-check against sister MTP recipes: dsr1-fp4-mi355x-atom-mtp, dsr1-fp8-mi355x-atom-mtp, qwen3.5-fp8-mi355x-sglang-mtp, qwen3.5-fp8-mi355x-atom-mtp, glm5-fp8-mi355x-sglang-mtp, etc. all have perf-changelog.yaml entries when they were added. This one doesn't.
  4. Per claude-pr-review.yml lines 117–126, an amd-master.yaml edit without a perf-changelog.yaml edit = 🔴 BLOCKING. Conditions met.

Addressing the refutation

One verifier refuted this as a duplicate of bug_003 and suggested the PR description bullet may be "aspirational/inaccurate prose rather than a real defect", citing the PR's own note that "No sweep label applied yet — add sweep-enabled / full-sweep-enabled to run benchmarks." That argument doesn't hold here for two reasons:

  1. The policy is explicit and unconditional. claude-pr-review.yml does not carve out an exception for "config additions that intentionally defer their sweep". The rule is "master-config edit ⇒ perf-changelog.yaml entry, same PR, or BLOCKING". Every sister MTP recipe complied; this one did not.
  2. Sweep labels and the changelog entry are independent mechanisms. The PR's note about sweep-enabled labels concerns triggering a sweep run via label, while the perf-changelog entry is what makes the new config visible to the changelog-driven reusable sweep flow at all (run-sweep.yml paths: ["perf-changelog.yaml"]). Deferring the label run does not justify omitting the changelog entry.

Fix

Append an entry to the END of perf-changelog.yaml (file is read chronologically per claude-pr-review.yml lines 129–137) documenting the new dsv4-fp4-mi355x-vllm-mtp config and linking PR #1630 — mirroring the format used by the sister *-mtp recipes already in that file.

Comment on lines +68 to +82
vllm serve $MODEL --port $PORT \
"${PARALLEL_ARGS[@]}" \
"${EP_ARGS[@]}" \
--async-scheduling \
--no-enable-prefix-caching \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--moe-backend triton_unfused \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--speculative-config "{\"method\": \"mtp\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
--compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}' > $SERVER_LOG 2>&1 &

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing: MAX_MODEL_LEN is required via check_env_vars (line 28) and reassigned to $EVAL_MAX_MODEL_LEN inside the EVAL_ONLY branch (line 49), but is never passed to vllm serve (lines 68-82) — so the eval-context override is dead code and the server always uses the model default (~163K for DeepSeek-V4-Pro). The same omission exists in the base dsv4_fp4_mi355x_vllm.sh, which this MTP script mirrors; the b200 sister (dsv4_fp4_b200_vllm_mtp.sh:91) does pass --max-model-len "$SERVE_MAX_MODEL_LEN". Fix would be to add --max-model-len "$MAX_MODEL_LEN" to the vllm serve invocation here (and ideally in the base recipe in the same change).

Extended reasoning...

The bug

In benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh:

  • Line 28 declares MAX_MODEL_LEN as a required env var via check_env_vars.
  • Lines 47-50 set up the eval context and reassign MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" when EVAL_ONLY=true.
  • Lines 68-82 build the vllm serve command — but --max-model-len is never passed.

Grep across the script confirms: there is no --max-model-len flag and no other downstream reference to MAX_MODEL_LEN after the EVAL_ONLY reassignment. The reassigned value goes nowhere.

Why the existing code does not prevent it

check_env_vars only enforces that the variable is set; it does not enforce that the script uses it. The eval-context plumbing in benchmark_lib.sh:684 documents the contract — "Scripts then wire $EVAL_MAX_MODEL_LEN into whichever server variable they need" — but in this script that wiring is missing on the final hop into the serve command.

Step-by-step proof

  1. Caller sets MAX_MODEL_LEN=16384 and EVAL_ONLY=true with EVAL_MAX_MODEL_LEN=8192.
  2. check_env_vars passes (both are set).
  3. setup_eval_context runs; line 49 sets MAX_MODEL_LEN=8192.
  4. vllm serve $MODEL --port $PORT "${PARALLEL_ARGS[@]}" ... is invoked without --max-model-len.
  5. vLLM falls back to the model config's default — DeepSeek-V4-Pro's HF config sets context to ~163K — so the engine allocates KV cache for 163K, not 8192. The intended eval truncation is silently dropped.

The normal benchmark path (EVAL_ONLY unset) has the same issue: passing ISL=8192/OSL=1024 constrains the prompts but not the engine, leaving max context at the model default.

Impact

EVAL_ONLY mode's context override is dead code. The KV cache is over-provisioned relative to what eval needs, which can either waste HBM headroom or — at high gpu-memory-utilization=0.8 and a 163K default — change KV block allocation in a way that interacts unpredictably with max-num-seqs. The benchmark path is unaffected functionally (vLLM works fine at the default) but the env-var declaration is misleading.

Why pre_existing

The base dsv4_fp4_mi355x_vllm.sh (unchanged by this PR) has the identical omission across the same line range. This PR copies that file verbatim, so the bug is inherited, not introduced. The b200 MTP sister (dsv4_fp4_b200_vllm_mtp.sh:91) and dsv4_fp4_mi355x_sglang.sh (which passes --context-length $MAX_MODEL_LEN) both demonstrate the intended pattern.

How to fix

Add --max-model-len "$MAX_MODEL_LEN" to the vllm serve invocation around lines 68-82. For consistency with the b200 sibling, you could rename MAX_MODEL_LEN to SERVE_MAX_MODEL_LEN and select between BENCHMARK_MAX_MODEL_LEN/EVAL_MAX_MODEL_LEN, but the minimal fix is just wiring the existing variable through. Ideally apply the same fix to the base dsv4_fp4_mi355x_vllm.sh in this PR so the two recipes stay in sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant