Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2204,6 +2204,33 @@ dsv4-fp4-mi355x-vllm:
search-space:
- { tp: 8, conc-start: 4, conc-end: 512 }

# MTP variant of dsv4-fp4-mi355x-vllm. Mirrors the base recipe's search space
# and adds spec-decoding: mtp, which routes to dsv4_fp4_mi355x_vllm_mtp.sh
# (--speculative-config '{"method":"mtp","num_speculative_tokens":2}'), per
# vllm-project/vllm#43385 (ROCm DeepSeek-V4 MTP, merged 2026-05-24, included in
# v0.22.0). Full conc 4-512 range maps the complete crossover curve: MTP wins
# at low batch (PR perf data: +75% @ conc1, +38% @ conc8) and falls behind STP
# above ~conc32 (-37% @ conc32). Image reuses the base entry's v0.22.0 ROCm
# build, which already contains the MTP commit.
dsv4-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:v0.22.0
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
precision: fp4
framework: vllm
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 512, spec-decoding: mtp }

Comment on lines +2207 to +2233
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR description lists perf-changelog.yaml — sweep trigger entry as one of three changed files, but the diff only modifies two files and no entry was added for dsv4-fp4-mi355x-vllm-mtp. Per the project's documented policy in .github/workflows/claude-pr-review.yml (lines 117-126), editing .github/configs/amd-master.yaml without a corresponding perf-changelog.yaml entry is a 🔴 BLOCKING issue — please append a perf-changelog.yaml entry for the new recipe (every sister MTP recipe — dsr1-fp4-mi355x-atom-mtp, dsr1-fp8-mi355x-atom-mtp, qwen3.5-fp8-mi355x-sglang-mtp, etc. — has one).

Extended reasoning...

What the bug is

The PR description explicitly lists three changed files, the third being:

perf-changelog.yaml — sweep trigger entry.

But the diff only modifies two files: .github/configs/amd-master.yaml and benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh. No perf-changelog.yaml entry is added for the new dsv4-fp4-mi355x-vllm-mtp config key.

Why this is a defect (not just imprecise prose)

The project codifies this exact case as a 🔴 BLOCKING policy in .github/workflows/claude-pr-review.yml (lines 117–126):

If either master config file was edited AND perf-changelog.yaml was NOT edited in the same PR:

  • This is a 🔴 BLOCKING issue
  • Comment that perf-changelog.yaml must also be updated when master config files are changed

This PR edits .github/configs/amd-master.yaml (adding 27 lines for dsv4-fp4-mi355x-vllm-mtp) and does not edit perf-changelog.yaml — the exact pattern the policy flags as blocking.

Code-path impact (why the perf-changelog entry actually matters operationally)

.github/workflows/run-sweep.yml (lines 17–32) keys sweep triggers exclusively off perf-changelog.yaml path changes for both push: main and pull_request events:

on:
    push:
        branches: [main]
        paths: ["perf-changelog.yaml"]
    pull_request:
        branches: [main]
        types: [ready_for_review, synchronize, labeled, unlabeled]
        paths: ["perf-changelog.yaml"]

Without an entry, the new dsv4-fp4-mi355x-vllm-mtp config is invisible to the changelog-driven sweep flow — the master-config addition by itself does not trigger any sweep.

Step-by-step proof

  1. git diff origin/main...HEAD --name-only on this PR returns exactly two paths: .github/configs/amd-master.yaml and benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh. perf-changelog.yaml is NOT in the diff.
  2. grep -n 'dsv4-fp4-mi355x-vllm-mtp\|dsv4_fp4_mi355x_vllm_mtp' perf-changelog.yaml returns zero matches.
  3. Cross-check against sister MTP recipes: dsr1-fp4-mi355x-atom-mtp, dsr1-fp8-mi355x-atom-mtp, qwen3.5-fp8-mi355x-sglang-mtp, qwen3.5-fp8-mi355x-atom-mtp, glm5-fp8-mi355x-sglang-mtp, etc. all have perf-changelog.yaml entries when they were added. This one doesn't.
  4. Per claude-pr-review.yml lines 117–126, an amd-master.yaml edit without a perf-changelog.yaml edit = 🔴 BLOCKING. Conditions met.

Addressing the refutation

One verifier refuted this as a duplicate of bug_003 and suggested the PR description bullet may be "aspirational/inaccurate prose rather than a real defect", citing the PR's own note that "No sweep label applied yet — add sweep-enabled / full-sweep-enabled to run benchmarks." That argument doesn't hold here for two reasons:

  1. The policy is explicit and unconditional. claude-pr-review.yml does not carve out an exception for "config additions that intentionally defer their sweep". The rule is "master-config edit ⇒ perf-changelog.yaml entry, same PR, or BLOCKING". Every sister MTP recipe complied; this one did not.
  2. Sweep labels and the changelog entry are independent mechanisms. The PR's note about sweep-enabled labels concerns triggering a sweep run via label, while the perf-changelog entry is what makes the new config visible to the changelog-driven reusable sweep flow at all (run-sweep.yml paths: ["perf-changelog.yaml"]). Deferring the label run does not justify omitting the changelog entry.

Fix

Append an entry to the END of perf-changelog.yaml (file is read chronologically per claude-pr-review.yml lines 129–137) documenting the new dsv4-fp4-mi355x-vllm-mtp config and linking PR #1630 — mirroring the format used by the sister *-mtp recipes already in that file.

# Day-0 single-sequence marker for DeepSeek-V4 on ATOM (ROCm/ATOM#650).
# PR1 of the ATOM DSv4 series still uses torch sparse-attention fallbacks
# that OOM once warmup/prefill batches multiple requests; keep CONC=1 until
Expand Down
109 changes: 109 additions & 0 deletions benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env bash
set -eo pipefail

# DeepSeek-V4-Pro on MI355X via vLLM — MTP variant of dsv4_fp4_mi355x_vllm.sh.
# Adds MTP speculative decoding per vllm-project/vllm#43385 (ROCm DeepSeek-V4
# MTP support, merged 2026-05-24, present in v0.22.0 tagged 2026-05-29):
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
#
# Benchmark prompts are routed through DeepSeek-V4 chat encoding via --dsv4
# (which auto-enables --use-chat-template). EAGLE/MTP-style spec decoding is
# trained against chat-formatted inputs; benchmarking against raw random
# prompts silently regresses the acceptance rate.
#
# All other serving flags mirror the non-MTP MI355X recipe (TP=8,
# VLLM_ROCM_USE_AITER=1, triton_unfused MoE, FP8 KV cache, mp executor, async
# scheduling, mode=3 FULL_AND_PIECEWISE compilation). See
# dsv4_fp4_mi355x_vllm.sh for per-flag rationale.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
DP_ATTENTION \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
fi

export VLLM_ROCM_USE_AITER=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor

PARALLEL_ARGS=(--tensor-parallel-size "$TP" --data-parallel-size 1)
if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS=(--tensor-parallel-size 1 --data-parallel-size "$TP")
fi

EP_ARGS=()
if [ "${EP_SIZE:-1}" -gt 1 ]; then
EP_ARGS=(--enable-expert-parallel)
fi

# use 2 speculative tokens for all configs for now
NUM_SPEC_TOKENS=2

set -x
vllm serve $MODEL --port $PORT \
"${PARALLEL_ARGS[@]}" \
"${EP_ARGS[@]}" \
--async-scheduling \
--no-enable-prefix-caching \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--moe-backend triton_unfused \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--speculative-config "{\"method\": \"mtp\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
--compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}' > $SERVER_LOG 2>&1 &

Comment on lines +68 to +82
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing: MAX_MODEL_LEN is required via check_env_vars (line 28) and reassigned to $EVAL_MAX_MODEL_LEN inside the EVAL_ONLY branch (line 49), but is never passed to vllm serve (lines 68-82) — so the eval-context override is dead code and the server always uses the model default (~163K for DeepSeek-V4-Pro). The same omission exists in the base dsv4_fp4_mi355x_vllm.sh, which this MTP script mirrors; the b200 sister (dsv4_fp4_b200_vllm_mtp.sh:91) does pass --max-model-len "$SERVE_MAX_MODEL_LEN". Fix would be to add --max-model-len "$MAX_MODEL_LEN" to the vllm serve invocation here (and ideally in the base recipe in the same change).

Extended reasoning...

The bug

In benchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh:

  • Line 28 declares MAX_MODEL_LEN as a required env var via check_env_vars.
  • Lines 47-50 set up the eval context and reassign MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" when EVAL_ONLY=true.
  • Lines 68-82 build the vllm serve command — but --max-model-len is never passed.

Grep across the script confirms: there is no --max-model-len flag and no other downstream reference to MAX_MODEL_LEN after the EVAL_ONLY reassignment. The reassigned value goes nowhere.

Why the existing code does not prevent it

check_env_vars only enforces that the variable is set; it does not enforce that the script uses it. The eval-context plumbing in benchmark_lib.sh:684 documents the contract — "Scripts then wire $EVAL_MAX_MODEL_LEN into whichever server variable they need" — but in this script that wiring is missing on the final hop into the serve command.

Step-by-step proof

  1. Caller sets MAX_MODEL_LEN=16384 and EVAL_ONLY=true with EVAL_MAX_MODEL_LEN=8192.
  2. check_env_vars passes (both are set).
  3. setup_eval_context runs; line 49 sets MAX_MODEL_LEN=8192.
  4. vllm serve $MODEL --port $PORT "${PARALLEL_ARGS[@]}" ... is invoked without --max-model-len.
  5. vLLM falls back to the model config's default — DeepSeek-V4-Pro's HF config sets context to ~163K — so the engine allocates KV cache for 163K, not 8192. The intended eval truncation is silently dropped.

The normal benchmark path (EVAL_ONLY unset) has the same issue: passing ISL=8192/OSL=1024 constrains the prompts but not the engine, leaving max context at the model default.

Impact

EVAL_ONLY mode's context override is dead code. The KV cache is over-provisioned relative to what eval needs, which can either waste HBM headroom or — at high gpu-memory-utilization=0.8 and a 163K default — change KV block allocation in a way that interacts unpredictably with max-num-seqs. The benchmark path is unaffected functionally (vLLM works fine at the default) but the env-var declaration is misleading.

Why pre_existing

The base dsv4_fp4_mi355x_vllm.sh (unchanged by this PR) has the identical omission across the same line range. This PR copies that file verbatim, so the bug is inherited, not introduced. The b200 MTP sister (dsv4_fp4_b200_vllm_mtp.sh:91) and dsv4_fp4_mi355x_sglang.sh (which passes --context-length $MAX_MODEL_LEN) both demonstrate the intended pattern.

How to fix

Add --max-model-len "$MAX_MODEL_LEN" to the vllm serve invocation around lines 68-82. For consistency with the b200 sibling, you could rename MAX_MODEL_LEN to SERVE_MAX_MODEL_LEN and select between BENCHMARK_MAX_MODEL_LEN/EVAL_MAX_MODEL_LEN, but the minimal fix is just wiring the existing variable through. Ideally apply the same fix to the base dsv4_fp4_mi355x_vllm.sh in this PR so the two recipes stay in sync.

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

# --dsv4 routes prompts through DeepSeek-V4 chat encoding (auto-enables
# --use-chat-template); required for meaningful MTP acceptance numbers.
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code \
--dsv4

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3342,3 +3342,9 @@
description:
- "Update vLLM ROCm image from v0.21.0 to v0.22.0"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1616

- config-keys:
- dsv4-fp4-mi355x-vllm-mtp
description:
- "Add MTP speculative-decoding sibling for dsv4-fp4-mi355x-vllm (model: deepseek-ai/DeepSeek-V4-Pro) on vllm/vllm-openai-rocm:v0.22.0, per vllm-project/vllm#43385"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1630
Loading