-
Notifications
You must be signed in to change notification settings - Fork 179
[AMD] Add DeepSeek-V4-Pro FP4 MI355X vLLM MTP recipe #1630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| #!/usr/bin/env bash | ||
| set -eo pipefail | ||
|
|
||
| # DeepSeek-V4-Pro on MI355X via vLLM — MTP variant of dsv4_fp4_mi355x_vllm.sh. | ||
| # Adds MTP speculative decoding per vllm-project/vllm#43385 (ROCm DeepSeek-V4 | ||
| # MTP support, merged 2026-05-24, present in v0.22.0 tagged 2026-05-29): | ||
| # --speculative-config '{"method":"mtp","num_speculative_tokens":2}'. | ||
| # | ||
| # Benchmark prompts are routed through DeepSeek-V4 chat encoding via --dsv4 | ||
| # (which auto-enables --use-chat-template). EAGLE/MTP-style spec decoding is | ||
| # trained against chat-formatted inputs; benchmarking against raw random | ||
| # prompts silently regresses the acceptance rate. | ||
| # | ||
| # All other serving flags mirror the non-MTP MI355X recipe (TP=8, | ||
| # VLLM_ROCM_USE_AITER=1, triton_unfused MoE, FP8 KV cache, mp executor, async | ||
| # scheduling, mode=3 FULL_AND_PIECEWISE compilation). See | ||
| # dsv4_fp4_mi355x_vllm.sh for per-flag rationale. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| DP_ATTENTION \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| if [ -n "$ROCR_VISIBLE_DEVICES" ]; then | ||
| export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES" | ||
| fi | ||
|
|
||
| export VLLM_ROCM_USE_AITER=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| start_gpu_monitor | ||
|
|
||
| PARALLEL_ARGS=(--tensor-parallel-size "$TP" --data-parallel-size 1) | ||
| if [ "${DP_ATTENTION}" = "true" ]; then | ||
| PARALLEL_ARGS=(--tensor-parallel-size 1 --data-parallel-size "$TP") | ||
| fi | ||
|
|
||
| EP_ARGS=() | ||
| if [ "${EP_SIZE:-1}" -gt 1 ]; then | ||
| EP_ARGS=(--enable-expert-parallel) | ||
| fi | ||
|
|
||
| # use 2 speculative tokens for all configs for now | ||
| NUM_SPEC_TOKENS=2 | ||
|
|
||
| set -x | ||
| vllm serve $MODEL --port $PORT \ | ||
| "${PARALLEL_ARGS[@]}" \ | ||
| "${EP_ARGS[@]}" \ | ||
| --async-scheduling \ | ||
| --no-enable-prefix-caching \ | ||
| --distributed-executor-backend mp \ | ||
| --gpu-memory-utilization 0.8 \ | ||
| --kv-cache-dtype fp8 \ | ||
| --trust-remote-code \ | ||
| --moe-backend triton_unfused \ | ||
| --tokenizer-mode deepseek_v4 \ | ||
| --reasoning-parser deepseek_v4 \ | ||
| --speculative-config "{\"method\": \"mtp\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \ | ||
| --compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}' > $SERVER_LOG 2>&1 & | ||
|
|
||
|
Comment on lines
+68
to
+82
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟣 Pre-existing: Extended reasoning...The bug In
Grep across the script confirms: there is no Why the existing code does not prevent it
Step-by-step proof
The normal benchmark path (EVAL_ONLY unset) has the same issue: passing Impact EVAL_ONLY mode's context override is dead code. The KV cache is over-provisioned relative to what eval needs, which can either waste HBM headroom or — at high Why pre_existing The base How to fix Add |
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| # --dsv4 routes prompts through DeepSeek-V4 chat encoding (auto-enables | ||
| # --use-chat-template); required for meaningful MTP acceptance numbers. | ||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code \ | ||
| --dsv4 | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The PR description lists
perf-changelog.yaml — sweep trigger entryas one of three changed files, but the diff only modifies two files and no entry was added fordsv4-fp4-mi355x-vllm-mtp. Per the project's documented policy in.github/workflows/claude-pr-review.yml(lines 117-126), editing.github/configs/amd-master.yamlwithout a correspondingperf-changelog.yamlentry is a 🔴 BLOCKING issue — please append aperf-changelog.yamlentry for the new recipe (every sister MTP recipe —dsr1-fp4-mi355x-atom-mtp,dsr1-fp8-mi355x-atom-mtp,qwen3.5-fp8-mi355x-sglang-mtp, etc. — has one).Extended reasoning...
What the bug is
The PR description explicitly lists three changed files, the third being:
But the diff only modifies two files:
.github/configs/amd-master.yamlandbenchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh. Noperf-changelog.yamlentry is added for the newdsv4-fp4-mi355x-vllm-mtpconfig key.Why this is a defect (not just imprecise prose)
The project codifies this exact case as a 🔴 BLOCKING policy in
.github/workflows/claude-pr-review.yml(lines 117–126):This PR edits
.github/configs/amd-master.yaml(adding 27 lines fordsv4-fp4-mi355x-vllm-mtp) and does not editperf-changelog.yaml— the exact pattern the policy flags as blocking.Code-path impact (why the perf-changelog entry actually matters operationally)
.github/workflows/run-sweep.yml(lines 17–32) keys sweep triggers exclusively offperf-changelog.yamlpath changes for bothpush: mainandpull_requestevents:Without an entry, the new
dsv4-fp4-mi355x-vllm-mtpconfig is invisible to the changelog-driven sweep flow — the master-config addition by itself does not trigger any sweep.Step-by-step proof
git diff origin/main...HEAD --name-onlyon this PR returns exactly two paths:.github/configs/amd-master.yamlandbenchmarks/single_node/dsv4_fp4_mi355x_vllm_mtp.sh.perf-changelog.yamlis NOT in the diff.grep -n 'dsv4-fp4-mi355x-vllm-mtp\|dsv4_fp4_mi355x_vllm_mtp' perf-changelog.yamlreturns zero matches.dsr1-fp4-mi355x-atom-mtp,dsr1-fp8-mi355x-atom-mtp,qwen3.5-fp8-mi355x-sglang-mtp,qwen3.5-fp8-mi355x-atom-mtp,glm5-fp8-mi355x-sglang-mtp, etc. all haveperf-changelog.yamlentries when they were added. This one doesn't.claude-pr-review.ymllines 117–126, an amd-master.yaml edit without a perf-changelog.yaml edit = 🔴 BLOCKING. Conditions met.Addressing the refutation
One verifier refuted this as a duplicate of bug_003 and suggested the PR description bullet may be "aspirational/inaccurate prose rather than a real defect", citing the PR's own note that "No sweep label applied yet — add
sweep-enabled/full-sweep-enabledto run benchmarks." That argument doesn't hold here for two reasons:claude-pr-review.ymldoes not carve out an exception for "config additions that intentionally defer their sweep". The rule is "master-config edit ⇒ perf-changelog.yaml entry, same PR, or BLOCKING". Every sister MTP recipe complied; this one did not.sweep-enabledlabels concerns triggering a sweep run via label, while the perf-changelog entry is what makes the new config visible to the changelog-driven reusable sweep flow at all (run-sweep.ymlpaths: ["perf-changelog.yaml"]). Deferring the label run does not justify omitting the changelog entry.Fix
Append an entry to the END of
perf-changelog.yaml(file is read chronologically perclaude-pr-review.ymllines 129–137) documenting the newdsv4-fp4-mi355x-vllm-mtpconfig and linking PR #1630 — mirroring the format used by the sister*-mtprecipes already in that file.