Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP)#1203
Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP)#1203Oseltamivir wants to merge 2 commits intomainfrom
Conversation
…e decoding) The vllm/vllm-openai:deepseekv4-cu130 image registers DeepSeekV4MTPModel (vllm/model_executor/models/deepseek_v4_mtp.py) and remaps model_type=deepseek_v4 -> deepseek_mtp inside SpeculativeConfig (vllm/config/speculative.py:248-253). DSv4-Pro publishes num_nextn_predict_layers=1 so num_speculative_tokens=1 is the natural default; vLLM reuses the head when SPEC_TOKENS > n_predict (must be divisible). Mirrors the coverage of dsv4-fp4-b300-sglang-mtp: two TP-only bands (TP8 conc 1-8, TP4 conc 4-32) for both 1k1k and 8k1k. DP-attn (TP=4 ep=4 / TP=8 ep=8) is wired in the script via DP_ATTENTION=true but excluded from the initial sweep until TP-only numbers land. Bench passes --dsv4 (auto-enables --use-chat-template) to route prompts through encoding_dsv4.py -- required for honest MTP acceptance numbers, random-token prompts silently regress.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Summary
benchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh— DeepSeek-V4-Pro on B300 with vLLM + MTP speculative decoding.dsv4-fp4-b300-vllm-mtpin.github/configs/nvidia-master.yaml, mirroring the coverage ofdsv4-fp4-b300-sglang-mtp(TP8 conc 1-8, TP4 conc 4-32 for 1k1k and 8k1k).perf-changelog.yamlentry.Why this works
The
vllm/vllm-openai:deepseekv4-cu130image (already pinned bydsv4-fp4-b300-vllm) registersDeepSeekV4MTPModelinvllm/model_executor/models/registry.py:580and remapsmodel_type=deepseek_v4→deepseek_mtpinsideSpeculativeConfig.hf_config_override(vllm/config/speculative.py:248-253). DSv4-Pro publishesnum_nextn_predict_layers=1, sonum_speculative_tokens=1is the natural default. vLLM reuses the MTP head whennum_speculative_tokens > n_predictas long as it's divisible (speculative.py:594-606), soSPEC_TOKENSis exposed as an env var for sweeps.Recipe split
Mirrors the SGLang MTP variant:
DP_ATTENTION=false(TP-only, low/mid conc): standard MTP path.DP_ATTENTION=true(DP-attn, high conc): wired in the script (tensor-parallel-size 1 + data-parallel-size $TP + --enable-expert-parallel) but not yet in the sweep — held back until TP-only numbers land. Combining DP-attn + MTP +--no-enable-prefix-cachingtriggered crashes on R1 ([Bug]: R1 + MTP + DP + disabled prefix cache crashes vllm-project/vllm#25202); DSv4 has separate MLA + MTP code paths so it may be fine, but worth gating.Chat template
Per AGENTS.md ("MTP scripts must benchmark against chat-formatted inputs"), the bench passes
--dsv4torun_benchmark_serving, which auto-enables--use-chat-templateand routes prompts throughencoding_dsv4.py(PR #1153). The DSv4-Pro tokenizer ships without a jinjachat_template, so plain--use-chat-templatewould crash;--dsv4sidesteps that.Diff vs the non-MTP base script
Minimal — only
SPEC_TOKENS/SPECULATIVE_CONFIGdeclaration,--speculative-configonvllm serve, and--dsv4on the bench client.Test plan
vllm servewill rejectmodel_type=deepseek_v4at startup.dsv4-fp4-b300-vllm.dsv4-fp4-b300-sglang-mtpfor the same TP/conc.SPEC_TOKENS=2,3once baseline lands.Update changelog PR link
After this PR is opened, update
perf-changelog.yamlpr-link: …/pull/TBDplaceholder to the actual PR number.