Skip to content

Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP)#1203

Open
Oseltamivir wants to merge 2 commits intomainfrom
dsv4-fp4-b300-vllm-mtp
Open

Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP)#1203
Oseltamivir wants to merge 2 commits intomainfrom
dsv4-fp4-b300-vllm-mtp

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • New benchmark benchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh — DeepSeek-V4-Pro on B300 with vLLM + MTP speculative decoding.
  • New matrix entry dsv4-fp4-b300-vllm-mtp in .github/configs/nvidia-master.yaml, mirroring the coverage of dsv4-fp4-b300-sglang-mtp (TP8 conc 1-8, TP4 conc 4-32 for 1k1k and 8k1k).
  • perf-changelog.yaml entry.

Why this works

The vllm/vllm-openai:deepseekv4-cu130 image (already pinned by dsv4-fp4-b300-vllm) registers DeepSeekV4MTPModel in vllm/model_executor/models/registry.py:580 and remaps model_type=deepseek_v4deepseek_mtp inside SpeculativeConfig.hf_config_override (vllm/config/speculative.py:248-253). DSv4-Pro publishes num_nextn_predict_layers=1, so num_speculative_tokens=1 is the natural default. vLLM reuses the MTP head when num_speculative_tokens > n_predict as long as it's divisible (speculative.py:594-606), so SPEC_TOKENS is exposed as an env var for sweeps.

Recipe split

Mirrors the SGLang MTP variant:

  • DP_ATTENTION=false (TP-only, low/mid conc): standard MTP path.
  • DP_ATTENTION=true (DP-attn, high conc): wired in the script (tensor-parallel-size 1 + data-parallel-size $TP + --enable-expert-parallel) but not yet in the sweep — held back until TP-only numbers land. Combining DP-attn + MTP + --no-enable-prefix-caching triggered crashes on R1 ([Bug]: R1 + MTP + DP + disabled prefix cache crashes vllm-project/vllm#25202); DSv4 has separate MLA + MTP code paths so it may be fine, but worth gating.

Chat template

Per AGENTS.md ("MTP scripts must benchmark against chat-formatted inputs"), the bench passes --dsv4 to run_benchmark_serving, which auto-enables --use-chat-template and routes prompts through encoding_dsv4.py (PR #1153). The DSv4-Pro tokenizer ships without a jinja chat_template, so plain --use-chat-template would crash; --dsv4 sidesteps that.

Diff vs the non-MTP base script

Minimal — only SPEC_TOKENS/SPECULATIVE_CONFIG declaration, --speculative-config on vllm serve, and --dsv4 on the bench client.

Test plan

  • First run on the B300 runner — confirm the deepseekv4-cu130 image carries the DSv4 MTP patch (registry + speculative.py); if older, vllm serve will reject model_type=deepseek_v4 at startup.
  • TP=8 conc 1-8 (1k1k) — sanity check MTP acceptance > 0 and tok/s improves over dsv4-fp4-b300-vllm.
  • TP=4 conc 4-32 (1k1k and 8k1k).
  • Compare acceptance / tok/s vs dsv4-fp4-b300-sglang-mtp for the same TP/conc.
  • If TP-only is healthy, follow up with DP-attn entries (TP=4 ep=4 dp-attn, conc 64-512).
  • Sweep SPEC_TOKENS=2,3 once baseline lands.

Update changelog PR link

After this PR is opened, update perf-changelog.yaml pr-link: …/pull/TBD placeholder to the actual PR number.

…e decoding)

The vllm/vllm-openai:deepseekv4-cu130 image registers DeepSeekV4MTPModel
(vllm/model_executor/models/deepseek_v4_mtp.py) and remaps
model_type=deepseek_v4 -> deepseek_mtp inside SpeculativeConfig
(vllm/config/speculative.py:248-253). DSv4-Pro publishes
num_nextn_predict_layers=1 so num_speculative_tokens=1 is the natural
default; vLLM reuses the head when SPEC_TOKENS > n_predict (must be
divisible).

Mirrors the coverage of dsv4-fp4-b300-sglang-mtp: two TP-only bands
(TP8 conc 1-8, TP4 conc 4-32) for both 1k1k and 8k1k. DP-attn
(TP=4 ep=4 / TP=8 ep=8) is wired in the script via DP_ATTENTION=true
but excluded from the initial sweep until TP-only numbers land.

Bench passes --dsv4 (auto-enables --use-chat-template) to route
prompts through encoding_dsv4.py -- required for honest MTP acceptance
numbers, random-token prompts silently regress.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant