Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP) by Oseltamivir · Pull Request #1203 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-28T00:34:47Z

Summary

New benchmark benchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh — DeepSeek-V4-Pro on B300 with vLLM + MTP speculative decoding.
New matrix entry dsv4-fp4-b300-vllm-mtp in .github/configs/nvidia-master.yaml, mirroring the coverage of dsv4-fp4-b300-sglang-mtp (TP8 conc 1-8, TP4 conc 4-32 for 1k1k and 8k1k).
perf-changelog.yaml entry.

Why this works

The vllm/vllm-openai:deepseekv4-cu130 image (already pinned by dsv4-fp4-b300-vllm) registers DeepSeekV4MTPModel in vllm/model_executor/models/registry.py:580 and remaps model_type=deepseek_v4 → deepseek_mtp inside SpeculativeConfig.hf_config_override (vllm/config/speculative.py:248-253). DSv4-Pro publishes num_nextn_predict_layers=1, so num_speculative_tokens=1 is the natural default. vLLM reuses the MTP head when num_speculative_tokens > n_predict as long as it's divisible (speculative.py:594-606), so SPEC_TOKENS is exposed as an env var for sweeps.

Recipe split

Mirrors the SGLang MTP variant:

DP_ATTENTION=false (TP-only, low/mid conc): standard MTP path.
DP_ATTENTION=true (DP-attn, high conc): wired in the script (tensor-parallel-size 1 + data-parallel-size $TP + --enable-expert-parallel) but not yet in the sweep — held back until TP-only numbers land. Combining DP-attn + MTP + --no-enable-prefix-caching triggered crashes on R1 ([Bug]: R1 + MTP + DP + disabled prefix cache crashes vllm-project/vllm#25202); DSv4 has separate MLA + MTP code paths so it may be fine, but worth gating.

Chat template

Per AGENTS.md ("MTP scripts must benchmark against chat-formatted inputs"), the bench passes --dsv4 to run_benchmark_serving, which auto-enables --use-chat-template and routes prompts through encoding_dsv4.py (PR #1153). The DSv4-Pro tokenizer ships without a jinja chat_template, so plain --use-chat-template would crash; --dsv4 sidesteps that.

Diff vs the non-MTP base script

Minimal — only SPEC_TOKENS/SPECULATIVE_CONFIG declaration, --speculative-config on vllm serve, and --dsv4 on the bench client.

Test plan

First run on the B300 runner — confirm the deepseekv4-cu130 image carries the DSv4 MTP patch (registry + speculative.py); if older, vllm serve will reject model_type=deepseek_v4 at startup.
TP=8 conc 1-8 (1k1k) — sanity check MTP acceptance > 0 and tok/s improves over dsv4-fp4-b300-vllm.
TP=4 conc 4-32 (1k1k and 8k1k).
Compare acceptance / tok/s vs dsv4-fp4-b300-sglang-mtp for the same TP/conc.
If TP-only is healthy, follow up with DP-attn entries (TP=4 ep=4 dp-attn, conc 64-512).
Sweep SPEC_TOKENS=2,3 once baseline lands.

Update changelog PR link

After this PR is opened, update perf-changelog.yaml pr-link: …/pull/TBD placeholder to the actual PR number.

…e decoding) The vllm/vllm-openai:deepseekv4-cu130 image registers DeepSeekV4MTPModel (vllm/model_executor/models/deepseek_v4_mtp.py) and remaps model_type=deepseek_v4 -> deepseek_mtp inside SpeculativeConfig (vllm/config/speculative.py:248-253). DSv4-Pro publishes num_nextn_predict_layers=1 so num_speculative_tokens=1 is the natural default; vLLM reuses the head when SPEC_TOKENS > n_predict (must be divisible). Mirrors the coverage of dsv4-fp4-b300-sglang-mtp: two TP-only bands (TP8 conc 1-8, TP4 conc 4-32) for both 1k1k and 8k1k. DP-attn (TP=4 ep=4 / TP=8 ep=8) is wired in the script via DP_ATTENTION=true but excluded from the initial sweep until TP-only numbers land. Bench passes --dsv4 (auto-enables --use-chat-template) to route prompts through encoding_dsv4.py -- required for honest MTP acceptance numbers, random-token prompts silently regress.

github-actions · 2026-04-28T00:34:54Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Oseltamivir requested a review from a team April 28, 2026 00:34

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners April 28, 2026 00:34

github-project-automation Bot added this to InferenceMAX Board Apr 28, 2026

perf-changelog: pin pr-link for dsv4-fp4-b300-vllm-mtp to PR #1203

bb504fd

Oseltamivir added the sweep-enabled label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP)#1203

Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP)#1203
Oseltamivir wants to merge 2 commits intomainfrom
dsv4-fp4-b300-vllm-mtp

Oseltamivir commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 28, 2026

Summary

Why this works

Recipe split

Chat template

Diff vs the non-MTP base script

Test plan

Update changelog PR link

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant