Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) :whale2: :whale2: by functionstackx · Pull Request #1127 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-24T03:42:47Z

Summary

Add new B200 vLLM config dsv4-fp4-b200-vllm for DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4.
Uses vllm/vllm-openai:deepseekv4-cu130 against deepseek-ai/DeepSeek-V4-Pro. 8xB200 recipe: DP=8 + expert parallelism (TP=1/replica), FP8 KV cache, block size 256, FP4 indexer cache.
New launch script benchmarks/single_node/dsv4_fp4_b200.sh mirrors the minimaxm2.5_fp4_b200.sh DP-attn routing (DP_ATTENTION=true → --tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel) and passes the recipe's flags verbatim.

Recipe flags (Pro, 8xB200)

--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--enable-expert-parallel
--data-parallel-size 8
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
--attention_config.use_fp4_indexer_cache=True
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

Search space

1k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..1024 }
8k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..512 }

Test plan

python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-keys dsv4-fp4-b200-vllm --config-files .github/configs/nvidia-master.yaml generates the expected matrix (exp-name dsv4_1k1k / dsv4_8k1k, dp-attn=true, ep=8, tp=8, correct conc ladders, max-model-len 2304 / 9472).
bash -n benchmarks/single_node/dsv4_fp4_b200.sh passes syntax check.
YAML files parse cleanly (yaml.safe_load).
Triggered B200 sweep via perf-changelog.yaml appends — verify benchmark + eval produce results on a B200 runner.
If any flag is unsupported by the image, update script to match the image's supported names and re-run.

🤖 Generated with Claude Code

github-actions · 2026-04-24T03:42:55Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-24T03:50:27Z

+set -x
+vllm serve $MODEL --host 0.0.0.0 --port $PORT \
+$PARALLEL_ARGS \
+--trust-remote-code \
+--kv-cache-dtype fp8 \
+--block-size 256 \
+--max-model-len $MAX_MODEL_LEN \
+--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
+--attention_config.use_fp4_indexer_cache=True \
+--tokenizer-mode deepseek_v4 \
+--tool-call-parser deepseek_v4 \
+--enable-auto-tool-choice \
+--reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &


🔴 The new dsv4_fp4_b200.sh launches vllm serve without --no-enable-prefix-caching, but every other vLLM script in benchmarks/single_node/ that uses the random dataset (--random-range-ratio) sets that flag — including minimaxm2.5_fp4_b200.sh (line 53), which the PR description says this script mirrors. Without it, dsv4 sweep results will be biased upward (random prompts share short common prefixes, so prefix caching inflates throughput) and won't be comparable to other models in the matrix. One-line fix: append --no-enable-prefix-caching to the vllm serve invocation in the new script.

Extended reasoning...

What the bug is

benchmarks/single_node/dsv4_fp4_b200.sh (lines 47–59) launches vllm serve with the recipe flags from the DeepSeek-V4 blog but omits --no-enable-prefix-caching. The script then calls run_benchmark_serving with --random-range-ratio "$RANDOM_RANGE_RATIO" (line 71), i.e. the random benchmark dataset. Because the server defaults to prefix caching enabled, any short prompt prefix that happens to repeat across the random samples will hit the prefix cache and inflate the measured throughput.

Why this is the established convention in this repo

A grep across benchmarks/single_node/ shows that 24 vLLM scripts that drive the random dataset all explicitly disable prefix caching. The closest sibling is the script the PR description says this one mirrors:

benchmarks/single_node/minimaxm2.5_fp4_b200.sh:53: --stream-interval 20 --no-enable-prefix-caching \\

The same flag appears in kimik2.5_fp4_b200.sh, kimik2.5_int4_b200.sh, minimaxm2.5_fp8_{h100,h200,b200,b300,mi300x,mi325x,mi355x}.sh, the gptoss FP4 vLLM scripts, etc. The TRT scripts use the equivalent enable_block_reuse: false and the dsr1 SGLang scripts use --disable-radix-cache. The convention was added explicitly:

PR Disable prefix cache for kimi vllm configs #926: "Disable prefix caching (--no-enable-prefix-caching) for all Kimi K2.5 benchmarks using random datasets"

PR [NVIDIA] Disable prefix minimax #966: "Disable prefix caching (--no-enable-prefix-caching) for all MiniMax benchmarks using random datasets"

Both changelog entries call out that the random dataset shares short common prefixes, which is why the flag must be set.

How the dsv4 script regressed

The PR description states the script "mirrors the minimaxm2.5_fp4_b200.sh DP-attn routing" and the parallelism block, env-var checks, and run_benchmark_serving invocation are essentially copied verbatim — including the --random-range-ratio argument. But --no-enable-prefix-caching, which was on line 53 of the source script, was dropped from the new vllm serve invocation.

Step-by-step proof the bug manifests

Operator runs the dsv4-fp4-b200-vllm config (e.g. { tp:8, ep:8, dp-attn:true, conc:1024 } from nvidia-master.yaml).

The runner exports RANDOM_RANGE_RATIO and invokes benchmarks/single_node/dsv4_fp4_b200.sh.

The script starts vllm serve without --no-enable-prefix-caching (line 47–59 of the diff). vLLM defaults to prefix caching enabled.

run_benchmark_serving … --random-range-ratio "$RANDOM_RANGE_RATIO" --num-prompts "$((CONC * 10))" (lines 64–75) generates random prompts that share short common prefixes (the standard random dataset behavior the changelog entries above describe).

vLLM hits the prefix cache for those shared prefixes, so prefill latency drops and reported tokens/sec is inflated.

The corresponding minimax/kimi/gptoss runs explicitly disable that cache and report uninflated numbers, so dsv4 numbers will be systematically higher and not comparable to peers in the same matrix sweep.

Fix

Add --no-enable-prefix-caching to the vllm serve argument list in benchmarks/single_node/dsv4_fp4_b200.sh — for example on the same line as --reasoning-parser deepseek_v4, matching the placement in minimaxm2.5_fp4_b200.sh:53. One-line change, no other behavior affected.

New DeepSeek-V4-Pro vLLM B200 benchmark, per the recipe published at https://vllm.ai/blog/deepseek-v4. Uses the vllm/vllm-openai:deepseekv4-cu130 image against deepseek-ai/DeepSeek-V4-Pro. The 8xB200 recipe runs as DP=8 + expert parallelism with TP=1 per replica, FP8 KV cache, block size 256, and an FP4 indexer cache. The search space uses a single entry per seq-len (tp=8, ep=8, dp-attn=true) so DP_ATTENTION=true routes into the DP-path in the launch script. Launch flags per the recipe: --trust-remote-code, --kv-cache-dtype fp8, --block-size 256, --enable-expert-parallel, --data-parallel-size=$TP (=8), --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}', --attention_config.use_fp4_indexer_cache=True, --tokenizer-mode deepseek_v4, --tool-call-parser deepseek_v4, --enable-auto-tool-choice, --reasoning-parser deepseek_v4. Configs: 1k1k conc 4-1024, 8k1k conc 4-512. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add --no-enable-prefix-caching to the launch to match the other vLLM B200 benchmark scripts (gptoss, minimaxm2.5), which disable prefix caching to avoid cross-request cache hits skewing steady-state throughput numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the recipe, no --tensor-parallel-size flag is passed — vLLM shards via expert parallelism + data parallelism only. Drop the PARALLEL_ARGS branching (and the now-unused EP_SIZE / DP_ATTENTION env-var checks) and pass --enable-expert-parallel --data-parallel-size $TP directly. TP from the search space is still used by the runner for GPU allocation (and as the DP size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

process_changelog.py's additions-only check flagged the previous commit because I'd stripped the two trailing spaces on an unrelated '- config-keys: ' line while adding the dsv4 entry. Restore the original whitespace so the diff is pure additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Give the vllm engine up to an hour to finish startup/compilation on B200 before the client considers it unready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror of #1146 for B200. Each model historically used one inference engine, so the b200 launchers just resolved benchmarks/single_node/${model}_${precision}_b200.sh regardless of FRAMEWORK. With dsv4 we now want both an sglang script (already on main as dsv4_fp4_b200.sh from #1131) and a vllm script (added by this PR as dsv4_fp4_b200_vllm.sh) to coexist. - launch_b200-{nb,dgxc-slurm,cw}.sh prefer an engine-tagged script (e.g. dsv4_fp4_b200_vllm.sh) and fall back to the legacy unsuffixed name (or the existing _trt suffix) when the tagged variant is absent. Existing dsr1/glm5/qwen3.5/kimik2.5/minimaxm2.5/gptoss /minimaxm2.5/dsv4-sglang b200 scripts keep their current names. - This wires up the dsv4-fp4-b200-vllm config so FRAMEWORK=vllm resolves to dsv4_fp4_b200_vllm.sh instead of the sglang script that shares the unsuffixed path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx · 2026-04-25T18:24:09Z

Superceded by #1156

functionstackx requested a review from a team April 24, 2026 03:42

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 24, 2026 03:42

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

cquil11 added the sweep-enabled label Apr 24, 2026

cquil11 changed the title ~~Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro)~~ Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋 Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

functionstackx mentioned this pull request Apr 24, 2026

Add B300 config: dsv4-fp4fp8-vllm (DeepSeek-V4-Pro) #1128

Closed

4 tasks

cquil11 added full-sweep-enabled and removed sweep-enabled labels Apr 24, 2026

functionstackx mentioned this pull request Apr 24, 2026

Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro) #1130

Merged

4 tasks

functionstackx added full-sweep-enabled and removed full-sweep-enabled labels Apr 25, 2026

functionstackx and others added 6 commits April 25, 2026 01:27

dsv4-fp4-b200-vllm: set VLLM_ENGINE_READY_TIMEOUT_S=3600

ff34a18

Give the vllm engine up to an hour to finish startup/compilation on B200 before the client considers it unready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx force-pushed the claude/add-dsv4-fp4-b200-vllm branch from d19b758 to ad0f7bd Compare April 25, 2026 05:30

functionstackx closed this Apr 25, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋#1127

Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋#1127
functionstackx wants to merge 6 commits intomainfrom
claude/add-dsv4-fp4-b200-vllm

functionstackx commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

functionstackx commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

functionstackx commented Apr 24, 2026

Summary

Recipe flags (Pro, 8xB200)

Search space

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

What the bug is

Why this is the established convention in this repo

How the dsv4 script regressed

Step-by-step proof the bug manifests

Fix

Uh oh!

functionstackx commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants