Skip to content

Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋#1127

Closed
functionstackx wants to merge 6 commits intomainfrom
claude/add-dsv4-fp4-b200-vllm
Closed

Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋#1127
functionstackx wants to merge 6 commits intomainfrom
claude/add-dsv4-fp4-b200-vllm

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Add new B200 vLLM config dsv4-fp4-b200-vllm for DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4.
  • Uses vllm/vllm-openai:deepseekv4-cu130 against deepseek-ai/DeepSeek-V4-Pro. 8xB200 recipe: DP=8 + expert parallelism (TP=1/replica), FP8 KV cache, block size 256, FP4 indexer cache.
  • New launch script benchmarks/single_node/dsv4_fp4_b200.sh mirrors the minimaxm2.5_fp4_b200.sh DP-attn routing (DP_ATTENTION=true--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel) and passes the recipe's flags verbatim.

Recipe flags (Pro, 8xB200)

--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--enable-expert-parallel
--data-parallel-size 8
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
--attention_config.use_fp4_indexer_cache=True
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

Search space

  • 1k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..1024 }
  • 8k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..512 }

Test plan

  • python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-keys dsv4-fp4-b200-vllm --config-files .github/configs/nvidia-master.yaml generates the expected matrix (exp-name dsv4_1k1k / dsv4_8k1k, dp-attn=true, ep=8, tp=8, correct conc ladders, max-model-len 2304 / 9472).
  • bash -n benchmarks/single_node/dsv4_fp4_b200.sh passes syntax check.
  • YAML files parse cleanly (yaml.safe_load).
  • Triggered B200 sweep via perf-changelog.yaml appends — verify benchmark + eval produce results on a B200 runner.
  • If any flag is unsupported by the image, update script to match the image's supported names and re-run.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@cquil11 cquil11 changed the title Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋 Apr 24, 2026
Comment thread benchmarks/single_node/dsv4_fp4_b200.sh Outdated
Comment on lines +47 to +59
set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
$PARALLEL_ARGS \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--max-model-len $MAX_MODEL_LEN \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dsv4_fp4_b200.sh launches vllm serve without --no-enable-prefix-caching, but every other vLLM script in benchmarks/single_node/ that uses the random dataset (--random-range-ratio) sets that flag — including minimaxm2.5_fp4_b200.sh (line 53), which the PR description says this script mirrors. Without it, dsv4 sweep results will be biased upward (random prompts share short common prefixes, so prefix caching inflates throughput) and won't be comparable to other models in the matrix. One-line fix: append --no-enable-prefix-caching to the vllm serve invocation in the new script.

Extended reasoning...

What the bug is

benchmarks/single_node/dsv4_fp4_b200.sh (lines 47–59) launches vllm serve with the recipe flags from the DeepSeek-V4 blog but omits --no-enable-prefix-caching. The script then calls run_benchmark_serving with --random-range-ratio "$RANDOM_RANGE_RATIO" (line 71), i.e. the random benchmark dataset. Because the server defaults to prefix caching enabled, any short prompt prefix that happens to repeat across the random samples will hit the prefix cache and inflate the measured throughput.

Why this is the established convention in this repo

A grep across benchmarks/single_node/ shows that 24 vLLM scripts that drive the random dataset all explicitly disable prefix caching. The closest sibling is the script the PR description says this one mirrors:

benchmarks/single_node/minimaxm2.5_fp4_b200.sh:53:    --stream-interval 20 --no-enable-prefix-caching \\

The same flag appears in kimik2.5_fp4_b200.sh, kimik2.5_int4_b200.sh, minimaxm2.5_fp8_{h100,h200,b200,b300,mi300x,mi325x,mi355x}.sh, the gptoss FP4 vLLM scripts, etc. The TRT scripts use the equivalent enable_block_reuse: false and the dsr1 SGLang scripts use --disable-radix-cache. The convention was added explicitly:

Both changelog entries call out that the random dataset shares short common prefixes, which is why the flag must be set.

How the dsv4 script regressed

The PR description states the script "mirrors the minimaxm2.5_fp4_b200.sh DP-attn routing" and the parallelism block, env-var checks, and run_benchmark_serving invocation are essentially copied verbatim — including the --random-range-ratio argument. But --no-enable-prefix-caching, which was on line 53 of the source script, was dropped from the new vllm serve invocation.

Step-by-step proof the bug manifests

  1. Operator runs the dsv4-fp4-b200-vllm config (e.g. { tp:8, ep:8, dp-attn:true, conc:1024 } from nvidia-master.yaml).
  2. The runner exports RANDOM_RANGE_RATIO and invokes benchmarks/single_node/dsv4_fp4_b200.sh.
  3. The script starts vllm serve without --no-enable-prefix-caching (line 47–59 of the diff). vLLM defaults to prefix caching enabled.
  4. run_benchmark_serving … --random-range-ratio "$RANDOM_RANGE_RATIO" --num-prompts "$((CONC * 10))" (lines 64–75) generates random prompts that share short common prefixes (the standard random dataset behavior the changelog entries above describe).
  5. vLLM hits the prefix cache for those shared prefixes, so prefill latency drops and reported tokens/sec is inflated.
  6. The corresponding minimax/kimi/gptoss runs explicitly disable that cache and report uninflated numbers, so dsv4 numbers will be systematically higher and not comparable to peers in the same matrix sweep.

Fix

Add --no-enable-prefix-caching to the vllm serve argument list in benchmarks/single_node/dsv4_fp4_b200.sh — for example on the same line as --reasoning-parser deepseek_v4, matching the placement in minimaxm2.5_fp4_b200.sh:53. One-line change, no other behavior affected.

functionstackx and others added 6 commits April 25, 2026 01:27
New DeepSeek-V4-Pro vLLM B200 benchmark, per the recipe published at
https://vllm.ai/blog/deepseek-v4. Uses the vllm/vllm-openai:deepseekv4-cu130
image against deepseek-ai/DeepSeek-V4-Pro.

The 8xB200 recipe runs as DP=8 + expert parallelism with TP=1 per
replica, FP8 KV cache, block size 256, and an FP4 indexer cache. The
search space uses a single entry per seq-len (tp=8, ep=8, dp-attn=true)
so DP_ATTENTION=true routes into the DP-path in the launch script.

Launch flags per the recipe: --trust-remote-code, --kv-cache-dtype fp8,
--block-size 256, --enable-expert-parallel, --data-parallel-size=$TP
(=8), --compilation-config
'{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}',
--attention_config.use_fp4_indexer_cache=True, --tokenizer-mode
deepseek_v4, --tool-call-parser deepseek_v4,
--enable-auto-tool-choice, --reasoning-parser deepseek_v4.

Configs: 1k1k conc 4-1024, 8k1k conc 4-512.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add --no-enable-prefix-caching to the launch to match the
other vLLM B200 benchmark scripts (gptoss, minimaxm2.5),
which disable prefix caching to avoid cross-request cache
hits skewing steady-state throughput numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the recipe, no --tensor-parallel-size flag is passed — vLLM
shards via expert parallelism + data parallelism only. Drop the
PARALLEL_ARGS branching (and the now-unused EP_SIZE / DP_ATTENTION
env-var checks) and pass --enable-expert-parallel --data-parallel-size $TP
directly. TP from the search space is still used by the runner for
GPU allocation (and as the DP size).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
process_changelog.py's additions-only check flagged the previous
commit because I'd stripped the two trailing spaces on an
unrelated '- config-keys:  ' line while adding the dsv4 entry.
Restore the original whitespace so the diff is pure additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Give the vllm engine up to an hour to finish startup/compilation on
B200 before the client considers it unready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of #1146 for B200. Each model historically used one inference
engine, so the b200 launchers just resolved
benchmarks/single_node/${model}_${precision}_b200.sh regardless of
FRAMEWORK. With dsv4 we now want both an sglang script (already on
main as dsv4_fp4_b200.sh from #1131) and a vllm script (added by this
PR as dsv4_fp4_b200_vllm.sh) to coexist.

- launch_b200-{nb,dgxc-slurm,cw}.sh prefer an engine-tagged script
  (e.g. dsv4_fp4_b200_vllm.sh) and fall back to the legacy unsuffixed
  name (or the existing _trt suffix) when the tagged variant is
  absent. Existing dsr1/glm5/qwen3.5/kimik2.5/minimaxm2.5/gptoss
  /minimaxm2.5/dsv4-sglang b200 scripts keep their current names.
- This wires up the dsv4-fp4-b200-vllm config so FRAMEWORK=vllm
  resolves to dsv4_fp4_b200_vllm.sh instead of the sglang script
  that shares the unsuffixed path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-dsv4-fp4-b200-vllm branch from d19b758 to ad0f7bd Compare April 25, 2026 05:30
@functionstackx
Copy link
Copy Markdown
Contributor Author

Superceded by #1156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants