Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋#1127
Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋#1127functionstackx wants to merge 6 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| set -x | ||
| vllm serve $MODEL --host 0.0.0.0 --port $PORT \ | ||
| $PARALLEL_ARGS \ | ||
| --trust-remote-code \ | ||
| --kv-cache-dtype fp8 \ | ||
| --block-size 256 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ | ||
| --attention_config.use_fp4_indexer_cache=True \ | ||
| --tokenizer-mode deepseek_v4 \ | ||
| --tool-call-parser deepseek_v4 \ | ||
| --enable-auto-tool-choice \ | ||
| --reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 & |
There was a problem hiding this comment.
🔴 The new dsv4_fp4_b200.sh launches vllm serve without --no-enable-prefix-caching, but every other vLLM script in benchmarks/single_node/ that uses the random dataset (--random-range-ratio) sets that flag — including minimaxm2.5_fp4_b200.sh (line 53), which the PR description says this script mirrors. Without it, dsv4 sweep results will be biased upward (random prompts share short common prefixes, so prefix caching inflates throughput) and won't be comparable to other models in the matrix. One-line fix: append --no-enable-prefix-caching to the vllm serve invocation in the new script.
Extended reasoning...
What the bug is
benchmarks/single_node/dsv4_fp4_b200.sh (lines 47–59) launches vllm serve with the recipe flags from the DeepSeek-V4 blog but omits --no-enable-prefix-caching. The script then calls run_benchmark_serving with --random-range-ratio "$RANDOM_RANGE_RATIO" (line 71), i.e. the random benchmark dataset. Because the server defaults to prefix caching enabled, any short prompt prefix that happens to repeat across the random samples will hit the prefix cache and inflate the measured throughput.
Why this is the established convention in this repo
A grep across benchmarks/single_node/ shows that 24 vLLM scripts that drive the random dataset all explicitly disable prefix caching. The closest sibling is the script the PR description says this one mirrors:
benchmarks/single_node/minimaxm2.5_fp4_b200.sh:53: --stream-interval 20 --no-enable-prefix-caching \\
The same flag appears in kimik2.5_fp4_b200.sh, kimik2.5_int4_b200.sh, minimaxm2.5_fp8_{h100,h200,b200,b300,mi300x,mi325x,mi355x}.sh, the gptoss FP4 vLLM scripts, etc. The TRT scripts use the equivalent enable_block_reuse: false and the dsr1 SGLang scripts use --disable-radix-cache. The convention was added explicitly:
- PR Disable prefix cache for kimi vllm configs #926: "Disable prefix caching (--no-enable-prefix-caching) for all Kimi K2.5 benchmarks using random datasets"
- PR [NVIDIA] Disable prefix minimax #966: "Disable prefix caching (--no-enable-prefix-caching) for all MiniMax benchmarks using random datasets"
Both changelog entries call out that the random dataset shares short common prefixes, which is why the flag must be set.
How the dsv4 script regressed
The PR description states the script "mirrors the minimaxm2.5_fp4_b200.sh DP-attn routing" and the parallelism block, env-var checks, and run_benchmark_serving invocation are essentially copied verbatim — including the --random-range-ratio argument. But --no-enable-prefix-caching, which was on line 53 of the source script, was dropped from the new vllm serve invocation.
Step-by-step proof the bug manifests
- Operator runs the
dsv4-fp4-b200-vllmconfig (e.g.{ tp:8, ep:8, dp-attn:true, conc:1024 }fromnvidia-master.yaml). - The runner exports
RANDOM_RANGE_RATIOand invokesbenchmarks/single_node/dsv4_fp4_b200.sh. - The script starts
vllm servewithout--no-enable-prefix-caching(line 47–59 of the diff). vLLM defaults to prefix caching enabled. run_benchmark_serving … --random-range-ratio "$RANDOM_RANGE_RATIO" --num-prompts "$((CONC * 10))"(lines 64–75) generates random prompts that share short common prefixes (the standard random dataset behavior the changelog entries above describe).- vLLM hits the prefix cache for those shared prefixes, so prefill latency drops and reported tokens/sec is inflated.
- The corresponding minimax/kimi/gptoss runs explicitly disable that cache and report uninflated numbers, so dsv4 numbers will be systematically higher and not comparable to peers in the same matrix sweep.
Fix
Add --no-enable-prefix-caching to the vllm serve argument list in benchmarks/single_node/dsv4_fp4_b200.sh — for example on the same line as --reasoning-parser deepseek_v4, matching the placement in minimaxm2.5_fp4_b200.sh:53. One-line change, no other behavior affected.
New DeepSeek-V4-Pro vLLM B200 benchmark, per the recipe published at https://vllm.ai/blog/deepseek-v4. Uses the vllm/vllm-openai:deepseekv4-cu130 image against deepseek-ai/DeepSeek-V4-Pro. The 8xB200 recipe runs as DP=8 + expert parallelism with TP=1 per replica, FP8 KV cache, block size 256, and an FP4 indexer cache. The search space uses a single entry per seq-len (tp=8, ep=8, dp-attn=true) so DP_ATTENTION=true routes into the DP-path in the launch script. Launch flags per the recipe: --trust-remote-code, --kv-cache-dtype fp8, --block-size 256, --enable-expert-parallel, --data-parallel-size=$TP (=8), --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}', --attention_config.use_fp4_indexer_cache=True, --tokenizer-mode deepseek_v4, --tool-call-parser deepseek_v4, --enable-auto-tool-choice, --reasoning-parser deepseek_v4. Configs: 1k1k conc 4-1024, 8k1k conc 4-512. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add --no-enable-prefix-caching to the launch to match the other vLLM B200 benchmark scripts (gptoss, minimaxm2.5), which disable prefix caching to avoid cross-request cache hits skewing steady-state throughput numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the recipe, no --tensor-parallel-size flag is passed — vLLM shards via expert parallelism + data parallelism only. Drop the PARALLEL_ARGS branching (and the now-unused EP_SIZE / DP_ATTENTION env-var checks) and pass --enable-expert-parallel --data-parallel-size $TP directly. TP from the search space is still used by the runner for GPU allocation (and as the DP size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
process_changelog.py's additions-only check flagged the previous commit because I'd stripped the two trailing spaces on an unrelated '- config-keys: ' line while adding the dsv4 entry. Restore the original whitespace so the diff is pure additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Give the vllm engine up to an hour to finish startup/compilation on B200 before the client considers it unready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of #1146 for B200. Each model historically used one inference engine, so the b200 launchers just resolved benchmarks/single_node/${model}_${precision}_b200.sh regardless of FRAMEWORK. With dsv4 we now want both an sglang script (already on main as dsv4_fp4_b200.sh from #1131) and a vllm script (added by this PR as dsv4_fp4_b200_vllm.sh) to coexist. - launch_b200-{nb,dgxc-slurm,cw}.sh prefer an engine-tagged script (e.g. dsv4_fp4_b200_vllm.sh) and fall back to the legacy unsuffixed name (or the existing _trt suffix) when the tagged variant is absent. Existing dsr1/glm5/qwen3.5/kimik2.5/minimaxm2.5/gptoss /minimaxm2.5/dsv4-sglang b200 scripts keep their current names. - This wires up the dsv4-fp4-b200-vllm config so FRAMEWORK=vllm resolves to dsv4_fp4_b200_vllm.sh instead of the sglang script that shares the unsuffixed path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d19b758 to
ad0f7bd
Compare
|
Superceded by #1156 |
Summary
dsv4-fp4-b200-vllmfor DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4.vllm/vllm-openai:deepseekv4-cu130againstdeepseek-ai/DeepSeek-V4-Pro. 8xB200 recipe: DP=8 + expert parallelism (TP=1/replica), FP8 KV cache, block size 256, FP4 indexer cache.benchmarks/single_node/dsv4_fp4_b200.shmirrors theminimaxm2.5_fp4_b200.shDP-attn routing (DP_ATTENTION=true→--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel) and passes the recipe's flags verbatim.Recipe flags (Pro, 8xB200)
Search space
{ tp: 8, ep: 8, dp-attn: true, conc: 4..1024 }{ tp: 8, ep: 8, dp-attn: true, conc: 4..512 }Test plan
python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-keys dsv4-fp4-b200-vllm --config-files .github/configs/nvidia-master.yamlgenerates the expected matrix (exp-namedsv4_1k1k/dsv4_8k1k, dp-attn=true, ep=8, tp=8, correct conc ladders, max-model-len 2304 / 9472).bash -n benchmarks/single_node/dsv4_fp4_b200.shpasses syntax check.yaml.safe_load).perf-changelog.yamlappends — verify benchmark + eval produce results on a B200 runner.🤖 Generated with Claude Code