Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro)#1130
Conversation
Port the DeepSeek-V4-Pro vLLM recipe to H200 per https://vllm.ai/blog/deepseek-v4. Uses the cu129 image and omits the FP4 indexer cache flag (H200 has no FP4 path). Max-model-len is pinned at 800k per the recipe. Prefix caching is disabled (matches the B200/B300 configs and the user's note) and VLLM_ENGINE_READY_TIMEOUT_S is bumped to 1200s to tolerate slow weight loading. Launch: EP + DP=$TP (no --tensor-parallel-size), FP8 KV cache, block size 256, max-model-len 800000, prefix caching disabled, deepseek_v4 tokenizer/tool-call/reasoning parsers. Configs: 1k1k conc 4-64, 8k1k conc 4-64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM — straightforward H200 vLLM benchmark config addition for DeepSeek-V4-Pro, mirrors the established single-node pattern.
Extended reasoning...
Overview
This PR adds a new H200 vLLM benchmark configuration (dsv4-fp8-h200-vllm) for DeepSeek-V4-Pro: a new entry in .github/configs/nvidia-master.yaml, a new self-contained launch script benchmarks/single_node/dsv4_fp8_h200.sh, and a corresponding perf-changelog.yaml entry. Companion PRs (#1127 for B200, #1128 for B300) cover the same recipe on other hardware.
Security risks
None. This is benchmark/config plumbing — no auth, crypto, secrets, network exposure, or user-input handling. The shell script binds vLLM to 0.0.0.0 inside the runner container as is standard for every other single-node script in this directory.
Level of scrutiny
Low. This is a config-only addition: nvidia-master.yaml gets a new isolated key, perf-changelog gets a pure addition, and the launch script is brand new (so cannot regress existing benchmarks). The script structure (check_env_vars, start_gpu_monitor, wait_for_server_ready, run_benchmark_serving, run_eval) matches the established pattern used by sibling scripts like dsr1_fp8_h200.sh.
Other factors
- The PR description includes verified test outputs (
generate_sweep_configs.pyexpansion,bash -nsyntax check, YAML parse) and explicitly flags the H200 sweep run as still pending — appropriate transparency. - The
pr-link: ...pull/XXXXplaceholder in perf-changelog is consistent with many existing entries in the file. - No bugs reported by the bug hunting system.
VLLM_ENGINE_READY_TIMEOUT_S 1200 -> 3600. Matches the B300 config; DeepSeek-V4-Pro weight loading was tripping the 20-min gate during sweeps. Also update the changelog entry text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cquil11
left a comment
There was a problem hiding this comment.
LGTM 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋
Summary
dsv4-fp8-h200-vllmfor DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4.vllm/vllm-openai:deepseekv4-cu129(cu129 for H200, vs cu130 for B200/B300) againstdeepseek-ai/DeepSeek-V4-Pro.--attention_config.use_fp4_indexer_cacheis omitted. Max-model-len pinned at 800k per the recipe.benchmarks/single_node/dsv4_fp8_h200.sh.VLLM_ENGINE_READY_TIMEOUT_S=1200so the large-weight load doesn't trip the default 600s gate.Companion PRs
Recipe flags
Search space
{ tp: 8, ep: 8, dp-attn: true, conc: 4..64 }{ tp: 8, ep: 8, dp-attn: true, conc: 4..64 }Test plan
generate_sweep_configs.py test-config --config-keys dsv4-fp8-h200-vllmexpands to 10 entries (exp-namedsv4_1k1k/dsv4_8k1k, runnerh200, tp=8, ep=8, dp-attn=true, conc 4-64).bash -n benchmarks/single_node/dsv4_fp8_h200.shpasses.perf-changelog.yamldiff vs main is pure additions.🤖 Generated with Claude Code