Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro) by functionstackx · Pull Request #1130 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-24T05:30:33Z

Summary

Add new H200 vLLM config dsv4-fp8-h200-vllm for DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4.
Uses vllm/vllm-openai:deepseekv4-cu129 (cu129 for H200, vs cu130 for B200/B300) against deepseek-ai/DeepSeek-V4-Pro.
H200 has no FP4 path, so --attention_config.use_fp4_indexer_cache is omitted. Max-model-len pinned at 800k per the recipe.
New launch script benchmarks/single_node/dsv4_fp8_h200.sh.
Prefix caching disabled; VLLM_ENGINE_READY_TIMEOUT_S=1200 so the large-weight load doesn't trip the default 600s gate.

Companion PRs

Recipe flags

--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--no-enable-prefix-caching
--enable-expert-parallel
--data-parallel-size $TP     # $TP = 8 from search space
--max-model-len 800000
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

Search space

1k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..64 }
8k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..64 }

Test plan

generate_sweep_configs.py test-config --config-keys dsv4-fp8-h200-vllm expands to 10 entries (exp-name dsv4_1k1k/dsv4_8k1k, runner h200, tp=8, ep=8, dp-attn=true, conc 4-64).
bash -n benchmarks/single_node/dsv4_fp8_h200.sh passes.
YAML files parse; perf-changelog.yaml diff vs main is pure additions.
Run the triggered sweep on an H200 runner — verify the server launches within the 20-minute timeout and benchmark + eval produce results.

🤖 Generated with Claude Code

Port the DeepSeek-V4-Pro vLLM recipe to H200 per https://vllm.ai/blog/deepseek-v4. Uses the cu129 image and omits the FP4 indexer cache flag (H200 has no FP4 path). Max-model-len is pinned at 800k per the recipe. Prefix caching is disabled (matches the B200/B300 configs and the user's note) and VLLM_ENGINE_READY_TIMEOUT_S is bumped to 1200s to tolerate slow weight loading. Launch: EP + DP=$TP (no --tensor-parallel-size), FP8 KV cache, block size 256, max-model-len 800000, prefix caching disabled, deepseek_v4 tokenizer/tool-call/reasoning parsers. Configs: 1k1k conc 4-64, 8k1k conc 4-64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T05:30:41Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T05:30:41Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude

LGTM — straightforward H200 vLLM benchmark config addition for DeepSeek-V4-Pro, mirrors the established single-node pattern.

Extended reasoning...

Overview

This PR adds a new H200 vLLM benchmark configuration (dsv4-fp8-h200-vllm) for DeepSeek-V4-Pro: a new entry in .github/configs/nvidia-master.yaml, a new self-contained launch script benchmarks/single_node/dsv4_fp8_h200.sh, and a corresponding perf-changelog.yaml entry. Companion PRs (#1127 for B200, #1128 for B300) cover the same recipe on other hardware.

Security risks

None. This is benchmark/config plumbing — no auth, crypto, secrets, network exposure, or user-input handling. The shell script binds vLLM to 0.0.0.0 inside the runner container as is standard for every other single-node script in this directory.

Level of scrutiny

Low. This is a config-only addition: nvidia-master.yaml gets a new isolated key, perf-changelog gets a pure addition, and the launch script is brand new (so cannot regress existing benchmarks). The script structure (check_env_vars, start_gpu_monitor, wait_for_server_ready, run_benchmark_serving, run_eval) matches the established pattern used by sibling scripts like dsr1_fp8_h200.sh.

Other factors

The PR description includes verified test outputs (generate_sweep_configs.py expansion, bash -n syntax check, YAML parse) and explicitly flags the H200 sweep run as still pending — appropriate transparency.
The pr-link: ...pull/XXXX placeholder in perf-changelog is consistent with many existing entries in the file.
No bugs reported by the bug hunting system.

VLLM_ENGINE_READY_TIMEOUT_S 1200 -> 3600. Matches the B300 config; DeepSeek-V4-Pro weight loading was tripping the 20-min gate during sweeps. Also update the changelog entry text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11

LGTM 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋

functionstackx requested a review from a team April 24, 2026 05:30

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 24, 2026 05:30

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

functionstackx added the sweep-enabled label Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

functionstackx added full-sweep-enabled and removed sweep-enabled labels Apr 24, 2026

update to pr 380

620afcd

cquil11 approved these changes Apr 24, 2026

View reviewed changes

Update perf-changelog.yaml

55a6b10

functionstackx removed the full-sweep-enabled label Apr 24, 2026

functionstackx merged commit 6f3c1c0 into main Apr 24, 2026
24 of 31 checks passed

functionstackx deleted the claude/add-dsv4-fp8-h200-vllm branch April 24, 2026 15:11

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 24, 2026

Klaud-Cold mentioned this pull request Apr 24, 2026

[NV] Add deepseek-v4-pro b300 vllm config #1144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro)#1130

Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro)#1130
functionstackx merged 4 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm

functionstackx commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Uh oh!

cquil11 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

functionstackx commented Apr 24, 2026

Summary

Companion PRs

Recipe flags

Search space

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants