Skip to content

Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro)#1130

Merged
functionstackx merged 4 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm
Apr 24, 2026
Merged

Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro)#1130
functionstackx merged 4 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Add new H200 vLLM config dsv4-fp8-h200-vllm for DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4.
  • Uses vllm/vllm-openai:deepseekv4-cu129 (cu129 for H200, vs cu130 for B200/B300) against deepseek-ai/DeepSeek-V4-Pro.
  • H200 has no FP4 path, so --attention_config.use_fp4_indexer_cache is omitted. Max-model-len pinned at 800k per the recipe.
  • New launch script benchmarks/single_node/dsv4_fp8_h200.sh.
  • Prefix caching disabled; VLLM_ENGINE_READY_TIMEOUT_S=1200 so the large-weight load doesn't trip the default 600s gate.

Companion PRs

Recipe flags

--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--no-enable-prefix-caching
--enable-expert-parallel
--data-parallel-size $TP     # $TP = 8 from search space
--max-model-len 800000
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

Search space

  • 1k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..64 }
  • 8k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..64 }

Test plan

  • generate_sweep_configs.py test-config --config-keys dsv4-fp8-h200-vllm expands to 10 entries (exp-name dsv4_1k1k/dsv4_8k1k, runner h200, tp=8, ep=8, dp-attn=true, conc 4-64).
  • bash -n benchmarks/single_node/dsv4_fp8_h200.sh passes.
  • YAML files parse; perf-changelog.yaml diff vs main is pure additions.
  • Run the triggered sweep on an H200 runner — verify the server launches within the 20-minute timeout and benchmark + eval produce results.

🤖 Generated with Claude Code

Port the DeepSeek-V4-Pro vLLM recipe to H200 per
https://vllm.ai/blog/deepseek-v4. Uses the cu129 image and omits the
FP4 indexer cache flag (H200 has no FP4 path). Max-model-len is pinned
at 800k per the recipe. Prefix caching is disabled (matches the
B200/B300 configs and the user's note) and VLLM_ENGINE_READY_TIMEOUT_S
is bumped to 1200s to tolerate slow weight loading.

Launch: EP + DP=$TP (no --tensor-parallel-size), FP8 KV cache,
block size 256, max-model-len 800000, prefix caching disabled,
deepseek_v4 tokenizer/tool-call/reasoning parsers.

Configs: 1k1k conc 4-64, 8k1k conc 4-64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward H200 vLLM benchmark config addition for DeepSeek-V4-Pro, mirrors the established single-node pattern.

Extended reasoning...

Overview

This PR adds a new H200 vLLM benchmark configuration (dsv4-fp8-h200-vllm) for DeepSeek-V4-Pro: a new entry in .github/configs/nvidia-master.yaml, a new self-contained launch script benchmarks/single_node/dsv4_fp8_h200.sh, and a corresponding perf-changelog.yaml entry. Companion PRs (#1127 for B200, #1128 for B300) cover the same recipe on other hardware.

Security risks

None. This is benchmark/config plumbing — no auth, crypto, secrets, network exposure, or user-input handling. The shell script binds vLLM to 0.0.0.0 inside the runner container as is standard for every other single-node script in this directory.

Level of scrutiny

Low. This is a config-only addition: nvidia-master.yaml gets a new isolated key, perf-changelog gets a pure addition, and the launch script is brand new (so cannot regress existing benchmarks). The script structure (check_env_vars, start_gpu_monitor, wait_for_server_ready, run_benchmark_serving, run_eval) matches the established pattern used by sibling scripts like dsr1_fp8_h200.sh.

Other factors

  • The PR description includes verified test outputs (generate_sweep_configs.py expansion, bash -n syntax check, YAML parse) and explicitly flags the H200 sweep run as still pending — appropriate transparency.
  • The pr-link: ...pull/XXXX placeholder in perf-changelog is consistent with many existing entries in the file.
  • No bugs reported by the bug hunting system.

VLLM_ENGINE_READY_TIMEOUT_S 1200 -> 3600. Matches the B300
config; DeepSeek-V4-Pro weight loading was tripping the 20-min
gate during sweeps. Also update the changelog entry text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋 🐋

@functionstackx functionstackx merged commit 6f3c1c0 into main Apr 24, 2026
24 of 31 checks passed
@functionstackx functionstackx deleted the claude/add-dsv4-fp8-h200-vllm branch April 24, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants