Add H100 config: dsv4-fp8-dynamo-vllm (DeepSeek-V4-Pro multinode disagg) by Oseltamivir · Pull Request #1142 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-24T20:07:32Z

Summary

Port the DSV4-Pro vLLM recipe from single-node H200 to H100 multinode disaggregated serving via Dynamo (2 prefill nodes + 2 decode nodes, DP16/EP16 per side, 32xH100 total).
2P+2D is the minimum viable shape: ~862 GB FP8 weights don't fit on one 8xH100-80GB node (640 GB), so each side must own >=2 nodes. This exactly fills the h100-multinode pool.
srt-slurm recipes are bundled locally at benchmarks/multi_node/srt_slurm_recipes/ and overlaid onto the upstream clone at runtime. Temporary pending an upstream PR to NVIDIA/srt-slurm.

Config parity with H200

Engine flags match benchmarks/single_node/dsv4_fp8_h200.sh:

deepseek_v4 tokenizer, tool-call, and reasoning parsers
--kv-cache-dtype fp8, --block-size 256
--no-enable-prefix-caching, --no-enable-flashinfer-autotune
--enable-expert-parallel, --gpu-memory-utilization 0.95, --max-num-seqs 512, --max-num-batched-tokens 512
compilation mode 0 with FULL_DECODE_ONLY cudagraph
VLLM_ENGINE_READY_TIMEOUT_S=3600

Differs from H200:

max-model-len: 16384 (H200's 800k does not fit KV across two 80GB decode nodes)
H100 80GB tunings from the DSR1 H100 vLLM recipe: VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput,low_latency} all2all backends, VLLM_USE_DEEP_GEMM=1
NixlConnector P<->D KV transfer, dynamo: { version: 1.0.1, install: true } with setup_script: vllm-container-deps.sh
tensor-parallel-size: 1 + data-parallel-size: 16 per side (vs H200 --data-parallel-size 8)

🤖 Generated with Claude Code

Port the DSV4-Pro vLLM recipe from single-node H200 to H100 as multinode disaggregated serving via Dynamo. The ~862 GB FP8 weights don't fit on one 8xH100-80GB node (640 GB), so each side must own >=2 nodes; with the h100-multinode pool at 4 nodes, 2P+2D DP16/EP16 per side (32 H100s total) is the minimum viable shape and fills the pool exactly. Engine flags match the single-node H200 recipe: deepseek_v4 tokenizer, tool-call, and reasoning parsers; FP8 KV cache; block size 256; prefix caching disabled; compilation mode 0 with FULL_DECODE_ONLY cudagraph. max-model-len is capped at 16384 (H200's 800k does not fit KV across two 80GB decode nodes). Keeps H100-tuned knobs from the DSR1 vLLM recipe: VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput,low_latency} all2all backends, NixlConnector P<->D KV transfer, VLLM_USE_DEEP_GEMM, dynamo 1.0.1. srt-slurm recipes are bundled locally at benchmarks/multi_node/srt_slurm_recipes/ and overlaid onto the srt-slurm clone at runtime. This is temporary until the recipes can be upstreamed to NVIDIA/srt-slurm. Changes: - recipes: benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/ {1k1k,8k1k}/disagg-h100-fp8-1p1d-dep16-dep16.yaml - runner: launch_h100-dgxc-slurm.sh gains a dynamo-vllm framework branch (dsv4-fp8 model path at /mnt/nfs/lustre/models/dsv4-fp8, vLLM container squash mapping, srtslurm.yaml dynamo-vllm alias) and an unconditional local-recipes overlay after the srt-slurm checkout - master: .github/configs/nvidia-master.yaml adds dsv4-fp8-h100-dynamo-vllm with 1k1k conc [4,8,16,32,64,128] and 8k1k conc [4,8,16,32,64] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T20:07:42Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sweep 24909864822 had all three multinode jobs fail in 6s with ExitCode=1:0 and no sweep_JOBID.log written, leaving no usable diagnostic in the CI artifact. Two defensive changes: 1. mkdir -p outputs/$JOB_ID/logs before polling, so Slurm's #SBATCH --output=outputs/%j/logs/sweep_%j.log directive can open the target file even when the compute-node stepd lacks permission to create the parent dir on NFS. 2. On the "job failed before creating log file" path, tar outputs/$JOB_ID/ (sbatch_script.sh, config.yaml, any partial log, and the scontrol dump) into multinode_server_logs.tar.gz so the CI artifact captures what was submitted and why Slurm exited early. Previously exit 1 ran before the tar step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR 1142's first real sweep hit "ModuleNotFoundError: No module named 'vllm.inputs.data'" on all three multinode jobs. Same error as PR 1129 on GB200. Root cause: ai-dynamo 1.0.1 (installed by NVIDIA/srt-slurm@sa-submission-q2-2026 via `dynamo: { version: 1.0.1 }`) imports vllm.inputs.data.TokensPrompt, a path removed in the DSV4 vLLM wheel. Dynamo workers crash during import before any vLLM flag matters. Fix, mirroring PR 1129: - launch_h100-dgxc-slurm.sh: override srt-slurm clone URL/ref via SRT_SLURM_REPO_URL and SRT_SLURM_REF env vars, set to alec-flowers/srt-slurm@d60e3f1c (head of NVIDIA/srt-slurm#71) for dynamo-vllm+dsv4. All other frameworks/models keep NVIDIA upstream. - Recipes: replace `dynamo.version: 1.0.1` with `dynamo.hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b`. The fork's schema accepts `hash:` for pinning a specific ai-dynamo/dynamo commit. That commit has the matching vllm.inputs import path. - Recipes: adopt DSV4-specific flags PR 1129 proved necessary for startup: `enforce-eager: true` (prefill only), `enable-sleep-mode: true`, `no-disable-hybrid-kv-cache-manager: true`, explicit `kv-transfer-config` (NixlConnector kv_both), env vars VLLM_SERVER_DEV_MODE=1 and TILELANG_CLEANUP_TEMP_FILES=1. - Recipes: drop `data-parallel-hybrid-lb` and `async-scheduling` (DSR1 patterns that PR 1129 omitted on DSV4; keep minimal delta from DSV4 H200 single-node). Kept H100-specific knobs: VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput, low_latency} all2all backends, VLLM_USE_DEEP_GEMM. Skipped GB200-only flags (NCCL_MNNVL_ENABLE, NCCL_NVLS_ENABLE, VLLM_USE_NCCL_SYMM_MEM). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Dynamo vLLM worker argparse rejects --enable-auto-tool-choice and --tool-call-parser — the sweep from e0359c6 got past the module-import error but failed with "unrecognized arguments: --enable-auto-tool-choice --tool-call-parser deepseek_v4" during prefill worker startup. These flags (along with --tokenizer-mode and --reasoning-parser) are OpenAI API-server concerns. In disagg, Dynamo is the frontend and does tokenization / tool parsing itself; the vLLM workers are engine-only processes and expose only engine args. The H200 single-node recipe uses `vllm serve` directly (full API server), which is why those flags work there but fail here. Kimi K2.5 (only other working dynamo-vllm recipe) also omits all four flags — that's the precedent. Removed from both prefill and decode: tokenizer-mode: deepseek_v4 tool-call-parser: deepseek_v4 reasoning-parser: deepseek_v4 enable-auto-tool-choice: true Kept trust-remote-code: true (needed for DSV4's custom modeling code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Workers got past module import and weight load (471s), then died simultaneously with: /dvs/p4/build/sw/rel/gpgpu/toolkit/r12.9/main_nvshmem/src/host/mem/ mem_heap.cpp:exchange_heap_memory_handle:781: Fatal IPC Failure IPC failure: Sending data over socket failed: No such file or directory Root cause: `all2all-backend: deepep_{high_throughput,low_latency}` routes expert-parallel comms through NVSHMEM. The cu129 DSV4 vLLM wheel's NVSHMEM can't complete host-side IPC bootstrap after the workers enter the executor init phase. DSR1 on the same H100 nodes uses deepep successfully, but through a different container (nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0) with an older NVSHMEM. Fix — mirror PR 1129's GB200 approach: 1. Drop the `all2all-backend` override entirely. The DSV4 vLLM code picks its own default for this model, which routes through NCCL symmetric memory instead of NVSHMEM. 2. Add env vars: VLLM_USE_NCCL_SYMM_MEM=1 (prefer NCCL symm mem path) NCCL_CUMEM_ENABLE=1 (CUDA unified memory companion) Skipped NCCL_MNNVL_ENABLE and NCCL_NVLS_ENABLE (Blackwell-only; MNNVL is GB200 NVSwitch fabric, NVLS is NVLink SHARP — neither exists on H100). Keeps all H100-specific knobs (VLLM_USE_DEEP_GEMM, VLLM_MOE_DP_CHUNK_SIZE=192, VLLM_SKIP_P2P_CHECK). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run 24913192394 got past every prior failure (NVSHMEM/IPC, module import, argparse) but OOMed during compile_or_warm_up_model: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 93.00 MiB is free. PyTorch: 72.99 GiB | CUDA Graphs: 1.28 GiB File ".../vllm/model_executor/layers/sparse_attn_indexer.py", line 122 DSV4's "Lightning Indexer" sparse attention layer allocates transient torch.empty buffers that aren't accounted for in vLLM's KV cache profiling. With gpu-memory-utilization=0.95, vLLM reserves ~75 GiB of each H100's 79 GiB usable, leaving only ~4 GiB for non-PyTorch state (NCCL buffers, NVSHMEM scratch, the indexer's transient allocations). The indexer's 512 MiB allocation tips it over. The H200 single-node DSV4 recipe uses 0.95 and works because each H200 has 141 GiB/GPU — 4 GiB headroom is enough there. PR 1129 uses 0.88 (prefill) / 0.9 (decode) on GB200's 192 GiB. DSR1 H100 disagg uses vLLM's default 0.9 and works because DSR1's MLA doesn't have the indexer overhead. 0.85 reserves ~12 GiB headroom on H100 80GB, well above the indexer's ~6 GiB working set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run 24914869373: server starts successfully (eval-only succeeds in 33m, end-to-end gsm8k completions). The throughput jobs fail before sending a single request: ValueError: Cannot use chat template functions because tokenizer.chat_template is not set File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 346, in sample_random_requests chat_template_dummy = tokenizer.apply_chat_template(...) DSV4-Pro's HF tokenizer ships without a chat_template attribute. The server uses tokenizer-mode=deepseek_v4 (set automatically from the model's tokenizer_config.json) to handle templating itself, but sa-bench's prompt-construction path runs a *local* HF apply_chat_template before sending — and that raises with no template to apply. Eval works because lm-eval-harness sends raw messages to /v1/chat/completions; the server templates them via Dynamo's parser. Set `use_chat_template: false` on both recipes' benchmark blocks (matches PR 1129). sa-bench will send raw random text, which is what the throughput benchmark wants anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Expand the search space with a TEP-style recipe alongside the existing DEP, following the dsr1-fp8-h100-dynamo-sglang TEP/DEP split pattern. The h100-multinode pool is exactly 4 nodes and DSV4-Pro weights need >=2 nodes per side, so we cannot add more workers (1P+1D = 4 nodes is the only fit). The TEP variant therefore differs from DEP by changing each worker's *internal* parallelism, not the worker count: DEP (existing): tp=1, dp=16, ep=16, dp-attn=true 16 independent attention paths, sharded experts. Better at high concurrency / throughput. TEP (new): tp=16, dp=1, ep=16, dp-attn=false Single replica spread across all 16 GPUs, sharded experts. All 16 GPUs cooperate on each forward pass. Cross-node TP routes attn all-reduce + MoE all2all over IB — expensive per token, but latency wins at small batch sizes (conc 4-32). Concurrency split per the user's hint ("DEP for high conc, TEP for low conc"): 1k1k TEP: [4, 8, 16, 32] 1k1k DEP: [64, 128, 256] 8k1k TEP: [4, 8, 16] 8k1k DEP: [32, 64, 128] Also extends the DEP high-conc tail by one point each side (1k1k 128 -> 256, 8k1k 64 -> 128). TEP recipe drops `data-parallel-hybrid-lb` (no DP) and lowers `max-num-seqs` to 64 / `max-num-batched-tokens` to 512 since cudagraph capture would otherwise reserve memory for batch shapes never reached at conc<=32. Keeps the existing DSV4 startup workarounds (VLLM_USE_NCCL_SYMM_MEM, gpu-memory-utilization=0.85, no all2all-backend override, etc). Doubles the matrix from 2 to 4 entries (validated via MultiNodeMatrixEntry). Also adds `du -sh "$MODEL_PATH"` in the dynamo-vllm branch of launch_h100-dgxc-slurm.sh so model size shows in CI output — useful for catching partial downloads or wrong revisions before the 8-min weight-load step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sweep 24921015519 surfaced that cross-node TP=16 doesn't work with the Dynamo+vLLM stack: pydantic_core._pydantic_core.ValidationError: 1 validation error for ParallelConfig Value error, World size (16) is larger than the number of available GPUs (1) in this node. If this is intentional and you are using: - ray, set '--distributed-executor-backend ray'. - multiprocessing, set '--nnodes' appropriately. Dynamo spawns one vLLM process per GPU; each process only sees its single local GPU and vLLM rejects world_size=16. Working around this would need --distributed-executor-backend=ray which Dynamo doesn't coordinate. None of the working DSV4 vLLM recipes (kimi GB200, DSR1 H100, PR 1129 GB200) use cross-node TP either — the execution model assumes one process per GPU. So drop TEP entirely; instead deliver two DEP recipes per ISL/OSL that differ in batch tuning: DEP-eager (low conc): max-num-seqs=64, max-num-batched-tokens=256, enforce-eager=true on decode (no cudagraph). Smaller cudagraph capture footprint, faster warmup, no decode kernel-launch optimization (irrelevant at conc<=32 where network round-trips dominate per-token latency). DEP (high conc, existing): max-num-seqs=512, max-num-batched-tokens =512, decode cudagraph enabled. Higher batching throughput at conc>=64. Conc splits unchanged from previous attempt: 1k1k eager [4,8,16,32] 1k1k dep [64,128,256] 8k1k eager [4,8,16] 8k1k dep [32,64,128] Same 4 matrix entries, all with the same tp=1/dp=16/ep=16/dp-attn=true metadata; differentiation is via the CONFIG_FILE pointer in additional-settings (mirrors how the trtllm dsr1-h100 recipes encode multiple variants of the same topology). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…914869373) The eager low-conc DEP variant added in 1bdeb9e was untested, and the TEP variant before that didn't work at all on Dynamo+vLLM. Drop both and revert to the single-DEP search-space form that successfully served gsm8k eval-only in run 24914869373: 1k1k DEP: conc [4, 8, 16, 32, 64, 128] 8k1k DEP: conc [4, 8, 16, 32, 64] Each entry uses tp=1, dp=16, ep=16, dp-attn=true (1P+1D filling the 4-node h100-multinode pool). max-num-seqs=512, decode cudagraph on, gpu-memory-utilization=0.85. Removes: - benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/1k1k/disagg-h100-fp8-1p1d-dep16-dep16-eager.yaml - benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/8k1k/disagg-h100-fp8-1p1d-dep16-dep16-eager.yaml Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…vllm # Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml

Run 24922713022 hit the default 1800s orchestrator deadline on all three matrix jobs (1k1k bench, 8k1k bench, 8k1k eval). Concurrent multinode matrix jobs starve the same Lustre OSTs — first shard load took 423s, shard 8/64 was reached at 16 min, projected total weight load ~107 min. Match the GB200 dsv4 recipes which already added these blocks for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DSV4-Pro per-rank weights are 74.99 GiB at DP=16/EP=16 — H100 80GB leaves only ~4 GiB headroom and sparse_attn_indexer's profile_run torch.empty(512 MiB) OOMs (run 24923521075). Cross-node TP=16 shards the model 16-way across 2 nodes (~5 GiB per rank). srt-slurm's vllm.py:386-388 emits --headless on the secondary node when data-parallel-size is absent and the worker spans nodes; Dynamo's run_dynamo_headless calls vLLM's run_headless which uses MultiprocExecutor + torch.distributed (no Ray) to form the cross-node PG. NCCL TP all-reduce flows over IB on every layer — slower per-token than intra-node NVLink, but the only way to fit DSV4-Pro at 80 GB. Other changes: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for fragmentation; gpu-memory-utilization back to 0.95 (matches H200); enforce-eager on decode for the first attempt (cross-node cudagraphs are fragile). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Oseltamivir requested a review from a team April 24, 2026 20:07

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners April 24, 2026 20:07

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

Update perf-changelog pr-link to PR 1142

0cd54af

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Oseltamivir added the sweep-enabled label Apr 24, 2026

Oseltamivir and others added 14 commits April 24, 2026 13:31

Merge branch 'main' into dsv4-fp8-h100-dynamo-vllm

3495a78

Merge branch 'main' into dsv4-fp8-h100-dynamo-vllm

837179d

Merge remote-tracking branch 'origin/main' into dsv4-fp8-h100-dynamo-…

f3693b7

…vllm # Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml

Oseltamivir force-pushed the dsv4-fp8-h100-dynamo-vllm branch from 348910a to f798361 Compare April 25, 2026 07:05

Merge branch 'main' into dsv4-fp8-h100-dynamo-vllm

66d0da9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add H100 config: dsv4-fp8-dynamo-vllm (DeepSeek-V4-Pro multinode disagg)#1142

Add H100 config: dsv4-fp8-dynamo-vllm (DeepSeek-V4-Pro multinode disagg)#1142
Oseltamivir wants to merge 17 commits intomainfrom
dsv4-fp8-h100-dynamo-vllm

Oseltamivir commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Config parity with H200

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Apr 24, 2026 •

edited

Loading