Add H100 config: dsv4-fp8-dynamo-vllm (DeepSeek-V4-Pro multinode disagg)#1142
Add H100 config: dsv4-fp8-dynamo-vllm (DeepSeek-V4-Pro multinode disagg)#1142Oseltamivir wants to merge 17 commits intomainfrom
Conversation
Port the DSV4-Pro vLLM recipe from single-node H200 to H100 as multinode
disaggregated serving via Dynamo. The ~862 GB FP8 weights don't fit on one
8xH100-80GB node (640 GB), so each side must own >=2 nodes; with the
h100-multinode pool at 4 nodes, 2P+2D DP16/EP16 per side (32 H100s total)
is the minimum viable shape and fills the pool exactly.
Engine flags match the single-node H200 recipe: deepseek_v4 tokenizer,
tool-call, and reasoning parsers; FP8 KV cache; block size 256; prefix
caching disabled; compilation mode 0 with FULL_DECODE_ONLY cudagraph.
max-model-len is capped at 16384 (H200's 800k does not fit KV across two
80GB decode nodes). Keeps H100-tuned knobs from the DSR1 vLLM recipe:
VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput,low_latency} all2all
backends, NixlConnector P<->D KV transfer, VLLM_USE_DEEP_GEMM, dynamo 1.0.1.
srt-slurm recipes are bundled locally at benchmarks/multi_node/srt_slurm_recipes/
and overlaid onto the srt-slurm clone at runtime. This is temporary until
the recipes can be upstreamed to NVIDIA/srt-slurm.
Changes:
- recipes: benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/
{1k1k,8k1k}/disagg-h100-fp8-1p1d-dep16-dep16.yaml
- runner: launch_h100-dgxc-slurm.sh gains a dynamo-vllm framework branch
(dsv4-fp8 model path at /mnt/nfs/lustre/models/dsv4-fp8, vLLM container
squash mapping, srtslurm.yaml dynamo-vllm alias) and an unconditional
local-recipes overlay after the srt-slurm checkout
- master: .github/configs/nvidia-master.yaml adds dsv4-fp8-h100-dynamo-vllm
with 1k1k conc [4,8,16,32,64,128] and 8k1k conc [4,8,16,32,64]
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep 24909864822 had all three multinode jobs fail in 6s with ExitCode=1:0 and no sweep_JOBID.log written, leaving no usable diagnostic in the CI artifact. Two defensive changes: 1. mkdir -p outputs/$JOB_ID/logs before polling, so Slurm's #SBATCH --output=outputs/%j/logs/sweep_%j.log directive can open the target file even when the compute-node stepd lacks permission to create the parent dir on NFS. 2. On the "job failed before creating log file" path, tar outputs/$JOB_ID/ (sbatch_script.sh, config.yaml, any partial log, and the scontrol dump) into multinode_server_logs.tar.gz so the CI artifact captures what was submitted and why Slurm exited early. Previously exit 1 ran before the tar step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 1142's first real sweep hit "ModuleNotFoundError: No module named
'vllm.inputs.data'" on all three multinode jobs. Same error as PR 1129
on GB200.
Root cause: ai-dynamo 1.0.1 (installed by NVIDIA/srt-slurm@sa-submission-q2-2026
via `dynamo: { version: 1.0.1 }`) imports vllm.inputs.data.TokensPrompt,
a path removed in the DSV4 vLLM wheel. Dynamo workers crash during
import before any vLLM flag matters.
Fix, mirroring PR 1129:
- launch_h100-dgxc-slurm.sh: override srt-slurm clone URL/ref via
SRT_SLURM_REPO_URL and SRT_SLURM_REF env vars, set to
alec-flowers/srt-slurm@d60e3f1c (head of NVIDIA/srt-slurm#71) for
dynamo-vllm+dsv4. All other frameworks/models keep NVIDIA upstream.
- Recipes: replace `dynamo.version: 1.0.1` with `dynamo.hash:
6a159fedd8e4a1563aa647c31f622aedbf254b5b`. The fork's schema accepts
`hash:` for pinning a specific ai-dynamo/dynamo commit. That commit
has the matching vllm.inputs import path.
- Recipes: adopt DSV4-specific flags PR 1129 proved necessary for
startup: `enforce-eager: true` (prefill only), `enable-sleep-mode: true`,
`no-disable-hybrid-kv-cache-manager: true`, explicit
`kv-transfer-config` (NixlConnector kv_both), env vars
VLLM_SERVER_DEV_MODE=1 and TILELANG_CLEANUP_TEMP_FILES=1.
- Recipes: drop `data-parallel-hybrid-lb` and `async-scheduling` (DSR1
patterns that PR 1129 omitted on DSV4; keep minimal delta from DSV4
H200 single-node).
Kept H100-specific knobs: VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput,
low_latency} all2all backends, VLLM_USE_DEEP_GEMM. Skipped GB200-only
flags (NCCL_MNNVL_ENABLE, NCCL_NVLS_ENABLE, VLLM_USE_NCCL_SYMM_MEM).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dynamo vLLM worker argparse rejects --enable-auto-tool-choice and --tool-call-parser — the sweep from e0359c6 got past the module-import error but failed with "unrecognized arguments: --enable-auto-tool-choice --tool-call-parser deepseek_v4" during prefill worker startup. These flags (along with --tokenizer-mode and --reasoning-parser) are OpenAI API-server concerns. In disagg, Dynamo is the frontend and does tokenization / tool parsing itself; the vLLM workers are engine-only processes and expose only engine args. The H200 single-node recipe uses `vllm serve` directly (full API server), which is why those flags work there but fail here. Kimi K2.5 (only other working dynamo-vllm recipe) also omits all four flags — that's the precedent. Removed from both prefill and decode: tokenizer-mode: deepseek_v4 tool-call-parser: deepseek_v4 reasoning-parser: deepseek_v4 enable-auto-tool-choice: true Kept trust-remote-code: true (needed for DSV4's custom modeling code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers got past module import and weight load (471s), then died
simultaneously with:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.9/main_nvshmem/src/host/mem/
mem_heap.cpp:exchange_heap_memory_handle:781: Fatal IPC Failure
IPC failure: Sending data over socket failed: No such file or directory
Root cause: `all2all-backend: deepep_{high_throughput,low_latency}`
routes expert-parallel comms through NVSHMEM. The cu129 DSV4 vLLM
wheel's NVSHMEM can't complete host-side IPC bootstrap after the
workers enter the executor init phase. DSR1 on the same H100 nodes
uses deepep successfully, but through a different container
(nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0) with an older NVSHMEM.
Fix — mirror PR 1129's GB200 approach:
1. Drop the `all2all-backend` override entirely. The DSV4 vLLM code
picks its own default for this model, which routes through NCCL
symmetric memory instead of NVSHMEM.
2. Add env vars:
VLLM_USE_NCCL_SYMM_MEM=1 (prefer NCCL symm mem path)
NCCL_CUMEM_ENABLE=1 (CUDA unified memory companion)
Skipped NCCL_MNNVL_ENABLE and NCCL_NVLS_ENABLE (Blackwell-only; MNNVL
is GB200 NVSwitch fabric, NVLS is NVLink SHARP — neither exists on
H100). Keeps all H100-specific knobs (VLLM_USE_DEEP_GEMM,
VLLM_MOE_DP_CHUNK_SIZE=192, VLLM_SKIP_P2P_CHECK).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 24913192394 got past every prior failure (NVSHMEM/IPC, module import, argparse) but OOMed during compile_or_warm_up_model: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 93.00 MiB is free. PyTorch: 72.99 GiB | CUDA Graphs: 1.28 GiB File ".../vllm/model_executor/layers/sparse_attn_indexer.py", line 122 DSV4's "Lightning Indexer" sparse attention layer allocates transient torch.empty buffers that aren't accounted for in vLLM's KV cache profiling. With gpu-memory-utilization=0.95, vLLM reserves ~75 GiB of each H100's 79 GiB usable, leaving only ~4 GiB for non-PyTorch state (NCCL buffers, NVSHMEM scratch, the indexer's transient allocations). The indexer's 512 MiB allocation tips it over. The H200 single-node DSV4 recipe uses 0.95 and works because each H200 has 141 GiB/GPU — 4 GiB headroom is enough there. PR 1129 uses 0.88 (prefill) / 0.9 (decode) on GB200's 192 GiB. DSR1 H100 disagg uses vLLM's default 0.9 and works because DSR1's MLA doesn't have the indexer overhead. 0.85 reserves ~12 GiB headroom on H100 80GB, well above the indexer's ~6 GiB working set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 24914869373: server starts successfully (eval-only succeeds in 33m,
end-to-end gsm8k completions). The throughput jobs fail before sending
a single request:
ValueError: Cannot use chat template functions because
tokenizer.chat_template is not set
File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 346,
in sample_random_requests
chat_template_dummy = tokenizer.apply_chat_template(...)
DSV4-Pro's HF tokenizer ships without a chat_template attribute. The
server uses tokenizer-mode=deepseek_v4 (set automatically from the
model's tokenizer_config.json) to handle templating itself, but
sa-bench's prompt-construction path runs a *local* HF
apply_chat_template before sending — and that raises with no template
to apply.
Eval works because lm-eval-harness sends raw messages to
/v1/chat/completions; the server templates them via Dynamo's parser.
Set `use_chat_template: false` on both recipes' benchmark blocks
(matches PR 1129). sa-bench will send raw random text, which is what
the throughput benchmark wants anyway.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expand the search space with a TEP-style recipe alongside the existing
DEP, following the dsr1-fp8-h100-dynamo-sglang TEP/DEP split pattern.
The h100-multinode pool is exactly 4 nodes and DSV4-Pro weights need
>=2 nodes per side, so we cannot add more workers (1P+1D = 4 nodes is
the only fit). The TEP variant therefore differs from DEP by changing
each worker's *internal* parallelism, not the worker count:
DEP (existing): tp=1, dp=16, ep=16, dp-attn=true
16 independent attention paths, sharded experts.
Better at high concurrency / throughput.
TEP (new): tp=16, dp=1, ep=16, dp-attn=false
Single replica spread across all 16 GPUs, sharded
experts. All 16 GPUs cooperate on each forward pass.
Cross-node TP routes attn all-reduce + MoE all2all
over IB — expensive per token, but latency wins at
small batch sizes (conc 4-32).
Concurrency split per the user's hint ("DEP for high conc, TEP for
low conc"):
1k1k TEP: [4, 8, 16, 32] 1k1k DEP: [64, 128, 256]
8k1k TEP: [4, 8, 16] 8k1k DEP: [32, 64, 128]
Also extends the DEP high-conc tail by one point each side
(1k1k 128 -> 256, 8k1k 64 -> 128).
TEP recipe drops `data-parallel-hybrid-lb` (no DP) and lowers
`max-num-seqs` to 64 / `max-num-batched-tokens` to 512 since cudagraph
capture would otherwise reserve memory for batch shapes never reached
at conc<=32. Keeps the existing DSV4 startup workarounds
(VLLM_USE_NCCL_SYMM_MEM, gpu-memory-utilization=0.85, no all2all-backend
override, etc).
Doubles the matrix from 2 to 4 entries (validated via
MultiNodeMatrixEntry).
Also adds `du -sh "$MODEL_PATH"` in the dynamo-vllm branch of
launch_h100-dgxc-slurm.sh so model size shows in CI output — useful
for catching partial downloads or wrong revisions before the 8-min
weight-load step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep 24921015519 surfaced that cross-node TP=16 doesn't work with the
Dynamo+vLLM stack:
pydantic_core._pydantic_core.ValidationError: 1 validation error for ParallelConfig
Value error, World size (16) is larger than the number of available
GPUs (1) in this node. If this is intentional and you are using:
- ray, set '--distributed-executor-backend ray'.
- multiprocessing, set '--nnodes' appropriately.
Dynamo spawns one vLLM process per GPU; each process only sees its
single local GPU and vLLM rejects world_size=16. Working around this
would need --distributed-executor-backend=ray which Dynamo doesn't
coordinate. None of the working DSV4 vLLM recipes (kimi GB200, DSR1
H100, PR 1129 GB200) use cross-node TP either — the execution model
assumes one process per GPU.
So drop TEP entirely; instead deliver two DEP recipes per ISL/OSL
that differ in batch tuning:
DEP-eager (low conc): max-num-seqs=64, max-num-batched-tokens=256,
enforce-eager=true on decode (no cudagraph). Smaller cudagraph
capture footprint, faster warmup, no decode kernel-launch
optimization (irrelevant at conc<=32 where network round-trips
dominate per-token latency).
DEP (high conc, existing): max-num-seqs=512, max-num-batched-tokens
=512, decode cudagraph enabled. Higher batching throughput at
conc>=64.
Conc splits unchanged from previous attempt:
1k1k eager [4,8,16,32] 1k1k dep [64,128,256]
8k1k eager [4,8,16] 8k1k dep [32,64,128]
Same 4 matrix entries, all with the same tp=1/dp=16/ep=16/dp-attn=true
metadata; differentiation is via the CONFIG_FILE pointer in
additional-settings (mirrors how the trtllm dsr1-h100 recipes encode
multiple variants of the same topology).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…914869373) The eager low-conc DEP variant added in 1bdeb9e was untested, and the TEP variant before that didn't work at all on Dynamo+vLLM. Drop both and revert to the single-DEP search-space form that successfully served gsm8k eval-only in run 24914869373: 1k1k DEP: conc [4, 8, 16, 32, 64, 128] 8k1k DEP: conc [4, 8, 16, 32, 64] Each entry uses tp=1, dp=16, ep=16, dp-attn=true (1P+1D filling the 4-node h100-multinode pool). max-num-seqs=512, decode cudagraph on, gpu-memory-utilization=0.85. Removes: - benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/1k1k/disagg-h100-fp8-1p1d-dep16-dep16-eager.yaml - benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/8k1k/disagg-h100-fp8-1p1d-dep16-dep16-eager.yaml Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vllm # Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml
Run 24922713022 hit the default 1800s orchestrator deadline on all three matrix jobs (1k1k bench, 8k1k bench, 8k1k eval). Concurrent multinode matrix jobs starve the same Lustre OSTs — first shard load took 423s, shard 8/64 was reached at 16 min, projected total weight load ~107 min. Match the GB200 dsv4 recipes which already added these blocks for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DSV4-Pro per-rank weights are 74.99 GiB at DP=16/EP=16 — H100 80GB leaves only ~4 GiB headroom and sparse_attn_indexer's profile_run torch.empty(512 MiB) OOMs (run 24923521075). Cross-node TP=16 shards the model 16-way across 2 nodes (~5 GiB per rank). srt-slurm's vllm.py:386-388 emits --headless on the secondary node when data-parallel-size is absent and the worker spans nodes; Dynamo's run_dynamo_headless calls vLLM's run_headless which uses MultiprocExecutor + torch.distributed (no Ray) to form the cross-node PG. NCCL TP all-reduce flows over IB on every layer — slower per-token than intra-node NVLink, but the only way to fit DSV4-Pro at 80 GB. Other changes: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for fragmentation; gpu-memory-utilization back to 0.95 (matches H200); enforce-eager on decode for the first attempt (cross-node cudagraphs are fragile). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
348910a to
f798361
Compare
Summary
h100-multinodepool.benchmarks/multi_node/srt_slurm_recipes/and overlaid onto the upstream clone at runtime. Temporary pending an upstream PR to NVIDIA/srt-slurm.Config parity with H200
Engine flags match
benchmarks/single_node/dsv4_fp8_h200.sh:deepseek_v4tokenizer, tool-call, and reasoning parsers--kv-cache-dtype fp8,--block-size 256--no-enable-prefix-caching,--no-enable-flashinfer-autotune--enable-expert-parallel,--gpu-memory-utilization 0.95,--max-num-seqs 512,--max-num-batched-tokens 512FULL_DECODE_ONLYcudagraphVLLM_ENGINE_READY_TIMEOUT_S=3600Differs from H200:
max-model-len: 16384(H200's 800k does not fit KV across two 80GB decode nodes)VLLM_MOE_DP_CHUNK_SIZE=192,deepep_{high_throughput,low_latency}all2all backends,VLLM_USE_DEEP_GEMM=1NixlConnectorP<->D KV transfer,dynamo: { version: 1.0.1, install: true }withsetup_script: vllm-container-deps.shtensor-parallel-size: 1+data-parallel-size: 16per side (vs H200--data-parallel-size 8)🤖 Generated with Claude Code