improve NVIDIA CI stability by kedarpotdar-nv · Pull Request #75 · SemiAnalysisAI/InferenceX

kedarpotdar-nv · 2025-09-30T23:41:41Z

Changes

Disable NCCL Graph Registration

NCCL_GRAPH_REGISTER tries to automatically enable user buffer registration with CUDA Graphs. Disabling it can reduce our vLLM and SGLang perf but will improve CI stability.

remove enroot --container-mount-home

This causes conflicts with pre-installed packages inside user directory and caused last night's runs to fail

See successful run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/18146263616

kedarpotdar-nv · 2025-09-30T23:49:28Z

Also modified docker run to include --init to improve stability by better handling zombie processes

https://www.paolomainardi.com/posts/docker-run-init/

srtctl SrtConfig schema rejects backend.connector for the sglang backend type. The field was carried over from the dynamo-vllm dsv4 recipes (where it is valid and set to null). PR #69/#75 sglang recipes upstream do not declare it.

…ckwell sglang fork Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell container's pre-baked sglang fails at import time: File ".../dynamo/sglang/health_check.py", line 20 def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine]) AttributeError: module 'sglang' has no attribute 'Engine' The DSV4 sglang fork bundled in this image does not expose sgl.Engine. Drop the dynamo: block so srtctl uses the dynamo build pre-installed in the container — matches NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe), which also has no dynamo: block.

…tch types broken Run after the deepep-mode: low_latency change failed again. Logs show two distinct DeepEP-path failures: 1. Prefill scheduler crash: File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output' The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch output type in this image's sglang fork exposes topk_output, so forcing low_latency vs normal mode does not help. mxfp4_deepseek.py is a fork-only file (does not exist in upstream sgl-project/sglang), so the API mismatch can only be fixed by rebuilding the image. 2. Decode CUDA graph capture crash: RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank' DeepEP low_latency_dispatch's per-rank token cap is exceeded by the cuda-graph-max-bs we configured. Both failures are in the DeepEP path. Per upstream sgl-project/sglang (server_args.py), moe_a2a_backend defaults to 'none', which uses all-reduce/all-gather dispatch and lets TP shard the expert weights across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe) takes the same TP-only stance — pure tensor-parallel-size: N with no enable-dp-attention, no moe-a2a-backend deepep, no dp-size, no ep-size. Drop those five fields from all 6 recipes. Topology shape preserved: - 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes) - 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes) DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank) or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of KV cache headroom on each 96 GB GB200 GPU. Topology filenames retain the 'dep8' / 'dep16' historical names from the vLLM PR #1129 sibling for symmetry — the actual sglang_config is TP-only.

…ility at TP=8/16 After the DeepEP removal, model load crashed at: File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes raise ValueError( ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192, which fails the divisibility check. PR #75 sidesteps this by using TP=4 (1536/4=384), but that locks us into single-node workers. sglang's --moe-dense-tp-size flag is the documented workaround (server_args.py: 'useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the shared / dense-MLP layers replicated across ranks (TP=1) while the rest of the model — attention, routed experts — keeps TP=8/16. Memory cost is small since shared experts are a fraction of total weights. Applied to all 6 recipes; topology/node counts unchanged.

Run after moe-dense-tp-size: 1 added still hit: ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info): if not enable_dp_attention: return tp_rank, tp_size, 0 # moe_dense_tp_size IGNORED The flag is only honored when enable_dp_attention=True. Since we already dropped DP-attention to avoid the fork's mxfp4_deepseek bug, moe-dense-tp-size: 1 was a no-op. Two valid paths: (a) re-enable DP-attention without DeepEP — speculative, never tested (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant passes. Matches NVIDIA/srt-slurm PR #75 (the only verified- working DSV4 sglang disagg recipe upstream) verbatim. Going with (b). Recipes drop moe-dense-tp-size (no longer needed at TP=4) and switch tensor-parallel-size to 4 in both prefill+decode. gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per worker). prefill_nodes / decode_nodes track worker counts. Topology shape (filenames keep historical dep8/dep16 naming for symmetry with the vLLM #1129 sibling; actual config is TP=4): - 1k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes total) - 1k1k 1p1d-dep16: P TP=4 / D TP=4 (2 nodes total) — same shape, different conc - 1k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes) - 8k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 7p1d-dep16: P 7*TP=4 / D TP=4 (8 nodes) nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on every prefill+decode block — including the commented 8k/1k block). Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048 in all env blocks (DeepEP path is dormant in this config but the env var is in place for re-enabling later).

improve CI stability with docker

d1dc8c4

kedarpotdar-nv requested a review from functionstackx September 30, 2025 23:41

kedarpotdar-nv added 2 commits September 30, 2025 16:54

add comment on docker init

1d19374

add comments

d825e07

functionstackx merged commit 0027aad into main Oct 1, 2025

functionstackx deleted the nvidia-ci-stability branch October 1, 2025 00:06

cquil11 added the NVIDIA label Apr 8, 2026

Oseltamivir mentioned this pull request Apr 25, 2026

Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks #1157

Open

5 tasks

Oseltamivir mentioned this pull request Apr 26, 2026

gb300 1k1k sglang #1169

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve NVIDIA CI stability#75

improve NVIDIA CI stability#75
functionstackx merged 3 commits intomainfrom
nvidia-ci-stability

kedarpotdar-nv commented Sep 30, 2025 •

edited

Loading

Uh oh!

kedarpotdar-nv commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kedarpotdar-nv commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

kedarpotdar-nv commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kedarpotdar-nv commented Sep 30, 2025 •

edited

Loading