Merged
Conversation
Collaborator
Author
|
Also modified docker run to include --init to improve stability by better handling zombie processes |
5 tasks
Oseltamivir
added a commit
that referenced
this pull request
Apr 25, 2026
…ckwell sglang fork
Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell
container's pre-baked sglang fails at import time:
File ".../dynamo/sglang/health_check.py", line 20
def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine])
AttributeError: module 'sglang' has no attribute 'Engine'
The DSV4 sglang fork bundled in this image does not expose sgl.Engine.
Drop the dynamo: block so srtctl uses the dynamo build pre-installed in
the container — matches NVIDIA/srt-slurm PR #75 (the only upstream
DSV4 sglang disagg recipe), which also has no dynamo: block.
Oseltamivir
added a commit
that referenced
this pull request
Apr 25, 2026
…tch types broken
Run after the deepep-mode: low_latency change failed again. Logs show
two distinct DeepEP-path failures:
1. Prefill scheduler crash:
File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347
topk_output = dispatch_output.topk_output
AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output'
The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch
output type in this image's sglang fork exposes topk_output, so
forcing low_latency vs normal mode does not help. mxfp4_deepseek.py
is a fork-only file (does not exist in upstream sgl-project/sglang),
so the API mismatch can only be fixed by rebuilding the image.
2. Decode CUDA graph capture crash:
RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233
'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank'
DeepEP low_latency_dispatch's per-rank token cap is exceeded by the
cuda-graph-max-bs we configured.
Both failures are in the DeepEP path. Per upstream sgl-project/sglang
(server_args.py), moe_a2a_backend defaults to 'none', which uses
all-reduce/all-gather dispatch and lets TP shard the expert weights
across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the
only upstream DSV4 sglang disagg recipe) takes the same TP-only stance
— pure tensor-parallel-size: N with no enable-dp-attention, no
moe-a2a-backend deepep, no dp-size, no ep-size.
Drop those five fields from all 6 recipes. Topology shape preserved:
- 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes)
- 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes)
- 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes)
- 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes)
- 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes)
- 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes)
DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank)
or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of
KV cache headroom on each 96 GB GB200 GPU.
Topology filenames retain the 'dep8' / 'dep16' historical names from
the vLLM PR #1129 sibling for symmetry — the actual sglang_config is
TP-only.
Oseltamivir
added a commit
that referenced
this pull request
Apr 25, 2026
…ility at TP=8/16
After the DeepEP removal, model load crashed at:
File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes
raise ValueError(
ValueError: Weight output_partition_size = 192 is not divisible
by weight quantization block_n = 128.
DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants
in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192,
which fails the divisibility check. PR #75 sidesteps this by using
TP=4 (1536/4=384), but that locks us into single-node workers.
sglang's --moe-dense-tp-size flag is the documented workaround
(server_args.py: 'useful when, with large TP size, there are errors
caused by weights in MLP layers having dimension smaller than the
min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the
shared / dense-MLP layers replicated across ranks (TP=1) while the
rest of the model — attention, routed experts — keeps TP=8/16. Memory
cost is small since shared experts are a fraction of total weights.
Applied to all 6 recipes; topology/node counts unchanged.
Oseltamivir
added a commit
that referenced
this pull request
Apr 25, 2026
Run after moe-dense-tp-size: 1 added still hit:
ValueError: Weight output_partition_size = 192 is not divisible
by weight quantization block_n = 128.
Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info):
if not enable_dp_attention:
return tp_rank, tp_size, 0 # moe_dense_tp_size IGNORED
The flag is only honored when enable_dp_attention=True. Since we
already dropped DP-attention to avoid the fork's mxfp4_deepseek bug,
moe-dense-tp-size: 1 was a no-op.
Two valid paths:
(a) re-enable DP-attention without DeepEP — speculative, never tested
(b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant
passes. Matches NVIDIA/srt-slurm PR #75 (the only verified-
working DSV4 sglang disagg recipe upstream) verbatim.
Going with (b). Recipes drop moe-dense-tp-size (no longer needed at
TP=4) and switch tensor-parallel-size to 4 in both prefill+decode.
gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per
worker). prefill_nodes / decode_nodes track worker counts.
Topology shape (filenames keep historical dep8/dep16 naming for
symmetry with the vLLM #1129 sibling; actual config is TP=4):
- 1k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes total)
- 1k1k 1p1d-dep16: P TP=4 / D TP=4 (2 nodes total) — same shape, different conc
- 1k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes)
- 8k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes)
- 8k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes)
- 8k1k 7p1d-dep16: P 7*TP=4 / D TP=4 (8 nodes)
nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on
every prefill+decode block — including the commented 8k/1k block).
Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048
in all env blocks (DeepEP path is dormant in this config but the env
var is in place for re-enabling later).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
NCCL_GRAPH_REGISTER tries to automatically enable user buffer registration with CUDA Graphs. Disabling it can reduce our vLLM and SGLang perf but will improve CI stability.
This causes conflicts with pre-installed packages inside user directory and caused last night's runs to fail
See successful run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/18146263616