[NVIDIA] Fix vllm & sglang b200 updated containers#4
Merged
Conversation
jthomson04
pushed a commit
to jthomson04/InferenceMAX
that referenced
this pull request
Jan 21, 2026
Modify GB200 runs to use test partition
2 tasks
chunfangamd
added a commit
that referenced
this pull request
May 24, 2026
…DSA state-index path
amd-master.yaml
- Image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402
-> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
(matches qwen3.5-fp8-mi355x-sglang-disagg; the older 0.5.9 image is
no longer the reference build for hybrid-attention disagg models on
MI355X.)
- Scenarios: collapse the four legacy "top/middle/bottom/small-scale"
search-spaces per ISL into a single 1P+1D TP=8 EP=1 dp-attn=false
entry with the standard conc-list [8, 16, 32, 64, 128, 256, 512]
for both 1k1k and 8k1k. dp-attn=false avoids the
fused_moe_triton/layer.py:209 shared-slot assertion that
--enable-dp-attention + --moe-a2a-backend mori triggers for GLM-5
(256 routed + 1 shared expert; (256-1) % 8 = 7 != 0). The collapsed
layout mirrors the qwen3.5-fp8-mi355x-sglang-disagg shape so the
same CI matrix-expansion logic applies to both.
patches/mori_conn.py
- Add patch #4: rank + length normalization in
MoriKVReceiver._send_swa_dsa_state, immediately before the
group_concurrent_contiguous call. For GLM-5 (single DSA component),
upstream hands dst_state_indices as a 2-D (1, N) array while
src_state_indices is 1-D length 1; the existing [:common_len]
slice operates only on the outer axis, leaving the rank mismatched.
np.diff then produces (1, N-1) vs (0,), which can't broadcast and
crashes with "operands could not be broadcast together with shapes
(1,12) (0,)". The fix ravels both indices to 1-D and re-truncates
to common length so np.diff outputs compatible 1-D arrays. One-shot
log gates the warning to once per receiver class.
- Verified end-to-end:
glm5-fp8-mi355x-sglang-disagg gsm8k flexible-extract = 0.9704 +/- 0.0047
glm5-fp8-mi355x-sglang-disagg gsm8k strict-match = 0.9712 +/- 0.0046
qwen3.5-fp8-mi355x-sglang-disagg gsm8k (regression) = 0.9780 +/- 0.004
Patch #4 fires zero times on the Qwen3.5 Mamba path (it lives
inside _send_swa_dsa_state, never called for Mamba); patches #1-#3
behavior is unchanged.
patches/README.md
- Document patch #4 alongside the existing three. Cross-link the full
bug analysis at scripts/sglang_disagg/docs_glm5/01-bug-analysis.md
and the gsm8k verification at
scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md.
Oseltamivir
added a commit
that referenced
this pull request
May 26, 2026
4 tasks
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.