Skip to content

[NVIDIA] Fix vllm & sglang b200 updated containers#4

Merged
kimbochen merged 6 commits into
mainfrom
fix-vllm-b200
Sep 4, 2025
Merged

[NVIDIA] Fix vllm & sglang b200 updated containers#4
kimbochen merged 6 commits into
mainfrom
fix-vllm-b200

Conversation

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

No description provided.

@kimbochen kimbochen merged commit 75ec29c into main Sep 4, 2025
@kimbochen kimbochen deleted the fix-vllm-b200 branch September 4, 2025 00:42
jthomson04 pushed a commit to jthomson04/InferenceMAX that referenced this pull request Jan 21, 2026
Modify GB200 runs to use test partition
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
@cquil11 cquil11 changed the title Fix vllm & sglang b200 updated containers [NVIDIA] Fix vllm & sglang b200 updated containers Apr 8, 2026
chunfangamd added a commit that referenced this pull request May 24, 2026
…DSA state-index path

amd-master.yaml
  - Image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402
        -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
    (matches qwen3.5-fp8-mi355x-sglang-disagg; the older 0.5.9 image is
    no longer the reference build for hybrid-attention disagg models on
    MI355X.)
  - Scenarios: collapse the four legacy "top/middle/bottom/small-scale"
    search-spaces per ISL into a single 1P+1D TP=8 EP=1 dp-attn=false
    entry with the standard conc-list [8, 16, 32, 64, 128, 256, 512]
    for both 1k1k and 8k1k. dp-attn=false avoids the
    fused_moe_triton/layer.py:209 shared-slot assertion that
    --enable-dp-attention + --moe-a2a-backend mori triggers for GLM-5
    (256 routed + 1 shared expert; (256-1) % 8 = 7 != 0). The collapsed
    layout mirrors the qwen3.5-fp8-mi355x-sglang-disagg shape so the
    same CI matrix-expansion logic applies to both.

patches/mori_conn.py
  - Add patch #4: rank + length normalization in
    MoriKVReceiver._send_swa_dsa_state, immediately before the
    group_concurrent_contiguous call. For GLM-5 (single DSA component),
    upstream hands dst_state_indices as a 2-D (1, N) array while
    src_state_indices is 1-D length 1; the existing [:common_len]
    slice operates only on the outer axis, leaving the rank mismatched.
    np.diff then produces (1, N-1) vs (0,), which can't broadcast and
    crashes with "operands could not be broadcast together with shapes
    (1,12) (0,)". The fix ravels both indices to 1-D and re-truncates
    to common length so np.diff outputs compatible 1-D arrays. One-shot
    log gates the warning to once per receiver class.

  - Verified end-to-end:
      glm5-fp8-mi355x-sglang-disagg gsm8k flexible-extract = 0.9704 +/- 0.0047
      glm5-fp8-mi355x-sglang-disagg gsm8k strict-match     = 0.9712 +/- 0.0046
      qwen3.5-fp8-mi355x-sglang-disagg gsm8k (regression)  = 0.9780 +/- 0.004
    Patch #4 fires zero times on the Qwen3.5 Mamba path (it lives
    inside _send_swa_dsa_state, never called for Mamba); patches #1-#3
    behavior is unchanged.

patches/README.md
  - Document patch #4 alongside the existing three. Cross-link the full
    bug analysis at scripts/sglang_disagg/docs_glm5/01-bug-analysis.md
    and the gsm8k verification at
    scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md.
Oseltamivir added a commit that referenced this pull request May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants