Skip to content

short term patch: GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash#1578

Merged
functionstackx merged 1 commit into
chang/glm5-fp8-mi355x-sglang-disagg-pr-2from
klaud-cold/glm5-mori-conn-overlay
May 28, 2026
Merged

short term patch: GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash#1578
functionstackx merged 1 commit into
chang/glm5-fp8-mi355x-sglang-disagg-pr-2from
klaud-cold/glm5-mori-conn-overlay

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx commented May 28, 2026

Summary

patch authored by @chunfangamd (credit to chunfang!!)

Stacked on top of #1572. Ports the validated MoRI conn.py overlay from Chun Fang's working branch (chun-chang/sglang-disagg-qwen3.5, commits 48e459bd + c4e397de) to fix PR #1572's CI startup crash on glm5-fp8-mi355x-sglang-disagg.

Why

PR #1572's latest CI run crashed at PD-disagg startup:

File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py", line 1424, in <genexpr>
    struct.pack("I", item_len)
struct.error: required argument is not an integer

Root cause: KVArgs.state_item_lens is List[List[int]] for any model with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.), but mori/conn.py:_register_kv_args iterates and packs each element with naked struct.pack("I", item_len), expecting the legacy List[int]. Two further sites in send_state and _send_swa_dsa_state have the same legacy-API assumption. Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use the plural-API; MoRI never migrated.

This is the exact "MoRI conn.py overlay" PR #1572's own description called out as out-of-scope follow-up.

What

  • benchmarks/multi_node/amd_utils/patches/mori_conn.py (new, 1665 lines) — drop-in replacement for sglang v0.5.12.post1's disaggregation/mori/conn.py with four conservative patches:
    1. Sender flatten — handle nested state_item_lens
    2. state_type plural-API fallback — matches Mooncake/NIXL
    3. Consumer normalize state_item_lens once at send_state entry
    4. SWA/DSA rank+length normalize before group_concurrent_contiguous — fixes GLM-5 DSA single-component np.diff broadcast crash
  • benchmarks/multi_node/amd_utils/patches/README.md — bug analysis, when-to-use table, opt-out knob docs
  • benchmarks/multi_node/amd_utils/job.slurm (+25) — auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains v0.5.12.post1; opt-out via MORI_CONN_PATCH=skip; appends to \${EXTRA_DOCKER_MOUNTS:-} so other mounts still compose
  • .github/configs/amd-master.yaml (+1/-1) — image bump v0.5.12-...-20260517v0.5.12.post1-...-20260523 to unlock the auto-apply gate
  • perf-changelog.yaml (+2/-1) — document the image bump and overlay rationale

Validated on chun-chang at the same image tag

Recipe Metric Result
glm5-fp8-mi355x-sglang-disagg GSM8K strict-match 0.9712 ± 0.0046
glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract 0.9704 ± 0.0047

Matches/exceeds upstream sgl-project/sglang#22665's reported 0.970 baseline.

Why a stacked PR

Keeps #1572's review surface focused on the GLM-5 recipe (config, models.yaml, env tuning, GLM-5-specific setup_deps patches). This PR layers the upstream-bug workaround on top so it can be reviewed/merged independently and dropped cleanly when the upstream sglang fix lands.

Test plan

  • CI sweep on glm5-fp8-mi355x-sglang-disagg (1k1k + 8k1k smoke) clears the startup crash
  • Verify [job.slurm] auto-applied MoRI conn.py overlay log line appears on v0.5.12.post1 runs
  • Verify the overlay does not auto-apply on the older v0.5.9-mori-0402 image (no log line; other disagg recipes unaffected)
  • Spot-check GSM8K accuracy matches chun-chang's 0.971 baseline

Follow-up

Stop-gap until sglang migrates MoRI to the plural state_types API. Tracks sgl-project/sglang#21886 and #22665. Drop this overlay when a published lmsysorg/sglang-rocm:* image carries the upstream fix.

🤖 Generated with Claude Code


Note

Medium Risk
Large vendored Python overlay on the PD/KV transfer path and a new image tag for one CI recipe; scope is gated to v0.5.12.post1 images with opt-out, but wrong mount or patch drift could affect disagg correctness.

Overview
Unblocks GLM-5-FP8 MI355X SGLang PD-disagg CI by working around broken MoRI conn.py in lmsysorg/sglang-rocm:v0.5.12.post1 (hybrid models with plural state_types crash at startup with struct.error).

The PR adds an in-tree mori_conn.py overlay (four targeted fixes: nested state_item_lens flatten, state_types[0] fallback, send_state normalization, DSA index ravel) plus patches/README.md. job.slurm auto bind-mounts it when DOCKER_IMAGE_NAME contains v0.5.12.post1 (opt-out MORI_CONN_PATCH=skip), forwards ${EXTRA_DOCKER_MOUNTS:-} into docker run, and dedupes mounts. amd-master.yaml bumps glm5-fp8-mi355x-sglang-disagg to v0.5.12.post1-...-20260523; perf-changelog.yaml documents the image + overlay.

Reviewed by Cursor Bugbot for commit aba002c. Bugbot is set up for automated code reviews on this repo. Configure here.

…p crash

sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend
(legacy singular `state_type` + flat-int wire format) that crashes
hybrid-attention models at PD-disagg startup. PR #1572's CI run
(actions/runs/26544255929/job/78192785656) hit this exact failure:

  File ".../disaggregation/mori/conn.py", line 1424, in <genexpr>
      struct.pack("I", item_len)
  struct.error: required argument is not an integer

Root cause: KVArgs.state_item_lens is List[List[int]] for any model
with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.),
but MoRI's `_register_kv_args` iterates and packs each element with
naked `struct.pack("I", item_len)`, expecting flat List[int].
Two further sites in `send_state` and `_send_swa_dsa_state` have the
same legacy-API assumption.

This PR ports the validated overlay from Chun Fang's working branch
(chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and
stacks it on top of PR #1572:

  - benchmarks/multi_node/amd_utils/patches/mori_conn.py     (new, 1665 lines)
    Drop-in replacement for sglang v0.5.12.post1's
    disaggregation/mori/conn.py with four conservative patches:
      1. Sender flatten — handle nested state_item_lens
      2. state_type plural-API fallback (matches Mooncake/NIXL)
      3. Consumer normalize state_item_lens at send_state entry
      4. SWA/DSA rank+length normalize before group_concurrent_contiguous
         (fixes GLM-5 DSA single-component np.diff broadcast crash)

  - benchmarks/multi_node/amd_utils/patches/README.md         (new)
    Bug analysis, when-to-use table, opt-out knob documentation.

  - benchmarks/multi_node/amd_utils/job.slurm                 (+25)
    Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains
    "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to
    ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts.

  - .github/configs/amd-master.yaml                            (+1/-1)
    Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to
    unlock the auto-apply gate (matches chun-chang lineage).

  - perf-changelog.yaml                                        (+2/-1)
    Document the image bump and overlay rationale.

Validated on chun-chang at the same image tag:
  glm5-fp8-mi355x-sglang-disagg GSM8K strict-match     = 0.9712 ± 0.0046
  glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047

Stop-gap until sglang migrates MoRI to the plural state_types API
that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381)
already use. Tracks sgl-project/sglang#21886 and #22665.

Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Claude finished @functionstackx's task in 4m 3s —— View job


Review: PR #1578 — GLM-5 disagg MoRI conn.py overlay

  • Gather context and read changed files
  • Validate master config and perf-changelog
  • Review mori_conn.py overlay (1665 lines, 4 patches)
  • Review job.slurm changes
  • Review patches README
  • Cross-reference with upstream sglang v0.5.12 source
  • Post final review

LGTM — no blocking issues found.

The four patches are conservative, well-scoped, and address real upstream bugs confirmed by the CI crash trace. I cross-referenced the overlay against sglang v0.5.12's mori/conn.py via the inferencemax-repos MCP server and confirmed:

  1. Sender flatten (Patch [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1, _register_kv_args L1527) — upstream v0.5.12 has naked struct.pack("I", item_len) for item_len in self.kv_mgr.kv_args.state_item_lens which will crash when state_item_lens is List[List[int]]. The _flatten_skip_none fix correctly handles both nested and flat cases.

  2. state_type plural fallback (Patch [NVIDIA] Add TRT 70B (FP8 and FP4) #2, send_state L968-989) — upstream v0.5.12 reads only the legacy singular state_type attribute and raises immediately when it's "none". The overlay's state_types[0] fallback mirrors what Mooncake and NIXL already do.

  3. Consumer normalize (Patch [NVIDIA] update vllm b200 image. TODO: add logic for docker runner. #3, send_state L923-944) — flattens state_item_lens in-place before the downstream len(...) == len(state_mem_descs) assertion, avoiding a spurious failure for nested inputs.

  4. DSA index ravel (Patch [NVIDIA] Fix vllm & sglang b200 updated containers #4, _send_swa_dsa_state L1151-1180) — fixes the np.diff broadcasting crash for single-component DSA models (GLM-5) by raveling 2-D (1, N) index arrays to 1-D before group_concurrent_contiguous.

job.slurm integration is clean: auto-apply gate (v0.5.12.post1 glob), opt-out (MORI_CONN_PATCH=skip), dedup guard against EXTRA_DOCKER_MOUNTS, diagnostic log line, and proper :ro bind mount. The ${EXTRA_DOCKER_MOUNTS:-} expansion at L492 correctly word-splits Docker -v args.

Checklist:

  • Image lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 is a public Docker Hub image ✓
  • perf-changelog.yaml updated with image bump and overlay rationale ✓
  • Entry appended at end of file (L3175-3184 of 3184) ✓
  • amd-master.yaml change is image-only (no new config key or model prefix change) ✓

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit aba002c. Configure here.

pass
self.__class__._logged_dsa_index_flatten = True
src_state_indices = src_state_indices[:new_common]
dst_state_indices = dst_state_indices[:new_common]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PATCH #4 ravel placed after truncation causes data loss

Medium Severity

The ravel normalization (PATCH #4) is placed after the pre-existing length-based truncation at lines 1135–1149, but len() on a 2-D numpy array returns the first axis size, not total elements. When src_state_indices is shape (1, N) and dst_state_indices is shape (N,), common_len becomes min(1, N) = 1, and dst_state_indices[:1] truncates it to a single element before the ravel can flatten both to 1-D. After ravel, src has N elements but dst has 1, so new_common = 1 and only one state index is transferred — silent data loss for the entire state transfer.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit aba002c. Configure here.

Comment on lines +78 to +82
| Image / version | Need `mori_conn.py` overlay? |
|---|---|
| `lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523` | yes (Qwen3.5-MoE-FP8, GLM-5, any hybrid model on this image) |
| `lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-*` (used by `dsr1-fp4-*-disagg`) | not validated; same code path likely affected — try with the overlay if you hit the same `struct.error` |
| `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-*` (used by `dsr1-fp8-*-disagg`, `glm5-*-disagg`) | predates [PR #22665](https://github.com/sgl-project/sglang/pull/22665); different code paths; **do not** apply this overlay |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new patches/README.md has three documentation accuracy issues introduced by this PR (no runtime impact — the auto-apply gate in job.slurm keys on the image-tag substring v0.5.12.post1, not on recipe names):

  1. Row 3 of the "When to use which patch" table lists glm5-*-disagg as a user of the rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-* image with "do not apply this overlay". After this PR the only glm5-*-disagg recipe (glm5-fp8-mi355x-sglang-disagg) is on v0.5.12.post1 and does need the overlay. Drop glm5-*-disagg from the row-3 parenthetical so it only mentions dsr1-fp8-*-disagg.

  2. Row 2 lists lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-* as the image used by dsr1-fp4-*-disagg. The two actual recipes (dsr1-fp4-mi355x-sglang-disagg{,-mtp}) both use lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519 — the v0.5.10.post1 tag string appears nowhere in amd-master.yaml. Update the image tag in the row.

  3. Five dangling path references (lines 12, 49, 57, 61, 85) point at scripts/sglang_disagg/ — but no scripts/ directory exists in the repo. The four cited docs (docs_glm5/01-bug-analysis.md, docs_glm5/02-fix-and-verification.md, docs/03-upstream-pr-proposal.md) all dead-end. Either land these docs in this PR or drop the references.

Extended reasoning...

This is a pure documentation-hygiene cleanup on the new patches/README.md added in this PR. There is no runtime impact: the auto-apply gate in job.slurm (lines 71-78) keys on "${DOCKER_IMAGE_NAME:-}" == *"v0.5.12.post1"*, not on recipe names or the README table, so the overlay still gets applied to exactly the right recipes regardless of what the table says.

Bug 1 — row 3 stale glm5-*-disagg parenthetical

README line 82 says:

| `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-*` (used by `dsr1-fp8-*-disagg`, `glm5-*-disagg`) | predates PR #22665; different code paths; **do not** apply this overlay |

But .github/configs/amd-master.yaml line 492 (after this PR's image bump) shows:

glm5-fp8-mi355x-sglang-disagg:
  image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523

The only glm5-*-disagg recipe in the AMD master config is on v0.5.12.post1 — never on the 0.5.9-mori image. Row 1 of the same table correctly says GLM-5 on v0.5.12.post1 needs the overlay. A reader scanning row 3 by recipe name (glm5-*-disagg) instead of by image gets the opposite (and wrong) instruction.

Step-by-step proof of impact (manual readers only — automation is fine):

  1. A future operator is debugging glm5-fp8-mi355x-sglang-disagg and wants to know if the MoRI overlay is needed.
  2. They grep the README for glm5-*-disagg and land on row 3.
  3. Row 3 says "do not apply this overlay".
  4. They opt out via MORI_CONN_PATCH=skip and reproduce the v0.5.12.post1 struct.error startup crash that this PR exists to fix.
  5. Time wasted figuring out the table is stale.

Bug 2 — row 2 wrong image tag for dsr1-fp4-*-disagg

README line 81 says:

| `lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-*` (used by `dsr1-fp4-*-disagg`) | not validated; same code path likely affected — try with the overlay if you hit the same `struct.error` |

But amd-master.yaml line 1518 shows:

dsr1-fp4-mi355x-sglang-disagg:
  image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519

and line 1746 shows the same image for dsr1-fp4-mi355x-sglang-disagg-mtp. Grep confirms v0.5.10.post1 appears nowhere in amd-master.yaml. So row 2 cites an image that doesn't exist in the AMD config and gives no useful guidance for dsr1-fp4-*-disagg at all.

Bug 3 — five dangling path references

README references the following paths:

  • Line 12: scripts/sglang_disagg/ (claimed to host "local-test driver scripts")
  • Line 49: scripts/sglang_disagg/docs_glm5/01-bug-analysis.md
  • Line 57: scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md
  • Line 61: scripts/sglang_disagg/docs/03-upstream-pr-proposal.md
  • Line 85: scripts/sglang_disagg/docs/03-upstream-pr-proposal.md (second reference)

But ls scripts/ returns "No such file or directory" — there is no scripts/ directory anywhere in the repo. These paths likely live on the author's working branch (chun-chang/sglang-disagg-qwen3.5) but never got ported. The README invites readers to "see" those documents for the full bug analysis and upstream proposal — every link dead-ends.

How to fix

All three are local README edits:

  1. Row 3 parenthetical: drop , glm5-*-disagg(used by \dsr1-fp8-*-disagg`)`.
  2. Row 2: replace lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-* with the actual image, lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519.
  3. Either land the four scripts/sglang_disagg/ docs in this PR (alongside the README that references them) or strike the five references and inline a brief summary of the bug analysis.

Severity

All five verifiers (across three sub-bugs) tagged this as nit: docs-only, no functional regression, auto-apply gate behaves correctly. Filing as nit accordingly.

Comment thread perf-changelog.yaml
description:
- "Add GLM-5-FP8 MI355X SGLang disaggregated prefill-decode benchmark"
- "Image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 (matches glm5-fp8-mi355x-sglang)"
- "Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 (bumped from .v0.5.12-...-20260517 to unlock the PD-disagg MoRI overlay; matches chun-chang/sglang-disagg-qwen3.5)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Cosmetic typo in the new perf-changelog.yaml entry at line 3179: bumped from .v0.5.12-...-20260517 has a stray leading period before the v. The actual previous image tag was v0.5.12-rocm720-mi35x-20260517 (no leading dot). One-character fix — drop the period.

Extended reasoning...

What the bug is

In perf-changelog.yaml line 3179 the new changelog entry reads:

- "Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 (bumped from .v0.5.12-...-20260517 to unlock the PD-disagg MoRI overlay; matches chun-chang/sglang-disagg-qwen3.5)"

The substring bumped from .v0.5.12-...-20260517 carries a stray leading period before the v. The actual previous image tag, as visible in this same PR's .github/configs/amd-master.yaml diff, was v0.5.12-rocm720-mi35x-20260517 (no leading dot).

Why this is a real but minor issue

Step-by-step proof:

  1. Look at .github/configs/amd-master.yaml in this PR — the glm5-fp8-mi355x-sglang-disagg: entry changes from image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523.
  2. The previous tag string is v0.5.12-rocm720-mi35x-20260517 — no leading period.
  3. The changelog entry abbreviates the previous tag as .v0.5.12-...-20260517 with a leading dot, which doesn't match.

The dot is almost certainly a stray keystroke (perhaps a leftover from formatting around the ellipsis ...) rather than a meaningful version segment. Both sibling entries elsewhere in the file consistently write the unbumped tag without a leading dot.

Impact

Zero runtime impact — this is a free-text human-readable changelog string parsed only by changelog-rendering tooling (or human readers). No code reads this field for version comparison. Severity is purely cosmetic.

Fix

Replace .v0.5.12-...-20260517 with v0.5.12-...-20260517 on line 3179 of perf-changelog.yaml (one-character deletion). All three independent verifiers reached the same conclusion ("nit") with no refutations.

@functionstackx functionstackx changed the title [Klaud Cold] GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash [Klaud Cold][Debug] GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash May 28, 2026
@functionstackx functionstackx changed the base branch from chang/glm5-fp8-mi355x-sglang-disagg-pr-2 to main May 28, 2026 02:23
@functionstackx functionstackx requested a review from a team May 28, 2026 02:23
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx changed the base branch from main to chang/glm5-fp8-mi355x-sglang-disagg-pr-2 May 28, 2026 04:59
@functionstackx functionstackx changed the title [Klaud Cold][Debug] GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash short term patch: GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash May 28, 2026
@functionstackx functionstackx merged commit ca5efd9 into chang/glm5-fp8-mi355x-sglang-disagg-pr-2 May 28, 2026
65 of 72 checks passed
@functionstackx functionstackx deleted the klaud-cold/glm5-mori-conn-overlay branch May 28, 2026 05:00
@github-actions
Copy link
Copy Markdown
Contributor

functionstackx added a commit that referenced this pull request May 28, 2026
…p crash (#1578)

sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend
(legacy singular `state_type` + flat-int wire format) that crashes
hybrid-attention models at PD-disagg startup. PR #1572's CI run
(actions/runs/26544255929/job/78192785656) hit this exact failure:

  File ".../disaggregation/mori/conn.py", line 1424, in <genexpr>
      struct.pack("I", item_len)
  struct.error: required argument is not an integer

Root cause: KVArgs.state_item_lens is List[List[int]] for any model
with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.),
but MoRI's `_register_kv_args` iterates and packs each element with
naked `struct.pack("I", item_len)`, expecting flat List[int].
Two further sites in `send_state` and `_send_swa_dsa_state` have the
same legacy-API assumption.

This PR ports the validated overlay from Chun Fang's working branch
(chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and
stacks it on top of PR #1572:

  - benchmarks/multi_node/amd_utils/patches/mori_conn.py     (new, 1665 lines)
    Drop-in replacement for sglang v0.5.12.post1's
    disaggregation/mori/conn.py with four conservative patches:
      1. Sender flatten — handle nested state_item_lens
      2. state_type plural-API fallback (matches Mooncake/NIXL)
      3. Consumer normalize state_item_lens at send_state entry
      4. SWA/DSA rank+length normalize before group_concurrent_contiguous
         (fixes GLM-5 DSA single-component np.diff broadcast crash)

  - benchmarks/multi_node/amd_utils/patches/README.md         (new)
    Bug analysis, when-to-use table, opt-out knob documentation.

  - benchmarks/multi_node/amd_utils/job.slurm                 (+25)
    Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains
    "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to
    ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts.

  - .github/configs/amd-master.yaml                            (+1/-1)
    Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to
    unlock the auto-apply gate (matches chun-chang lineage).

  - perf-changelog.yaml                                        (+2/-1)
    Document the image bump and overlay rationale.

Validated on chun-chang at the same image tag:
  glm5-fp8-mi355x-sglang-disagg GSM8K strict-match     = 0.9712 ± 0.0046
  glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047

Stop-gap until sglang migrates MoRI to the plural state_types API
that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381)
already use. Tracks sgl-project/sglang#21886 and #22665.

Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
functionstackx added a commit that referenced this pull request May 28, 2026
…1572)

* Add GLM-5 FP8 MI355X SGLang disaggregated benchmark (PR-2).

Introduce glm5-fp8-mi355x-sglang-disagg CI config, model server flags,
launch script, setup_deps.sh image patches, and GLM-5 env tuning for
MoRI PD disaggregation on MI355X.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>

* Update benchmarks/multi_node/amd_utils/setup_deps.sh

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>

* fix: add FRAMEWORK to check_env_vars, fix NODELIST variable name

- Add missing FRAMEWORK to check_env_vars list to match sister
  sglang-disagg scripts (dsr1_fp8, dsr1_fp4)
- Rename NODE_LIST to NODELIST (quoted) to match the convention
  used by kimik2.5/minimaxm2.5 vllm-disagg sisters

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>

* [Klaud Cold] GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash (#1578)

sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend
(legacy singular `state_type` + flat-int wire format) that crashes
hybrid-attention models at PD-disagg startup. PR #1572's CI run
(actions/runs/26544255929/job/78192785656) hit this exact failure:

  File ".../disaggregation/mori/conn.py", line 1424, in <genexpr>
      struct.pack("I", item_len)
  struct.error: required argument is not an integer

Root cause: KVArgs.state_item_lens is List[List[int]] for any model
with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.),
but MoRI's `_register_kv_args` iterates and packs each element with
naked `struct.pack("I", item_len)`, expecting flat List[int].
Two further sites in `send_state` and `_send_swa_dsa_state` have the
same legacy-API assumption.

This PR ports the validated overlay from Chun Fang's working branch
(chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and
stacks it on top of PR #1572:

  - benchmarks/multi_node/amd_utils/patches/mori_conn.py     (new, 1665 lines)
    Drop-in replacement for sglang v0.5.12.post1's
    disaggregation/mori/conn.py with four conservative patches:
      1. Sender flatten — handle nested state_item_lens
      2. state_type plural-API fallback (matches Mooncake/NIXL)
      3. Consumer normalize state_item_lens at send_state entry
      4. SWA/DSA rank+length normalize before group_concurrent_contiguous
         (fixes GLM-5 DSA single-component np.diff broadcast crash)

  - benchmarks/multi_node/amd_utils/patches/README.md         (new)
    Bug analysis, when-to-use table, opt-out knob documentation.

  - benchmarks/multi_node/amd_utils/job.slurm                 (+25)
    Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains
    "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to
    ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts.

  - .github/configs/amd-master.yaml                            (+1/-1)
    Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to
    unlock the auto-apply gate (matches chun-chang lineage).

  - perf-changelog.yaml                                        (+2/-1)
    Document the image bump and overlay rationale.

Validated on chun-chang at the same image tag:
  glm5-fp8-mi355x-sglang-disagg GSM8K strict-match     = 0.9712 ± 0.0046
  glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047

Stop-gap until sglang migrates MoRI to the plural state_types API
that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381)
already use. Tracks sgl-project/sglang#21886 and #22665.

Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>

---------

Co-authored-by: cliu1004@amd.com <cliu1004@amd.com@mia1-p01-g18.mia.tensorwave.lan>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant