short term patch: GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash#1578
Conversation
…p crash sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend (legacy singular `state_type` + flat-int wire format) that crashes hybrid-attention models at PD-disagg startup. PR #1572's CI run (actions/runs/26544255929/job/78192785656) hit this exact failure: File ".../disaggregation/mori/conn.py", line 1424, in <genexpr> struct.pack("I", item_len) struct.error: required argument is not an integer Root cause: KVArgs.state_item_lens is List[List[int]] for any model with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.), but MoRI's `_register_kv_args` iterates and packs each element with naked `struct.pack("I", item_len)`, expecting flat List[int]. Two further sites in `send_state` and `_send_swa_dsa_state` have the same legacy-API assumption. This PR ports the validated overlay from Chun Fang's working branch (chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and stacks it on top of PR #1572: - benchmarks/multi_node/amd_utils/patches/mori_conn.py (new, 1665 lines) Drop-in replacement for sglang v0.5.12.post1's disaggregation/mori/conn.py with four conservative patches: 1. Sender flatten — handle nested state_item_lens 2. state_type plural-API fallback (matches Mooncake/NIXL) 3. Consumer normalize state_item_lens at send_state entry 4. SWA/DSA rank+length normalize before group_concurrent_contiguous (fixes GLM-5 DSA single-component np.diff broadcast crash) - benchmarks/multi_node/amd_utils/patches/README.md (new) Bug analysis, when-to-use table, opt-out knob documentation. - benchmarks/multi_node/amd_utils/job.slurm (+25) Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts. - .github/configs/amd-master.yaml (+1/-1) Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to unlock the auto-apply gate (matches chun-chang lineage). - perf-changelog.yaml (+2/-1) Document the image bump and overlay rationale. Validated on chun-chang at the same image tag: glm5-fp8-mi355x-sglang-disagg GSM8K strict-match = 0.9712 ± 0.0046 glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047 Stop-gap until sglang migrates MoRI to the plural state_types API that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use. Tracks sgl-project/sglang#21886 and #22665. Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @functionstackx's task in 4m 3s —— View job Review: PR #1578 — GLM-5 disagg MoRI conn.py overlay
LGTM — no blocking issues found. The four patches are conservative, well-scoped, and address real upstream bugs confirmed by the CI crash trace. I cross-referenced the overlay against sglang v0.5.12's
Checklist:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit aba002c. Configure here.
| pass | ||
| self.__class__._logged_dsa_index_flatten = True | ||
| src_state_indices = src_state_indices[:new_common] | ||
| dst_state_indices = dst_state_indices[:new_common] |
There was a problem hiding this comment.
PATCH #4 ravel placed after truncation causes data loss
Medium Severity
The ravel normalization (PATCH #4) is placed after the pre-existing length-based truncation at lines 1135–1149, but len() on a 2-D numpy array returns the first axis size, not total elements. When src_state_indices is shape (1, N) and dst_state_indices is shape (N,), common_len becomes min(1, N) = 1, and dst_state_indices[:1] truncates it to a single element before the ravel can flatten both to 1-D. After ravel, src has N elements but dst has 1, so new_common = 1 and only one state index is transferred — silent data loss for the entire state transfer.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit aba002c. Configure here.
| | Image / version | Need `mori_conn.py` overlay? | | ||
| |---|---| | ||
| | `lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523` | yes (Qwen3.5-MoE-FP8, GLM-5, any hybrid model on this image) | | ||
| | `lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-*` (used by `dsr1-fp4-*-disagg`) | not validated; same code path likely affected — try with the overlay if you hit the same `struct.error` | | ||
| | `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-*` (used by `dsr1-fp8-*-disagg`, `glm5-*-disagg`) | predates [PR #22665](https://github.com/sgl-project/sglang/pull/22665); different code paths; **do not** apply this overlay | |
There was a problem hiding this comment.
🟡 The new patches/README.md has three documentation accuracy issues introduced by this PR (no runtime impact — the auto-apply gate in job.slurm keys on the image-tag substring v0.5.12.post1, not on recipe names):
-
Row 3 of the "When to use which patch" table lists
glm5-*-disaggas a user of therocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-*image with "do not apply this overlay". After this PR the onlyglm5-*-disaggrecipe (glm5-fp8-mi355x-sglang-disagg) is onv0.5.12.post1and does need the overlay. Dropglm5-*-disaggfrom the row-3 parenthetical so it only mentionsdsr1-fp8-*-disagg. -
Row 2 lists
lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-*as the image used bydsr1-fp4-*-disagg. The two actual recipes (dsr1-fp4-mi355x-sglang-disagg{,-mtp}) both uselmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519— thev0.5.10.post1tag string appears nowhere inamd-master.yaml. Update the image tag in the row. -
Five dangling path references (lines 12, 49, 57, 61, 85) point at
scripts/sglang_disagg/— but noscripts/directory exists in the repo. The four cited docs (docs_glm5/01-bug-analysis.md,docs_glm5/02-fix-and-verification.md,docs/03-upstream-pr-proposal.md) all dead-end. Either land these docs in this PR or drop the references.
Extended reasoning...
This is a pure documentation-hygiene cleanup on the new patches/README.md added in this PR. There is no runtime impact: the auto-apply gate in job.slurm (lines 71-78) keys on "${DOCKER_IMAGE_NAME:-}" == *"v0.5.12.post1"*, not on recipe names or the README table, so the overlay still gets applied to exactly the right recipes regardless of what the table says.
Bug 1 — row 3 stale glm5-*-disagg parenthetical
README line 82 says:
| `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-*` (used by `dsr1-fp8-*-disagg`, `glm5-*-disagg`) | predates PR #22665; different code paths; **do not** apply this overlay |
But .github/configs/amd-master.yaml line 492 (after this PR's image bump) shows:
glm5-fp8-mi355x-sglang-disagg:
image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523The only glm5-*-disagg recipe in the AMD master config is on v0.5.12.post1 — never on the 0.5.9-mori image. Row 1 of the same table correctly says GLM-5 on v0.5.12.post1 needs the overlay. A reader scanning row 3 by recipe name (glm5-*-disagg) instead of by image gets the opposite (and wrong) instruction.
Step-by-step proof of impact (manual readers only — automation is fine):
- A future operator is debugging
glm5-fp8-mi355x-sglang-disaggand wants to know if the MoRI overlay is needed. - They
grepthe README forglm5-*-disaggand land on row 3. - Row 3 says "do not apply this overlay".
- They opt out via
MORI_CONN_PATCH=skipand reproduce the v0.5.12.post1struct.errorstartup crash that this PR exists to fix. - Time wasted figuring out the table is stale.
Bug 2 — row 2 wrong image tag for dsr1-fp4-*-disagg
README line 81 says:
| `lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-*` (used by `dsr1-fp4-*-disagg`) | not validated; same code path likely affected — try with the overlay if you hit the same `struct.error` |
But amd-master.yaml line 1518 shows:
dsr1-fp4-mi355x-sglang-disagg:
image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519and line 1746 shows the same image for dsr1-fp4-mi355x-sglang-disagg-mtp. Grep confirms v0.5.10.post1 appears nowhere in amd-master.yaml. So row 2 cites an image that doesn't exist in the AMD config and gives no useful guidance for dsr1-fp4-*-disagg at all.
Bug 3 — five dangling path references
README references the following paths:
- Line 12:
scripts/sglang_disagg/(claimed to host "local-test driver scripts") - Line 49:
scripts/sglang_disagg/docs_glm5/01-bug-analysis.md - Line 57:
scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md - Line 61:
scripts/sglang_disagg/docs/03-upstream-pr-proposal.md - Line 85:
scripts/sglang_disagg/docs/03-upstream-pr-proposal.md(second reference)
But ls scripts/ returns "No such file or directory" — there is no scripts/ directory anywhere in the repo. These paths likely live on the author's working branch (chun-chang/sglang-disagg-qwen3.5) but never got ported. The README invites readers to "see" those documents for the full bug analysis and upstream proposal — every link dead-ends.
How to fix
All three are local README edits:
- Row 3 parenthetical: drop
, glm5-*-disagg→(used by \dsr1-fp8-*-disagg`)`. - Row 2: replace
lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-*with the actual image,lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519. - Either land the four
scripts/sglang_disagg/docs in this PR (alongside the README that references them) or strike the five references and inline a brief summary of the bug analysis.
Severity
All five verifiers (across three sub-bugs) tagged this as nit: docs-only, no functional regression, auto-apply gate behaves correctly. Filing as nit accordingly.
| description: | ||
| - "Add GLM-5-FP8 MI355X SGLang disaggregated prefill-decode benchmark" | ||
| - "Image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 (matches glm5-fp8-mi355x-sglang)" | ||
| - "Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 (bumped from .v0.5.12-...-20260517 to unlock the PD-disagg MoRI overlay; matches chun-chang/sglang-disagg-qwen3.5)" |
There was a problem hiding this comment.
🟡 Cosmetic typo in the new perf-changelog.yaml entry at line 3179: bumped from .v0.5.12-...-20260517 has a stray leading period before the v. The actual previous image tag was v0.5.12-rocm720-mi35x-20260517 (no leading dot). One-character fix — drop the period.
Extended reasoning...
What the bug is
In perf-changelog.yaml line 3179 the new changelog entry reads:
- "Image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 (bumped from .v0.5.12-...-20260517 to unlock the PD-disagg MoRI overlay; matches chun-chang/sglang-disagg-qwen3.5)"The substring bumped from .v0.5.12-...-20260517 carries a stray leading period before the v. The actual previous image tag, as visible in this same PR's .github/configs/amd-master.yaml diff, was v0.5.12-rocm720-mi35x-20260517 (no leading dot).
Why this is a real but minor issue
Step-by-step proof:
- Look at
.github/configs/amd-master.yamlin this PR — theglm5-fp8-mi355x-sglang-disagg:entry changes fromimage: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517toimage: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523. - The previous tag string is
v0.5.12-rocm720-mi35x-20260517— no leading period. - The changelog entry abbreviates the previous tag as
.v0.5.12-...-20260517with a leading dot, which doesn't match.
The dot is almost certainly a stray keystroke (perhaps a leftover from formatting around the ellipsis ...) rather than a meaningful version segment. Both sibling entries elsewhere in the file consistently write the unbumped tag without a leading dot.
Impact
Zero runtime impact — this is a free-text human-readable changelog string parsed only by changelog-rendering tooling (or human readers). No code reads this field for version comparison. Severity is purely cosmetic.
Fix
Replace .v0.5.12-...-20260517 with v0.5.12-...-20260517 on line 3179 of perf-changelog.yaml (one-character deletion). All three independent verifiers reached the same conclusion ("nit") with no refutations.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26550752378 |
ca5efd9
into
chang/glm5-fp8-mi355x-sglang-disagg-pr-2
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26550752378 |
…p crash (#1578) sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend (legacy singular `state_type` + flat-int wire format) that crashes hybrid-attention models at PD-disagg startup. PR #1572's CI run (actions/runs/26544255929/job/78192785656) hit this exact failure: File ".../disaggregation/mori/conn.py", line 1424, in <genexpr> struct.pack("I", item_len) struct.error: required argument is not an integer Root cause: KVArgs.state_item_lens is List[List[int]] for any model with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.), but MoRI's `_register_kv_args` iterates and packs each element with naked `struct.pack("I", item_len)`, expecting flat List[int]. Two further sites in `send_state` and `_send_swa_dsa_state` have the same legacy-API assumption. This PR ports the validated overlay from Chun Fang's working branch (chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and stacks it on top of PR #1572: - benchmarks/multi_node/amd_utils/patches/mori_conn.py (new, 1665 lines) Drop-in replacement for sglang v0.5.12.post1's disaggregation/mori/conn.py with four conservative patches: 1. Sender flatten — handle nested state_item_lens 2. state_type plural-API fallback (matches Mooncake/NIXL) 3. Consumer normalize state_item_lens at send_state entry 4. SWA/DSA rank+length normalize before group_concurrent_contiguous (fixes GLM-5 DSA single-component np.diff broadcast crash) - benchmarks/multi_node/amd_utils/patches/README.md (new) Bug analysis, when-to-use table, opt-out knob documentation. - benchmarks/multi_node/amd_utils/job.slurm (+25) Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts. - .github/configs/amd-master.yaml (+1/-1) Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to unlock the auto-apply gate (matches chun-chang lineage). - perf-changelog.yaml (+2/-1) Document the image bump and overlay rationale. Validated on chun-chang at the same image tag: glm5-fp8-mi355x-sglang-disagg GSM8K strict-match = 0.9712 ± 0.0046 glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047 Stop-gap until sglang migrates MoRI to the plural state_types API that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use. Tracks sgl-project/sglang#21886 and #22665. Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
…1572) * Add GLM-5 FP8 MI355X SGLang disaggregated benchmark (PR-2). Introduce glm5-fp8-mi355x-sglang-disagg CI config, model server flags, launch script, setup_deps.sh image patches, and GLM-5 env tuning for MoRI PD disaggregation on MI355X. Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> * Update benchmarks/multi_node/amd_utils/setup_deps.sh Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> * fix: add FRAMEWORK to check_env_vars, fix NODELIST variable name - Add missing FRAMEWORK to check_env_vars list to match sister sglang-disagg scripts (dsr1_fp8, dsr1_fp4) - Rename NODE_LIST to NODELIST (quoted) to match the convention used by kimik2.5/minimaxm2.5 vllm-disagg sisters Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> * [Klaud Cold] GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash (#1578) sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend (legacy singular `state_type` + flat-int wire format) that crashes hybrid-attention models at PD-disagg startup. PR #1572's CI run (actions/runs/26544255929/job/78192785656) hit this exact failure: File ".../disaggregation/mori/conn.py", line 1424, in <genexpr> struct.pack("I", item_len) struct.error: required argument is not an integer Root cause: KVArgs.state_item_lens is List[List[int]] for any model with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.), but MoRI's `_register_kv_args` iterates and packs each element with naked `struct.pack("I", item_len)`, expecting flat List[int]. Two further sites in `send_state` and `_send_swa_dsa_state` have the same legacy-API assumption. This PR ports the validated overlay from Chun Fang's working branch (chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and stacks it on top of PR #1572: - benchmarks/multi_node/amd_utils/patches/mori_conn.py (new, 1665 lines) Drop-in replacement for sglang v0.5.12.post1's disaggregation/mori/conn.py with four conservative patches: 1. Sender flatten — handle nested state_item_lens 2. state_type plural-API fallback (matches Mooncake/NIXL) 3. Consumer normalize state_item_lens at send_state entry 4. SWA/DSA rank+length normalize before group_concurrent_contiguous (fixes GLM-5 DSA single-component np.diff broadcast crash) - benchmarks/multi_node/amd_utils/patches/README.md (new) Bug analysis, when-to-use table, opt-out knob documentation. - benchmarks/multi_node/amd_utils/job.slurm (+25) Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts. - .github/configs/amd-master.yaml (+1/-1) Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to unlock the auto-apply gate (matches chun-chang lineage). - perf-changelog.yaml (+2/-1) Document the image bump and overlay rationale. Validated on chun-chang at the same image tag: glm5-fp8-mi355x-sglang-disagg GSM8K strict-match = 0.9712 ± 0.0046 glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047 Stop-gap until sglang migrates MoRI to the plural state_types API that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use. Tracks sgl-project/sglang#21886 and #22665. Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> --------- Co-authored-by: cliu1004@amd.com <cliu1004@amd.com@mia1-p01-g18.mia.tensorwave.lan> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>


Summary
patch authored by @chunfangamd (credit to chunfang!!)
Stacked on top of #1572. Ports the validated MoRI conn.py overlay from Chun Fang's working branch (
chun-chang/sglang-disagg-qwen3.5, commits48e459bd+c4e397de) to fix PR #1572's CI startup crash onglm5-fp8-mi355x-sglang-disagg.Why
PR #1572's latest CI run crashed at PD-disagg startup:
Root cause:
KVArgs.state_item_lensisList[List[int]]for any model withstate_types: List[StateType](Qwen3.5-MoE, GLM-5 NSA, etc.), butmori/conn.py:_register_kv_argsiterates and packs each element with nakedstruct.pack("I", item_len), expecting the legacyList[int]. Two further sites insend_stateand_send_swa_dsa_statehave the same legacy-API assumption. Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use the plural-API; MoRI never migrated.This is the exact "MoRI
conn.pyoverlay" PR #1572's own description called out as out-of-scope follow-up.What
benchmarks/multi_node/amd_utils/patches/mori_conn.py(new, 1665 lines) — drop-in replacement for sglang v0.5.12.post1'sdisaggregation/mori/conn.pywith four conservative patches:state_item_lensstate_typeplural-API fallback — matches Mooncake/NIXLstate_item_lensonce atsend_stateentrygroup_concurrent_contiguous— fixes GLM-5 DSA single-componentnp.diffbroadcast crashbenchmarks/multi_node/amd_utils/patches/README.md— bug analysis, when-to-use table, opt-out knob docsbenchmarks/multi_node/amd_utils/job.slurm(+25) — auto-bind-mount the overlay whenDOCKER_IMAGE_NAMEcontainsv0.5.12.post1; opt-out viaMORI_CONN_PATCH=skip; appends to\${EXTRA_DOCKER_MOUNTS:-}so other mounts still compose.github/configs/amd-master.yaml(+1/-1) — image bumpv0.5.12-...-20260517→v0.5.12.post1-...-20260523to unlock the auto-apply gateperf-changelog.yaml(+2/-1) — document the image bump and overlay rationaleValidated on chun-chang at the same image tag
glm5-fp8-mi355x-sglang-disaggglm5-fp8-mi355x-sglang-disaggMatches/exceeds upstream sgl-project/sglang#22665's reported 0.970 baseline.
Why a stacked PR
Keeps #1572's review surface focused on the GLM-5 recipe (config, models.yaml, env tuning, GLM-5-specific setup_deps patches). This PR layers the upstream-bug workaround on top so it can be reviewed/merged independently and dropped cleanly when the upstream sglang fix lands.
Test plan
glm5-fp8-mi355x-sglang-disagg(1k1k + 8k1k smoke) clears the startup crash[job.slurm] auto-applied MoRI conn.py overlaylog line appears onv0.5.12.post1runsv0.5.9-mori-0402image (no log line; other disagg recipes unaffected)Follow-up
Stop-gap until sglang migrates MoRI to the plural
state_typesAPI. Tracks sgl-project/sglang#21886 and #22665. Drop this overlay when a publishedlmsysorg/sglang-rocm:*image carries the upstream fix.🤖 Generated with Claude Code
Note
Medium Risk
Large vendored Python overlay on the PD/KV transfer path and a new image tag for one CI recipe; scope is gated to
v0.5.12.post1images with opt-out, but wrong mount or patch drift could affect disagg correctness.Overview
Unblocks GLM-5-FP8 MI355X SGLang PD-disagg CI by working around broken MoRI
conn.pyinlmsysorg/sglang-rocm:v0.5.12.post1(hybrid models with pluralstate_typescrash at startup withstruct.error).The PR adds an in-tree
mori_conn.pyoverlay (four targeted fixes: nestedstate_item_lensflatten,state_types[0]fallback,send_statenormalization, DSA index ravel) pluspatches/README.md.job.slurmauto bind-mounts it whenDOCKER_IMAGE_NAMEcontainsv0.5.12.post1(opt-outMORI_CONN_PATCH=skip), forwards${EXTRA_DOCKER_MOUNTS:-}intodocker run, and dedupes mounts.amd-master.yamlbumpsglm5-fp8-mi355x-sglang-disaggtov0.5.12.post1-...-20260523;perf-changelog.yamldocuments the image + overlay.Reviewed by Cursor Bugbot for commit aba002c. Bugbot is set up for automated code reviews on this repo. Configure here.