Skip to content

[AMD] vLLM Kimi MXFP4 & MiniMax M2.5 FP8 disaggregated prefill-decode for MI355X#1569

Merged
functionstackx merged 85 commits into
mainfrom
amd/vllm_disagg_mvp_dev2
May 27, 2026
Merged

[AMD] vLLM Kimi MXFP4 & MiniMax M2.5 FP8 disaggregated prefill-decode for MI355X#1569
functionstackx merged 85 commits into
mainfrom
amd/vllm_disagg_mvp_dev2

Conversation

@ichbinblau
Copy link
Copy Markdown
Collaborator

@ichbinblau ichbinblau commented May 27, 2026

Summary

  • Add vLLM disaggregated prefill-decode (PD) benchmark recipes for MI355X with MoRI-IO KV transfer
  • Add Kimi-K2.5-MXFP4 and MiniMax M2.5 disagg inference configs (1P2D)
  • Consolidate amd_utils to support both sglang and vllm disagg engines
  • Multiple bug fixes: KV cache leak, Slurm job termination, docker privilege detection, MoRI-IO READ mode

Patches

  • patch_moriio_save_kv_timeout
  • patch_moriio_transfer_timeout
  • patch_moriio_load_kv_timeout
  • patch_scheduler_read_mode_fix
  • patch_prefill_idle_kv_reaper

from @simondanielsson link

FYI can be removed once https://github.com/vllm-project/vllm/pull/40344 merged and new nightly is available

Test plan

  • Verify CI passes on amd-master config with vllm-disagg entries
  • Validate multi-node sglang benchmarks still work (no regressions from amd_utils refactor)
  • Run vllm-disagg multi-node benchmark on MI355X cluster
  • Confirm Kimi-K2.5 and MiniMax M2.5 disagg recipes launch correctly

Co-Authored-By: @chunfangamd
Co-Authored-By: @simondanielsson
Co-Authored-By: @billishyahao
Co-Authored-By: @ChuanLi1101


Note

Medium Risk
Large CI/Slurm/Docker and runtime patches to vLLM MoRI-IO affect multi-node benchmark stability; SGLang path is mostly moved unchanged but shared scripts (env, job, bench) could regress existing disagg runs.

Overview
Adds vLLM disaggregated prefill–decode on MI355X alongside existing SGLang disagg: new amd-master recipes for Kimi-K2.5-MXFP4 and MiniMax-M2.5 (1P2D, MoRI-IO), plus multi-node entry scripts and perf-changelog entries.

amd_utils is refactored into a dual-engine path (ENGINE=sglang-disagg vs vllm-disagg): thin server.sh dispatcher, SGLang logic moved to server_sglang.sh, new server_vllm.sh, models_vllm.yaml, UCX/RDMA setup in env.sh, external vllm-router on rank 0 in job.slurm, HF cache model resolution, and per-node Docker privilege detection. submit.sh gains reuse of an existing Slurm allocation, node exclude list, and clearer CLI.

Benchmark plumbing: bench.sh / benchmark_lib.sh support vLLM (--tokenizer, served model name, infinite req rate, cooldown); result collection uses {framework}_isl_* paths. setup_deps.sh installs/patches vLLM MoRI-IO (timeouts, scheduler READ-mode, idle KV reaper) for stable high-concurrency runs.

launch_mi355x-amds.sh treats vllm-disagg like sglang-disagg for multi-node scripts and improves CI log capture on failure.

Reviewed by Cursor Bugbot for commit b1ae781. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread benchmarks/multi_node/amd_utils/job.slurm
Comment thread benchmarks/multi_node/amd_utils/job.slurm
Comment thread benchmarks/multi_node/amd_utils/job.slurm
Comment thread benchmarks/multi_node/amd_utils/server_sglang.sh Outdated
Comment thread benchmarks/multi_node/amd_utils/env.sh
SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"
if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]]; then
EXCLUDE_OPT=(--exclude "$SLURM_EXCLUDE_NODES")
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded default SLURM exclude nodes affects all clusters

Medium Severity

SLURM_EXCLUDE_NODES is defaulted to mia1-p01-g11,mia1-p01-g12,mia1-p01-g15 — cluster-specific hostnames that become a non-empty string for all users. Since the subsequent if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]] check is always true, every sbatch submission on every cluster unconditionally passes --exclude with these mia1-specific nodes. The default for an exclude list intended to be optional needs to be empty.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d2a212e. Configure here.

if [[ "$rdma_ip" =~ ^192\.168\.([0-9]+)\.([0-9]+)$ ]]; then
local rdma_subnet="${BASH_REMATCH[1]}"
local rdma_host="${BASH_REMATCH[2]}"
local rdma_gw="192.168.${rdma_subnet}.$(( rdma_host | 1 ))"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The peer/gateway IP for a /31 point-to-point link at server_vllm.sh:74 is computed with bitwise OR (rdma_host | 1), but a /31 has two hosts that differ only in bit 0, so the peer is obtained by XOR (rdma_host ^ 1). For any node whose RDMA host octet is odd (1, 3, 5, ...), rdma_gw resolves to the node's own IP; the kernel rejects ip route replace ... via <self>, the error is swallowed by 2>/dev/null, and inter-node RoCEv2 traffic on that node silently falls through to the management interface — defeating the whole purpose of this route fix. Fix: change | to ^.

Extended reasoning...

What the bug is

In benchmarks/multi_node/amd_utils/server_vllm.sh, the setup_rdma_env helper (lines 67–82) adds a route to the wider 192.168.X.0/24 RDMA fabric via the TOR-switch peer on each node's /31 point-to-point link. The peer IP is computed at line 74:

local rdma_gw="192.168.${rdma_subnet}.$(( rdma_host | 1 ))"

A /31 subnet has exactly two host addresses that differ only in bit 0 (e.g. .0/.1, .2/.3, .4/.5). The peer of host X is therefore X ^ 1 (flip bit 0), not X | 1.

Step-by-step proof

host octet host | 1 host ^ 1 Correct peer
0 1 ✅ 1 ✅ 1
1 1 ❌ (self!) 0 ✅ 0
2 3 ✅ 3 ✅ 3
3 3 ❌ (self!) 2 ✅ 2
4 5 ✅ 5 ✅ 5
5 5 ❌ (self!) 4 ✅ 4

For any node whose RDMA host octet is odd, rdma_gw evaluates to the node's own IP.

What happens at runtime

The subsequent command on line 78 is:

ip route replace "192.168.${rdma_subnet}.0/24" via "$rdma_gw" dev "$rdma_iface" 2>/dev/null && \
    echo "[RDMA-ROUTE] Added ..." || \
    echo "[RDMA-ROUTE] Route add failed ..."

The Linux kernel rejects ip route ... via <local-address> on a non-onlink route (RTNETLINK answers: Invalid argument). Because stderr is redirected to /dev/null and the failure is only logged as a single "Route add failed" line on stdout, the error is effectively silent — the script proceeds and the prefill/decode workers start as if RDMA were configured.

Impact

Without the fabric route, the source-routing lookup for another node's 192.168.x.y RDMA IP falls through to the default route — the management interface, not the RDMA NIC. Inter-node KV transfer for vLLM disaggregated prefill/decode then traverses the management network instead of the lossless RoCEv2 fabric the workaround was meant to enable. Roughly half of nodes in a /31-addressed fabric (those with odd host octets) will be affected, depending on the cluster's Pensando ionic IP-assignment convention.

This function is called unconditionally from server_vllm.sh:91 on every vllm-disagg run, so any node with a 192.168.x.x RDMA IP exercises this path. The bug is latent in a new file, so there are no callers in main today, but the CI workflows added in this PR (kimik2.5_fp4_mi355x_vllm-disagg.sh, minimaxm2.5_fp8_mi355x_vllm-disagg.sh) will exercise it.

Why existing code does not prevent it

Nothing in the helper validates that rdma_gw != rdma_ip, and the 2>/dev/null masks the kernel's rejection. The fallback echo "[RDMA-ROUTE] Route add failed for ..." is the only visible signal in the per-node log — easy to miss in a multi-page slurm_job-*.out, and produces a "puzzling perf degradation" mode rather than a clean failure.

Fix

One-character change:

local rdma_gw="192.168.${rdma_subnet}.$(( rdma_host ^ 1 ))"

With XOR, the math is correct for both halves of every /31:
0^1=1, 1^1=0, 2^1=3, 3^1=2, 4^1=5, 5^1=4, ... — always the peer.

Comment on lines +195 to +196
NUM_NODES=$((xP + yD))
echo "NUM_NODES: $NUM_NODES (xP=$xP + yD=$yD)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 job.slurm:195 now computes NUM_NODES=$((xP + yD)) using worker counts, but submit.sh allocates PREFILL_NODES + DECODE_NODES via sbatch -N. For multi-node-per-worker decode configs the third node is trimmed away, breaking the existing dsr1-fp8-mi355x-sglang-disagg "Top of curve" scenario (1 decode worker @ DEP16 across 2 nodes) and the analogous fp4/mtp variants. Restore NUM_NODES=${NUM_NODES} (already exported by submit.sh) or recompute it from PREFILL_NODES_PER_WORKER*xP + DECODE_NODES_PER_WORKER*yD.

Extended reasoning...

Bug

benchmarks/multi_node/amd_utils/job.slurm:195 was changed from NUM_NODES=${NUM_NODES} (inherited from submit.sh) to NUM_NODES=$((xP + yD)). This is incorrect for any disagg config where a single worker spans more than one node: xP/yD count workers, while PREFILL_NODES/DECODE_NODES (and the original NUM_NODES exported by submit.sh) count nodes.

Trigger path

submit.sh builds the allocation and the worker exports as follows:

  • L93: NUM_NODES=$((PREFILL_NODES + DECODE_NODES)) and L118 exports it
  • L108-109: xP=$PREFILL_WORKERS, yD=$DECODE_WORKERS (comment: "may span multiple nodes")
  • sbatch is called with -N $NUM_NODES

Inside the slurm script, the new line 195 throws away the exported value and uses workers instead. SELECTED_NODES = head -n NUM_NODES then trims the allocated nodelist (lines 197-199), SLURM_NNODES is reset (line 207), and every downstream srun only fans out to the truncated subset.

Concrete failing scenario (unchanged by this PR, but consumes the refactored amd_utils)

.github/configs/amd-master.yaml still contains dsr1-fp8-mi355x-sglang-disagg "Top of curve" with prefill num-worker=1 + PREFILL_NODES=1 and decode num-worker=1 + DECODE_NODES=2. The runner script invokes submit.sh with:

PREFILL_NODES=1, PREFILL_WORKERS=1, DECODE_NODES=2, DECODE_WORKERS=1

Step-by-step proof:

  1. submit.sh: NUM_NODES = 1 + 2 = 3. sbatch allocates 3 nodes. xP=1, yD=1. DECODE_TP_SIZE = DECODE_NODES * DECODE_TP / DECODE_WORKERS = 2*8/1 = 16.
  2. job.slurm:195 (this PR): overrides NUM_NODES = xP + yD = 1 + 1 = 2.
  3. SELECTED_NODES = head -n 2 of the 3-node allocation. SLURM_NNODES = 2. The third (decode) node is silently dropped.
  4. server_sglang.sh:232-233: DECODE_NODES_PER_WORKER = (16+7)/8 = 2. The decode worker is multi-node (DEP16 across 2 nodes).
  5. server_sglang.sh role assignment: rank 0 = prefill head, rank 1 = decode head. Rank 2 (the decode follower needed for the multi-node decode worker) is never spawned because srun only runs on 2 ranks. The decode head hangs at --dist-init-addr waiting for its peer.

Impact

  • Pre-existing scenarios broken (consume the refactored amd_utils unchanged): dsr1-fp8-mi355x-sglang-disagg (Top of curve), dsr1-fp8-mi355x-sglang-disagg-mtp (Top of curve), and dsr1-fp4-mi355x-sglang-disagg(-mtp) 1P1D large-TP variants in amd-master.yaml. These are production sweeps that ship in the same yaml.
  • New configs unaffected by chance: kimik2.5-fp4-mi355x-vllm-disagg and minimaxm2.5-fp8-mi355x-vllm-disagg happen to use 1-node-per-worker layouts (xP+yD == PREFILL_NODES+DECODE_NODES), so they don't trip the bug. That's why this regression slipped through the PR's own test plan.
  • The failure mode is silent: sbatch succeeds and 3 nodes are allocated, but only 2 ever run. The decode head hangs on --dist-init-addr until SGLANG_DISAGGREGATION_WAITING_TIMEOUT (3600s) fires.

Fix

Drop the new computation and revert to using the value submit.sh already exports:

NUM_NODES=${NUM_NODES}

Or recompute correctly from per-worker node counts:

PREFILL_NODES_PER_WORKER=$(((PREFILL_TP_SIZE + GPUS_PER_NODE - 1) / GPUS_PER_NODE))
DECODE_NODES_PER_WORKER=$(((DECODE_TP_SIZE  + GPUS_PER_NODE - 1) / GPUS_PER_NODE))
NUM_NODES=$((PREFILL_NODES_PER_WORKER * xP + DECODE_NODES_PER_WORKER * yD))

Either approach restores the prior behavior for the multi-node-per-worker decode scenarios while keeping the new vllm-disagg recipes unaffected.

Comment on lines +158 to +163
# Set SLURM_EXCLUDE_NODES env var to a comma-separated list of hostnames.
EXCLUDE_OPT=()
SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"
if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]]; then
EXCLUDE_OPT=(--exclude "$SLURM_EXCLUDE_NODES")
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 submit.sh frames itself as a cluster-portable template (line 4: "must be configured for your specific cluster before use") but hardcodes a mia1-only default for SLURM_EXCLUDE_NODES (mia1-p01-g11,mia1-p01-g12,mia1-p01-g15) at line 159. The default leaks into every cluster's sbatch invocation, and the ${VAR:-default} syntax means SLURM_EXCLUDE_NODES='' cannot opt out — only unset works. Fix is either ${VAR-default} (no colon, so empty string disables) or move the default into runners/launch_mi355x-amds.sh alongside the other mia1-specific knobs (IBDEVICES, MORI_RDMA_TC, MODEL_DIR).

Extended reasoning...

What the bug is

At benchmarks/multi_node/amd_utils/submit.sh:158-163:

EXCLUDE_OPT=()
SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"
if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]]; then
    EXCLUDE_OPT=(--exclude "$SLURM_EXCLUDE_NODES")
fi

Two distinct issues:

  1. Cluster-specific default in a portable template. submit.sh's docstring (lines 4-5) declares it a "Cluster Configuration Template ... must be configured for your specific cluster before use." But the default exclude list contains literal mia1 hostnames. The other files in amd_utils/ (env.sh, job.slurm) explicitly support GPU*, smci355-ccs-aus*, and mia1* hostname patterns — so the rest of the directory is genuinely cluster-portable, and this submit.sh default is inconsistent with that.

  2. No way to disable via empty string. ${VAR:-default} (with the colon) substitutes the default both when VAR is unset and when VAR is empty. So a CI caller doing SLURM_EXCLUDE_NODES='' bash submit.sh ... cannot disable the option — they have to unset SLURM_EXCLUDE_NODES instead.

Why existing code doesn't prevent it

This is new code in this PR. The verifiers confirmed the default literal is hardcoded and that no other code path overrides it before the sbatch invocation. The other cluster-specific knobs (IBDEVICES, MORI_RDMA_TC, MODEL_DIR) are correctly scoped — they're set inside runners/launch_mi355x-amds.sh (line 30: IBDEVICES="rdma0,rdma1,...rdma7", line 31: MORI_RDMA_TC=104) — so the runner-based pattern is the existing convention this default breaks.

Step-by-step proof of the empty-string issue

$ SLURM_EXCLUDE_NODES='' bash -c 'X="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"; echo "X=$X"'
X=mia1-p01-g11,mia1-p01-g12,mia1-p01-g15

The empty string was re-substituted to the default. With the colon-less form:

$ SLURM_EXCLUDE_NODES='' bash -c 'X="${SLURM_EXCLUDE_NODES-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"; echo "X=$X"'
X=

Which would then fail the [[ -n "$SLURM_EXCLUDE_NODES" ]] check and skip --exclude entirely — the intended opt-out behavior.

Impact

Low today, real tomorrow. The verifier-refutation correctly notes that the current production pipeline is mia1-locked (runners/launch_mi355x-amds.sh hardcodes mia1-specific IBDEVICES, MORI_RDMA_TC=104, and MODEL_PATH=/it-share/data), so this default actually matches the current target cluster. The footgun fires the moment anyone tries the smci355 or GPU* runner paths that the rest of amd_utils/ already supports — sbatch on those clusters either rejects unknown node names with "Invalid node name specified" (strict slurmctld) or silently no-ops the --exclude (permissive).

Fix

Either of two two-character/one-line changes:

  • Drop the colon: SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}" — empty string now disables --exclude.
  • Move the default into runners/launch_mi355x-amds.sh next to the other mia1-specific knobs, so the runner owns the cluster-scoped value and submit.sh truly is portable.

Severity rationale

Filed at nit: the current production path is mia1 (verified by the runner's mia1-specific IBDEVICES/MODEL_PATH/MORI_RDMA_TC), so this is harmless today. Worth flagging on the review because the file is being touched in this PR, the docstring frames the script as portable, and the fix is two characters.

Comment on lines +22 to 32
DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',')
if [[ -n "$DETECTED" ]]; then
export IBDEVICES="$DETECTED"
else
echo "ERROR: Unable to detect cluster from hostname $NODENAME and IBDEVICES not set" >&2
exit 1
echo "WARNING: Unable to detect RDMA devices. Set IBDEVICES explicitly." >&2
fi
echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES from hostname $NODENAME"
echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES from hostname $(hostname -s)"
else
echo "[INFO] Using IBDEVICES=$IBDEVICES (set by runner or environment)"
fi
export IBDEVICES
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The [INFO] Auto-detected IBDEVICES=… line at env.sh:28 sits outside the inner success/failure branch, so on the failure path it prints [INFO] Auto-detected IBDEVICES= from hostname X immediately after the WARNING that just said detection failed — directly contradicting it. The from hostname phrasing is also stale (detection now uses ibv_devinfo, not hostname). Fix: move the echo inside the if [[ -n "$DETECTED" ]]; then branch. Optional follow-up: consider exit 1 on the empty-detect path (the previous version did) so downstream consumers (NCCL_IB_HCA, --disaggregation-ib-device, UCX_NET_DEVICES fallback) fail with an actionable error instead of propagating empty strings.

Extended reasoning...

What goes wrong

if [[ -z "$IBDEVICES" ]]; then
    DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',')
    if [[ -n "$DETECTED" ]]; then
        export IBDEVICES="$DETECTED"
    else
        echo "WARNING: Unable to detect RDMA devices. Set IBDEVICES explicitly." >&2
    fi
    echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES from hostname $(hostname -s)"   # ← line 28, OUTSIDE the if/else
fi

The [INFO] echo at line 28 is at the same indent level as the inner if [[ -n "$DETECTED" ]]; then … else … fi, not nested inside the success branch. When ibv_devinfo returns nothing, DETECTED is empty, the else branch emits the WARNING to stderr, and then the unconditional [INFO] line still fires on stdout — claiming auto-detection succeeded with an empty value.

Step-by-step proof (direct manual run, no runner pre-set):

  1. Operator runs bash benchmarks/multi_node/amd_utils/server.sh on a node where ibv_devinfo isn't installed or returns no HCAs. IBDEVICES is unset → outer if enters the detection branch.
  2. DETECTED=$(ibv_devinfo 2>/dev/null | …) → empty string.
  3. [[ -n "$DETECTED" ]] is false → else runs:
    WARNING: Unable to detect RDMA devices. Set IBDEVICES explicitly. → stderr.
  4. Control falls through to line 28 (no fi between the inner else and this echo):
    [INFO] Auto-detected IBDEVICES= from hostname mia1-p01-g03 → stdout.

Operator now sees a WARNING immediately followed by an INFO claiming the auto-detect succeeded with empty value. The two messages contradict each other, and from hostname … is doubly misleading — detection no longer uses hostname (it uses ibv_devinfo).

Downstream impact of the silent continuation

Once IBDEVICES is empty, the values propagate without further validation:

  • Line 45: NCCL_IB_HCA=${NCCL_IB_HCA:-$IBDEVICES}. Because IBDEVICES is exported as an empty string (not unset), the :- default doesn't fire — NCCL_IB_HCA ends up empty.
  • SGLang server.sh later passes --disaggregation-ib-device ${IBDEVICES} → empty.
  • vLLM env.sh's UCX fallback at lines 59–66: FIRST_IB=$(echo "$IBDEVICES" | cut -d',' -f1) → empty → the ${FIRST_IB}:1 fallback becomes ":1" (invalid UCX device spec).

The previous version of this script (sglang path on origin/main) had exit 1 for an unrecognized hostname, so the failure was loud and immediately actionable. The new path makes it a silent WARN and lets all of the above propagate.

Why existing CI hides this

runners/launch_mi355x-amds.sh explicitly sets IBDEVICES=rdma0,rdma1,…,rdma7 and MORI_RDMA_TC=104 before the SLURM job runs, so the outer [[ -z "$IBDEVICES" ]] check is false in CI and the buggy block is never entered. Anyone debugging RDMA setup, running directly, or hitting a runner that doesn't pre-set IBDEVICES lands on this path.

Fix

Move the INFO line inside the success branch (and optionally restore the fail-fast exit):

if [[ -z "$IBDEVICES" ]]; then
    DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',')
    if [[ -n "$DETECTED" ]]; then
        export IBDEVICES="$DETECTED"
        echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES via ibv_devinfo on $(hostname -s)"
    else
        echo "ERROR: Unable to detect RDMA devices via ibv_devinfo. Set IBDEVICES explicitly." >&2
        exit 1
    fi
fi

The log fix is trivial and purely cosmetic; the optional exit 1 is a small behavioral regression to restore but high-value for operators outside the production CI path.

fi

exec sudo docker run --rm \
\$MAYBE_EXEC \$DOCKER_CMD run \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker --rm flag dropped, containers may leak

Low Severity

The --rm flag was removed from the docker run command. While explicit cleanup exists in the KEEP_CONTAINERS block, if the post-run srun cleanup fails (e.g., due to SLURM timeout or node unreachability), containers will persist on worker nodes. The old code used --rm to guarantee automatic cleanup regardless of external failures.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8b801d4. Configure here.

@@ -0,0 +1,654 @@
#!/bin/bash
# =============================================================================
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need all deps here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the first three and last patches have not been upstreamed. Scheduler read mode patch could be removed when @simondanielsson 's patch merged.
patch_moriio_save_kv_timeout
patch_moriio_transfer_timeout
patch_moriio_load_kv_timeout
patch_scheduler_read_mode_fix
patch_prefill_idle_kv_reaper

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI can be removed once vllm-project/vllm#40344 merged and new nightly is available

Comment thread perf-changelog.yaml
Comment thread benchmarks/multi_node/amd_utils/server_vllm.sh
Comment thread benchmarks/multi_node/amd_utils/server_sglang.sh
@functionstackx functionstackx changed the title [AMD] vLLM disaggregated prefill-decode for MI355X [AMD] vLLM Kimi MXFP4 & MiniMax M2.5 disaggregated prefill-decode for MI355X May 27, 2026
@functionstackx functionstackx changed the title [AMD] vLLM Kimi MXFP4 & MiniMax M2.5 disaggregated prefill-decode for MI355X [AMD] vLLM Kimi MXFP4 & MiniMax M2.5 FP8 disaggregated prefill-decode for MI355X May 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

chunfangamd and others added 12 commits May 27, 2026 12:14
Add multi-node vLLM PD disaggregation recipe using Nixl/RIXL KV transfer
and vllm-router, mirroring the existing SGLang disagg recipe structure.

- New benchmark config: dsr1-fp8-mi355x-vllm-disagg (1P2D, TP8)
- New utils: vllm_disagg_utils/ (job.slurm, server.sh, submit.sh, etc.)
- Runner: extend launch_mi355x-amds.sh for vllm-disagg framework
Extract hardcoded model configurations from server.sh bash maps and
job.slurm VALID_MODELS into a declarative models.yaml, mirroring the
SGLang disagg recipe pattern. Adding a new model now requires no script
changes.

Also:
- Consolidate UCX transport vars in job.slurm Docker env; remove
  duplicated setup_ucx_env() from server.sh
- Extract RDMA workarounds (ionic /31 route fix, Nixl UCX patch) into
  setup_rdma_env() helper
- Lower UCX_LOG_LEVEL from info to warn
- Add nicctl mount and QoS/DSCP auto-detection to env.sh
- Remove stale host libionic bind-mounts (driver now built into image)
Adapt server.sh to vLLM v0.17.1 breaking changes:
- Use simplified kv-transfer-config (side channel via env vars instead
  of kv_ip/kv_port, add kv_load_failure_policy)
- Remove deprecated --disable-log-requests (disabled by default in v0.17)
- Route NIXL side channel through RDMA IP for correct fabric path
- Fix RIXL ucx_error_handling_mode patch for updated _api.py layout
bench.sh: replace `vllm bench serve` (log-only output) with the shared
run_benchmark_serving helper from benchmark_lib.sh, matching the SGLang
disagg pattern. This produces the .json result files that the multinode
CI workflow expects (benchmark-multinode-tmpl.yml → process_result.py).

server.sh: make the Nixl ucx_error_handling_mode=none runtime patch
conditional on Pensando ionic RDMA devices (IBDEVICES=*ionic*). On the
mia1 cluster (ConnectX/mlx5, IBDEVICES=rdma*), UCX handles error mode
natively and the patch is skipped.

Model-path resolution and IBDEVICES/UCX/QoS auto-detection were verified
to already work on mia1 — no changes needed.

Tested locally (Job 2802, 1P+2D, ISL/OSL=1024):
  conc  8 →  507 tok/s   conc 32 → 1778 tok/s
  conc 16 → 1004 tok/s   conc 64 → 2480 tok/s
All four .json result files produced; 100% external prefix cache hit rate.
Move the vllm-router from a dedicated proxy node onto the first prefill
node, mirroring SGLang's co-location pattern. This reduces the node count
from xP + yD + 1 to xP + yD (e.g., 3 nodes instead of 4 for 1P+2D).

- server.sh: NODE_RANK=0 now runs both vllm serve (prefill, port 2584)
  and vllm-router (port 30000); barrier waits on all nodes
- submit.sh / job.slurm: NUM_NODES = PREFILL_NODES + DECODE_NODES
- bench.sh: ROUTER_PORT default updated to 30000

Local 1P+2D benchmark (ISL/OSL=1024, DeepSeek-R1 FP8, MI355X):
  - Throughput: +1.6% to +8.4% across concurrency 8-64
  - Mean TTFT: -22% to -63% (prefill is local to router)
  - TPOT/ITL: unchanged (within noise)
  - 25% fewer nodes, no performance regression
Replace the custom Docker image (vllm_disagg_pd:latest) with the public
vllm/vllm-openai-rocm:v0.17.1 base image. Missing components (UCX, RIXL,
etcd, libionic1, vllm-router) are now installed at container start via
setup_deps.sh, which is sourced by server.sh.

This eliminates the need to build, host, and maintain a custom image —
CI nodes can pull directly from Docker Hub.

Changes:
- Add setup_deps.sh: idempotent installer for UCX (ROCm fork), RIXL,
  etcd, libionic1 (Pensando ionic), and vllm-router (NODE_RANK=0 only).
  Build steps run in subshells to avoid CWD pollution.
- server.sh: source setup_deps.sh before any other logic
- job.slurm: add --entrypoint "" to override the base image's vllm CLI
  entrypoint, allowing bash -lc to work correctly
- env.sh: update comment (paths now set by setup_deps.sh, not image ENV)
- amd-master.yaml: image changed to vllm/vllm-openai-rocm:v0.17.1

Tested locally (Job 2807, 3 nodes, ISL/OSL=1024):
  Setup overhead: ~2.5 min per node (all components built from source)
  Benchmark completed successfully across concurrency 8/16/32/64
…ecode

Enable MoRI-based Expert Parallelism (--enable-expert-parallel
--all2all-backend mori) on decode workers for DeepSeek-R1-0528,
while keeping TP=8 to preserve KV cache transfer compatibility
with the prefill node via NixlConnector. This matches SGLang's
approach of TP=8 + EP within the TP group.

KV Transfer: RIXL/NixlConnector (unchanged)
MoE All-to-All: NCCL (default) -> MoRI-EP (--all2all-backend mori)

Changes:
- models.yaml: Add --enable-expert-parallel --all2all-backend mori
  to decode_flags; increase engine ready timeout to 1200s
- setup_deps.sh: Add MoRI install and vLLM v0.17.1 patches for
  MoRI-EP + FP8 compatibility (AITER assertion, defer_input_quant)
- server.sh: Support decode_env from models.yaml for decode-specific
  environment overrides
- dsr1_fp8_mi355x_vllm-disagg.sh: Pass NODELIST to submit.sh for
  Slurm node constraints
…roxy

Replace NixlConnector with MoRIIOConnector for KV cache transfer and
replace the Rust-based vllm-router with a MoRI-IO-aware Python proxy
that handles both HTTP routing and ZMQ-based RDMA endpoint discovery.

The key architectural change is that the proxy enriches each request's
kv_transfer_params with remote RDMA endpoint info (handshake_port,
notify_port, host, port) before dispatching, enabling concurrent
prefill+decode in WRITE mode — something vllm-router could not do
because it only understands HTTP, not the MoRI-IO registration protocol.

Changes:
- Add moriio_proxy.py: MoRI-IO-aware proxy with ZMQ service discovery,
  request enrichment, and /health endpoint (adapted from vLLM upstream
  moriio_toy_proxy_server.py)
- server.sh: switch --kv-transfer-config from NixlConnector to
  MoRIIOConnector with kv_connector_extra_config (proxy_ip,
  proxy_ping_port, http_port); launch proxy before prefill on NODE_RANK=0;
  set VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1 as workaround for v0.17.1
  completion-ID mismatch (upstream fix: vllm-project/vllm#34907)
- setup_deps.sh: replace vllm-router/Rust install with lightweight
  Python deps (quart, aiohttp, msgpack, pyzmq) for the proxy

Benchmark (Job 2853 vs 2818 NixlConnector baseline, ISL/OSL=1024):
  TTFT median:  -37% to -55% across C8–C64 (e.g. 384→241ms @C64)
  TTFT p99:     -63% at C64 (6622→2469ms)
  Throughput:   +8% at C64 (2634→2844 tok/s)
  TPOT:         unchanged (~22ms @C64)
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
… still ran the first barrier; 2. kill and kill run only when DRY_RUN=0

Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Enable READ-mode KV transfer (decode-initiated RDMA reads) with a
critical scheduler assertion fix, and add safety timeouts to prevent
indefinite hangs during RDMA transfers.

Changes:
- setup_deps.sh: Add patches — save_kv_layer/start_load_kv
  handshake timeouts (30s), RDMA transfer timeout (120s), deferred
  write task expiry (60s), write worker error handling, and scheduler
  assertion fix for READ-mode intermediate request states
- moriio_proxy.py: Add stream idle timeout (PROXY_STREAM_IDLE_TIMEOUT)
  to abort stalled decode streams, and proper response.release()
- submit.sh, job.slurm: Plumb PROXY_STREAM_IDLE_TIMEOUT and
  VLLM_MORIIO_CONNECTOR_READ_MODE env vars into Docker containers

Validated: 1k/1k full sweep (C8–C512), 100% success rate at all
concurrency levels, peak 8500 output tok/s at C512.
Port the vLLM disaggregated serving pipeline from the 4N cluster
(Pensando ionic NICs) to the 9N mia1 cluster (mlx5/rdma NICs).

Key changes:
- Fix C512 deadlock: apply ucx_error_handling_mode=none universally
  instead of only for ionic NICs. Under high concurrency, UCX's default
  UCP_ERR_HANDLING_MODE_PEER prevents RIXL RDMA READ retries from
  recovering after ibv_post_send queue exhaustion, causing prefill KV
  cache saturation and pipeline deadlock.
- Force-reinstall MoRI from b645fc8 to fix PCI topology assertion
  failure on nodes with Broadcom PEX890xx PCIe switches.
- Auto-detect Docker privilege (sudo vs non-sudo) for cross-cluster
  portability.
- Add SLURM_EXCLUDE_NODES support to skip nodes with broken Docker
  sockets.
- Increase VLLM_ENGINE_READY_TIMEOUT_S to 3600 to accommodate longer
  setup times (RIXL/MoRI source builds over NFS).
simondanielsson and others added 19 commits May 27, 2026 12:15
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…agg log dirs

Signed-off-by: Theresa Shan <theresa.shan@amd.com>
No external references to this folder exist in the codebase.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Restores import order and failure-rate check block.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Restores _resolve helper and tokenizer fix logic.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Fix IBDEVICES detection log: move info message inside success branch,
  exit 1 on failure instead of silently propagating empty strings
- Add missing SGLANG_USE_AITER=1
- Set SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0 to match upstream

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The refactored server_sglang.sh dropped the per-role COMBINE_DTYPE
mapping that the old server.sh had. SGLang reads SGLANG_MORI_COMBINE_DTYPE
internally, so map it from MORI_COMBINE_DTYPE_PREFILL (fp8_direct_cast)
on prefill and MORI_COMBINE_DTYPE_DECODE (fp8) on decode.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Move VLLM_USE_V1, VLLM_SERVER_DEV_MODE, VLLM_DISABLE_REQUEST_ID_RANDOMIZATION
to env.sh alongside other engine-specific config. Remove commented-out
etcd setup block that is no longer used.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The refactored DOCKER_ENV_COMMON array dropped -e IS_MULTINODE that
the old job.slurm had. Without it, eval metadata tagging inside the
container sees an empty value.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ang.sh

Add BENCH_MAX_CONC_VALUE extraction and the two DP+EP override blocks
that the refactor from server.sh dropped. These adjust max-running-requests,
dispatch tokens, and MOE input tokens when both DP and EP are enabled.
Also add trailing newline for POSIX compliance. server_sglang.sh now
matches upstream server.sh exactly.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b1ae781. Configure here.

sudo docker ps -aq --filter \"name=^container_sbatch_\" | xargs -r sudo docker rm -f || true
sudo docker ps -aq | xargs -r sudo docker stop || true
\$DOCKER_CMD ps -aq --filter \"$CONT_FILTER\" | xargs -r \$DOCKER_CMD rm -f || true
\$DOCKER_CMD ps -aq | xargs -r \$DOCKER_CMD stop || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-clean stops all Docker containers on every node

High Severity

The pre-clean step runs \$DOCKER_CMD ps -aq | xargs -r \$DOCKER_CMD stop on every allocated node, which stops all running Docker containers — not just benchmark-related ones. On shared Slurm nodes where other users or services have running containers, this will disrupt unrelated workloads.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b1ae781. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Comment on lines +633 to +637
patch_moriio_save_kv_timeout
patch_moriio_transfer_timeout
patch_moriio_load_kv_timeout
patch_scheduler_read_mode_fix
patch_prefill_idle_kv_reaper
Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @billishyahao @ichbinblau @simondanielsson

These patches r fine for inital PR but please remove them in follow up PRs as u improve the performance once vllm-project/vllm#40344 hits nightly

Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @billishyahao @ichbinblau @simondanielsson @chunfangamd thank you for the inital vLLM disagg enablement in OSS vLLM! 🥳

These patches r fine for inital PR but please remove them in follow up PRs vllm-project/vllm#40344 hits nightly

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit bb00055 into main May 27, 2026
48 checks passed
@functionstackx functionstackx deleted the amd/vllm_disagg_mvp_dev2 branch May 27, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants