[AMD] vLLM Kimi MXFP4 & MiniMax M2.5 FP8 disaggregated prefill-decode for MI355X by ichbinblau · Pull Request #1569 · SemiAnalysisAI/InferenceX

ichbinblau · 2026-05-27T06:12:42Z

Summary

Add vLLM disaggregated prefill-decode (PD) benchmark recipes for MI355X with MoRI-IO KV transfer
Add Kimi-K2.5-MXFP4 and MiniMax M2.5 disagg inference configs (1P2D)
Consolidate amd_utils to support both sglang and vllm disagg engines
Multiple bug fixes: KV cache leak, Slurm job termination, docker privilege detection, MoRI-IO READ mode

Patches

patch_moriio_save_kv_timeout
patch_moriio_transfer_timeout
patch_moriio_load_kv_timeout
patch_scheduler_read_mode_fix
patch_prefill_idle_kv_reaper

FYI can be removed once https://github.com/vllm-project/vllm/pull/40344 merged and new nightly is available

Test plan

Verify CI passes on amd-master config with vllm-disagg entries
Validate multi-node sglang benchmarks still work (no regressions from amd_utils refactor)
Run vllm-disagg multi-node benchmark on MI355X cluster
Confirm Kimi-K2.5 and MiniMax M2.5 disagg recipes launch correctly

Co-Authored-By: @chunfangamd
Co-Authored-By: @simondanielsson
Co-Authored-By: @billishyahao
Co-Authored-By: @ChuanLi1101

Note

Medium Risk
Large CI/Slurm/Docker and runtime patches to vLLM MoRI-IO affect multi-node benchmark stability; SGLang path is mostly moved unchanged but shared scripts (env, job, bench) could regress existing disagg runs.

Overview
Adds vLLM disaggregated prefill–decode on MI355X alongside existing SGLang disagg: new amd-master recipes for Kimi-K2.5-MXFP4 and MiniMax-M2.5 (1P2D, MoRI-IO), plus multi-node entry scripts and perf-changelog entries.

amd_utils is refactored into a dual-engine path (ENGINE=sglang-disagg vs vllm-disagg): thin server.sh dispatcher, SGLang logic moved to server_sglang.sh, new server_vllm.sh, models_vllm.yaml, UCX/RDMA setup in env.sh, external vllm-router on rank 0 in job.slurm, HF cache model resolution, and per-node Docker privilege detection. submit.sh gains reuse of an existing Slurm allocation, node exclude list, and clearer CLI.

Benchmark plumbing: bench.sh / benchmark_lib.sh support vLLM (--tokenizer, served model name, infinite req rate, cooldown); result collection uses {framework}_isl_* paths. setup_deps.sh installs/patches vLLM MoRI-IO (timeouts, scheduler READ-mode, idle KV reaper) for stable high-concurrency runs.

launch_mi355x-amds.sh treats vllm-disagg like sglang-disagg for multi-node scripts and improves CI log capture on failure.

^{Reviewed by Cursor Bugbot for commit b1ae781. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-05-27T06:23:41Z

+SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"
+if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]]; then
+    EXCLUDE_OPT=(--exclude "$SLURM_EXCLUDE_NODES")
+fi


Hardcoded default SLURM exclude nodes affects all clusters

Medium Severity

SLURM_EXCLUDE_NODES is defaulted to mia1-p01-g11,mia1-p01-g12,mia1-p01-g15 — cluster-specific hostnames that become a non-empty string for all users. Since the subsequent if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]] check is always true, every sbatch submission on every cluster unconditionally passes --exclude with these mia1-specific nodes. The default for an exclude list intended to be optional needs to be empty.

^{Reviewed by Cursor Bugbot for commit d2a212e. Configure here.}

claude · 2026-05-27T06:26:00Z

+    if [[ "$rdma_ip" =~ ^192\.168\.([0-9]+)\.([0-9]+)$ ]]; then
+        local rdma_subnet="${BASH_REMATCH[1]}"
+        local rdma_host="${BASH_REMATCH[2]}"
+        local rdma_gw="192.168.${rdma_subnet}.$(( rdma_host | 1 ))"


🔴 The peer/gateway IP for a /31 point-to-point link at server_vllm.sh:74 is computed with bitwise OR (rdma_host | 1), but a /31 has two hosts that differ only in bit 0, so the peer is obtained by XOR (rdma_host ^ 1). For any node whose RDMA host octet is odd (1, 3, 5, ...), rdma_gw resolves to the node's own IP; the kernel rejects ip route replace ... via <self>, the error is swallowed by 2>/dev/null, and inter-node RoCEv2 traffic on that node silently falls through to the management interface — defeating the whole purpose of this route fix. Fix: change | to ^.

Extended reasoning...

What the bug is

In benchmarks/multi_node/amd_utils/server_vllm.sh, the setup_rdma_env helper (lines 67–82) adds a route to the wider 192.168.X.0/24 RDMA fabric via the TOR-switch peer on each node's /31 point-to-point link. The peer IP is computed at line 74:

local rdma_gw="192.168.${rdma_subnet}.$(( rdma_host | 1 ))"

A /31 subnet has exactly two host addresses that differ only in bit 0 (e.g. .0/.1, .2/.3, .4/.5). The peer of host X is therefore X ^ 1 (flip bit 0), not X | 1.

Step-by-step proof

host octet host | 1 host ^ 1 Correct peer

0 1 ✅ 1 ✅ 1

1 1 ❌ (self!) 0 ✅ 0

2 3 ✅ 3 ✅ 3

3 3 ❌ (self!) 2 ✅ 2

4 5 ✅ 5 ✅ 5

5 5 ❌ (self!) 4 ✅ 4

For any node whose RDMA host octet is odd, rdma_gw evaluates to the node's own IP.

What happens at runtime

The subsequent command on line 78 is:

ip route replace "192.168.${rdma_subnet}.0/24" via "$rdma_gw" dev "$rdma_iface" 2>/dev/null && \ echo "[RDMA-ROUTE] Added ..." || \ echo "[RDMA-ROUTE] Route add failed ..."

The Linux kernel rejects ip route ... via <local-address> on a non-onlink route (RTNETLINK answers: Invalid argument). Because stderr is redirected to /dev/null and the failure is only logged as a single "Route add failed" line on stdout, the error is effectively silent — the script proceeds and the prefill/decode workers start as if RDMA were configured.

Impact

Without the fabric route, the source-routing lookup for another node's 192.168.x.y RDMA IP falls through to the default route — the management interface, not the RDMA NIC. Inter-node KV transfer for vLLM disaggregated prefill/decode then traverses the management network instead of the lossless RoCEv2 fabric the workaround was meant to enable. Roughly half of nodes in a /31-addressed fabric (those with odd host octets) will be affected, depending on the cluster's Pensando ionic IP-assignment convention.

This function is called unconditionally from server_vllm.sh:91 on every vllm-disagg run, so any node with a 192.168.x.x RDMA IP exercises this path. The bug is latent in a new file, so there are no callers in main today, but the CI workflows added in this PR (kimik2.5_fp4_mi355x_vllm-disagg.sh, minimaxm2.5_fp8_mi355x_vllm-disagg.sh) will exercise it.

Why existing code does not prevent it

Nothing in the helper validates that rdma_gw != rdma_ip, and the 2>/dev/null masks the kernel's rejection. The fallback echo "[RDMA-ROUTE] Route add failed for ..." is the only visible signal in the per-node log — easy to miss in a multi-page slurm_job-*.out, and produces a "puzzling perf degradation" mode rather than a clean failure.

Fix

One-character change:

local rdma_gw="192.168.${rdma_subnet}.$(( rdma_host ^ 1 ))"

With XOR, the math is correct for both halves of every /31:
0^1=1, 1^1=0, 2^1=3, 3^1=2, 4^1=5, 5^1=4, ... — always the peer.

claude · 2026-05-27T06:26:01Z

+NUM_NODES=$((xP + yD))
+echo "NUM_NODES: $NUM_NODES (xP=$xP + yD=$yD)"


🔴 job.slurm:195 now computes NUM_NODES=$((xP + yD)) using worker counts, but submit.sh allocates PREFILL_NODES + DECODE_NODES via sbatch -N. For multi-node-per-worker decode configs the third node is trimmed away, breaking the existing dsr1-fp8-mi355x-sglang-disagg "Top of curve" scenario (1 decode worker @ DEP16 across 2 nodes) and the analogous fp4/mtp variants. Restore NUM_NODES=${NUM_NODES} (already exported by submit.sh) or recompute it from PREFILL_NODES_PER_WORKER*xP + DECODE_NODES_PER_WORKER*yD.

Extended reasoning...

Bug

benchmarks/multi_node/amd_utils/job.slurm:195 was changed from NUM_NODES=${NUM_NODES} (inherited from submit.sh) to NUM_NODES=$((xP + yD)). This is incorrect for any disagg config where a single worker spans more than one node: xP/yD count workers, while PREFILL_NODES/DECODE_NODES (and the original NUM_NODES exported by submit.sh) count nodes.

Trigger path

submit.sh builds the allocation and the worker exports as follows:

L93: NUM_NODES=$((PREFILL_NODES + DECODE_NODES)) and L118 exports it

L108-109: xP=$PREFILL_WORKERS, yD=$DECODE_WORKERS (comment: "may span multiple nodes")

sbatch is called with -N $NUM_NODES

Inside the slurm script, the new line 195 throws away the exported value and uses workers instead. SELECTED_NODES = head -n NUM_NODES then trims the allocated nodelist (lines 197-199), SLURM_NNODES is reset (line 207), and every downstream srun only fans out to the truncated subset.

Concrete failing scenario (unchanged by this PR, but consumes the refactored amd_utils)

.github/configs/amd-master.yaml still contains dsr1-fp8-mi355x-sglang-disagg "Top of curve" with prefill num-worker=1 + PREFILL_NODES=1 and decode num-worker=1 + DECODE_NODES=2. The runner script invokes submit.sh with:

PREFILL_NODES=1, PREFILL_WORKERS=1, DECODE_NODES=2, DECODE_WORKERS=1

Step-by-step proof:

submit.sh: NUM_NODES = 1 + 2 = 3. sbatch allocates 3 nodes. xP=1, yD=1. DECODE_TP_SIZE = DECODE_NODES * DECODE_TP / DECODE_WORKERS = 2*8/1 = 16.

job.slurm:195 (this PR): overrides NUM_NODES = xP + yD = 1 + 1 = 2.

SELECTED_NODES = head -n 2 of the 3-node allocation. SLURM_NNODES = 2. The third (decode) node is silently dropped.

server_sglang.sh:232-233: DECODE_NODES_PER_WORKER = (16+7)/8 = 2. The decode worker is multi-node (DEP16 across 2 nodes).

server_sglang.sh role assignment: rank 0 = prefill head, rank 1 = decode head. Rank 2 (the decode follower needed for the multi-node decode worker) is never spawned because srun only runs on 2 ranks. The decode head hangs at --dist-init-addr waiting for its peer.

Impact

Pre-existing scenarios broken (consume the refactored amd_utils unchanged): dsr1-fp8-mi355x-sglang-disagg (Top of curve), dsr1-fp8-mi355x-sglang-disagg-mtp (Top of curve), and dsr1-fp4-mi355x-sglang-disagg(-mtp) 1P1D large-TP variants in amd-master.yaml. These are production sweeps that ship in the same yaml.

New configs unaffected by chance: kimik2.5-fp4-mi355x-vllm-disagg and minimaxm2.5-fp8-mi355x-vllm-disagg happen to use 1-node-per-worker layouts (xP+yD == PREFILL_NODES+DECODE_NODES), so they don't trip the bug. That's why this regression slipped through the PR's own test plan.

The failure mode is silent: sbatch succeeds and 3 nodes are allocated, but only 2 ever run. The decode head hangs on --dist-init-addr until SGLANG_DISAGGREGATION_WAITING_TIMEOUT (3600s) fires.

Fix

Drop the new computation and revert to using the value submit.sh already exports:

NUM_NODES=${NUM_NODES}

Or recompute correctly from per-worker node counts:

PREFILL_NODES_PER_WORKER=$(((PREFILL_TP_SIZE + GPUS_PER_NODE - 1) / GPUS_PER_NODE)) DECODE_NODES_PER_WORKER=$(((DECODE_TP_SIZE + GPUS_PER_NODE - 1) / GPUS_PER_NODE)) NUM_NODES=$((PREFILL_NODES_PER_WORKER * xP + DECODE_NODES_PER_WORKER * yD))

Either approach restores the prior behavior for the multi-node-per-worker decode scenarios while keeping the new vllm-disagg recipes unaffected.

claude · 2026-05-27T06:26:01Z

+# Set SLURM_EXCLUDE_NODES env var to a comma-separated list of hostnames.
+EXCLUDE_OPT=()
+SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"
+if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]]; then
+    EXCLUDE_OPT=(--exclude "$SLURM_EXCLUDE_NODES")
+fi


🟡 submit.sh frames itself as a cluster-portable template (line 4: "must be configured for your specific cluster before use") but hardcodes a mia1-only default for SLURM_EXCLUDE_NODES (mia1-p01-g11,mia1-p01-g12,mia1-p01-g15) at line 159. The default leaks into every cluster's sbatch invocation, and the ${VAR:-default} syntax means SLURM_EXCLUDE_NODES='' cannot opt out — only unset works. Fix is either ${VAR-default} (no colon, so empty string disables) or move the default into runners/launch_mi355x-amds.sh alongside the other mia1-specific knobs (IBDEVICES, MORI_RDMA_TC, MODEL_DIR).

Extended reasoning...

What the bug is

At benchmarks/multi_node/amd_utils/submit.sh:158-163:

EXCLUDE_OPT=() SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}" if [[ -n "${SLURM_EXCLUDE_NODES:-}" ]]; then EXCLUDE_OPT=(--exclude "$SLURM_EXCLUDE_NODES") fi

Two distinct issues:

Cluster-specific default in a portable template. submit.sh's docstring (lines 4-5) declares it a "Cluster Configuration Template ... must be configured for your specific cluster before use." But the default exclude list contains literal mia1 hostnames. The other files in amd_utils/ (env.sh, job.slurm) explicitly support GPU*, smci355-ccs-aus*, and mia1* hostname patterns — so the rest of the directory is genuinely cluster-portable, and this submit.sh default is inconsistent with that.

No way to disable via empty string. ${VAR:-default} (with the colon) substitutes the default both when VAR is unset and when VAR is empty. So a CI caller doing SLURM_EXCLUDE_NODES='' bash submit.sh ... cannot disable the option — they have to unset SLURM_EXCLUDE_NODES instead.

Why existing code doesn't prevent it

This is new code in this PR. The verifiers confirmed the default literal is hardcoded and that no other code path overrides it before the sbatch invocation. The other cluster-specific knobs (IBDEVICES, MORI_RDMA_TC, MODEL_DIR) are correctly scoped — they're set inside runners/launch_mi355x-amds.sh (line 30: IBDEVICES="rdma0,rdma1,...rdma7", line 31: MORI_RDMA_TC=104) — so the runner-based pattern is the existing convention this default breaks.

Step-by-step proof of the empty-string issue

$ SLURM_EXCLUDE_NODES='' bash -c 'X="${SLURM_EXCLUDE_NODES:-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"; echo "X=$X"' X=mia1-p01-g11,mia1-p01-g12,mia1-p01-g15

The empty string was re-substituted to the default. With the colon-less form:

$ SLURM_EXCLUDE_NODES='' bash -c 'X="${SLURM_EXCLUDE_NODES-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}"; echo "X=$X"' X=

Which would then fail the [[ -n "$SLURM_EXCLUDE_NODES" ]] check and skip --exclude entirely — the intended opt-out behavior.

Impact

Low today, real tomorrow. The verifier-refutation correctly notes that the current production pipeline is mia1-locked (runners/launch_mi355x-amds.sh hardcodes mia1-specific IBDEVICES, MORI_RDMA_TC=104, and MODEL_PATH=/it-share/data), so this default actually matches the current target cluster. The footgun fires the moment anyone tries the smci355 or GPU* runner paths that the rest of amd_utils/ already supports — sbatch on those clusters either rejects unknown node names with "Invalid node name specified" (strict slurmctld) or silently no-ops the --exclude (permissive).

Fix

Either of two two-character/one-line changes:

Drop the colon: SLURM_EXCLUDE_NODES="${SLURM_EXCLUDE_NODES-mia1-p01-g11,mia1-p01-g12,mia1-p01-g15}" — empty string now disables --exclude.

Move the default into runners/launch_mi355x-amds.sh next to the other mia1-specific knobs, so the runner owns the cluster-scoped value and submit.sh truly is portable.

Severity rationale

Filed at nit: the current production path is mia1 (verified by the runner's mia1-specific IBDEVICES/MODEL_PATH/MORI_RDMA_TC), so this is harmless today. Worth flagging on the review because the file is being touched in this PR, the docstring frames the script as portable, and the fix is two characters.

claude · 2026-05-27T06:26:01Z

+    DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',')
+    if [[ -n "$DETECTED" ]]; then
+        export IBDEVICES="$DETECTED"
    else
-        echo "ERROR: Unable to detect cluster from hostname $NODENAME and IBDEVICES not set" >&2
-        exit 1
+        echo "WARNING: Unable to detect RDMA devices. Set IBDEVICES explicitly." >&2
    fi
-    echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES from hostname $NODENAME"
+    echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES from hostname $(hostname -s)"
 else
    echo "[INFO] Using IBDEVICES=$IBDEVICES (set by runner or environment)"
 fi
 export IBDEVICES


🟡 The [INFO] Auto-detected IBDEVICES=… line at env.sh:28 sits outside the inner success/failure branch, so on the failure path it prints [INFO] Auto-detected IBDEVICES= from hostname X immediately after the WARNING that just said detection failed — directly contradicting it. The from hostname phrasing is also stale (detection now uses ibv_devinfo, not hostname). Fix: move the echo inside the if [[ -n "$DETECTED" ]]; then branch. Optional follow-up: consider exit 1 on the empty-detect path (the previous version did) so downstream consumers (NCCL_IB_HCA, --disaggregation-ib-device, UCX_NET_DEVICES fallback) fail with an actionable error instead of propagating empty strings.

Extended reasoning...

What goes wrong

if [[ -z "$IBDEVICES" ]]; then DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',') if [[ -n "$DETECTED" ]]; then export IBDEVICES="$DETECTED" else echo "WARNING: Unable to detect RDMA devices. Set IBDEVICES explicitly." >&2 fi echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES from hostname $(hostname -s)" # ← line 28, OUTSIDE the if/else fi

The [INFO] echo at line 28 is at the same indent level as the inner if [[ -n "$DETECTED" ]]; then … else … fi, not nested inside the success branch. When ibv_devinfo returns nothing, DETECTED is empty, the else branch emits the WARNING to stderr, and then the unconditional [INFO] line still fires on stdout — claiming auto-detection succeeded with an empty value.

Step-by-step proof (direct manual run, no runner pre-set):

Operator runs bash benchmarks/multi_node/amd_utils/server.sh on a node where ibv_devinfo isn't installed or returns no HCAs. IBDEVICES is unset → outer if enters the detection branch.

DETECTED=$(ibv_devinfo 2>/dev/null | …) → empty string.

[[ -n "$DETECTED" ]] is false → else runs:
WARNING: Unable to detect RDMA devices. Set IBDEVICES explicitly. → stderr.

Control falls through to line 28 (no fi between the inner else and this echo):
[INFO] Auto-detected IBDEVICES= from hostname mia1-p01-g03 → stdout.

Operator now sees a WARNING immediately followed by an INFO claiming the auto-detect succeeded with empty value. The two messages contradict each other, and from hostname … is doubly misleading — detection no longer uses hostname (it uses ibv_devinfo).

Downstream impact of the silent continuation

Once IBDEVICES is empty, the values propagate without further validation:

Line 45: NCCL_IB_HCA=${NCCL_IB_HCA:-$IBDEVICES}. Because IBDEVICES is exported as an empty string (not unset), the :- default doesn't fire — NCCL_IB_HCA ends up empty.

SGLang server.sh later passes --disaggregation-ib-device ${IBDEVICES} → empty.

vLLM env.sh's UCX fallback at lines 59–66: FIRST_IB=$(echo "$IBDEVICES" | cut -d',' -f1) → empty → the ${FIRST_IB}:1 fallback becomes ":1" (invalid UCX device spec).

The previous version of this script (sglang path on origin/main) had exit 1 for an unrecognized hostname, so the failure was loud and immediately actionable. The new path makes it a silent WARN and lets all of the above propagate.

Why existing CI hides this

runners/launch_mi355x-amds.sh explicitly sets IBDEVICES=rdma0,rdma1,…,rdma7 and MORI_RDMA_TC=104 before the SLURM job runs, so the outer [[ -z "$IBDEVICES" ]] check is false in CI and the buggy block is never entered. Anyone debugging RDMA setup, running directly, or hitting a runner that doesn't pre-set IBDEVICES lands on this path.

Fix

Move the INFO line inside the success branch (and optionally restore the fail-fast exit):

if [[ -z "$IBDEVICES" ]]; then DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',') if [[ -n "$DETECTED" ]]; then export IBDEVICES="$DETECTED" echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES via ibv_devinfo on $(hostname -s)" else echo "ERROR: Unable to detect RDMA devices via ibv_devinfo. Set IBDEVICES explicitly." >&2 exit 1 fi fi

The log fix is trivial and purely cosmetic; the optional exit 1 is a small behavioral regression to restore but high-value for operators outside the production CI path.

cursor · 2026-05-27T06:28:33Z

+fi

-exec sudo docker run --rm \
+\$MAYBE_EXEC \$DOCKER_CMD run \


Docker --rm flag dropped, containers may leak

Low Severity

The --rm flag was removed from the docker run command. While explicit cleanup exists in the KEEP_CONTAINERS block, if the post-run srun cleanup fails (e.g., due to SLURM timeout or node unreachability), containers will persist on worker nodes. The old code used --rm to guarantee automatic cleanup regardless of external failures.

^{Reviewed by Cursor Bugbot for commit 8b801d4. Configure here.}

billishyahao · 2026-05-27T06:34:48Z

@@ -0,0 +1,654 @@
+#!/bin/bash
+# =============================================================================


Do we still need all deps here?

Yes, the first three and last patches have not been upstreamed. Scheduler read mode patch could be removed when @simondanielsson 's patch merged.
patch_moriio_save_kv_timeout
patch_moriio_transfer_timeout
patch_moriio_load_kv_timeout
patch_scheduler_read_mode_fix
patch_prefill_idle_kv_reaper

FYI can be removed once vllm-project/vllm#40344 merged and new nightly is available

github-actions · 2026-05-27T16:13:03Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26498948991
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26498948991

Add multi-node vLLM PD disaggregation recipe using Nixl/RIXL KV transfer and vllm-router, mirroring the existing SGLang disagg recipe structure. - New benchmark config: dsr1-fp8-mi355x-vllm-disagg (1P2D, TP8) - New utils: vllm_disagg_utils/ (job.slurm, server.sh, submit.sh, etc.) - Runner: extend launch_mi355x-amds.sh for vllm-disagg framework

Extract hardcoded model configurations from server.sh bash maps and job.slurm VALID_MODELS into a declarative models.yaml, mirroring the SGLang disagg recipe pattern. Adding a new model now requires no script changes. Also: - Consolidate UCX transport vars in job.slurm Docker env; remove duplicated setup_ucx_env() from server.sh - Extract RDMA workarounds (ionic /31 route fix, Nixl UCX patch) into setup_rdma_env() helper - Lower UCX_LOG_LEVEL from info to warn - Add nicctl mount and QoS/DSCP auto-detection to env.sh - Remove stale host libionic bind-mounts (driver now built into image)

Adapt server.sh to vLLM v0.17.1 breaking changes: - Use simplified kv-transfer-config (side channel via env vars instead of kv_ip/kv_port, add kv_load_failure_policy) - Remove deprecated --disable-log-requests (disabled by default in v0.17) - Route NIXL side channel through RDMA IP for correct fabric path - Fix RIXL ucx_error_handling_mode patch for updated _api.py layout

bench.sh: replace `vllm bench serve` (log-only output) with the shared run_benchmark_serving helper from benchmark_lib.sh, matching the SGLang disagg pattern. This produces the .json result files that the multinode CI workflow expects (benchmark-multinode-tmpl.yml → process_result.py). server.sh: make the Nixl ucx_error_handling_mode=none runtime patch conditional on Pensando ionic RDMA devices (IBDEVICES=*ionic*). On the mia1 cluster (ConnectX/mlx5, IBDEVICES=rdma*), UCX handles error mode natively and the patch is skipped. Model-path resolution and IBDEVICES/UCX/QoS auto-detection were verified to already work on mia1 — no changes needed. Tested locally (Job 2802, 1P+2D, ISL/OSL=1024): conc 8 → 507 tok/s conc 32 → 1778 tok/s conc 16 → 1004 tok/s conc 64 → 2480 tok/s All four .json result files produced; 100% external prefix cache hit rate.

Move the vllm-router from a dedicated proxy node onto the first prefill node, mirroring SGLang's co-location pattern. This reduces the node count from xP + yD + 1 to xP + yD (e.g., 3 nodes instead of 4 for 1P+2D). - server.sh: NODE_RANK=0 now runs both vllm serve (prefill, port 2584) and vllm-router (port 30000); barrier waits on all nodes - submit.sh / job.slurm: NUM_NODES = PREFILL_NODES + DECODE_NODES - bench.sh: ROUTER_PORT default updated to 30000 Local 1P+2D benchmark (ISL/OSL=1024, DeepSeek-R1 FP8, MI355X): - Throughput: +1.6% to +8.4% across concurrency 8-64 - Mean TTFT: -22% to -63% (prefill is local to router) - TPOT/ITL: unchanged (within noise) - 25% fewer nodes, no performance regression

Replace the custom Docker image (vllm_disagg_pd:latest) with the public vllm/vllm-openai-rocm:v0.17.1 base image. Missing components (UCX, RIXL, etcd, libionic1, vllm-router) are now installed at container start via setup_deps.sh, which is sourced by server.sh. This eliminates the need to build, host, and maintain a custom image — CI nodes can pull directly from Docker Hub. Changes: - Add setup_deps.sh: idempotent installer for UCX (ROCm fork), RIXL, etcd, libionic1 (Pensando ionic), and vllm-router (NODE_RANK=0 only). Build steps run in subshells to avoid CWD pollution. - server.sh: source setup_deps.sh before any other logic - job.slurm: add --entrypoint "" to override the base image's vllm CLI entrypoint, allowing bash -lc to work correctly - env.sh: update comment (paths now set by setup_deps.sh, not image ENV) - amd-master.yaml: image changed to vllm/vllm-openai-rocm:v0.17.1 Tested locally (Job 2807, 3 nodes, ISL/OSL=1024): Setup overhead: ~2.5 min per node (all components built from source) Benchmark completed successfully across concurrency 8/16/32/64

…ecode Enable MoRI-based Expert Parallelism (--enable-expert-parallel --all2all-backend mori) on decode workers for DeepSeek-R1-0528, while keeping TP=8 to preserve KV cache transfer compatibility with the prefill node via NixlConnector. This matches SGLang's approach of TP=8 + EP within the TP group. KV Transfer: RIXL/NixlConnector (unchanged) MoE All-to-All: NCCL (default) -> MoRI-EP (--all2all-backend mori) Changes: - models.yaml: Add --enable-expert-parallel --all2all-backend mori to decode_flags; increase engine ready timeout to 1200s - setup_deps.sh: Add MoRI install and vLLM v0.17.1 patches for MoRI-EP + FP8 compatibility (AITER assertion, defer_input_quant) - server.sh: Support decode_env from models.yaml for decode-specific environment overrides - dsr1_fp8_mi355x_vllm-disagg.sh: Pass NODELIST to submit.sh for Slurm node constraints

@C64

…roxy Replace NixlConnector with MoRIIOConnector for KV cache transfer and replace the Rust-based vllm-router with a MoRI-IO-aware Python proxy that handles both HTTP routing and ZMQ-based RDMA endpoint discovery. The key architectural change is that the proxy enriches each request's kv_transfer_params with remote RDMA endpoint info (handshake_port, notify_port, host, port) before dispatching, enabling concurrent prefill+decode in WRITE mode — something vllm-router could not do because it only understands HTTP, not the MoRI-IO registration protocol. Changes: - Add moriio_proxy.py: MoRI-IO-aware proxy with ZMQ service discovery, request enrichment, and /health endpoint (adapted from vLLM upstream moriio_toy_proxy_server.py) - server.sh: switch --kv-transfer-config from NixlConnector to MoRIIOConnector with kv_connector_extra_config (proxy_ip, proxy_ping_port, http_port); launch proxy before prefill on NODE_RANK=0; set VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1 as workaround for v0.17.1 completion-ID mismatch (upstream fix: vllm-project/vllm#34907) - setup_deps.sh: replace vllm-router/Rust install with lightweight Python deps (quart, aiohttp, msgpack, pyzmq) for the proxy Benchmark (Job 2853 vs 2818 NixlConnector baseline, ISL/OSL=1024): TTFT median: -37% to -55% across C8–C64 (e.g. 384→241ms @C64) TTFT p99: -63% at C64 (6622→2469ms) Throughput: +8% at C64 (2634→2844 tok/s) TPOT: unchanged (~22ms @C64)

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

… still ran the first barrier; 2. kill and kill run only when DRY_RUN=0 Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Enable READ-mode KV transfer (decode-initiated RDMA reads) with a critical scheduler assertion fix, and add safety timeouts to prevent indefinite hangs during RDMA transfers. Changes: - setup_deps.sh: Add patches — save_kv_layer/start_load_kv handshake timeouts (30s), RDMA transfer timeout (120s), deferred write task expiry (60s), write worker error handling, and scheduler assertion fix for READ-mode intermediate request states - moriio_proxy.py: Add stream idle timeout (PROXY_STREAM_IDLE_TIMEOUT) to abort stalled decode streams, and proper response.release() - submit.sh, job.slurm: Plumb PROXY_STREAM_IDLE_TIMEOUT and VLLM_MORIIO_CONNECTOR_READ_MODE env vars into Docker containers Validated: 1k/1k full sweep (C8–C512), 100% success rate at all concurrency levels, peak 8500 output tok/s at C512.

Port the vLLM disaggregated serving pipeline from the 4N cluster (Pensando ionic NICs) to the 9N mia1 cluster (mlx5/rdma NICs). Key changes: - Fix C512 deadlock: apply ucx_error_handling_mode=none universally instead of only for ionic NICs. Under high concurrency, UCX's default UCP_ERR_HANDLING_MODE_PEER prevents RIXL RDMA READ retries from recovering after ibv_post_send queue exhaustion, causing prefill KV cache saturation and pipeline deadlock. - Force-reinstall MoRI from b645fc8 to fix PCI topology assertion failure on nodes with Broadcom PEX890xx PCIe switches. - Auto-detect Docker privilege (sudo vs non-sudo) for cross-cluster portability. - Add SLURM_EXCLUDE_NODES support to skip nodes with broken Docker sockets. - Increase VLLM_ENGINE_READY_TIMEOUT_S to 3600 to accommodate longer setup times (RIXL/MoRI source builds over NFS).

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…agg log dirs Signed-off-by: Theresa Shan <theresa.shan@amd.com>

No external references to this folder exist in the codebase. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Restores import order and failure-rate check block. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Restores _resolve helper and tokenizer fix logic. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- Fix IBDEVICES detection log: move info message inside success branch, exit 1 on failure instead of silently propagating empty strings - Add missing SGLANG_USE_AITER=1 - Set SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0 to match upstream Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The refactored server_sglang.sh dropped the per-role COMBINE_DTYPE mapping that the old server.sh had. SGLang reads SGLANG_MORI_COMBINE_DTYPE internally, so map it from MORI_COMBINE_DTYPE_PREFILL (fp8_direct_cast) on prefill and MORI_COMBINE_DTYPE_DECODE (fp8) on decode. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Move VLLM_USE_V1, VLLM_SERVER_DEV_MODE, VLLM_DISABLE_REQUEST_ID_RANDOMIZATION to env.sh alongside other engine-specific config. Remove commented-out etcd setup block that is no longer used. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The refactored DOCKER_ENV_COMMON array dropped -e IS_MULTINODE that the old job.slurm had. Without it, eval metadata tagging inside the container sees an empty value. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…ang.sh Add BENCH_MAX_CONC_VALUE extraction and the two DP+EP override blocks that the refactor from server.sh dropped. These adjust max-running-requests, dispatch tokens, and MOE input tokens when both DP and EP are enabled. Also add trailing newline for POSIX compliance. server_sglang.sh now matches upstream server.sh exactly. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b1ae781. Configure here.}

cursor · 2026-05-27T16:17:08Z

-sudo docker ps -aq --filter \"name=^container_sbatch_\" | xargs -r sudo docker rm -f || true
-sudo docker ps -aq | xargs -r sudo docker stop || true
+\$DOCKER_CMD ps -aq --filter \"$CONT_FILTER\" | xargs -r \$DOCKER_CMD rm -f || true
+\$DOCKER_CMD ps -aq | xargs -r \$DOCKER_CMD stop || true


Pre-clean stops all Docker containers on every node

High Severity

The pre-clean step runs \$DOCKER_CMD ps -aq | xargs -r \$DOCKER_CMD stop on every allocated node, which stops all running Docker containers — not just benchmark-related ones. On shared Slurm nodes where other users or services have running containers, this will disrupt unrelated workloads.

^{Reviewed by Cursor Bugbot for commit b1ae781. Configure here.}

github-actions · 2026-05-27T21:24:17Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26523726435
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26523726435

functionstackx · 2026-05-27T21:26:58Z

+patch_moriio_save_kv_timeout
+patch_moriio_transfer_timeout
+patch_moriio_load_kv_timeout
+patch_scheduler_read_mode_fix
+patch_prefill_idle_kv_reaper


hi @billishyahao @ichbinblau @simondanielsson

These patches r fine for inital PR but please remove them in follow up PRs as u improve the performance once vllm-project/vllm#40344 hits nightly

functionstackx

hi @billishyahao @ichbinblau @simondanielsson @chunfangamd thank you for the inital vLLM disagg enablement in OSS vLLM! 🥳

These patches r fine for inital PR but please remove them in follow up PRs vllm-project/vllm#40344 hits nightly

functionstackx · 2026-05-27T21:29:36Z

/reuse-sweep-run

ichbinblau requested a review from a team May 27, 2026 06:12

ichbinblau requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners May 27, 2026 06:12

github-project-automation Bot added this to InferenceMAX Board May 27, 2026

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/job.slurm

Comment thread benchmarks/multi_node/amd_utils/job.slurm

Comment thread benchmarks/multi_node/amd_utils/job.slurm

cursor Bot reviewed May 27, 2026

View reviewed changes

claude Bot reviewed May 27, 2026

View reviewed changes

cursor Bot reviewed May 27, 2026

View reviewed changes

billishyahao reviewed May 27, 2026

View reviewed changes

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/server_sglang.sh

ichbinblau added the sweep-enabled label May 27, 2026

functionstackx changed the title ~~[AMD] vLLM disaggregated prefill-decode for MI355X~~ [AMD] vLLM Kimi MXFP4 & MiniMax M2.5 disaggregated prefill-decode for MI355X May 27, 2026

functionstackx changed the title ~~[AMD] vLLM Kimi MXFP4 & MiniMax M2.5 disaggregated prefill-decode for MI355X~~ [AMD] vLLM Kimi MXFP4 & MiniMax M2.5 FP8 disaggregated prefill-decode for MI355X May 27, 2026

functionstackx removed the sweep-enabled label May 27, 2026

chunfangamd and others added 12 commits May 27, 2026 12:14

[AMD] BUG fix: RANDOM_RANGE_RATIO never reaches bench.sh

c50b3c8

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Bug fix: 1. With DRY_RUN=1, node 0 skipped starting proxy/prefill but…

fa7794d

… still ran the first barrier; 2. kill and kill run only when DRY_RUN=0 Signed-off-by: Theresa Shan <theresa.shan@amd.com>

simondanielsson and others added 19 commits May 27, 2026 12:15

fix: temporarily mount /coredumps

4c1520d

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

tmp: add bette r debugging capabilities

c2e0377

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: disable custom all-reduce for minimax

b172350

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: minimax segfault by avoiding M=8K fmoe kernel shape

9eaf548

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

revert: fix: temporarily mount /coredumps

2bde2b6

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

feat: add VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 as in single node example

e6d26d7

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: use FRAMEWORK arg in collect_latest_results.py to match vllm-dis…

102e59f

…agg log dirs Signed-off-by: Theresa Shan <theresa.shan@amd.com>

remove unused vllm_disagg_utils directory

c60e6af

No external references to this folder exist in the codebase. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

revert: restore backend_request_func.py to match main

106a4e4

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

revert: restore benchmark_serving.py to match main

8ccd28a

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

revert: fully restore benchmark_serving.py to match main

93da023

Restores import order and failure-rate check block. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

revert: fully restore backend_request_func.py to match main

f242ee5

Restores _resolve helper and tokenizer fix logic. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

add pr-link to vllm-disagg changelog entries

b133e5f

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

fix: pass IS_MULTINODE into Docker container

6286f44

The refactored DOCKER_ENV_COMMON array dropped -e IS_MULTINODE that the old job.slurm had. Without it, eval metadata tagging inside the container sees an empty value. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

fix: improve vllm-disagg changelog descriptions

37733fb

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

functionstackx force-pushed the amd/vllm_disagg_mvp_dev2 branch from e8b2230 to b1ae781 Compare May 27, 2026 16:15

functionstackx added the full-sweep-enabled label May 27, 2026

cursor Bot reviewed May 27, 2026

View reviewed changes

functionstackx reviewed May 27, 2026

View reviewed changes

functionstackx approved these changes May 27, 2026

View reviewed changes

functionstackx merged commit bb00055 into main May 27, 2026
48 checks passed

functionstackx deleted the amd/vllm_disagg_mvp_dev2 branch May 27, 2026 21:29

github-project-automation Bot moved this to Done in InferenceMAX Board May 27, 2026

Oseltamivir mentioned this pull request May 27, 2026

run-sweep: gate full-sweep PRs behind a sequential canary #1503

Merged

9 tasks

host octet	`host \| 1`	`host ^ 1`	Correct peer
0	1 ✅	1 ✅	1
1	1 ❌ (self!)	0 ✅	0
2	3 ✅	3 ✅	3
3	3 ❌ (self!)	2 ✅	2
4	5 ✅	5 ✅	5
5	5 ❌ (self!)	4 ✅	4

		NUM_NODES=$((xP + yD))
		echo "NUM_NODES: $NUM_NODES (xP=$xP + yD=$yD)"

		@@ -0,0 +1,654 @@
		#!/bin/bash
		# =============================================================================

Conversation

ichbinblau commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Patches

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Hardcoded default SLURM exclude nodes affects all clusters

Uh oh!

claude Bot May 27, 2026

Choose a reason for hiding this comment

What the bug is

Step-by-step proof

What happens at runtime

Impact

Why existing code does not prevent it

Fix

Uh oh!

claude Bot May 27, 2026

Choose a reason for hiding this comment

Bug

Trigger path

Concrete failing scenario (unchanged by this PR, but consumes the refactored amd_utils)

Impact

Fix

Uh oh!

claude Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Docker --rm flag dropped, containers may leak

Uh oh!

billishyahao May 27, 2026

Choose a reason for hiding this comment

Uh oh!

ichbinblau May 27, 2026

Choose a reason for hiding this comment

Uh oh!

simondanielsson May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Pre-clean stops all Docker containers on every node

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

functionstackx May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

functionstackx commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

ichbinblau commented May 27, 2026 •

edited

Loading

Docker `--rm` flag dropped, containers may leak

functionstackx May 27, 2026 •

edited

Loading