[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support by raviguptaamd · Pull Request #150 · ROCm/MAD

raviguptaamd · 2026-04-21T07:54:06Z

Summary

Enable multi-node MoRI EP disaggregated prefill/decode inference (2P/2D, 4P/4D) on OCI MI300X clusters with InfiniBand RDMA. Adds optional FULL_DECODE_ONLY CUDA graph mode for decode nodes.

Key changes:

Multi-node DP support with master-only --kv-transfer-config (child nodes join via --headless)
Runtime patching of vLLM PR #39276 (engine_id collision + MoRIIO robustness) via apply_moriio_2pd_patches.sh
RDMA/NCCL tuning, --ulimit memlock=-1:-1, host RDMA library auto-discovery
Host-local compilation caches (/tmp/vllm_cache/) for AITER JIT, Triton, COMGR, vLLM
Optional FULL_DECODE_ONLY CUDA graphs for decode nodes (prefill always eager)

Files Changed

File	Status	Purpose
`scripts/vllm_dissag/vllm_disagg_mori_ep.sh`	Modified	Multi-node DP, master-only kv-config, CUDA graph support, RDMA tuning
`scripts/vllm_dissag/run_xPyD_models.slurm`	Modified	memlock ulimit, NCCL/MoRI env vars, host-local caches, RDMA mounts
`scripts/vllm_dissag/benchmark_xPyD.sh`	Modified	Warmup tuning (ISL=32, OSL=32), per-step timeout logic
`scripts/vllm_dissag/apply_moriio_2pd_patches.sh`	New	Idempotent runtime patch for vLLM PR #39276 (5/5 verification checks)

Architecture

PROXY (co-located on Prefill Master node)
  HTTP load balancer + ZMQ router
       |                    |
  PREFILL (xP nodes)     DECODE (yD nodes)
  DP=8*xP, EP=8/GPU      DP=8*yD, EP=8/GPU
  Master: API, MoRIIO     Master: API, MoRIIO
          producer                 consumer
  Child:  headless,        Child:  headless,
          EP all-to-all            EP all-to-all

Docker Image & Dependencies

Dockerfile: docker/vllm_disagg_inference.ubuntu.amd.Dockerfile (unchanged in this PR)
Base image: rocm/vllm-dev:base_torch2.10_triton3.6_rocm7.2_torch_build_20260216
MoRI: Same version as in base image (not updated)
vLLM: v0.18.1rc1.dev133+g7d6917bef (same commit as Dockerfile VLLM_COMMIT)
Runtime dependency: vLLM PR #39276 — patched at container startup, becomes no-op once merged upstream

How to run

1. Build the Docker image

cd MAD

# For MI300X (gfx942) with ConnectX-7 NICs:
docker build \
    -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile \
    --build-arg GFX_COMPILATION_ARCH=gfx942 \
    --build-arg NIC_COMPILATION_ARCH=cx7 \
    -t <your-registry>/mad-mori-ep:gfx942 \
    .

# For MI355X (gfx950) with Ionic AINICs:
docker build \
    -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile \
    --build-arg GFX_COMPILATION_ARCH=gfx950 \
    --build-arg NIC_COMPILATION_ARCH=ionic \
    -t <your-registry>/mad-mori-ep:gfx950 \
    .

Push to your registry so all nodes can pull it:

docker push <your-registry>/mad-mori-ep:gfx942
docker push <your-registry>/mad-mori-ep:gfx950

2. Run 2P/2D (default eager mode)

DOCKER_IMAGE_NAME="<your-registry>/mad-mori-ep:gfx942" \
MODEL_NAME="DeepSeek-V3" RUN_MORI=1 \
BENCHMARK_CON="8 16 32 64 128" BENCHMARK_COMBINATIONS="1024/1024" \
xP=2 yD=2 sbatch -N 4 -n 4 \
    --nodelist=<node1>,<node2>,<node3>,<node4> \
    scripts/vllm_dissag/run_xPyD_models.slurm

3. Run 2P/2D with CUDA graph decode

VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY \
DOCKER_IMAGE_NAME="<your-registry>/mad-mori-ep:gfx942" \
MODEL_NAME="DeepSeek-V3" RUN_MORI=1 \
BENCHMARK_CON="8 16 32 64 128" BENCHMARK_COMBINATIONS="1024/1024" \
xP=2 yD=2 sbatch -N 4 -n 4 \
    --nodelist=<node1>,<node2>,<node3>,<node4> \
    scripts/vllm_dissag/run_xPyD_models.slurm

4. Run 4P/4D

DOCKER_IMAGE_NAME="<your-registry>/mad-mori-ep:gfx942" \
MODEL_NAME="DeepSeek-V3" RUN_MORI=1 \
BENCHMARK_CON="8 16 32 64 128" BENCHMARK_COMBINATIONS="1024/1024" \
xP=4 yD=4 sbatch -N 8 -n 8 \
    --nodelist=<node1>,<node2>,...,<node8> \
    scripts/vllm_dissag/run_xPyD_models.slurm

Compilation caches

AITER JIT, Triton, COMGR, and vLLM caches are auto-mounted to host-local /tmp/vllm_cache/. First run on a node takes longer (JIT compilation); subsequent runs reuse cached artifacts. Cache paths are configurable via AITER_JIT_DIR, TRITON_CACHE_DIR, COMGR_CACHE_DIR, VLLM_CACHE_ROOT.

Result log locations

Logs are written to /shared_inference/$USER/model_blog_logs/{SLURM_JOB_ID}/:

prefill_NODE{N}.log / decode_NODE{N}.log — per-node vLLM server logs
proxy_NODE0.log — proxy server log
benchmark_{JOBID}_*_CONCURRENCY.log — benchmark results
pd_vllm_bench_NODE0.log — benchmark driver output

Known limitations

MoRIIO KV transfer operates only between master nodes' DP ranks (0–7). Child DP ranks (8–15) fall back to local compute. Fixing requires upstream vLLM changes (partially addressed by PR #39276).
CUDA graph con=128 may have proxy-side failures at very high throughput — use --request-rate to throttle.

Enable multi-node xP/yD disaggregated prefill/decode inference with MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters. Key changes: - Remove 1P/1D restriction: support arbitrary xP/yD topologies - Fix 6: --kv-transfer-config only on master nodes; child nodes join via --headless (prevents spurious proxy registration from headless workers) - Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276 at container startup for engine_id collision fix + MoRIIO robustness - Add RDMA/NCCL/MoRI env var passthrough to Docker containers - Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races - Add --ulimit memlock=-1:-1 for large RDMA memory registrations - Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt) - Add stall detection with configurable per-step timeout in benchmark - Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h) Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located), warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime. Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker Depends on: vllm-project/vllm#39276 (applied at runtime) Made-with: Cursor

Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager. This captures CUDA graphs for the autoregressive decode phase only, reducing per-token dispatch overhead while preserving eager flexibility for prefill and MoRI EP all-to-all. Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm Default behavior (no env var set) remains --enforce-eager on all nodes. Made-with: Cursor

MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor

- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3 fallback out of for-loop word expansion to prevent script abort when vLLM is not importable - Fail fast on patch failure for multi-node DP (xP>1 or yD>1): patches are mandatory for multi-node, optional for 1P/1D - Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0] instead of $? to capture timeout's exit code through the pipe - Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to Docker container for Default/DeepEP mode compatibility - Revert barrier port cleanup to hardcoded defaults (5000, 2222, 15000) to stay aligned with in-container scripts Made-with: Cursor

Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor

Copilot

Pull request overview

Enables multi-node MoRI EP disaggregated inference (multi-node xP/yD topologies) on OCI MI300X clusters, including optional decode-side CUDA graph execution and runtime patching for upstream vLLM multi-node DP fixes.

Changes:

Extend vllm_disagg_mori_ep.sh to support multi-node DP (master-only KV transfer config, headless child nodes) with optional FULL_DECODE_ONLY CUDA graph decode mode and expanded RDMA/NCCL + cache tuning.
Update Slurm launcher to add memlock ulimit, host-local compilation caches, and broader RDMA library auto-mounting into containers.
Add an idempotent runtime patch script that downloads/applies vLLM PR #39276 and verifies expected markers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`scripts/vllm_dissag/vllm_disagg_mori_ep.sh`	Multi-node MoRI EP topology support, master-only KV transfer config, CUDA-graph decode option, and RDMA/cache/timeouts tuning.
`scripts/vllm_dissag/run_xPyD_models.slurm`	Container runtime tuning (memlock, caches), RDMA library mounts, and env passthrough for multi-node settings.
`scripts/vllm_dissag/benchmark_xPyD.sh`	Benchmark warmup parameterization and per-step timeout/stall logging.
`scripts/vllm_dissag/apply_moriio_2pd_patches.sh`	New startup script to fetch/apply/verify vLLM PR #39276 patch for multi-node DP robustness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lcskrishna · 2026-04-22T14:37:56Z

@@ -1,7 +1,5 @@
 #!/bin/bash
 #SBATCH --job-name=vllm-pd    # Specify a custom string for your slurm batch job
-#SBATCH -N 2           # Request xP + yD nodes (proxy co-located on prefill master)


is this change accidental?

lcskrishna

LGTM. If the change is accidental please fix it,

raviguptaamd added 5 commits April 19, 2026 07:53

Default MORI_SOCKET_IFNAME to eth0 for bootstrap communication

1c94375

MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor

fix(patches): fail fast on download/verification failures

8811c62

Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor

Copilot AI review requested due to automatic review settings April 21, 2026 07:54

raviguptaamd requested review from Rohan138, amathews-amd, coketaste, gargrahul and ppalaniappan-amd as code owners April 21, 2026 07:54

Copilot started reviewing on behalf of raviguptaamd April 21, 2026 07:55 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread scripts/vllm_dissag/vllm_disagg_mori_ep.sh

Comment thread scripts/vllm_dissag/apply_moriio_2pd_patches.sh

Comment thread scripts/vllm_dissag/apply_moriio_2pd_patches.sh

raviguptaamd mentioned this pull request Apr 22, 2026

fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150) #151

Open

7 tasks

lcskrishna reviewed Apr 22, 2026

View reviewed changes

lcskrishna approved these changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support#150

[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support#150
raviguptaamd wants to merge 5 commits intoROCm:developfrom
raviguptaamd:ravgupta/mori-2p2d-proven

raviguptaamd commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lcskrishna Apr 22, 2026

Uh oh!

lcskrishna left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

raviguptaamd commented Apr 21, 2026

Summary

Files Changed

Architecture

Docker Image & Dependencies

How to run

1. Build the Docker image

2. Run 2P/2D (default eager mode)

3. Run 2P/2D with CUDA graph decode

4. Run 4P/4D

Compilation caches

Result log locations

Known limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lcskrishna Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

lcskrishna left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants