Skip to content

[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support#150

Open
raviguptaamd wants to merge 5 commits intoROCm:developfrom
raviguptaamd:ravgupta/mori-2p2d-proven
Open

[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support#150
raviguptaamd wants to merge 5 commits intoROCm:developfrom
raviguptaamd:ravgupta/mori-2p2d-proven

Conversation

@raviguptaamd
Copy link
Copy Markdown
Contributor

Summary

Enable multi-node MoRI EP disaggregated prefill/decode inference (2P/2D, 4P/4D) on OCI MI300X clusters with InfiniBand RDMA. Adds optional FULL_DECODE_ONLY CUDA graph mode for decode nodes.

Key changes:

  • Multi-node DP support with master-only --kv-transfer-config (child nodes join via --headless)
  • Runtime patching of vLLM PR #39276 (engine_id collision + MoRIIO robustness) via apply_moriio_2pd_patches.sh
  • RDMA/NCCL tuning, --ulimit memlock=-1:-1, host RDMA library auto-discovery
  • Host-local compilation caches (/tmp/vllm_cache/) for AITER JIT, Triton, COMGR, vLLM
  • Optional FULL_DECODE_ONLY CUDA graphs for decode nodes (prefill always eager)

Files Changed

File Status Purpose
scripts/vllm_dissag/vllm_disagg_mori_ep.sh Modified Multi-node DP, master-only kv-config, CUDA graph support, RDMA tuning
scripts/vllm_dissag/run_xPyD_models.slurm Modified memlock ulimit, NCCL/MoRI env vars, host-local caches, RDMA mounts
scripts/vllm_dissag/benchmark_xPyD.sh Modified Warmup tuning (ISL=32, OSL=32), per-step timeout logic
scripts/vllm_dissag/apply_moriio_2pd_patches.sh New Idempotent runtime patch for vLLM PR #39276 (5/5 verification checks)

Architecture

PROXY (co-located on Prefill Master node)
  HTTP load balancer + ZMQ router
       |                    |
  PREFILL (xP nodes)     DECODE (yD nodes)
  DP=8*xP, EP=8/GPU      DP=8*yD, EP=8/GPU
  Master: API, MoRIIO     Master: API, MoRIIO
          producer                 consumer
  Child:  headless,        Child:  headless,
          EP all-to-all            EP all-to-all

Docker Image & Dependencies

  • Dockerfile: docker/vllm_disagg_inference.ubuntu.amd.Dockerfile (unchanged in this PR)
  • Base image: rocm/vllm-dev:base_torch2.10_triton3.6_rocm7.2_torch_build_20260216
  • MoRI: Same version as in base image (not updated)
  • vLLM: v0.18.1rc1.dev133+g7d6917bef (same commit as Dockerfile VLLM_COMMIT)
  • Runtime dependency: vLLM PR #39276 — patched at container startup, becomes no-op once merged upstream

How to run

1. Build the Docker image

cd MAD

# For MI300X (gfx942) with ConnectX-7 NICs:
docker build \
    -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile \
    --build-arg GFX_COMPILATION_ARCH=gfx942 \
    --build-arg NIC_COMPILATION_ARCH=cx7 \
    -t <your-registry>/mad-mori-ep:gfx942 \
    .

# For MI355X (gfx950) with Ionic AINICs:
docker build \
    -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile \
    --build-arg GFX_COMPILATION_ARCH=gfx950 \
    --build-arg NIC_COMPILATION_ARCH=ionic \
    -t <your-registry>/mad-mori-ep:gfx950 \
    .

Push to your registry so all nodes can pull it:

docker push <your-registry>/mad-mori-ep:gfx942
docker push <your-registry>/mad-mori-ep:gfx950

2. Run 2P/2D (default eager mode)

DOCKER_IMAGE_NAME="<your-registry>/mad-mori-ep:gfx942" \
MODEL_NAME="DeepSeek-V3" RUN_MORI=1 \
BENCHMARK_CON="8 16 32 64 128" BENCHMARK_COMBINATIONS="1024/1024" \
xP=2 yD=2 sbatch -N 4 -n 4 \
    --nodelist=<node1>,<node2>,<node3>,<node4> \
    scripts/vllm_dissag/run_xPyD_models.slurm

3. Run 2P/2D with CUDA graph decode

VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY \
DOCKER_IMAGE_NAME="<your-registry>/mad-mori-ep:gfx942" \
MODEL_NAME="DeepSeek-V3" RUN_MORI=1 \
BENCHMARK_CON="8 16 32 64 128" BENCHMARK_COMBINATIONS="1024/1024" \
xP=2 yD=2 sbatch -N 4 -n 4 \
    --nodelist=<node1>,<node2>,<node3>,<node4> \
    scripts/vllm_dissag/run_xPyD_models.slurm

4. Run 4P/4D

DOCKER_IMAGE_NAME="<your-registry>/mad-mori-ep:gfx942" \
MODEL_NAME="DeepSeek-V3" RUN_MORI=1 \
BENCHMARK_CON="8 16 32 64 128" BENCHMARK_COMBINATIONS="1024/1024" \
xP=4 yD=4 sbatch -N 8 -n 8 \
    --nodelist=<node1>,<node2>,...,<node8> \
    scripts/vllm_dissag/run_xPyD_models.slurm

Compilation caches

AITER JIT, Triton, COMGR, and vLLM caches are auto-mounted to host-local /tmp/vllm_cache/. First run on a node takes longer (JIT compilation); subsequent runs reuse cached artifacts. Cache paths are configurable via AITER_JIT_DIR, TRITON_CACHE_DIR, COMGR_CACHE_DIR, VLLM_CACHE_ROOT.

Result log locations

Logs are written to /shared_inference/$USER/model_blog_logs/{SLURM_JOB_ID}/:

  • prefill_NODE{N}.log / decode_NODE{N}.log — per-node vLLM server logs
  • proxy_NODE0.log — proxy server log
  • benchmark_{JOBID}_*_CONCURRENCY.log — benchmark results
  • pd_vllm_bench_NODE0.log — benchmark driver output

Known limitations

  • MoRIIO KV transfer operates only between master nodes' DP ranks (0–7). Child DP ranks (8–15) fall back to local compute. Fixing requires upstream vLLM changes (partially addressed by PR #39276).
  • CUDA graph con=128 may have proxy-side failures at very high throughput — use --request-rate to throttle.

Enable multi-node xP/yD disaggregated prefill/decode inference with
MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters.

Key changes:
- Remove 1P/1D restriction: support arbitrary xP/yD topologies
- Fix 6: --kv-transfer-config only on master nodes; child nodes join
  via --headless (prevents spurious proxy registration from headless workers)
- Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276
  at container startup for engine_id collision fix + MoRIIO robustness
- Add RDMA/NCCL/MoRI env var passthrough to Docker containers
- Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races
- Add --ulimit memlock=-1:-1 for large RDMA memory registrations
- Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt)
- Add stall detection with configurable per-step timeout in benchmark
- Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h)

Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located),
warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime.

Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker
Depends on: vllm-project/vllm#39276 (applied at runtime)

Made-with: Cursor
Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE
env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager.
This captures CUDA graphs for the autoregressive decode phase only,
reducing per-token dispatch overhead while preserving eager flexibility
for prefill and MoRI EP all-to-all.

Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm

Default behavior (no env var set) remains --enforce-eager on all nodes.

Made-with: Cursor
MoRI v1.1.0+ requires a valid interface name for shmem bootstrap.
Setting a sensible default prevents empty-string failures on OCI
clusters where eth0 is the management NIC.

Made-with: Cursor
- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3
  fallback out of for-loop word expansion to prevent script abort
  when vLLM is not importable
- Fail fast on patch failure for multi-node DP (xP>1 or yD>1):
  patches are mandatory for multi-node, optional for 1P/1D
- Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0]
  instead of $? to capture timeout's exit code through the pipe
- Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to
  Docker container for Default/DeepEP mode compatibility
- Revert barrier port cleanup to hardcoded defaults (5000, 2222,
  15000) to stay aligned with in-container scripts

Made-with: Cursor
Patch download and verification failures now exit non-zero so that
multi-node DP runs abort early instead of proceeding unpatched.

Made-with: Cursor
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables multi-node MoRI EP disaggregated inference (multi-node xP/yD topologies) on OCI MI300X clusters, including optional decode-side CUDA graph execution and runtime patching for upstream vLLM multi-node DP fixes.

Changes:

  • Extend vllm_disagg_mori_ep.sh to support multi-node DP (master-only KV transfer config, headless child nodes) with optional FULL_DECODE_ONLY CUDA graph decode mode and expanded RDMA/NCCL + cache tuning.
  • Update Slurm launcher to add memlock ulimit, host-local compilation caches, and broader RDMA library auto-mounting into containers.
  • Add an idempotent runtime patch script that downloads/applies vLLM PR #39276 and verifies expected markers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
scripts/vllm_dissag/vllm_disagg_mori_ep.sh Multi-node MoRI EP topology support, master-only KV transfer config, CUDA-graph decode option, and RDMA/cache/timeouts tuning.
scripts/vllm_dissag/run_xPyD_models.slurm Container runtime tuning (memlock, caches), RDMA library mounts, and env passthrough for multi-node settings.
scripts/vllm_dissag/benchmark_xPyD.sh Benchmark warmup parameterization and per-step timeout/stall logging.
scripts/vllm_dissag/apply_moriio_2pd_patches.sh New startup script to fetch/apply/verify vLLM PR #39276 patch for multi-node DP robustness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/vllm_dissag/vllm_disagg_mori_ep.sh
Comment thread scripts/vllm_dissag/apply_moriio_2pd_patches.sh
Comment thread scripts/vllm_dissag/apply_moriio_2pd_patches.sh
@@ -1,7 +1,5 @@
#!/bin/bash
#SBATCH --job-name=vllm-pd # Specify a custom string for your slurm batch job
#SBATCH -N 2 # Request xP + yD nodes (proxy co-located on prefill master)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change accidental?

Copy link
Copy Markdown
Contributor

@lcskrishna lcskrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If the change is accidental please fix it,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants