[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support#150
Open
raviguptaamd wants to merge 5 commits intoROCm:developfrom
Open
[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support#150raviguptaamd wants to merge 5 commits intoROCm:developfrom
raviguptaamd wants to merge 5 commits intoROCm:developfrom
Conversation
Enable multi-node xP/yD disaggregated prefill/decode inference with MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters. Key changes: - Remove 1P/1D restriction: support arbitrary xP/yD topologies - Fix 6: --kv-transfer-config only on master nodes; child nodes join via --headless (prevents spurious proxy registration from headless workers) - Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276 at container startup for engine_id collision fix + MoRIIO robustness - Add RDMA/NCCL/MoRI env var passthrough to Docker containers - Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races - Add --ulimit memlock=-1:-1 for large RDMA memory registrations - Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt) - Add stall detection with configurable per-step timeout in benchmark - Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h) Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located), warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime. Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker Depends on: vllm-project/vllm#39276 (applied at runtime) Made-with: Cursor
Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager. This captures CUDA graphs for the autoregressive decode phase only, reducing per-token dispatch overhead while preserving eager flexibility for prefill and MoRI EP all-to-all. Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm Default behavior (no env var set) remains --enforce-eager on all nodes. Made-with: Cursor
MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor
- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3 fallback out of for-loop word expansion to prevent script abort when vLLM is not importable - Fail fast on patch failure for multi-node DP (xP>1 or yD>1): patches are mandatory for multi-node, optional for 1P/1D - Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0] instead of $? to capture timeout's exit code through the pipe - Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to Docker container for Default/DeepEP mode compatibility - Revert barrier port cleanup to hardcoded defaults (5000, 2222, 15000) to stay aligned with in-container scripts Made-with: Cursor
Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Enables multi-node MoRI EP disaggregated inference (multi-node xP/yD topologies) on OCI MI300X clusters, including optional decode-side CUDA graph execution and runtime patching for upstream vLLM multi-node DP fixes.
Changes:
- Extend
vllm_disagg_mori_ep.shto support multi-node DP (master-only KV transfer config, headless child nodes) with optionalFULL_DECODE_ONLYCUDA graph decode mode and expanded RDMA/NCCL + cache tuning. - Update Slurm launcher to add memlock ulimit, host-local compilation caches, and broader RDMA library auto-mounting into containers.
- Add an idempotent runtime patch script that downloads/applies vLLM PR #39276 and verifies expected markers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
scripts/vllm_dissag/vllm_disagg_mori_ep.sh |
Multi-node MoRI EP topology support, master-only KV transfer config, CUDA-graph decode option, and RDMA/cache/timeouts tuning. |
scripts/vllm_dissag/run_xPyD_models.slurm |
Container runtime tuning (memlock, caches), RDMA library mounts, and env passthrough for multi-node settings. |
scripts/vllm_dissag/benchmark_xPyD.sh |
Benchmark warmup parameterization and per-step timeout/stall logging. |
scripts/vllm_dissag/apply_moriio_2pd_patches.sh |
New startup script to fetch/apply/verify vLLM PR #39276 patch for multi-node DP robustness. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7 tasks
lcskrishna
reviewed
Apr 22, 2026
| @@ -1,7 +1,5 @@ | |||
| #!/bin/bash | |||
| #SBATCH --job-name=vllm-pd # Specify a custom string for your slurm batch job | |||
| #SBATCH -N 2 # Request xP + yD nodes (proxy co-located on prefill master) | |||
Contributor
There was a problem hiding this comment.
is this change accidental?
lcskrishna
approved these changes
Apr 22, 2026
Contributor
lcskrishna
left a comment
There was a problem hiding this comment.
LGTM. If the change is accidental please fix it,
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable multi-node MoRI EP disaggregated prefill/decode inference (2P/2D, 4P/4D) on OCI MI300X clusters with InfiniBand RDMA. Adds optional
FULL_DECODE_ONLYCUDA graph mode for decode nodes.Key changes:
--kv-transfer-config(child nodes join via--headless)apply_moriio_2pd_patches.sh--ulimit memlock=-1:-1, host RDMA library auto-discovery/tmp/vllm_cache/) for AITER JIT, Triton, COMGR, vLLMFULL_DECODE_ONLYCUDA graphs for decode nodes (prefill always eager)Files Changed
scripts/vllm_dissag/vllm_disagg_mori_ep.shscripts/vllm_dissag/run_xPyD_models.slurmscripts/vllm_dissag/benchmark_xPyD.shscripts/vllm_dissag/apply_moriio_2pd_patches.shArchitecture
Docker Image & Dependencies
docker/vllm_disagg_inference.ubuntu.amd.Dockerfile(unchanged in this PR)rocm/vllm-dev:base_torch2.10_triton3.6_rocm7.2_torch_build_20260216v0.18.1rc1.dev133+g7d6917bef(same commit as DockerfileVLLM_COMMIT)How to run
1. Build the Docker image
Push to your registry so all nodes can pull it:
2. Run 2P/2D (default eager mode)
3. Run 2P/2D with CUDA graph decode
4. Run 4P/4D
Compilation caches
AITER JIT, Triton, COMGR, and vLLM caches are auto-mounted to host-local
/tmp/vllm_cache/. First run on a node takes longer (JIT compilation); subsequent runs reuse cached artifacts. Cache paths are configurable viaAITER_JIT_DIR,TRITON_CACHE_DIR,COMGR_CACHE_DIR,VLLM_CACHE_ROOT.Result log locations
Logs are written to
/shared_inference/$USER/model_blog_logs/{SLURM_JOB_ID}/:prefill_NODE{N}.log/decode_NODE{N}.log— per-node vLLM server logsproxy_NODE0.log— proxy server logbenchmark_{JOBID}_*_CONCURRENCY.log— benchmark resultspd_vllm_bench_NODE0.log— benchmark driver outputKnown limitations
--request-rateto throttle.