# LLM Distributed Inference/PD Disaggregation On AMD Instinct GPU

With the rapid growth of LLM model size, single-node inference optimization starts to show its limitations on LLM serving scaling, and distributed inference on multiple nodes has been more and more important for efficient LLM serving. Prefill and Decode (PD) Disaggregation is a typical use case of LLM distributed inference on GPU nodes. LLM inference comprises two distinct phases: Prefill and Decode. The Prefill phase is computation-intensive, processing the entire input sequence, while the Decode phase is memory-intensive, managing the Key-Value (KV) cache for token generation. PD disaggregation will run these two phases independently on different GPU nodes, which has key benefits of efficient GPU resource allocation and independent performance tuning. In this tutorial, we will demonstrate how to set up 1P1D distributed inference on a single MI300 node or two MI300x GPU nodes. 

# Prerequisites
This tutorial was developed and tested using the following setup.

## Operating system
Ubuntu 22.04/24.04: Ensure your system is running Ubuntu version 22.04/24.04.

## Hardware
AMD GPUs: This tutorial was tested on both a single AMD Instinct MI300X node and two AMD Instinct MI300X GPU nodes(each node has 8 MI300x GPUs, and RDMA NIC device is also a must for better performance).You can choose the test case according to the number of your GPU nodes. Ensure you are using an AMD Instinct GPU or compatible hardware with ROCm support and that your system meets [the official requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html).

## Software
ROCm 6.3 or later version: Install and verify ROCm by following [the ROCm install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html). After installation, confirm your setup using rocm-smi command. AMD and LLM Opensource community have also provided the pre-built ROCm docker images, for example, [rocm SGlang image](https://hub.docker.com/r/lmsysorg/sglang/tags), [rocm ubuntu22.04 image](https://hub.docker.com/r/rocm/dev-ubuntu-22.04) and [rocm ubuntu24.04 image](https://hub.docker.com/r/rocm/dev-ubuntu-24.04). Developers can use these pre-built docker images to reduce the efforts of setting up ROCm environment.

### Hugging Face API access

* Obtain an API token from [Hugging Face](https://huggingface.co) for downloading models.
* Ensure the Hugging Face API token has the necessary permissions and approval to access:
- For 1 node you need to have acces to [Meta Llama Llama-3.1-8B-Instruct checkpoints](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
- For 2 nodes you need to have acces to [Meta Llama Llama-3.3-70B-Instruct checkpoints](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).

## Set up PD Disaggregation environment 
In this tutorial, we will work on the pre-built ROCm SGLang image, which has integrated SGlang with AMD ROCm software stack successfully. Developers can also try other ROCm as the base image if needed.  

### Step1: Prepare the tutorial environment
Follow these steps to configure your tutorial environment:

### 1. Pull the Docker image

We use the lmsysorg/sglang:v0.4.9-rocm630 docker image as the base image, since it is the latest version when we run PD test on MI300x nodes for this tutorial. SGLang community will continue to release more ROCm SGlang docker images, and developers are strongly advised to try the latest SGlang docker for the better performance.

``` bash
docker pull lmsysorg/sglang:v0.4.9-rocm630
```

### 2. Launch the Docker container

In order to have the good network transfer performance, RDMA NIC is required for GPU nodes to run PD disaggregation, so when we launch the docker images, we need to map the RDMA device into the docker container, as shown in the below command.

``` bash
docker run -it --rm \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --ipc=host \
  --device=/dev/infiniband \
  --device=/dev/infiniband/rdma_cm \
  --privileged \
  --cap-add=SYS_ADMIN \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --shm-size 32G \
  -v $(pwd):/workspace \
  -w /workspace/notebooks \
  lmsysorg/sglang:v0.4.9-rocm630
```

**Note**: This command mounts the current directory to the `/workspace` directory in the container. Ensure the notebook file is either copied to this directory before running the Docker command or uploaded into the Jupyter Notebook environment after it starts. Save the token or URL provided in the terminal output to access the notebook from your web browser. You can download this notebook from the [AI Developer Hub GitHub repository](https://github.com/ROCm/gpuaidev).

### 3. Install and launch Jupyter

Inside the Docker container, install Jupyter using the following command:

``` bash
pip install jupyter
```

Start the Jupyter server:

``` bash
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```

**Note**: Ensure port `8888` is not already in use on your system before running the above command. If it is, you can specify a different port by replacing `--port=8888` with another port number, for example, `--port=8890`.

<div style="border-left:6px solid #d32f2f;background:#fdecea;color:#5f2120;padding:12px 16px;border-radius:6px;">
<strong>⚠️ NOTE:</strong> The rest of this notebook is designed to run as jupyter notebook. This notebook demonstrates Prefill/Decode (PD) disaggregation on AMD Instinct GPUs. It runs fully on a single node (intra‑node 1P1D) by default. Two‑node (inter‑node 1P1D) steps are optional and clearly marked.
</div>

**Run modes**
- Single node (default): No etcd required. Use “Intra‑Node 1P1D” section.
- Two nodes (optional): Requires etcd + RDMA + SSH setup. Use “Two‑Node (Inter‑Node) 1P1D” section.


### Provide your Hugging Face token

You'll require a Hugging Face API token to access Llama models with appropoiate permissions as indicated in earlier sections of this notebook. Let's first install Huggingface Hub library.


In [None]:
!pip install --upgrade huggingface_hub


Run the following interactive block in your Jupyter notebook to set up the token:

In [None]:
from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
notebook_login()


Verify that your token was accepted correctly:

In [None]:
# Validate the token
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

### Step2 : Install the necessary software components 
For intra-node 1P1D, you need to install mooncake transfer engine, etcd is not required for this case. For inter-node 1P1D, you need to install both etcd and mooncake transfer engine. 

#### 2.1. [optional]Install etcd
You can skip this step if running the single node test.etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. In the design of SGLang PD disaggregation solution, etcd server is required to run on each GPU node as the cluster metadata storage. So we also need to install etcd.   

In [None]:
%%bash
cd /sgl-workspace
apt update && apt install -y wget
wget https://github.com/etcd-io/etcd/releases/download/v3.6.0-rc.5/etcd-v3.6.0-rc.5-linux-amd64.tar.gz -O /tmp/etcd.tar.gz
tar -xvf /tmp/etcd.tar.gz -C /usr/local/bin/ --strip-components=1 && rm /tmp/etcd.tar.gz

#### 2.2. Install Mooncake 
Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. Its core components, like transfer engine, has been integrated into SGLang PD disaggregation solution to transfer KV cache between nodes. 

In [None]:
%%bash
apt update && apt install -y zip unzip openssh-server
apt -y install gcc make libtool autoconf  librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool  libibverbs-dev rdma-core strace
cd /sgl-workspace
pip install mooncake-transfer-engine

### Step 3: [optional]Install RDMA related libraries and do network transfer configurations  
You can skip the network configuration step for the single node 1P1D test. But SGlang/Mooncake transfer engine still needs to enable RDMA NIC device even if this is a intra-node test.You can run ibv_devices command to test RDMA device status. if RDMA device can't be found, you still need to install RDMA NIC driver to enable them as the step 3.1.

#### 3.1. [optional]Install NIC RDMA driver 
In this tutorial, we take two MI300x GPU nodes as an example, which have been equipped with Thor2/BCM-57608 NIC devices. If you use the different NIC devices, you need to download the corresponding NIC driver packages from device vendor, and install it according to the steps of official NIC user guide. The below steps are only for Thor2/BCM-57608 device.   

In [None]:
%%bash
echo -e "\n\n============Installing required pkgs============\n\n"
apt -y install libelf-dev

cd bcm5760x_230.2.52.0a/drivers_linux/bnxt_rocelib
tar xvf libbnxt_re-230.2.52.0.tar.gz
cd libbnxt_re-230.2.52.0
echo -e "\n\n============Compiling RoCE Lib now============\n\n"
sh autogen.sh
./configure
make
find /usr/lib64/  /usr/lib -name "libbnxt_re-rdmav*.so"  -exec mv {} {}.inbox \;
make install all
sh -c "echo /usr/local/lib >> /etc/ld.so.conf"
ldconfig
cp -f bnxt_re.driver /etc/libibverbs.d/
find . -name "*.so" -exec md5sum {} \;
BUILT_MD5SUM=$(find . -name "libbnxt_re-rdmav*.so" -exec md5sum {} \; |  cut -d " " -f 1)
echo -e "\n\nmd5sum of the built libbnxt_re is $BUILT_MD5SUM"

In [None]:
%%bash
ibv_devices

After running the above steps, you can list the RDMA device through ibv_devices command and check the RDMA device information through ibv_devinfo command. On our test device, ibv_devices command can give the below output, which means all RDMA devices can be enumerated successfully and work well now.

    device                 node GUID
    ------              ----------------
    rdma0               d604e6fffe0e9fb4
    rdma1               d604e6fffe0e9938
    xeth0               d604e6fffee921d0
    rdma2               d604e6fffe780000
    rdma3               d604e6fffe0e9d34
    rdma4               d604e6fffe780370
    rdma5               d604e6fffe0e8678
    rdma6               d604e6fffe0e8718
    rdma7               d604e6fffe7801f4

#### 3.2. Build and install ROCm-Aware UCX library

The [Unified Communication Framework](https://github.com/openucx/ucx) (UCX), is an open source, cross-platform framework designed to provide a common set of communication interfaces for various network programming models and interfaces. UCX uses ROCm technologies to implement various network operation primitives. UCX is the standard communication library for InfiniBand and RDMA over Converged Ethernet (RoCE) network interconnect.

In [None]:
%%bash
git clone https://github.com/openucx/ucx.git -b v1.18.1 
cd ucx 
./autogen.sh
./configure --with-rocm=/opt/rocm --enable-mt --prefix=/opt/ucx 
make -j 
make install

In [None]:
import os
os.environ['PATH'] = os.environ['PATH'] + ':/opt/ucx/bin'
os.environ['LD_LIBRARY_PATH'] = os.environ['LD_LIBRARY_PATH'] + ':/opt/ucx/lib'

In [None]:
%%bash
ucx_info -v

after installing UCX, you can use ucx_info command to check whether your UCX library was built with ROCm support. On our MI300x machine, "ucx_info -v" outputs the below information.

 Library version: 1.20.0
 Library path: /opt/ucx/lib/libucs.so.0
 API headers version: 1.20.0
 Git branch 'v1.18.1', revision 6022e2a
 Configured with: --with-rocm=/opt/rocm --enable-mt --prefix=/opt/ucx


#### 3.3. Build and install ROCm-Aware OpenMPI library
[Open MPI](https://www.open-mpi.org/) is a Message Passing Interface implementation used as a communication protocol for parallel and distributed computers. ROCm-aware support 
means that the MPI library can send and receive data from AMD GPU device buffers directly. As of today,ROCm support is available through UCX. While other communication transports might work as well, UCX is the only transport formally supported in Open MPI head of development for ROCm devices. So we need to enable UCX when building OpenMPI.

In [None]:
%%bash
git clone --recursive https://github.com/open-mpi/ompi.git -b v5.0.x
cd ompi 
./autogen.pl
./configure --prefix=/opt/ompi --with-rocm=/opt/rocm --with-ucx=/opt/ucx
make -j 8
make install 

In [None]:
import os
os.environ['PATH'] = os.environ['PATH'] + ':/opt/ompi/bin:/opt/ucx/bin'
os.environ['LD_LIBRARY_PATH'] = os.environ['LD_LIBRARY_PATH'] + ':/opt/ompi/lib:/opt/ucx/lib'

In [None]:
%%bash
ompi_info | grep "extensions"

after installing OpenMPI, you can use ompi_info command to check whether your OpenMPI library was built with ROCm support. On the tested MI300x machine, "ompi_info | grep "extensions"" outputs the below information.
  MPI extensions: affinity, cuda, ftmpi, rocm

#### 3.4. [optional]SSH Passwordless Login Configuration
When running Open MPI applications in a cluster, SSH is typically used to launch commands on remote nodes to set up the distributed inference. SSH Passwordless login, without entering a password or passphrase, is also required to be configured for all remote nodes. The below steps are required to run on each node of GPU cluster. In this tutorial, we just take two GPU nodes as the example.

First, use ssh-keygen to generate a key pair consisting of a public key and a private key on each node, id_rsa contains the private key and id_rsa.pub contains the public key

Second, copy the content of local public key to the authorized_keys file in remote node, assuming the authorized_keys file path is ~/.ssh/authorized_keys 

Third, disable password authentication on each node. Most servers allow both username/password authentication and SSH key authentication, but if you want to allow only SSH key authentication, then you can disable the use of usernames and passwords. You need to uncomment below contents in /etc/ssh/sshd_config : 'PermitRootLogin prohibit-password' and 'PubkeyAuthentication yes'. 

The default SSH Port is 22, which may be occupied by other SSH applications in the cluster. So we had better change the default SSH port inside current docker container, which is only for PD disaggregation application. This step is also accomplished through the sshd_config file, adding "port "self-defined port number"" in this file.
We can also add the above self-defined port into SSH configuration, to override the default port setting for remote SSH connection, which is done by adding the below content into ~/.ssh/config file.
  Host Remote_node_IP
      Port  "self-defined port number"

After running the above steps, you still need to run the below commands to have the settings work.

In [None]:
%%bash 
chmod 600 ~/.ssh/authorized_keys 
chmod 600 ~/.ssh/config 
service ssh restart

#### 3.5. [optional] Run RCCL-Test Benchmark to test RDMA settings 
[RCCL-Tests](https://github.com/ROCm/rccl-tests) is an open source tool provided by AMD to test the bandwidth and latency between GPUs/Nodes through performing the collective operations benchmark of ROCm Collective Communications Library (RCCL), such as all_reduce, all_gather, etc. In this tutorial, we use this benchmark tool to check whether RDMA devices have been enabled for LLM distributed inference. 

First, build RCCL-Tests benchmark. To compile RCCL tests with MPI support, you need to set MPI=1 and set MPI_HOME to the path where MPI is installed. If HIP is not installed in /opt/rocm, you may specify HIP_HOME. Similarly, if RCCL (librccl.so) is not installed in /opt/rocm/lib/, you may specify NCCL_HOME and CUSTOM_RCCL_LIB. 

In [None]:
%%bash
make MPI=1 MPI_HOME=/path/to/mpi HIP_HOME=/path/to/hip NCCL_HOME=/path/to/rccl

Second, specify the GPU node in Hostfile. The hostfile is a text file that contains the IP address of hosts/nodes, the number of available GPU slots on each host/node.

Third, run the RCCL benchmark tool. If the above configurations have been setup well, you will find that the bandwidth of RDMA device will be far higher than normal Ethernet device. 

In [None]:
%%bash
TORCH_NCCL_HIGH_PRIORITY=1 RCCL_MSCCL_ENABLE=0 mpirun -np <Total GPU number> --map-by ppr:<GPU number>:node --hostfile <mpi_hosts> --allow-run-as-root --mca pml ucx --mca btl ^openib  -x NCCL_SOCKET_IFNAME=<IP interfaces for communication> -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=<rdma device> -x NCCL_IB_GID_INDEX=3  /home/rccl-tests/build/all_reduce_perf -b 1k -e 2G -f 2 -g 1

## Run SGLang PD Disaggregation
SGLang has supported prefill-decode (PD) disaggregation on AMD Instinct GPUs, which is through mooncake to transfer KV cache. From the view of system architecture, SGLang PD disaggregation comprises 3 distinct components: proxy server, prefill server and decode server. When a request comes in, the proxy server will select a pair of prefill and decode servers based on workload balancing scheme. The selected Prefill Server and decode Server will pair via a handshake, establishing a local sender and receiver, respectively. The Decode Server pre-allocates the KV cache, signaling the Prefill Server to begin LLM prefill inference and compute the KV caches. Once prefill is done, the KV Cache data transfers to the Decode Server, which handles iterative token generation.

In this tutorial, we will test SGLang PD Disaggregation in two cases: Intra-node 1P1D and Inter-node 1P1D. For Intra-node, you need at least two GPUs, one GPU run prefill server and the other run decode server. For inter-node 1P1D, you need two MI300x nodes. One node will run prefill server, and the other node will run decode server. Since proxy server doesn't need high GPU resource, we put it on the prefill node. If you have a larger cluster, proxy node can run on a standalone node to have the better performance. Assuming that IP address of prefill node is 10.21.9.10, decode node is 10.21.9.15, we list the steps of SGLang PD Disaggregation as an example. Developer can modify the related parameters and settings according to your cluster info. 

### Intra-Node 1P1D

#### Run Prefill server
At this step, we used sglang.launch_server command to launch prefill server. The detailed description of this command's options can be found from SGLang document or source codes. Developers refer the latest version document once the options have been changed with the upgrade of SGLang framework. RDMA device names can be found through ibv_devices of previous steps. 

In [None]:
%%bash
HIP_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct \
         --disaggregation-mode prefill --port 30000 \
         --disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 2>&1 | tee /workspace/prefill.log >/dev/null &

#### Run Decode server
At this step, we used sglang.launch_server command to launch decode server. The detailed description of this command's options can be found from SGLang document or source codes. Developers refer the latest version document once the options have been changed with the upgrade of SGLang framework. RDMA device names can be found through ibv_devices of previous steps.

In [None]:
%%bash
HIP_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct \
        --disaggregation-mode decode --port 30001 \
        --disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 2>&1 | tee /workspace/decode.log >/dev/null &

#### Run Proxy server 
At this step, prefill and decode server ports will be configured when launching proxy server at the same node. Proxy server port will be also provided for test client program to connect.

In [None]:
%%bash
python -m sglang.srt.disaggregation.mini_lb --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 40000 2>&1 | tee /workspace/proxy.log >/dev/null &

### Inter-Node 1P1D
#### Run ETCD server on each node

Assuming that both of prefill and decode nodes have done all the previous settings in this tutorial, we need to run below commands in the SGLang ROCm containers of each node.

On prefill node, run etcd server is the below command. The below etcd server ports are just for reference, if you find they have been used by other processes, please try other ports.

In [None]:
%%bash 
etcd --name infra0 --data-dir /var/lib/etcd --initial-advertise-peer-urls http://10.21.9.10:2380 \
  --listen-peer-urls http://10.21.9.10:2380 \
  --listen-client-urls http://10.21.9.10:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://10.21.9.10:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://10.21.9.10:2380,infra1=http://10.21.9.15:2380 \
  --initial-cluster-state new \
  2>&1 | tee /workspace/etcd_infra0.log >/dev/null &

On decode node, run etcd server is the below command.The below etcd server ports are just for reference, if you find they have been used by other processes, please try other ports.

In [None]:
%%bash 
etcd --name infra1 --data-dir /var/lib/etcd --initial-advertise-peer-urls http://10.21.9.15:2380 \
  --listen-peer-urls http://10.21.9.15:2380 \
  --listen-client-urls http://10.21.9.15:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://10.21.9.15:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://10.21.9.10:2380,infra1=http://10.21.9.15:2380 \
  --initial-cluster-state new \
  2>&1 | tee /workspace/etcd_infra1.log >/dev/null &

#### Run Proxy server 
As mentioned before, this server will run on prefill node in this tutorial. You can also put it on a standalone node in the cluster for better performance. 

At this step, IP address/port of prefill and decode node pools will be configured, IP address/port of proxy server will be also provided for test client program to connect. 

In [None]:
%%bash
nohup python -m sglang.srt.disaggregation.mini_lb --prefill http://10.21.9.10:30000 \
                        --decode http://10.21.9.15:30000 --host 0.0.0.0 --port 40000 \
                        2>&1 | tee /workspace/proxy.log >/dev/null & 

#### Run Prefill server
At this step, we used sglang.launch_server command to launch prefill server. The detailed description of this command's options can be found from SGLang document or source codes. Developers refer the latest version document once the options have been changed with the upgrade of SGLang framework.


In [None]:
%%bash 
python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct \
                        --disaggregation-mode prefill --disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 \
                        --host 10.21.9.10 --port 30000  --trust-remote-code  \
                        --tp 8  --disable-radix-cache --disable-cuda-graph \
                        --max-running-requests 1024 --stream-output \
                        --dist-init-addr 10.21.9.10:5757 --nnodes 1 --node-rank 0 \
                        --mem-fraction-static 0.8 2>&1 | tee /workspace/prefill.log >/dev/null &

#### Run Decode server
At this step, we used sglang.launch_server command to launch decode server. The detailed description of this command's options can be found from SGLang document or source codes. Developers refer the latest version document once the options have been changed with the upgrade of SGLang framework.

In [None]:
%%bash 
python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct \
                        --disaggregation-mode decode --disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 \
                        --host 10.21.9.15 --port 30000 --trust-remote-code \
                        --tp 8 --disable-radix-cache --disable-cuda-graph \
                        --max-running-requests 1024 --stream-output \
                        --dist-init-addr 10.21.9.15:5757 --nnodes 1 --node-rank 0 \
                        --mem-fraction-static 0.8 2>&1 | tee /workspace/decode.log >/dev/null &

### Test PD Disaggregation 

At this step, we used sglang.bench_serving to test 1P1D like normal SGLang benchmark test. In this tutorial, we also run it on prefill node to simplify the demo. If you need to run it on other machine which can connect this cluster, you need to set the host IP address/port of proxy server in this command. Other test parameters can be changed as your need.   

In [None]:
%%bash
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 40000 --dataset-name generated-shared-prefix \
           --gsp-system-prompt-len 0 \
           --gsp-question-len 1024 \
           --gsp-output-len 1024 \
           --gsp-num-groups 1 \
           --gsp-prompts-per-group 16\
           --random-range-ratio 1 \
           --max-concurrency 16 \
           --pd-separated \
           2>&1 | tee test.log

##

### xPyD setup
If you have a larger GPU cluster to run PD disaggregation, you can run xPyD (multiple prefill and decode instances) to have a better performance. xPyD setup are the same with the above steps, just needing to modify some multi-node related configurations: 1) change the prefill and decode configuration in proxy server, like --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"  2) change the multi-node distributed serving options, like dist-init-addr, nnodes and node-rank, when launching prefill and decode server 3) change the tp/dp/ep-size options of SGLang serving program if needed.  

## Summary 
Through this tutorial, developer has already know how to set up and run SGLang PD disaggregation on AMD MI300 GPUs. Although we demonstrate 1P1D on both a single MI300x node and 2 MI300x GPU nodes, developer can implement xPyD by the steps easily on their own GPU cluster. If developer would like to study more about PD disaggregation, [Mooncake](https://kvcache-ai.github.io/Mooncake/), [LLM-d](https://llm-d.ai/docs/architecture/architecture) and [vLLM disagg_prefill](https://docs.vllm.ai/en/stable/features/disagg_prefill.html#development) can be very useful. LLM distributed inference, especially PD disaggregation, is still under development. We hope that this tutorial will encourage you to tune, test, and contribute to LLM distributed inference on AMD GPUs, and help us shape the future of AI acceleration.   