# Deploy Llama Stack over AMD Instinct™ GPU

A step-by-step guide on how to deploy the Llama Stack on AMD Instinct™ MI300X GPU.

## Introduction

As a leader in AI open-source innovation, Meta’s Llama series has democratized access to large language models, empowering developers worldwide. The Llama Stack—Meta’s all-in-one deployment framework—extends this vision by enabling seamless transitions from research to production through built-in tools for optimization, API integration, and scalability. This unified platform is ideal for teams requiring robust support to deploy Meta’s models at scale across diverse applications.

Complementing this ecosystem, AMD reinforces its position as a leader in AI acceleration hardware by expanding the AI software frontier through the ROCm™ open-source software stack. By fostering collaboration and optimizing performance, ROCm™ equips developers with a robust foundation to build high-throughput AI solutions tailored for production environments.

This tutroial guides developers in deploying the Llama Stack on AMD ROCm™-powered GPUs, creating a production-ready infrastructure for large language model (LLM) inference. We’ll also demonstrate programmatic interactions via the Llama Stack CLI and Python SDK, ensuring seamless server integration. To streamline this journey, we’ll first preview the core components involved—such as ROCm’s optimization tools, Llama Stack’s deployment workflows, and scalable GPU configurations—before diving into the hands-on session.

### Llama Stack and Remote vLLM Distribution

Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. It provides a unified set of APIs with implementations from leading service providers, enabling seamless transitions between development and production environments. [1]

  ![Llama Stack](https://llama-stack.readthedocs.io/en/latest/_images/llama-stack.png)

The Llama Stack’s Inference API is interoperable with a wide range of LLM inference providers, including vLLM, TGI, Ollama, and OpenAI APIs, ensuring seamless integration and flexibility for deployment. And it also provide 4 types client SDK, Python, Swift, Node and Kotlin.

For this tutorial, we’ve selected vLLM as the inference provider and the Llama Stack’s Python Client SDK to showcase scalable deployment workflows and illustrate hands-on, low-latency LLM integration into production-ready services.

### ROCm™ and vLLM docker images

ROCm™ is an open-source software platform optimized to extract HPC and AI workload performance from AMD Instinct accelerators and AMD Radeon GPUs while maintaining compatibility with industry software frameworks. For more information, see [What is ROCm™?](https://ROCm™.docs.amd.com/en/latest/what-is-ROCm™.html) [2]

AMD collaborates with vLLM to deliver a streamlined, high-performance LLM inference engine and production-ready deployment solutions for enterprise-grade AI workloads.

**Available vLLM Containers** [3]

AMD provides two main vLLM container options:

- [ROCm™/vllm](https://hub.docker.com/r/ROCm™/vllm): Production-ready container

  - Pinned to a specific version, for example: ROCm™/vllm-dev:ROCm™6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

  - Designed for stability

  - Optimized for deployment

- [ROCm™/vllm-dev](https://hub.docker.com/r/ROCm™/vllm-dev): Development container with the latest vLLM features

  - nightly, main and other specialized builds available:

    - nightly tags are built daily from the latest code, but may contain bugs

    - main tags are more stable builds, updated after testing

  - Includes development tools

  - Best for testing new features or custom modifications


## Deployment Llama Stack with ROCm™

We use [Remote vLLM Distribution](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html#remote-vllm-distribution "Link to this heading") running with ROCm™/vllm-dev docker image on Instinct™ MI300X GPU. In addition to supporting many LLM inference providers (e.g., Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, etc.), Llama Stack also allows users to choose safety providers as an option (e.g., Meta’s Llama Guard, AWS Bedrock Guardrails, vLLM, etc.). In this tutorial, we use two Instinct™ MI300X GPUs: one for deploying LLM inference APIs, and another for Safety/Shield APIs deployment.

### Prerequist

Environment of this tutorial:

- GPU: AMD Instinct™ MI300x
- OS: Ubuntu22.04
- Docker image: rocm/vllm-dev:main (vllm: 0.7.4.dev388+g51641aaa7.rocm631 )
- llama-stack: v0.2.1

#### Start ROCm™/vllm vLLM server container

#set your token of Huggingface
import os
os.environ['HF_TOKEN'] = ''

In [None]:
# or use input the token like that,
import os

# input your HF_TOKEN
hf_token = input("input your Hugging Face Token: ")
os.environ['HF_TOKEN'] = hf_token

print("HF_TOKEN set！")

In [2]:
%%bash
export INFERENCE_PORT=8080
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export CUDA_VISIBLE_DEVICES=0
export VLLM_DIMG="rocm/vllm-dev:main"
docker run -d --rm\
    --ipc=host \
    --privileged \
    --shm-size 16g \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --cap-add=CAP_SYS_ADMIN \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
    -p $INFERENCE_PORT:$INFERENCE_PORT \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name rocm-vllm-provider \
    $VLLM_DIMG \
    python -m vllm.entrypoints.openai.api_server \
    --model $INFERENCE_MODEL \
    --port $INFERENCE_PORT

1b590b7b6342bcfca3fa1e98480882183a384e98bf40160cc563d7d3055be0b7


**Note** that you’ll also need to set --enable-auto-tool-choice and --tool-call-parser to enable tool calling in vLLM.

If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like meta-llama/Llama-Guard-3-1B using a script like:

In [3]:
%%bash
export SAFETY_PORT=8081
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1
export VLLM_DIMG="rocm/vllm-dev:main"

docker run -d --rm\
    --ipc=host \
    --privileged \
    --shm-size 16g \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --cap-add=CAP_SYS_ADMIN \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
    -p $SAFETY_PORT:$SAFETY_PORT \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name rocm-vllm-guard \
    $VLLM_DIMG \
    python -m vllm.entrypoints.openai.api_server \
    --model $SAFETY_MODEL \
    --port $SAFETY_PORT

8f924b37c2291ed9cfed9e706dd657d2bf660d106e6195c9efae2e9cc3d6740d


It needs enough time for vllm serve to launch the LLM service. The bigger LLLM need more time to be loaded. So you may need to adjust the time regard to your environment. Another way is to start the two vllm serve containers before running the subsequent steps of this jupyter notebook and use curl test to make sure the vllm serve is ready.

In [4]:
!sleep 360

Let's test weither the two vLLM serve containers are ready. If not, you should more time by using `sleep` command in bash until you got the right response of the curl test.

In [5]:
!curl http://localhost:8080/v1/models

{"object":"list","data":[{"id":"meta-llama/Llama-3.2-3B-Instruct","object":"model","created":1744505063,"owned_by":"vllm","root":"meta-llama/Llama-3.2-3B-Instruct","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-ae9ae6b274cf42e49446d191d5102f70","object":"model_permission","created":1744505063,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

In [6]:
%%bash
curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
             Dload  Upload   Total   Spent    Left  Speed
  150    229     90  0:00:01  0:00:01 --:--:--   32001  0:00:01 --:--:--   124


{"id":"cmpl-28adf0e730f24efbb5c38debb754e840","object":"text_completion","created":1744505063,"model":"meta-llama/Llama-3.2-3B-Instruct","choices":[{"index":0,"text":" city like no other. From its","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}

In [7]:
!curl http://localhost:8081/v1/models

{"object":"list","data":[{"id":"meta-llama/Llama-Guard-3-1B","object":"model","created":1744505065,"owned_by":"vllm","root":"meta-llama/Llama-Guard-3-1B","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-c3d6fb3d8f3a40f4a343fcde368cec23","object":"model_permission","created":1744505065,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

In [8]:
%%bash
 curl http://localhost:8081/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-Guard-3-1B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
             Dload  Upload   Total   Spent    Left  Speed
  134  0:00:01  0:00:01 --:--:--   487    134  0:00:01  0:00:01 --:--:--   


{"id":"cmpl-0f1308e60860420dbe4206f446c704ff","object":"text_completion","created":1744505065,"model":"meta-llama/Llama-Guard-3-1B","choices":[{"index":0,"text":" city that is full of history and","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}

### Install llama-stack

Install llama-stack

In [9]:
%%bash
pip install llama-stack llama-stack-client

Defaulting to user installation because normal site-packages is not writeable
rom llama-stack) (0.28.1)fied: httpx in /home/amd/.local/lib/python3.10/site-packages (f
hon3.10/site-packages (from llama-stack) (0.29.3) /home/amd/.local/lib/pyt
ome/amd/.local/lib/python3.10/site-packages (from llama-stack) (3.1.6)
: jsonschema in /home/amd/.local/lib/python3.10/site-packages (from llama-stack) (4.23.0)
10/site-packages (from llama-stack) (1.0.1)v in /home/amd/.local/lib/python3.
.local/lib/python3.10/site-packages (from llama-stack) (2.10.6)
sts in /home/amd/.local/lib/python3.10/site-packages (from llama-stack) (2.32.3)
 satisfied: rich in /home/amd/.local/lib/python3.10/site-packages (from llama-stack) (13.9.4)
ement already satisfied: setuptools in /usr/lib/python3/dist-packages (from llama-stack) (59.6.0)
quirement already satisfied: termcolor in /home/amd/.local/lib/python3.10/site-packages (from llama-stack) (2.5.0)
ges (from llama-stack) (0.9.0) tiktoken in /home/amd/.local/lib/pyt

[0m

In [10]:
%%bash
pip list | grep llama_stack

llama_stack                       0.2.1
llama_stack_client                0.2.1


In [11]:
%%bash
git clone https://github.com/meta-llama/llama-stack.git
# Copy the template yaml of remote-vllm distro 
cp ./llama-stack/llama_stack/templates/remote-vllm/run.yaml .
cp ./llama-stack/llama_stack/templates/remote-vllm/run-with-safety.yaml .

Cloning into 'llama-stack'...
Updating files: 100% (924/924), done.


### Running llama-stack

Llama-stcke release the distriution-remote-vllm docekr image run as the frontend while the vLLM serve at the backend. We should set the ports and the models of the containers of vllm serve to start the container of llama-stack.

In [12]:
%%bash
export INFERENCE_PORT=8080
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export LLAMA_STACK_PORT=8321
export SAFETY_PORT=8081
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B

docker run -d --rm \
  --network=host \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -v ./run-with-safety.yaml:/root/my-run.yaml \
  --name llama-stack-distro \
  llamastack/distribution-remote-vllm \
  --config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env VLLM_URL=http://0.0.0.0:$INFERENCE_PORT/v1 \
  --env SAFETY_MODEL=$SAFETY_MODEL \
  --env SAFETY_VLLM_URL=http://0.0.0.0:$SAFETY_PORT/v1



2619ffa6d1bbc7ed18da09c0444ccbfb73868e4910aec51bafeb5f89d13e08a0


Now, we have three containers.

In [13]:
# waiting for the container start
!sleep 60

In [14]:
!docker ps

CONTAINER ID   IMAGE                                 COMMAND                  CREATED              STATUS              PORTS                                       NAMES
2619ffa6d1bb   llamastack/distribution-remote-vllm   "python -m llama_sta…"   About a minute ago   Up About a minute                                               llama-stack-distro
8f924b37c229   rocm/vllm-dev:main                    "python -m vllm.entr…"   7 minutes ago        Up 7 minutes        0.0.0.0:8081->8081/tcp, :::8081->8081/tcp   rocm-vllm-guard
1b590b7b6342   rocm/vllm-dev:main                    "python -m vllm.entr…"   7 minutes ago        Up 7 minutes        0.0.0.0:8080->8080/tcp, :::8080->8080/tcp   rocm-vllm-provider


### Use llama-stack client CLI

We could use the client side CLI to access the llama-stack service. You could more details from https://llama-stack.readthedocs.io/en/latest/references/llama_stack_client_cli_reference.html .

In [15]:
!llama-stack-client models list


[1mAvailable Models[0m

┏━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━┳━━┓
┃[1m  [0m┃[1m [0m[1midentifier                      [0m[1m [0m┃[1m [0m[1mprovider_resource_id            [0m[1m [0m┃[1m  [0m┃[1m  [0m┃
┡━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━╇━━┩
│[34m  [0m│[1;36m [0m[1;36mall-MiniLM-L6-v2                [0m[1;36m [0m│[33m [0m[33mall-MiniLM-L6-v2                [0m[33m [0m│[35m  [0m│[32m  [0m│
├──┼──────────────────────────────────┼──────────────────────────────────┼──┼──┤
│[34m  [0m│[1;36m [0m[1;36mmeta-llama/Llama-3.2-3B-Instruct[0m[1;36m [0m│[33m [0m[33mmeta-llama/Llama-3.2-3B-Instruct[0m[33m [0m│[35m  [0m│[32m  [0m│
├──┼──────────────────────────────────┼──────────────────────────────────┼──┼──┤
│[34m  [0m│[1;36m [0m[1;36mmeta-llama/Llama-Guard-3-1B     [0m[1;36m [0m│[33m [0m[33mmeta-llama/Llama-Guard-3-1B     [0m[33m [0m│[35m  [0m│[32m

In [16]:
!llama-stack-client providers list

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1mAPI         [0m[1m [0m┃[1m [0m[1mProvider ID           [0m[1m [0m┃[1m [0m[1mProvider Type                 [0m[1m [0m┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ inference    │ vllm-inference         │ remote::vllm                   │
│ inference    │ vllm-safety            │ remote::vllm                   │
│ inference    │ sentence-transformers  │ inline::sentence-transformers  │
│ vector_io    │ faiss                  │ inline::faiss                  │
│ safety       │ llama-guard            │ inline::llama-guard            │
│ agents       │ meta-reference         │ inline::meta-reference         │
│ eval         │ meta-reference         │ inline::meta-reference         │
│ datasetio    │ huggingface            │ remote::huggingface            │
│ datasetio    │ localfs                │ inline::localfs                │
│ scoring      │ basic      

We could request inference by the CLI like that,

In [17]:
!llama-stack-client inference chat-completion --message "tell me a joke"

[1;35mChatCompletionResponse[0m[1m([0m
    [33mcompletion_message[0m=[1;35mCompletionMessage[0m[1m([0m
        [33mcontent[0m=[32m'[0m[32m{[0m[32m"name": "print", "parameters": [0m[32m{[0m[32m"f": "Why was the math book [0m
[32msad? Because it had too many problems."[0m[32m}[0m[32m}[0m[32m'[0m,
        [33mrole[0m=[32m'assistant'[0m,
        [33mstop_reason[0m=[32m'end_of_turn'[0m,
        [33mtool_calls[0m=[1m[[0m[1m][0m
    [1m)[0m,
    [33mlogprobs[0m=[3;35mNone[0m,
    [33mmetrics[0m=[1m[[0m
        [1;35mMetric[0m[1m([0m[33mmetric[0m=[32m'prompt_tokens'[0m, [33mvalue[0m=[1;36m14[0m[1;36m.0[0m, [33munit[0m=[3;35mNone[0m[1m)[0m,
        [1;35mMetric[0m[1m([0m[33mmetric[0m=[32m'completion_tokens'[0m, [33mvalue[0m=[1;36m38[0m[1;36m.0[0m, [33munit[0m=[3;35mNone[0m[1m)[0m,
        [1;35mMetric[0m[1m([0m[33mmetric[0m=[32m'total_tokens'[0m, [33mvalue[0m=[1;36m52[0m[1;36m.0[0m, [33m

### Use Python Client SDK

The Llama Stack provide Python Client SDK for developping the application. Here is a example to use the API to do the inference.
Here is the simple code refert to the one from https://llama-stack.readthedocs.io/en/latest/getting_started/index.html#test-basic-inference .

In [18]:
%%bash
cat > inference.py << EOF
# inference.py
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://localhost:8321")

# List available models
models = client.models.list()

# Select the first LLM
llm = next(m for m in models if m.model_type == "llm")
model_id = llm.identifier

print("Model:", model_id)

response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"},
    ],
)
print(response.completion_message.content)
EOF

In [19]:
!python inference.py

Model: meta-llama/Llama-3.2-3B-Instruct
{"name": "haiku", "parameters": {"c": "code", "t": "typing", "s": "silent"}}


### Clean up

In [20]:
%%bash
rm -rf llama-stack
rm run.yaml
rm run-with-safety.yaml
rm inference.py
docker stop llama-stack-distro
docker stop rocm-vllm-guard
docker stop rocm-vllm-provider

llama-stack-distro
rocm-vllm-guard
rocm-vllm-provider


## Reference

[1] [Llama Stack Docunmentation](https://llama-stack.readthedocs.io/en/latest/index.html)

[2] [AMD ROCm™ documentation](https://ROCm™.docs.amd.com/en/latest/)

[3] [How to Build a vLLM Container for Inference and Benchmarking](https://ROCm™.blogs.amd.com/software-tools-optimization/vllm-container/README.html)