# Deploying a model with vLLM to Amazon SageMaker AI for Inference

This notebook guides you through the process of deploying a fine-tuned Vision Language Model (VLM) to Amazon SageMaker. The deployment process includes several key steps:

1. **Environment Setup**: Installing necessary dependencies and configuring the AWS environment
2. **Model Artifact Management**: Locating and selecting the fine-tuned model artifacts from S3
3. **Container Infrastructure**: Building and pushing a custom container to Amazon ECR
4. **SageMaker Deployment**: Creating and deploying a SageMaker endpoint

**Prerequisites**

Before starting, ensure you have:
- AWS credentials configured with appropriate permissions
- AWS CLI installed
- Access to the S3 bucket containing your model artifacts

**Important Notes**

- The deployment uses an ml.g5.2xlarge instance which provides GPU acceleration necessary for efficient inference

## Environment Setup

First, we'll install `jq`, a lightweight command-line JSON processor. This will be used to parse AWS metadata and credentials later in our deployment process.

In [1]:
!sudo apt-get install -qq -y jq > /dev/null

debconf: delaying package configuration, since apt-utils is not installed


In [2]:
%pip install boto3 sagemaker pandas huggingface_hub --quiet

Note: you may need to restart the kernel to use updated packages.


In [3]:
%load_ext autoreload
%autoreload 2

Import the required libraries:

In [4]:
import json
import os
import boto3
import sagemaker
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from sagemaker.model import Model



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Initialize AWS Services

In [5]:
bucket_name = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]
role = sagemaker.get_execution_role()
session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
sm_client = boto3.client("sagemaker", region_name=region)

## Build and Push Docker Inference Container

Amazon SageMaker AI offers [three primary methods](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html) for deploying ML models to an SageMaker AI Inference Endpoint:
1. Using pre-built SageMaker containers for standard frameworks like PyTorch or TensorFlow
2. Modifying existing Docker containers with your own dependencies through a requirements.txt file
3. Or creating completely custom containers that implements a web server listening for requests (/invocations) for maximum flexibility and control over dependencies and requirements.

To run the fine-tuned models you will build our custom container and push it to Amazon Elastic Container Registry (ECR). The container will be built from our Dockerfile and pushed to ECR, making it available for SageMaker to use when deploying our endpoint.

### Docker Installation

To create our custom container for model serving, we first need Docker installed in our environment. This script handles the installation of Docker and its dependencies, including necessary security keys and repository configurations.

**Install docker-cli**

At the end of this install you should see,

```bash
Client: Docker Engine - Community
 Version:           20.10.24
 API version:       1.41
 Go version:        go1.19.7
 Git commit:        297e128
 Built:             Tue Apr  4 18:21:03 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true
```

In [6]:
!bash docker-artifacts/01_docker_install.sh

Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease                         
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]      
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1381 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]      
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]     
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3092 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1540 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [55.7 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [4148 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [82.7 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-

### Build and push a custom image

SageMaker Inference supports simplified deployment of Qwen2 using Large Model Inference (LMI) container images as indicated [here](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) and available images listed [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

However, video inference is not a supported by vLLM/OpenAI api server out of the box and so we design a custom image with a custom inference handler adapter from vllm's [api_server.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) to include video inference tag. We can build a custom image inside SageMaker Studio using `docker-cli`.

**Build our custom Docker image containing custom inference handler and push it to Amazon ECR (Elastic Container Registry).**

In [7]:
REPO_NAME = "vllm-sagemaker"
os.environ['REPO_NAME'] = REPO_NAME
# os.environ["S3_MODEL_URI"]=s3_model_uri

In [8]:
%%bash -s {region} {account_id}

REGION=$1

VERSION_TAG="latest"
CURRENT_ACCOUNT_NUMBER=$2

echo "bash 02_build_and_push.sh $REPO_NAME $VERSION_TAG $REGION $CURRENT_ACCOUNT_NUMBER"
cd docker-artifacts && bash 02_build_and_push.sh $REPO_NAME $VERSION_TAG $REGION $CURRENT_ACCOUNT_NUMBER

bash 02_build_and_push.sh vllm-sagemaker latest us-east-1 864981714022


+ reponame=vllm-sagemaker
+ versiontag=latest
+ regionname=us-east-1
+ account=864981714022
+ '[' vllm-sagemaker == '' ']'
+ '[' latest == '' ']'
+ '[' us-east-1 == '' ']'
= '' ']'4981714022 =
+ echo 'Verifying AWS credentials and ECR permissions...'
+ aws ecr get-authorization-token


Verifying AWS credentials and ECR permissions...


+ '[' 0 -ne 0 ']'
+ fullname=864981714022.dkr.ecr.us-east-1.amazonaws.com/vllm-sagemaker:latest
+ echo 'Checking ECR repository...'
r aws ecr describe-repositories --repository-names vllm-sagemake


Checking ECR repository...


+ '[' 0 -ne 0 ']'
+ echo 'Logging into ECR...'


Logging into ECR...


+ aws ecr get-login-password --region us-east-1
1714022.dkr.ecr.us-east-1.amazonaws.comd-stdin 86498
dline/login/#credential-stores/reference/comman



Login Succeeded


+ '[' 0 -ne 0 ']'
+ echo 'Building docker image...'
+ pwd
emaker -t vllm-sagemaker .le --network sag


Building docker image...
/home/sagemaker-user/deploy-vllm-sagemaker/docker-artifacts


DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
ildKit is currently disabled; enable it by removing the DOCKER_BUILDKIT=0
riable.     environment-va



Sending build context to Docker daemon  17.92kB
Step 1/9 : FROM vllm/vllm-openai:v0.8.2
v0.8.2: Pulling from vllm/vllm-openai
fs layer0c7b: Pulling 
edd1dba56169: Pulling fs layer
e06eb1b5c4cc: Pulling fs layer
rf308a765276: Pulling fs laye
3af11d09e9cd: Pulling fs layer
42896cdfd7b6: Pulling fs layer
600519079558: Pulling fs layer
424cadf: Pulling fs layer
73b7968785dc: Pulling fs layer
80150f70fb1e: Pulling fs layer
3bd5db8307cf: Pulling fs layer
bf7e8: Pulling fs layer
fa73b905e353: Pulling fs layer
Pulling fs layer
e8c447725a94: Pulling fs layer
190d38f5aedd: Pulling fs layer
 fs layer7fe: Pulling
77ae37c6bbc7: Pulling fs layer
5b72426a5a03: Pulling fs layer
er01584141e4: Pulling fs lay
27cba7a0f74e: Pulling fs layer
4c956bffdb61: Pulling fs layer
9a8cb1d36869: Pulling fs layer
db8307cf: Waiting
c6c838ebf7e8: Waiting
fa73b905e353: Waiting
376de2b58022: Waiting
iting7725a94: Wa
190d38f5aedd: Waiting
7969525fc7fe: Waiting
7f308a765276: Waiting
77ae37c6bbc7: Waiting
09e9cd: Waiting
42

+ '[' 0 -ne 0 ']'
+ echo 'Tagging docker image...'
us-east-1.amazonaws.com/vllm-sagemaker:latestecr.


Tagging docker image...


+ '[' 0 -ne 0 ']'
+ echo 'Pushing image to ECR...'
onaws.com/vllm-sagemaker:latestecr.us-east-1.amaz


Pushing image to ECR...
The push refers to repository [864981714022.dkr.ecr.us-east-1.amazonaws.com/vllm-sagemaker]
ffbc0a41256e: Preparing
f01b0955ae1f: Preparing
50cf2ff8d3b5: Preparing
5122b9431e78: Preparing
25709edf: Preparing
1c05548c7171: Preparing
1741937fa70d: Preparing
5bf0223b1bae: Preparing
a92859893164: Preparing
436cebf56b3b: Preparing
c7634d9e938a: Preparing
3820b7906b88: Preparing
ebd840f7bc2f: Preparing
268ff7b755f5: Preparing
26ec4e41: Preparing
9c4b3feff1c5: Preparing
3485b6969240: Preparing
bbdba4451161: Preparing
a648: Preparing
b0dfaf1ca5c5: Preparing
700fe921ad1f: Preparing
520e0f301880: Preparing
: Preparing0
e942261d196e: Preparing
eeb5315df33c: Preparing
022bf74291b2: Preparing
eparing52594: Pr
5498e8c22f69: Preparing
3485b6969240: Waiting
1741937fa70d: Waiting
5bf0223b1bae: Waiting
dba4451161: Waiting
a92859893164: Waiting
436cebf56b3b: Waiting
c7634d9e938a: Waiting
Waiting6a648: 
3820b7906b88: Waiting
b0dfaf1ca5c5: Waiting
ebd840f7bc2f: Waiting
700fe921ad1f:

+ '[' 0 -ne 0 ']'
+ echo 'Saving image URI to file...'


Saving image URI to file...


+ echo 864981714022.dkr.ecr.us-east-1.amazonaws.com/vllm-sagemaker:latest
+ '[' 0 -ne 0 ']'
Successfully built and pushed image: 864981714022.dkr.ecr.us-east-1.amazonaws.com/vllm-sagemaker:latest'
+ echo 'Image URI saved to: dockerfile-image.txt'


Successfully built and pushed image: 864981714022.dkr.ecr.us-east-1.amazonaws.com/vllm-sagemaker:latest
Image URI saved to: dockerfile-image.txt


### Getting Container Image URI

Retrieve the full URI of our Docker image from ECR. This URI is essential for SageMaker deployment as it tells SageMaker exactly where to find our custom container image. The URI follows the format:
`{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}:{tag}`

In [9]:
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{REPO_NAME}:latest"
print(f"Base image to deploy a SageMaker endpoint: {image_uri}")

os.environ['CUSTOM_IMAGE'] = image_uri

Base image to deploy a SageMaker endpoint: 864981714022.dkr.ecr.us-east-1.amazonaws.com/vllm-sagemaker:latest


## Understanding the Model Serving Architecture

When we deploy our model to a SageMaker endpoint, here's how the components work together:

1. **Docker Container Structure**:
   - The container runs on the SageMaker instance (ml.g5.2xlarge)

2. **Request Flow**:
   - External requests → SageMaker endpoint → Container's port 8080
   - The `sed` commands we used modified the API paths to match SageMaker's expected structure:
     - `/ping` for health checks
     - `/invocations` for model inference
     - `/invocations/completions` for completion requests

3. **SageMaker Integration**:
   - Routes HTTPS requests to our container
   - Manages container lifecycle
   - Handles authentication and scaling
   - Monitors container health via the `/ping` endpoint

This setup allows us to serve our fine-tuned model with production-grade reliability and performance.

**[Optional] We can run our container interactively in Terminal by using the command below. Make sure you are using a GPU instance for your Jupyterlab space since model inference requires a GPU.**

In [10]:
def dict_to_env_file(dict_data, output_file='output.env'):
    """
    Convert dictionary to environment file format (VARIABLE=VALUE)
    
    Args:
        dict_data (dict): Dictionary containing environment variables
        output_file (str): Name of the output file (default: output.env)
    """
    try:
        with open(output_file, 'w') as f:
            for key, value in dict_data.items():
                # Convert value to string and escape special characters if needed
                value_str = str(value).replace('"', '\\"')
                # Write each variable in KEY=VALUE format
                f.write(f'{key}={value_str}\n')
        print(f"Successfully wrote environment variables to {output_file}")
    except Exception as e:
        print(f"Error writing to env file: {e}")

In [17]:
# Define environment variables for the model
environment = {
    "HF_TOKEN":"your_token_here"
    # "USE_HF_TRANSFER": "true",  # Enable faster downloads
    # "HF_HUB_ENABLE_HF_TRANSFER": "1",
    "SM_VLLM_MODEL": "Qwen/Qwen2.5-VL-3B-Instruct", # you can name your model whatever you want    
    "SM_VLLM_LIMIT_MM_PER_PROMPT": "image=2, video=0", # max number of images allowed in prompt. Increase for multi-page documents. Requires more memory.
    "SM_VLLM_MAX_NUM_SEQS":"8", # decrease if less GPU memory available
    "SM_VLLM_MAX_MODEL_LEN":"38608", # max context length, decrease if less GPU memory available
    # "SM_VLLM_MAX_MODEL_LEN":"10608", # max context length, decrease if less GPU memory available
    # "SM_VLLM_DTYPE": "bfloat16"
}

dict_to_env_file(environment)

Successfully wrote environment variables to output.env


In [12]:
%%writefile run_container.sh
# # Get credentials from instance metadata
export $(curl -s 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI | jq -r '"AWS_ACCESS_KEY_ID="+.AccessKeyId, "AWS_SECRET_ACCESS_KEY="+.SecretAccessKey, "AWS_SESSION_TOKEN="+.Token')

# # Now run your docker container with these environment variables
# # Add --entrypoint /bin/bash in case you want to manually look into the container
set -x
docker run -d --gpus all --network sagemaker -it -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN \
--env-file output.env $REPO_NAME 

echo "Waiting for container to be ready..."
max_attempts=60  # Maximum number of attempts (10 minutes with 10-second intervals)
attempt=1

while [ $attempt -le $max_attempts ]; do
    echo "Attempt $attempt of $max_attempts: Checking if container is ready..."
    
    if curl -s -f http://localhost:8080/ping > /dev/null; then
        echo "Container is ready!"
        break
    fi
    
    if [ $attempt -eq $max_attempts ]; then
        echo "Container failed to become ready within the timeout period"
        docker logs inference_container
        exit 1
    fi
    
    attempt=$((attempt + 1))
    sleep 10
done

# Test chat completions endpoint
echo -e "\nTesting chat completions endpoint..."
curl -X POST http://localhost:8080/invocations \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Hello, how are you?"}
        ],
        "model": "Qwen/Qwen2.5-VL-3B-Instruct",
        "temperature": 0.7,
        "max_tokens": 100
    }'

# Test completions endpoint
echo -e "\nTesting completions endpoint..."
curl -X POST http://localhost:8080/invocations/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Hello, how are you?",
        "model": "Qwen/Qwen2.5-VL-3B-Instruct",
        "temperature": 0.7,
        "max_tokens": 100
    }'



Overwriting run_container.sh


In [13]:
!bash run_container.sh

+ docker run -d --gpus all --network sagemaker -it -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN --env-file output.env vllm-sagemaker
f263bd928ea47ae5654cf9cf18a30796d677d8ad9bad5c33b1b04a13d117a037
+ echo 'Waiting for container to be ready...'
Waiting for container to be ready...
+ max_attempts=60
+ attempt=1
+ '[' 1 -le 60 ']'
+ echo 'Attempt 1 of 60: Checking if container is ready...'
Attempt 1 of 60: Checking if container is ready...
+ curl -s -f http://localhost:8080/ping
+ '[' 1 -eq 60 ']'
+ attempt=2
+ sleep 10
+ '[' 2 -le 60 ']'
+ echo 'Attempt 2 of 60: Checking if container is ready...'
Attempt 2 of 60: Checking if container is ready...
+ curl -s -f http://localhost:8080/ping
+ '[' 2 -eq 60 ']'
+ attempt=3
+ sleep 10
+ '[' 3 -le 60 ']'
+ echo 'Attempt 3 of 60: Checking if container is ready...'
Attempt 3 of 60: Checking if container is ready...
+ curl -s -f http://localhost:8080/ping
+ '[' 3 -eq 60 ']'
+ attempt=4
+ sleep 10
+ '[' 4 -le 60 ']'
+ echo 'Atte

## Creating a SageMaker Model and deploy a SageMaker endpoint

Finally, we'll create a SageMaker model and deploy it to an inference endpoint. This will give us an HTTPS endpoint that we can use for inference.

Note: We're using an ml.g5.2xlarge instance which provides GPU acceleration necessary for efficient inference with a small multimodal model.

For more throughphut, lower latency, or when deploying a bigger model you might want to use a bigger instance type.

In [14]:
hf_model = "Qwen/Qwen2.5-VL-3B-Instruct"
sm_model_name = "qwen25vl3b"
sm_endpoint_name = "vllm-sagemaker-qwen"

print(f"Model name: {sm_model_name}")
print(f"Endpoint name: {sm_endpoint_name}")

Model name: qwen25vl3b
Endpoint name: vllm-sagemaker-qwen


Deploy our model to a SageMaker endpoint using an ml.g5.2xlarge instance. This GPU-enabled instance type provides the computational power needed for efficient inference with a Qwen2-VL model. The deployment:
- Creates a SageMaker model using our custom container
- Configures the endpoint with specified resources
- Initiates asynchronous deployment (wait=False)
- Sets up HTTPS endpoint for inference



In [18]:
# Main deployment logic
from utils.helpers import check_model_exists, check_endpoint_config_exists, check_endpoint_exists, delete_all_resources

endpoint_exists = check_endpoint_exists(endpoint_name=sm_endpoint_name, sm_client=sm_client)
model_exists = check_model_exists(sm_model_name, sm_client=sm_client)
config_exists = check_endpoint_config_exists(sm_endpoint_name, sm_client=sm_client)

if endpoint_exists or model_exists or config_exists:
    print(f"\nFound existing resources:")
    if endpoint_exists:
        print(f"- Endpoint: {sm_endpoint_name}")
    if model_exists:
        print(f"- Model: {sm_model_name}")
    if config_exists:
        print(f"- Endpoint config: {sm_endpoint_name}")
    
    delete_all_resources(sm_model_name, sm_endpoint_name, sm_client=sm_client)

# # Define environment variables for the model
# environment = {
#     # "USE_HF_TRANSFER": "true",  # Enable faster downloads
#     # "HF_HUB_ENABLE_HF_TRANSFER": "1",
#     "SM_VLLM_MODEL": hf_model, # you can name your model whatever you want
#     # "SM_VLLM_LIMIT_MM_PER_PROMPT": "image=2, video=0", # max number of images allowed in prompt. Increase for multi-page documents. Requires more memory.
#     # "SM_VLLM_MAX_NUM_SEQS":"8", # decrease if less GPU memory available
#     # "SM_VLLM_MAX_MODEL_LEN":"38608", # max context length, decrease if less GPU memory available
#     # "SM_VLLM_DTYPE": "bfloat16"
# }

# If we get here, either nothing existed or we've cleaned up
model = Model(
    image_uri=image_uri,
    role=role,
    sagemaker_session=session,
    name=sm_model_name,
    env=environment,
)

print(f"\nEndpoint is now being deployed.... This may take several minutes.")

# Deploy a new endpoint
model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    # instance_type="local",
    endpoint_name=sm_endpoint_name,
    wait=True
)


Endpoint 'vllm-sagemaker-qwen' exists with status: Failed.

Found existing resources:
- Endpoint: vllm-sagemaker-qwen
- Model: qwen25vl3b
- Endpoint config: vllm-sagemaker-qwen
Endpoint 'vllm-sagemaker-qwen' exists with status: Failed.
Deleted existing endpoint: vllm-sagemaker-qwen
Deleted existing endpoint config: vllm-sagemaker-qwen
Deleted existing model: qwen25vl3b

Endpoint is now being deployed.... This may take several minutes.


-------------!

## Next steps

After deploying the model as a SageMaker endpoint, we can call the model endpoint to run inference with the sample code in the next notebook [02_consume_model.ipynb](./02_consume_model.ipynb).