# Serve multiple LoRA adapters efficiently on SageMaker - vLLM

In this tutorial, we will learn how to serve many Low-Rank Adapters (LoRA) on top of the same base model efficiently on the same GPU. In order to do this, we'll deploy a [vLLM serving-based container](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html) to SageMaker Hosting. 

These are the steps we will take:

1. [Setup our environment](#setup)
2. [Build a new vLLM container image compatible with SageMaker, push it to Amazon ECR](#container)
3. [Download adapters from the HuggingFace Hub and upload them to S3](#download_adapter)
4. [Build LoRA modules manifest file](#manifest)
5. [Deploy the extended vLLM container to SageMaker](#deploy)
6. [Compare outputs of the base model and the adapter model](#compare)
7. [Benchmark our deployed endpoint under different traffic patterns - same adapter, and random access to many adapters](#benchmark)


## What is vLLM? 

vLLM is a fast and easy-to-use library for LLM inference and serving. It supports most state-of-the art LLM serving optimizations, such as PagedAttention, FlashAttention, continuous batching and more. 

Recently vLLM added support for efficient multi-LoRA serving, with one of the key features being support for different LoRA ranks in the same batch. This is important for users that tune each adapter's rank to its specific task and dataset to get the best overall performance (although the need for this is not definitive, see [here](https://arxiv.org/abs/2402.09353)).

You can read more about vLLM and its multi-LoRA serving feature [here](https://docs.vllm.ai/en/latest/models/lora.html).

<a id="setup"></a>
## Setup our environment 

In [1]:
!pip install -U boto3 sagemaker huggingface_hub --quiet

In [2]:
import sagemaker
import boto3
sess = sagemaker.Session()

# sagemaker session bucket -> used for uploading data, models and logs
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region = sess.boto_region_name

print(f"sagemaker default S3 bucket: {sagemaker_session_bucket}")
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {region}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker default S3 bucket: sagemaker-us-east-1-626723862963
sagemaker role arn: arn:aws:iam::626723862963:role/service-role/AmazonSageMaker-ExecutionRole-20231214T145077
sagemaker session region: us-east-1


<a id="container"></a>
## Build a new vLLM container image compatible with SageMaker, push it to Amazon ECR

This example includes a `Dockerfile` and `sagemaker_entrypoint.sh` in the `sagemaker_vllm` directory. Building this new container image makes vLLM compatible with SageMaker Hosting, namely launching the server on port 8080 via the container's `ENTRYPOINT` instruction, and changing the relevant server routes from the original `/ping` and `/v1/completions` to `/health` and `/invocations` . [Here](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image) you can find the basic interfaces required to adapt any container for deployment on Sagemaker Hosting.

We also make sure all relevant server configuration arguments (such as enabling the LoRA serving feature, LoRA module paths, etc.) are configurable via environment variables, so that our container entrypoint can be parametrized at runtime by SageMaker. vLLM does not support env var-based server config by default. You can find all possible configuration parameters within vllm [Server args](https://github.com/vllm-project/vllm/blob/865732342b4e3b8a4ef38f28a2a5bdb87cf3f970/vllm/entrypoints/openai/cli_args.py#L25) and [AsyncEngine args](https://github.com/vllm-project/vllm/blob/865732342b4e3b8a4ef38f28a2a5bdb87cf3f970/vllm/engine/arg_utils.py#L12); if you want to expose other parameters, add them to the Dockerfile, and make sure to pass them when launching the SageMaker Endpoint, as we will see in the next sections. A relevant one to expose for larger models would be [--tensor-parallel-size](https://github.com/vllm-project/vllm/blob/865732342b4e3b8a4ef38f28a2a5bdb87cf3f970/vllm/engine/arg_utils.py#L164C30-L164C52).

Let's analyze the Dockerfile and entrypoint script to understand how easy it is to adapt any serving framework (and vLLM in particula) to run on SageMaker Real-Time Hosting.


In [3]:
!pygmentize sagemaker_vllm/sagemaker_entrypoint.sh
!printf "\n\n\nEnd of entrypoint script ----------------"

[37m#!/bin/bash[39;49;00m
[37m[39;49;00m
[31mLORA_MODULES[39;49;00m=[34m$([39;49;00m<[33m"[39;49;00m[31m$MODEL_DIR[39;49;00m[33m/[39;49;00m[31m$LORA_MODULES_MANIFEST_FILE[39;49;00m[33m"[39;49;00m[34m)[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[31mLAUNCH_COMMAND[39;49;00m=[33m"[39;49;00m[33mvllm.entrypoints.openai.api_server \[39;49;00m
[33m--port 8080 \[39;49;00m
[33m--model [39;49;00m[31m$HF_MODEL_ID[39;49;00m[33m \[39;49;00m
[33m--max-model-len [39;49;00m[31m$MAX_MODEL_LEN[39;49;00m[33m \[39;49;00m
[33m--enable-lora \[39;49;00m
[33m--lora-modules [39;49;00m[31m$LORA_MODULES[39;49;00m[33m \[39;49;00m
[33m--max-loras [39;49;00m[31m$MAX_GPU_LORAS[39;49;00m[33m \[39;49;00m
[33m--max-cpu-loras [39;49;00m[31m$MAX_CPU_LORAS[39;49;00m[33m \[39;49;00m
[33m--max-num-seqs [39;49;00m[31m$MAX_NUM_SEQS[39;49;00m[33m"[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m# Check if ENFORCE_EAGER environment variable is 'true', append to l

Here are the relevant things to note from the previous cell output:
1. we force the server to run on port 8080, as required by SageMaker
2. selected base model is parameterized with `HF_MODEL_ID` env var
3. maximum allowed sequence lenght (input+output) is parameterized with `MAX_MODEL_LEN` env var; this is an important parameter, as many models' max len is larger than what a single A10G can hold, which would cause the server to error and exit at startup
4. maximum number of LoRA adapters that can run within the same batch on the GPU are parameterized with `MAX_GPU_LORAS` env var; the default in vLLM is 1, which would provide poor performance. To define an appropriate value for this parameter, you should take into consideration the total memory of the GPU you will be deploying on, memory required for each adapter, and the expected input/output lengths of the incoming payloads
5. maximum number of LoRA adapters that can be offloaded to CPU memory (RAM) for quick hotswapping is parameterized with `MAX_CPU_LORAS`. To define an appropriate value for this parameter, you should take into consideration the total RAM available in the instance type you will deploy on, and memory required for each adapter
6. maximum number of sequences that can be processed per iteration is parameterized with `MAX_NUM_SEQS`; it's important to tailor this to the GPU being used, as some GPU memory is pre-allocated based on its value
7. whether to enforce eager or not is parameterizes with `ENFORCE_EAGER`; by default vLLM captures the model for CUDA graphs which reduces its latency, but it also consumes an extra 1-3GB of memory, so it can be turned off by enforcing eager mode
8. the names (invocation target ids) and local paths for all LoRA adapters and their artifacts are listed within a manifest file (we will construct it later according to vLLM's `--lora-modules` [arg specification](https://docs.vllm.ai/en/latest/models/lora.html#serving-lora-adapters)), the name and directory of which we pass via the `LORA_MODULES_MANIFEST_FILE` and `MODEL_DIR` env vars

**Why do we have to pass all local directories for adapter artifacts in 8.?** --> At the time of writing, vLLM's LoRA serving feature does not allow for dynamic downloads of LoRA adapters from S3 or HF Hub as they are invoked. All adapters must be present locally on the underlying instance that the server runs on. That does not mean you have to include all the adapter artifacts in your container image, as this would be very rigid and unfriendly for image reusability. We will show you how downloading adapters from S3 can be done dynamically before the server starts up with the help of Sagemaker in the next sections.

**Why a manifest file instead of just another environment variable?** --> SageMaker enforces the length of the json encoded env vars dictionary that is passed to be under 1024 characters. This might not be enough to build the `--lora-modules` argument, especially as the number of adapters (i.e. modules) grows. With this in mind, we will build a manifest file that is downloaded and read into a variable before the vLLM server is started, sidestepping this limitation.


<div class="alert alert-block alert-info">
⚠️ The above approach is specific to the latest version of the vLLM container at the time of writing, and will likely change with updates to vLLM.
</div>

In [4]:
!pygmentize sagemaker_vllm/Dockerfile
!printf "\n\n\nEnd of Dockerfile ----------------"

[34mARG[39;49;00m[37m [39;49;00mVERSION[37m[39;49;00m
[34mFROM[39;49;00m[37m [39;49;00m[33mvllm/vllm-openai:$VERSION[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m# Make server compatible with SageMaker Hosting contract[39;49;00m[37m[39;49;00m
[34mRUN[39;49;00m[37m [39;49;00msed[37m [39;49;00m-i[37m [39;49;00m[33m's|/health|/ping|g'[39;49;00m[37m [39;49;00mvllm/entrypoints/openai/api_server.py[37m [39;49;00m[37m[39;49;00m
[34mRUN[39;49;00m[37m [39;49;00msed[37m [39;49;00m-i[37m [39;49;00m[33m's|/v1/completions|/invocations|g'[39;49;00m[37m [39;49;00mvllm/entrypoints/openai/api_server.py[37m [39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mCOPY[39;49;00m[37m [39;49;00msagemaker_entrypoint.sh[37m [39;49;00mentrypoint.sh[37m[39;49;00m
[34mRUN[39;49;00m[37m [39;49;00mchmod[37m [39;49;00m+x[37m [39;49;00mentrypoint.sh[37m[39;49;00m
[37m[39;49;00m
[34mENTRYPOINT[39;49;00m[37m [39;49;00m[[33m"./entrypoint.sh"[39;49;00m][

In the output of the above cell, you can see that we:
* fix to a specific vLLM container version, which we pass when we build the image
* replace the `/health` and `/v1/completions` server routes by `/ping` and `/invocations` in the main server launch script, as required by SageMaker
* copy our entrypoint script to the container, and set it as the ENTRYPOINT command, as required by SageMaker

! NOTE !: if you change the vLLM base container version, check to make sure the string replacements above still work as intended, and the path to the main server launch script still holds

## Activating Docker for Jupyterlab in Sagemaker Studio

Make sure to install docker in Sagemaker Studio Jupyterlab. This can also be run in terminal (File - New - Terminal)

In [5]:
%%bash
./setup_docker.sh

Reading package lists...
Building dependency tree...
Reading state information...
ca-certificates is already the newest version (20230311ubuntu0.22.04.1).
curl is already the newest version (7.81.0-1ubuntu1.16).
gnupg is already the newest version (2.2.27-3ubuntu2.1).
0 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.


gpg: cannot open '/dev/tty': No such device or address
curl: (23) Failure writing output to destination


Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 https://download.docker.com/linux/ubuntu jammy InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
docker-ce-cli is already the newest version (5:20.10.24~3-0~ubuntu-jammy).
docker-compose-plugin is already the newest version (2.28.1-1~ubuntu.22.04~jammy).
0 upgraded, 0 newly installed, 0 to remove and 13 not upgraded.
Client: Docker Engine - Community
 Version:           20.10.24
 API version:       1.41
 Go version:        go1.19.7
 Git commit:        297e128
 Built:             Tue Apr  4 18:21:03 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.25
  API version:      1.41 (minimum vers

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


We are good to go! We build the new container image and push it to a new ECR repository. Note SageMaker [supports private Docker registries](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-containers-inference-private.html) as well.

In [14]:
%%bash -s {region}
algorithm_name="sagemaker-vllm"  # name of your algorithm
tag="v0.3.3"
region=$1

account=$(aws sts get-caller-identity --query Account --output text)

image_uri="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:${tag}"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" --region $region > /dev/null
fi

cd sagemaker_vllm/ && docker build --network=sagemaker --build-arg VERSION=$tag -t ${algorithm_name}:${tag} .

# Authenticate Docker to an Amazon ECR registry
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com

# Tag the image
docker tag ${algorithm_name}:${tag} ${image_uri}

# Push the image to the repository
docker push ${image_uri}

# Save image name to tmp file to use when deploying endpoint
echo $image_uri > /tmp/image_uri

Sending build context to Docker daemon  6.656kB
Step 1/8 : ARG VERSION
Step 2/8 : FROM vllm/vllm-openai:$VERSION
v0.3.3: Pulling from vllm/vllm-openai
aece8493d397: Pulling fs layer
45f7ea5367fe: Pulling fs layer
3d97a47c3c73: Pulling fs layer
12cd4d19752f: Pulling fs layer
da5a484f9d74: Pulling fs layer
5e5846364eee: Pulling fs layer
fd355de1d1f2: Pulling fs layer
3480bb79c638: Pulling fs layer
e7016935dd60: Pulling fs layer
0e5efb84a642: Pulling fs layer
089120e96587: Pulling fs layer
d0eb2e2560f1: Pulling fs layer
f57296e87535: Pulling fs layer
19a1b014b8cd: Pulling fs layer
a4618d76678f: Pulling fs layer
16b49417b04a: Pulling fs layer
e7016935dd60: Waiting
0e5efb84a642: Waiting
089120e96587: Waiting
d0eb2e2560f1: Waiting
f57296e87535: Waiting
19a1b014b8cd: Waiting
a4618d76678f: Waiting
16b49417b04a: Waiting
12cd4d19752f: Waiting
5e5846364eee: Waiting
fd355de1d1f2: Waiting
3480bb79c638: Waiting
da5a484f9d74: Waiting
aece8493d397: Verifying Checksum
aece8493d397: Download complete
45

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



Login Succeeded
The push refers to repository [626723862963.dkr.ecr.us-east-1.amazonaws.com/sagemaker-vllm]
ec691392a2a4: Preparing
359b0c42c430: Preparing
a39435363212: Preparing
e7e638fe62c8: Preparing
12947d07ed5f: Preparing
067e9baa9a0d: Preparing
214b6ff61148: Preparing
0fd53473730f: Preparing
ca336086e060: Preparing
c501b4875b93: Preparing
674396d66abf: Preparing
600c676771a0: Preparing
6ac15100dff6: Preparing
40f0eb1871b9: Preparing
8d113b7b997c: Preparing
cd77f58b80cd: Preparing
e4b1bddcbe63: Preparing
765423415d69: Preparing
7b9433fba79b: Preparing
256d88da4185: Preparing
674396d66abf: Waiting
600c676771a0: Waiting
6ac15100dff6: Waiting
40f0eb1871b9: Waiting
8d113b7b997c: Waiting
cd77f58b80cd: Waiting
e4b1bddcbe63: Waiting
765423415d69: Waiting
7b9433fba79b: Waiting
256d88da4185: Waiting
067e9baa9a0d: Waiting
214b6ff61148: Waiting
0fd53473730f: Waiting
c501b4875b93: Waiting
ca336086e060: Waiting
12947d07ed5f: Layer already exists
067e9baa9a0d: Layer already exists
214b6ff61148

<a id="download_adapter"></a>
## Download adapter from HuggingFace Hub and push it to S3

We are going to simulate storing our adapter weights on S3, and having SageMaker download them upfront when we provision the endpoint. This enables most scenarios, including deployment after you’ve finetuned your own adapters and pushed them to S3, as well as securing deployments with no internet access inside your VPC, as detailed in this [blog post](https://www.philschmid.de/sagemaker-llm-vpc#2-upload-the-model-to-amazon-s3).

We first download an adapter trained with Mistral Instruct v0.1 as the base model to a local directory. This particular adapter was trained on GSM8K, a grade school math dataset.

In [15]:
from pathlib import Path
from huggingface_hub import snapshot_download

HF_MODEL_ID = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
# create model dir
model_dir = Path('mistral-adapter')
model_dir.mkdir(exist_ok=True)

# Download model from Hugging Face into model_dir
snapshot_download(
    HF_MODEL_ID,
    local_dir=str(model_dir), # download to model dir
    local_dir_use_symlinks=False, # use no symlinks to save disk space
    revision="main", # use a specific revision, e.g. refs/pr/21
    cache_dir='/home/ec2-user/SageMaker/.cache/'
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

'/home/sagemaker-user/notebooks/sagemaker/03_multi_adapter_hosting_sagemaker_vllm/mistral-adapter'

We copy this same adapter `n_adapters` times to different S3 prefixes in our SageMaker session bucket, simulating a large number of adapters we want to serve on the same endpoint and underlying GPU. We name the last prefix directory (leaf directory) as integer indexes, but you can change this to reflect the task or name of each adapter. 

In [16]:
import os

s3 = boto3.client('s3')

def upload_folder_to_s3(local_path, s3_bucket, s3_prefix):
    for root, dirs, files in os.walk(local_path):
        for file in files:
            local_file_path = os.path.join(root, file)
            s3_object_key = os.path.join(s3_prefix, os.path.relpath(local_file_path, local_path))
            s3.upload_file(local_file_path, s3_bucket, s3_object_key)

# Upload the folder n_adapters times under different prefixes
n_adapters=50
base_prefix = 'vllm/mistral-adapters'
for i in range(1, n_adapters+1):
    prefix = f'{base_prefix}/{i}'
    upload_folder_to_s3(model_dir, sagemaker_session_bucket, prefix)
    print(f'Uploaded folder to S3: {sagemaker_session_bucket}/{prefix}')

Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/1
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/2
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/3
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/4
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/5
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/6
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/7
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/8
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/9
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/10
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/11
Uploaded folder to S3: sagemaker-us-east-1-626723862963/vllm/mistral-adapters/12
Uploaded folder to S3: sagemaker-us-e

<a id="manifest"></a>
## Build LoRA modules manifest file

Now, we build the manifest file. The `--lora-modules` arg syntax to follow is `MODULE_NAME_1=LOCAL_DIRECTORY_1 MODULE_NAME_2=LOCAL_DIRECTORY_2 ...`. 

To maintain consistency and ease of management, we want our module names (which we will invoke later) to match the names of the leaf directories on S3 that hold each adapter artifact. In order to do this, we list the contents of our `base_prefix` directory on S3, parse the leaf directory names, and use them to build the `--lora-modules` argument. In this case, the leaf names will be integers in the range of 1 to `n_adapters`; however, the following cells can be reused for any S3 prefix that holds directories with LoRA artifacts.  

In [17]:
def list_s3_leaf_directories(bucket_name, prefix):
    response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter='/')

    leaf_directories = []
    for prefix in response.get('CommonPrefixes', []):
        directory_path = prefix['Prefix']
        directory_name = directory_path.rstrip('/').rsplit('/', 1)[-1]  # Extract the last part of the path, the leaf directory
        leaf_directories.append(directory_name)

    return leaf_directories


bucket_name = sagemaker_session_bucket
prefix = base_prefix+'/'
leaf_directories = list_s3_leaf_directories(bucket_name, prefix)
print(leaf_directories)

['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '5', '50', '6', '7', '8', '9']


Now, we build the manifest file and upload it to S3

In [18]:
lora_modules_manifest_file = "lora_modules.txt"

with open(lora_modules_manifest_file, "w") as file:
    vllm_arg_pattern = " ".join([f"{leaf_dir}=/opt/ml/model/{leaf_dir}/" for leaf_dir in leaf_directories])
    file.write(vllm_arg_pattern)

# Upload to S3
s3.upload_file(lora_modules_manifest_file, sagemaker_session_bucket, f'{base_prefix}/{lora_modules_manifest_file}')

We can verify the file was uploaded to the correct directory (our base_prefix) on the following cell. If there is no output, the file is not where it should be, and you should verify the previous steps.

In [19]:
! aws s3 ls "s3://$sagemaker_session_bucket/$base_prefix/" | grep "$lora_modules_manifest_file"

2024-07-05 12:23:48       1031 lora_modules.txt


<a id="deploy"></a>
## Deploy SageMaker endpoint


Now we deploy a SageMaker endpoint, pointing to our `base_prefix` as the `model_data` parameter.

Let's dissect what is happening here:
* as explained in the SageMaker [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-load-artifacts), SageMaker downloads model artifacts under the provided `S3URI` to the `/opt/ml/model` directory; your container has read-only access to this directory
* by specifying that our data is in an `S3Prefix` and `CompressionType` is `None`, you do not need to tar.gz the `base_prefix` directory; SageMaker will download all the files and directories in our `base_prefix` in uncompressed format, replicating the S3 directory structure (i.e. the `base_prefix` dir structure will match the `/opt/ml/model` dir structure). This is why we pass `/opt/ml/model` as the `MODEL_DIR` in the next cell, and why we placed the lora manifest file in the root of the `base_prefix` 


In [20]:
from huggingface_hub import login
login()
from pathlib import Path
hf_token = Path("/home/sagemaker-user/.cache/huggingface/token").read_text()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [21]:
import json
import datetime

from sagemaker import Model
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Retrieve image_uri from tmp file
image_uri = !cat /tmp/image_uri
# Increased health check timeout to give time for model download
health_check_timeout = 800
# Endpoint configs
number_of_gpu = 1
instance_type = "ml.g5.xlarge"
endpoint_name = sagemaker.utils.name_from_base("sm-vllm")

# Env vars required for server launch
config = {
  'MODEL_DIR': '/opt/ml/model', # root dir for adatper dirs and manifest file
  'LORA_MODULES_MANIFEST_FILE': lora_modules_manifest_file, # manifest file name
  'HF_MODEL_ID': "mistralai/Mistral-7B-Instruct-v0.1", # model_id from hf.co/models
  'HUGGING_FACE_HUB_TOKEN': hf_token,
  'MAX_MODEL_LEN': json.dumps(4096),  # max length of input text
  'MAX_GPU_LORAS': json.dumps(20), # max number of adapters usable in single batch
  'MAX_CPU_LORAS': json.dumps(50), # max number of adapters that can be held in CPU mem
  'MAX_NUM_SEQS': json.dumps(100), # max number of sequences per iteration
  'ENFORCE_EAGER': json.dumps(False), # whether to turn off CUDA graphs and enforce eager mode (saves GPU mem) 
}

# Create SM Model, pass in model data as a whole prefix of uncompressed model artifacts
vllm_model = Model(
    image_uri=image_uri[0],
    model_data={
        'S3DataSource':{
            'S3Uri': f's3://{sagemaker_session_bucket}/{base_prefix}/',
            'S3DataType': 'S3Prefix',
            'CompressionType': 'None'}},
    env=config,
    role=role,
)

vllm_predictor = vllm_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

-----------------------------------*

UnexpectedStatusException: Error hosting endpoint sm-vllm-2024-07-05-12-23-48-834: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. Try changing the instance type or reference the troubleshooting page https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-troubleshooting.html

In [None]:
# You can reinstantiate the Predictor object if you restart the notebook or Predictor is None
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
endpoint_name = endpoint_name

vllm_predictor = Predictor(
    endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

<a id="compare"></a>
## Invoke base model and adapter, compare outputs

We can invoke the base Mistral model, as well as any of the adapters downloaded to our endpoint! vLLM will take care of downloading them, continuously batch requests for different adapters, and manage DRAM and RAM by loading/offloading adapters.

Let’s inspect the difference between the base model’s response and the adapter’s response:

In [None]:
prompt = '[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]'

payload_base = {
    "model": "mistralai/Mistral-7B-Instruct-v0.1",
    "prompt": prompt,
    "max_tokens": 64,
    "temperature":0
}


payload_adapter = {
    "model": "1",
    "prompt": prompt,
    "max_tokens": 64,
    "temperature":0
}

response_base = vllm_predictor.predict(payload_base)
response_adapter = vllm_predictor.predict(payload_adapter)

print(f'Base model output:\n-------------\n {response_base["choices"][0]["text"]}')
print(f'\nAdapter output:\n-------------\n {response_adapter["choices"][0]["text"]}')

You can also check out the full details of the response

In [None]:
response_adapter

<a id="benchmark"></a>
## Benchmark single adapter vs. random access to adapters



First, we individually call each of the adapters in sequence, to make sure they are previously loaded to either GPU or CPU memory. We want to exclude disk read latency from the benchmark metrics.

In [None]:
from tqdm import tqdm

for i in tqdm(range(1,n_adapters+1)):
    payload_adapter = {
        "model": str(i),
        "prompt": prompt,
        "max_tokens": 64,
        "temperature":0
    }
    vllm_predictor.predict(payload_adapter)

Now we are ready to benchmark. For the single adapter case, we invoke the adapter `total_requests` times from `num_threads` concurrent clients.

For the multi-adapter case, we invoke a random adapter from any of the clients, until all adapters have been invoked `total_requests//num_adapters` times.

In [None]:
# Adjust if you run into connection pool errors
# import botocore

# Configure botocore to use a larger connection pool
# config = botocore.config.Config(max_pool_connections=100)

In [None]:
import threading
import time
import random


# Configuration
total_requests = 300
num_adapters = 50
num_threads = 20  # Adjust based on your system capabilities


# Shared lock and counters for # invocations of each adapter 
adapter_counters = [total_requests // num_adapters] * num_adapters
counters_lock = threading.Lock()

def invoke_adapter(aggregate_latency, single_adapter=False):
    global total_requests
    latencies = []
    while True:
        with counters_lock:
            if single_adapter:
                adapter_id = 1
                if total_requests > 0:
                    total_requests -= 1
                else:
                    break
            else:
                # Find an adapter that still needs to be called
                remaining_adapters = [i for i, count in enumerate(adapter_counters) if count > 0]
                if not remaining_adapters:
                    break
                adapter_id = random.choice(remaining_adapters) + 1
                adapter_counters[adapter_id - 1] -= 1

        prompt = '[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]'

        payload_adapter = {
            "model": str(adapter_id),
            "prompt": prompt,
            "max_tokens": 64,
            "temperature":0
        }
        start_time = time.time()
        response_adapter = vllm_predictor.predict(payload_adapter)
        latency = time.time() - start_time
        latencies.append(latency)

    aggregate_latency.extend(latencies)

def benchmark_scenario(single_adapter=False):
    threads = []
    all_latencies = []
    start_time = time.time()

    for _ in range(num_threads):
        thread_latencies = []
        all_latencies.append(thread_latencies)
        thread = threading.Thread(target=invoke_adapter, args=(thread_latencies, single_adapter))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    total_latency = sum([sum(latencies) for latencies in all_latencies])
    total_requests_made = sum([len(latencies) for latencies in all_latencies])
    average_latency = total_latency / total_requests_made
    throughput = total_requests_made / (time.time() - start_time)

    print(f"Total Time: {time.time() - start_time}s")
    print(f"Average Latency: {average_latency}s")
    print(f"Throughput: {throughput} requests/s")

# Run benchmarks
print("Benchmarking: Single Adapter Multiple Times")
benchmark_scenario(single_adapter=True)

print("\nBenchmarking: Multiple Adapters with Random Access")
benchmark_scenario()


<a id="cleanup"></a>
## Cleanup endpoint resources

In [None]:
vllm_predictor.delete_endpoint()