# re:Invent 2024 Workshop

# Using SageMaker Efficient Multi-Adapter Serving to host LoRA adapters at Scale

Multi-Adapter serving allows for multiple fine-tuned models to be hosted in a cost efficient manner on a singular endpoint. Via a multi-adapter approach we can tackle multiple different tasks with a singular base LLM. In this example you will use a pre-trained LoRA adapter that was fine tuned from Llama 3.1 8B Instruct on the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum).

You will also see how to dynamically load these adapters using [SageMaker Inference Components](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/), in this example we specifically explore the Inference Component Adapter feature which will allow for us to load hundreds of adapters on a SageMaker real-time endpoint.

![](./images/ic-adapter-architecture.png)

## Step 1: Setup

### Fetch and import dependencies 
Ignore incompatability errors

In [1]:
%pip uninstall -q -y autogluon-multimodal autogluon-timeseries autogluon-features autogluon-common autogluon-core
%pip install boto3==1.35.68 --quiet --upgrade
%pip install sagemaker==2.235.2 --quiet --upgrade
%pip install -Uq datasets==3.0.0

Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.13.3 requires botocore<1.34.163,>=1.34.70, but you have botocore 1.35.71 which is incompatible.
amazon-sagemaker-sql-magic 0.1.3 requires sqlparse==0.5.0, but you have sqlparse 0.5.1 which is incompatible.
langchain-aws 0.1.18 requires boto3<1.35.0,>=1.34.131, but you have boto3 1.35.68 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sparkmagic 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.2.3 which is incompatible.
virtualenv 20.21.0 requires platformdirs<4,>=2.4, but you have platform

## Restart kernel before continuing 
## Menu Bar > Kernel > Restart Kernel...

In [2]:
import sagemaker
import boto3
import json
import os
import sys
sys.path.append(os.path.dirname(os.getcwd()))

from utilities.helpers import (
    pretty_print_html, 
    set_meta_llama_params,
    print_dialog,
    format_messages,
    write_eula,
    read_eula
)

print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

boto3 version: 1.35.68
sagemaker version: 2.235.2


### Configure development environment and boto3 clients

In [3]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name

sm_client = boto3.client(service_name="sagemaker")
sm_runtime = boto3.client(service_name="sagemaker-runtime")

## Step 2: Create a model, endpoint configuration and endpoint

# License/EULA
# Please review [Llama LICENSE](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/LICENSE) before continuing!

In [4]:
from ipywidgets import Dropdown

eula_dropdown = Dropdown(
    options=["True", "False"],
    value="False",
    description="**Please accept Llama 3.1 8B Instruct EULA to continue:**",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(eula_dropdown)

Dropdown(description='**Please accept Llama 3.1 8B Instruct EULA to continue:**', index=1, layout=Layout(width…

In [6]:
llama_eula = f'{str(eula_dropdown.value.capitalize())}'
print(f"Your Llama 3.1 EULA attribute is set to 👉 {llama_eula}")

Your Llama 3.1 EULA attribute is set to 👉 True


In [7]:
_ = write_eula(llama_eula)

In [8]:
read_eula()

'True'

## For this workshop, the model artifacts have already been downloaded to S3 for you.

In [13]:
model_id_pathsafe = "llama3_1_8b_instruct"
s3_model_path = f"s3://{bucket}/sagemaker/models/base/{model_id_pathsafe}/"

s3_model_path

's3://sagemaker-us-east-1-975050153094/sagemaker/models/base/llama3_1_8b_instruct/'

### Select a Large Model Inference (LMI) container image

Select one of the [available Large Model Inference (LMI) container images for hosting](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). Efficient adapter inference capability is available in `0.31.0-lmi13.0.0` and higher. Ensure that you are using the image URI for the region that corresponds with your deployment region.

In [14]:
inference_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"

print(f"Inference container image:: {inference_image_uri}")

Inference container image:: 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124


### Configure model container environment

Create an container environment for the hosting container. LMI container parameters can be found in the [LMI User Guides](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/index.html).

By using the `OPTION_MAX_LORAS` and `OPTION_MAX_CPU_LORAS` parameters, you can control how adapters are loaded and unloaded into GPU/CPU memory. The `OPTION_MAX_LORAS` parameter defines the number of adapters that will be held in GPU memory, and any additional adapters will be offloaded to CPU memory. The `OPTION_MAX_CPU_LORAS` parameter controls the number of adapters that will be held in CPU memory, with any additional adapters being offloaded to local SSD. In the following example, the container will hold 30 adapters in GPU memory, and 70 adapters in CPU memory.

```
env = {
    "HF_MODEL_ID": f"{s3_model_path}",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "30",
    "OPTION_MAX_CPU_LORAS": "70",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}
```


Later in this workshop you will test scenarios where you will force adapters to swap between different tiers. To make this easier, you will set the `OPTION_MAX_LORAS` property to `1` and the `OPTION_MAX_CPU_LORAS` to `1`. This will allow you to hold 1 adapter in GPU memory and 1 in CPU memory before moving adapters to disk.

In [15]:
env = {
    "HF_MODEL_ID": f"{s3_model_path}",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "1",
    "OPTION_MAX_CPU_LORAS": "1",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

env

{'HF_MODEL_ID': 's3://sagemaker-us-east-1-975050153094/sagemaker/models/base/llama3_1_8b_instruct/',
 'OPTION_ROLLING_BATCH': 'lmi-dist',
 'OPTION_MAX_ROLLING_BATCH_SIZE': '16',
 'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
 'OPTION_ENABLE_LORA': 'true',
 'OPTION_MAX_LORAS': '1',
 'OPTION_MAX_CPU_LORAS': '1',
 'OPTION_DTYPE': 'fp16',
 'OPTION_MAX_MODEL_LEN': '6000'}

### Create a model object

With your container image and environment defined, you can create a SageMaker model object that you will use to create an inference component later.

In [16]:
model_name = sagemaker.utils.name_from_base("llama-3-1-8b-instruct")
print(f'Model name: {model_name}')

Model name: llama-3-1-8b-instruct-2024-11-29-19-00-31-830


In [17]:
create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image_uri, 
        "Environment": env,
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model ARN: {model_arn}")

Created Model ARN: arn:aws:sagemaker:us-east-1:975050153094:model/llama-3-1-8b-instruct-2024-11-29-19-00-31-830


### Create an endpoint configuration

To create a SageMaker endpoint, you need an endpoint configuration. When using Inference Components, you do not specify a model in the endpoint configuration. You will load the model as a component later on.

In [18]:
# Set variant name and instance type for hosting
endpoint_config_name = f"{model_name}"
variant_name = "AllTraffic"
instance_type = "ml.g5.2xlarge"
model_data_download_timeout_in_seconds = 900
container_startup_health_check_timeout_in_seconds = 900

initial_instance_count = 1

In [19]:
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = role,
    ProductionVariants = [
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": initial_instance_count,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ]
)

print(f"Created Endpoint Config ARN: {create_endpoint_config_response['EndpointConfigArn']}")

Created Endpoint Config ARN: arn:aws:sagemaker:us-east-1:975050153094:endpoint-config/llama-3-1-8b-instruct-2024-11-29-19-00-31-830


### Create inference endpoint

Create your empty SageMaker endpoint. You will use this to host your base model and adapter inference components later.

In [20]:
endpoint_name = f"{model_name}"

print(f'Endpoint name: {endpoint_name}')

Endpoint name: llama-3-1-8b-instruct-2024-11-29-19-00-31-830


#### This step can take around 5 minutes

In [None]:
%%time

create_endpoint_response = sm_client.create_endpoint(
    EndpointName = endpoint_name, EndpointConfigName = endpoint_config_name
)

sess.wait_for_endpoint(endpoint_name)

print(f"Created Endpoint ARN: {create_endpoint_response['EndpointArn']}")

----

### Create base model inference component

With your endpoint created, you can now create the IC for the base model. This will be the base component that the adapter components you create later will depend on. 

Notable parameters here are [`ComputeResourceRequirements`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InferenceComponentComputeResourceRequirements.html). These are a component level configuration that determine the amount of resources that the component needs (Memory, vCPUs, Accelerators). The adapters will share these resources with the base component.


#### This step can take around 7 minutes

In [None]:
%%time

base_inference_component_name = f"base-{model_name}"
print(f"Base inference component name: {base_inference_component_name}")

variant_name = "AllTraffic"

initial_copy_count = 1
min_memory_required_in_mb = 10000
number_of_accelerator_devices_required = 1

base_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = base_inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification={
        "ModelName": model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": initial_copy_count,
    },
)

sess.wait_for_inference_component(base_inference_component_name)

print(f"\nCreated Base inference component ARN: {base_create_inference_component_response['InferenceComponentArn']}")

-------------!
Created Base inference component ARN: arn:aws:sagemaker:us-east-1:975050153094:inference-component/base-llama-3-1-8b-instruct-2024-11-29-19-00-31-830
CPU times: user 88 ms, sys: 9.39 ms, total: 97.4 ms
Wall time: 5min 43s


### View logs for the base inference component (and adapters after they're loaded)

In [24]:
import urllib

cw_path = urllib.parse.quote_plus(f'/aws/sagemaker/InferenceComponents/{base_inference_component_name}', safe='', encoding=None, errors=None)

print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

You can view your inference component logs here:

 https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/%2Faws%2Fsagemaker%2FInferenceComponents%2Fbase-llama-3-1-8b-instruct-2024-11-29-19-00-31-830


### Create the Inference Components (ICs) for the adapters

In this example you’ll create a single adapter, but you could host up to hundreds of them per endpoint. They will need to be compressed and uploaded to S3.

The adapter package has the following files at the root of the archive with no sub-folders:

![](./images/adapter_files.png)

For this example, an adapter was fine tuned using QLoRA and [Fully Sharded Data Parallel (FSDP)](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-v2.html) on the training split of the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum). Training took 21 minutes on a ml.p4d.24xlarge and cost ~$13 using current [on-demand pricing](https://aws.amazon.com/sagemaker/pricing/).

#### Compress and copy local adapter to S3

In [25]:
ectsum_adapter_filename = "ectsum-adapter.tar.gz"
ectsum_adapter_local_path = "../adapters/ectsum-adapter/"
ectsum_adapter_s3_uri = f"s3://{bucket}/adapters/{ectsum_adapter_filename}"

!tar -cvzf {ectsum_adapter_filename} -C {ectsum_adapter_local_path} . 

!aws s3 cp ./{ectsum_adapter_filename} {ectsum_adapter_s3_uri}

./
./README.md
./adapter_config.json
./adapter_model.safetensors
./special_tokens_map.json
./tokenizer.json
./tokenizer_config.json
./training_args.bin
upload: ./ectsum-adapter.tar.gz to s3://sagemaker-us-east-1-975050153094/adapters/ectsum-adapter.tar.gz


### Create ECTSum adapter inference component

For each adapter you are going to deploy, you need to specify an `InferenceComponentName`, an `ArtifactUrl` with the S3 location of the adapter archive, and a `BaseInferenceComponentName` to create the connection between the base model IC and the new adapter ICs. You will repeat this process for each additional adapter.

#### This step can take around 2 minutes

In [26]:
%%time

ic1_adapter_name = f"ic-ectsum-{base_inference_component_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic1_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic1_adapter_name)

print(f"\nCreated Adapter inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

--!
Created Adapter inference component ARN: arn:aws:sagemaker:us-east-1:975050153094:inference-component/ic-ectsum-base-llama-3-1-8b-instruct-2024-11-29-19-00-31-830
CPU times: user 38.1 ms, sys: 4 ms, total: 42.1 ms
Wall time: 1min 4s


Look at base inference component logs again.

It should show a line that looks like:

`Registered adapter <ADAPTER_NAME> from /opt/ml/models/ ... successfully`.

In [27]:
print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

You can view your inference component logs here:

 https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/%2Faws%2Fsagemaker%2FInferenceComponents%2Fbase-llama-3-1-8b-instruct-2024-11-29-19-00-31-830


## Step 3: Invoking the Endpoint

First you will pull a random datapoint form the ECTSum test split. You'll use the `text` field to invoke the model and the `summary` filed to compare with ground truth later.

In [40]:
from datasets import load_dataset
dataset_name = "mrSoul7766/ECTSum"

test_dataset = load_dataset(dataset_name, split="test")

#due to GPU memory limitations on ml.g5.2xlarge, we have limited the max sequence length to 6000 tokens. 
#Some of the ECTSum samples are too large.
#This code will loop until it gets a sample that is < 5500 so that inference does not throw errors.

valid_test_value = False
while not valid_test_value:
    test_item = test_dataset.shuffle().select(range(1))
    sample_size = len(test_item["text"][0])/4
    if sample_size > 5500:
        print(f'sample size {sample_size} > 5500, fetching new sample.')
    else:
        print(f'sample_size {sample_size}')
        valid_test_value = True

ground_truth_response = test_item["summary"]

sample size 5674.25 > 5500, fetching new sample.
sample_size 5045.5


Next you will build a prompt to invoke the model for earnings summarization, filling in the source text with a random item from the ECTSum dataset. 

In [41]:
prompt =f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
                You are an AI assistant trained to summarize earnings calls. Provide a concise summary of the call, capturing the key points and overall context. Focus on quarter over quarter revenue, earnings per share, changes in debt, highlighted risks, and growth opportunities.
                <|eot_id|><|start_header_id|>user<|end_header_id|>
                Summarize the following earnings call:

                {test_item["text"]}
                <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

### Plain base model with no adapters

To test the base model, specify the `EndpointName` for the endpoint you created earlier and the name of the base inference component as `InferenceComponentName` along with your prompt and other inference parameters in the `Body` parameter.

In [42]:
%%time

component_to_invoke = base_inference_component_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 125, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

base_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"]}\n\n')
print(f'Base Model Response:\n\n {base_response}\n')

Ground Truth:

 ['sees fy adjusted earnings per share $6.35 to $6.85.\nsees fy revenue $7.75 billion to $7.95 billion.\nq2 adjusted earnings per share $1.48 excluding items.']


Base Model Response:

 

**Oshkosh Corporation Q2 2021 Earnings Call Summary**

**Key Highlights:**

1. **Revenue**: Q2 2021 revenue was $1.9 billion, a 6.5% increase from the prior year and $140 million above expectations.
2. **Adjusted Earnings Per Share**: $1.48, exceeding expectations and prior-year results.
3. **Segment Performance**:
	* Access Equipment: Strong sales growth, driven by increased demand in Asia and North America.
	* Defense: Sales decreased due to lower FMTV sales, but backlog remains robust at $

CPU times: user 15.1 ms, sys: 334 μs, total: 15.4 ms
Wall time: 5.97 s


### Invoke ECTSum adapter

To invoke the adapter, use the adapter inference component name in your `invoke_endpoint` call.

In [43]:
%%time

component_to_invoke = ic1_adapter_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 125, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"]}\n\n')
print(f'Adapter Model Response:\n\n {adapter_response}\n')

Ground Truth:

 ['sees fy adjusted earnings per share $6.35 to $6.85.\nsees fy revenue $7.75 billion to $7.95 billion.\nq2 adjusted earnings per share $1.48 excluding items.']


Adapter Model Response:

 
                Here is a concise summary of the call:

                q2 earnings per share $1.48.
q2 sales $1.92 billion versus refinitiv ibes estimate of $1.86 billion.
sees fy adjusted earnings per share $6.35 to $6.85.
sees fy sales $7.75 billion to $7.95 billion.
q2 sales $1.92 billion.
sees fy sales $7.75 billion to $7.95 billion.
sees fy adjusted earnings per share $6.35 to $6.85.
sees fy sales $7

CPU times: user 4.36 ms, sys: 19 μs, total: 4.38 ms
Wall time: 6.21 s


### Compare outputs

Compare the outputs of the base model and adapter to ground truth. In this test, notice that while the base model looks subjectively more visually attractive, the adapter response is significantly closer to ground truth; which is what you are looking for. This will be proven with metrics in the next section.

In [44]:
print(f'Ground Truth:\n\n {test_item["summary"][0]}\n\n')
print("\n----------------------------------\n")
print(f'Base Model Response:\n\n {base_response}')
print("\n----------------------------------\n")
print(f'Adapter Model Response:\n\n {adapter_response}')

Ground Truth:

 sees fy adjusted earnings per share $6.35 to $6.85.
sees fy revenue $7.75 billion to $7.95 billion.
q2 adjusted earnings per share $1.48 excluding items.



----------------------------------

Base Model Response:

 

**Oshkosh Corporation Q2 2021 Earnings Call Summary**

**Key Highlights:**

1. **Revenue**: Q2 2021 revenue was $1.9 billion, a 6.5% increase from the prior year and $140 million above expectations.
2. **Adjusted Earnings Per Share**: $1.48, exceeding expectations and prior-year results.
3. **Segment Performance**:
	* Access Equipment: Strong sales growth, driven by increased demand in Asia and North America.
	* Defense: Sales decreased due to lower FMTV sales, but backlog remains robust at $

----------------------------------

Adapter Model Response:

 
                Here is a concise summary of the call:

                q2 earnings per share $1.48.
q2 sales $1.92 billion versus refinitiv ibes estimate of $1.86 billion.
sees fy adjusted earnings per s

To validate the true adapter performance, you can use a tool like [fmeval](https://github.com/aws/fmeval) to run an evaluation of summarization accuracy. This will calculate the METEOR, ROUGE, and BertScore metrics for the adapter versus the base model. Doing so against the test split of ECTSum yields the following results:

![](./images/fmeval-overall.png)

The fine-tuned adapter shows a 59% increase in METEOR score, 159% increase in ROUGE score, and 8.6% in BertScore. The following diagram shows the frequency distribution of scores for the different metrics, with the adapter consistently scoring better more often in all metrics. 

Since the adapter is already loaded into GPU memory, model latency is largely unaffected, with only a difference of 2% between direct base model invocation and the adapter. If the adapter is loaded from CPU memory or disk, it will incur an cold start delay for the first load to GPU.

![](./images/fmeval-histogram.png)

### Upload a new ECTSum adapter artifact and update the live adapter inference component

Since adapters are managed as Inference Components, you can update them on a running endpoint. SageMaker handles the unloading/deregistering of the old adapter and loading/registering of the new adapter onto every base ICs on all of the instances that it is running on for this endpoint. To update an adapter IC, use the  update_inference_component  API and supply the existing IC name and the S3 path to the new compressed adapter archive. 

You can train a new adapter, or re-upload the existing adapter artifact to test this functionality.

In [45]:
new_ectsum_adapter_s3_uri = f"s3://{bucket}/lora-adapters/new-ectsum-adapter.tar.gz"

!aws s3 cp ./ectsum-adapter.tar.gz {new_ectsum_adapter_s3_uri}

upload: ./ectsum-adapter.tar.gz to s3://sagemaker-us-east-1-975050153094/lora-adapters/new-ectsum-adapter.tar.gz


#### This step can take around 5 minutes

In [46]:
%%time

update_inference_component_response = sm_client.update_inference_component(
    InferenceComponentName = ic1_ectsum_name,
    Specification={
        "Container": {
            "ArtifactUrl": new_ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic1_adapter_name)

print(f'Updated inference component adapter ARN: {update_inference_component_response["InferenceComponentArn"]}')

---------------!Updated inference component adapter ARN: arn:aws:sagemaker:us-east-1:975050153094:inference-component/ic-ectsum-base-llama-3-1-8b-instruct-2024-11-29-19-00-31-830
CPU times: user 103 ms, sys: 3.9 ms, total: 107 ms
Wall time: 5min 23s


If you view your inference component logs (link below), you will see log entries for the deregistration of the old adapter and the registration of the new one.

You should see something similar to:
`[INFO ] PyProcess - W-200-0d1e4741a42db26-stdout: [1,0]<stdout>:INFO::Unregistered adapter ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401 successfully`

`[INFO ] PyProcess - W-200-0d1e4741a42db26-stdout: [1,0]<stdout>:INFO::Registered adapter ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401 from /opt/ml/models/container_340043819279-ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401-1732570150851-MaeveWestworldService-1.0.9353.0 successfully`

In [47]:
print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

You can view your inference component logs here:

 https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/%2Faws%2Fsagemaker%2FInferenceComponents%2Fbase-llama-3-1-8b-instruct-2024-11-29-19-00-31-830


### Retest with updated adapter

In [48]:
%%time

component_to_invoke = ic1_adapter_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 125, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"][0]}\n\n')
print(f'Updated Adapter Model Response:\n\n {adapter_response}\n')

Ground Truth:

 sees fy adjusted earnings per share $6.35 to $6.85.
sees fy revenue $7.75 billion to $7.95 billion.
q2 adjusted earnings per share $1.48 excluding items.


Updated Adapter Model Response:

 
                Here is a concise summary of the call:

                compname posts q2 adjusted earnings per share $1.48.
q2 sales rose 6.5 percent to $1.9 billion.
q2 earnings per share $1.48.
sees q3 sales to rise about 40 percent from q3 of 2020.
qtrly consolidated sales rose 6.5 percent to $1.9 billion.
sees 2021 sales to be in the range of $7.75 billion to $7.95 billion.
sees 2021 adjusted earnings per share to be in

CPU times: user 19.1 ms, sys: 0 ns, total: 19.1 ms
Wall time: 6.3 s


## Step 4: Swapping adapters between GPU/CPU/disk

To illustrate the swapping of adapters between different tiers, you will create 2 more adapter inference components. For simplicity, you can reuse the same adapter artifact code from earlier.

When registering new adapters, the newest registration moves into GPU and if `OPTION_MAX_LORAS` is exceeded, will evict the least recently used (LRU) adapter to the CPU tier. If this move causes `OPTION_MAX_CPU_LORAS` to be exceeded, the LRU adapter from the CPU is then evicted to disk.

Since you have set up `OPTION_MAX_LORAS` to `1` and `OPTION_MAX_CPU_LORAS` to `1` in the earlier section, the registration of IC2 in the next step will move the original adapter to CPU, and the subsequent registration of IC3 after that will move IC2 from GPU > CPU and IC1 from CPU > Disk. Invoking adapters not currently in GPU will incur a cold start penalty on the first invocation.

In [49]:
%%time

ic2_adapter_name = f"ic2-ectsum-{base_inference_component_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic2_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic2_adapter_name)

print(f"\nCreated Adapter 2 inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

---!
Created Adapter 2 inference component ARN: arn:aws:sagemaker:us-east-1:975050153094:inference-component/ic2-ectsum-base-llama-3-1-8b-instruct-2024-11-29-19-00-31-830
CPU times: user 41.9 ms, sys: 495 μs, total: 42.4 ms
Wall time: 1min 21s


In [51]:
%%time

ic3_adapter_name = f"ic3-ectsum-{base_inference_component_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic3_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic3_adapter_name)

print(f"\nCreated Adapter 3 inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

----!
Created Adapter 3 inference component ARN: arn:aws:sagemaker:us-east-1:975050153094:inference-component/ic3-ectsum-base-llama-3-1-8b-instruct-2024-11-29-19-00-31-830


In [None]:
import time

#tiers 0 - GPU, 1 - CPU, 2 - Disk
tiers = [ ic3_adapter_name, ic2_adapter_name, ic1_adapter_name ]

cycles = 2

invocation_order = [
    #base_inference_component_name, #baseline
    ic3_adapter_name, #ic3 in GPU already.  GPU: ic3 CPU: ic2 DISK: ic1
    ic2_adapter_name, #swap ic2 from CPU.   GPU: ic2 CPU: ic3 DISK: ic1
    ic2_adapter_name, #ic2 is still in GPU. GPU: ic2 CPU: ic3 DISK: ic1
    ic1_adapter_name, #swap ic1 from disk.  GPU: ic1 CPU: ic2 DISK: ic3
    ic1_adapter_name, #ic1 is still in GPU. GPU: ic1 CPU: ic2 DISK: ic3
    ic2_adapter_name, #swap ic2 from CPU.   GPU: ic2 CPU: ic1 DISK: ic3
    ic3_adapter_name, #swap ic3 from disk.  GPU: ic3 CPU: ic2 DISK: ic1
    # back to the starting configuration
]

no_swaps = []
cpu_swaps = []
disk_swaps = []

swap_type = 0

for cycle in range(cycles):
    for invocation in invocation_order:
    
        if invocation == base_inference_component_name or tiers.index(invocation) == 0:
            #do nothing
            swap_type = "NONE"
            pass
        elif tiers.index(invocation) == 1:
            tiers[1] = tiers[0]
            tiers[0] = invocation
            swap_type = "FROM_CPU"
        elif tiers.index(invocation) == 2:
            tiers[2] = tiers[1]
            tiers[1] = tiers[0]
            tiers[0] = invocation
            swap_type = "FROM_DISK"


        component_to_invoke = invocation
        
        start = time.time()*1000
        
        response_model = sm_runtime.invoke_endpoint(
            EndpointName = endpoint_name,
            InferenceComponentName = component_to_invoke,
            Body = json.dumps(
                {
                    "inputs": prompt,
                    "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 1, "temperature":0.9}
                }
            ),
            ContentType = "application/json",
        )
    
        end = time.time()*1000

        total = int(end - start)
        
        if swap_type == "NONE":
            no_swaps.append(total)
        elif swap_type == "FROM_CPU":
            cpu_swaps.append(total)
        elif swap_type == "FROM_DISK":
            disk_swaps.append(total)
    
        print(f'call to [{invocation.split("-")[0]}] {total} ms. swap: [{swap_type}] [ GPU: {tiers[0].split("-")[0]} CPU: {tiers[1].split("-")[0]} Disk: {tiers[2].split("-")[0]} ]')

no_swaps_count = len(no_swaps)
no_swaps_avg = int(sum(no_swaps)/len(no_swaps))
cpu_swaps_count = len(cpu_swaps)
cpu_swaps_avg = int(sum(cpu_swaps)/len(cpu_swaps))
disk_swaps_count = len(disk_swaps)
disk_swaps_avg = int(sum(disk_swaps)/len(disk_swaps))

call to [ic3] 1185 ms. swap: [NONE] [ GPU: ic3 CPU: ic2 Disk: ic ]
call to [ic2] 1244 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic3 Disk: ic ]
call to [ic2] 1174 ms. swap: [NONE] [ GPU: ic2 CPU: ic3 Disk: ic ]
call to [ic] 1237 ms. swap: [FROM_DISK] [ GPU: ic CPU: ic2 Disk: ic3 ]
call to [ic] 1173 ms. swap: [NONE] [ GPU: ic CPU: ic2 Disk: ic3 ]
call to [ic2] 1269 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic Disk: ic3 ]
call to [ic3] 1236 ms. swap: [FROM_DISK] [ GPU: ic3 CPU: ic2 Disk: ic ]
call to [ic3] 1202 ms. swap: [NONE] [ GPU: ic3 CPU: ic2 Disk: ic ]
call to [ic2] 1216 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic3 Disk: ic ]
call to [ic2] 1173 ms. swap: [NONE] [ GPU: ic2 CPU: ic3 Disk: ic ]
call to [ic] 1242 ms. swap: [FROM_DISK] [ GPU: ic CPU: ic2 Disk: ic3 ]
call to [ic] 1175 ms. swap: [NONE] [ GPU: ic CPU: ic2 Disk: ic3 ]
call to [ic2] 1240 ms. swap: [FROM_CPU] [ GPU: ic2 CPU: ic Disk: ic3 ]
call to [ic3] 1238 ms. swap: [FROM_DISK] [ GPU: ic3 CPU: ic2 Disk: ic ]
call to [ic3] 1178 ms. swap: [

In [94]:
import pandas as pd
import numpy as np
np_no_swaps = np.array(no_swaps)
np_cpu_swaps = np.array(cpu_swaps)
np_disk_swaps = np.array(disk_swaps)

data = {
    "count": [no_swaps_count, cpu_swaps_count, disk_swaps_count],
    "average": [no_swaps_avg, cpu_swaps_avg, disk_swaps_avg],
    "+latency avg": [0, cpu_swaps_avg-no_swaps_avg, disk_swaps_avg-no_swaps_avg],
    "+%latency avg": [0, ((cpu_swaps_avg-no_swaps_avg)/no_swaps_avg)*100, ((disk_swaps_avg-no_swaps_avg)/no_swaps_avg)*100],
    "p50": [int(np.percentile(np_no_swaps, 50)), int(np.percentile(np_cpu_swaps, 50)), int(np.percentile(np_disk_swaps, 50))],
    "p75": [int(np.percentile(np_no_swaps, 75)), int(np.percentile(np_cpu_swaps, 75)), int(np.percentile(np_disk_swaps, 75))],
    "p99": [int(np.percentile(np_no_swaps, 99)), int(np.percentile(np_cpu_swaps, 99)), int(np.percentile(np_disk_swaps, 99))]
}


df = pd.DataFrame(data, index = ["no swap", "cpu swap", "disk swap"])

df

Unnamed: 0,count,average,+latency avg,+%latency avg,p50,p75,p99
no swap,1500,1135,0,0.0,1127,1131,1299
cpu swap,1000,1189,54,4.757709,1186,1190,1244
disk swap,1000,1190,55,4.845815,1187,1191,1243


If you were to run a similar test on 500 cycles, you'd see the following:
![](./images/adapter-load-latency.png)

## Step 5: Clean up the environment

If you need to delete an adapter, call the `delete_inference_component` API with the IC name to remove it. 

In [27]:
sess.delete_inference_component(ic1_adapter_name, wait = True)
sess.delete_inference_component(ic2_adapter_name, wait = True)
sess.delete_inference_component(ic3_adapter_name, wait = True)

print(f'Adapter Component {ic1_adapter_name} deleted.')
print(f'Adapter Component {ic2_adapter_name} deleted.')
print(f'Adapter Component {ic3_adapter_name} deleted.')

Deleting the base model IC will automatically delete the base IC and any associated adapter ICs.

In [28]:
sess.delete_inference_component(base_inference_component_name, wait = True)

print(f'Base Component {base_inference_component_name} deleted.')

Clean up the running endpoint and its configuration.

In [29]:
sess.delete_endpoint(endpoint_name)
print(f'Endpoint {endpoint_name} deleted.')

sess.delete_endpoint_config(endpoint_config_name)
print(f'Endpoint Configuration {endpoint_config_name} deleted.')

sess.delete_model(model_name)
print(f'Model {model_name} deleted.')