# Using SageMaker Efficient Multi-Adapter Serving to host LoRA adapters at Scale

Multi-Adapter serving allows for multiple fine-tuned models to be hosted in a cost efficient manner on a singular endpoint. Via a multi-adapter approach we can tackle multiple different tasks with a singular base LLM. In this example you will use a pre-trained LoRA adapter that was fine tuned from Llama 3.1 8B Instruct on the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum).

You will also see how to dynamically load these adapters using [SageMaker Inference Components](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/), in this example we specifically explore the Inference Component Adapter feature which will allow for us to load hundreds of adapters on a SageMaker real-time endpoint.

![](./images/ic-adapter-architecture.png)

## Step 1: Setup

### Fetch and import dependencies 
Ignore incompatability errors

In [None]:
%pip install -Uq datasets==3.0.0 --no-warn-conflicts

In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
#!sudo apt-get install git-lfs
#!git clone https://github.com/aws-samples/sagemaker-genai-hosting-examples.git

## Restart kernel before continuing 
## Menu Bar > Kernel > Restart Kernel...

In [None]:
import sagemaker
import boto3
import json

print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

### Configure development environment and boto3 clients

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name

sm_client = boto3.client(service_name="sagemaker")
sm_runtime = boto3.client(service_name="sagemaker-runtime")

## Step 2: Deploy a model to SageMaker IC-based endpoint

### Select a Large Model Inference (LMI) container image

Select one of the [available Large Model Inference (LMI) container images for hosting](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). Efficient adapter inference capability is available in `0.31.0-lmi13.0.0` and higher. Ensure that you are using the image URI for the region that corresponds with your deployment region.

In [None]:
#inference_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"
inference_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124"

print(f"Inference container image:: {inference_image_uri}")

### Configure model container environment

Create an container environment for the hosting container. LMI container parameters can be found in the [LMI User Guides](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/index.html).

By using the `OPTION_MAX_LORAS` and `OPTION_MAX_CPU_LORAS` parameters, you can control how adapters are loaded and unloaded into GPU/CPU memory. The `OPTION_MAX_LORAS` parameter defines the number of adapters that will be held in GPU memory. The `OPTION_MAX_CPU_LORAS` parameter controls the number of adapters that will be held in CPU memory. It is important to note that adapters which are loaded to GPU have to be precached in CPU memory and will occupy space in the CPU cache. This means `OPTION_MAX_CPU_LORAS` should be set to `OPTION_MAX_LORAS + <number of adapters you want to cache in CPU>`. Any adapters beyond this will be offloaded to local SSD. 

In the following example, the container will hold 30 adapters in GPU memory, and 70 adapters in CPU memory. Out of the 70, 30 will be precached adapters that already reside in GPU, leaving you with 40 slots free.

```
env = {
    "HF_MODEL_ID": f"{s3_model_path}",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "30",
    "OPTION_MAX_CPU_LORAS": "70",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}
```


Later in this workshop you will test scenarios where you will force adapters to swap between different tiers. To make this easier, you will set the `OPTION_MAX_LORAS` property to `1` and the `OPTION_MAX_CPU_LORAS` to `2`. This will allow you to hold 1 adapter in GPU memory and 1 in CPU memory (plus 1 precached from GPU) before moving adapters to disk.

---
You can deploy a model on SageMaker endpoint from several sources:
- SageMaker JumpStart
- HuggingFace model hub
- Amazon S3 bucker
---

#### Please choose only ONE deployment option below

### Option 1: Deploy a model from SageMaker JumpStart

In [None]:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

model_id, model_version = "meta-textgeneration-llama-3-1-8b-instruct", "2.7.2"

model_name = endpoint_name = sagemaker.utils.name_from_base("test")
base_inference_component_name = "base-" + model_name

env = {
    "HF_MODEL_ID": "/opt/ml/model",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "1",
    "OPTION_MAX_CPU_LORAS": "2",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

jumpstart_model = JumpStartModel(model_id=model_id,
                                 model_version=model_version,
                                 name=model_name,
                                 image_uri=inference_image_uri,
                                 env=env)

jumpstart_model.deploy(
    accept_eula=True,
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    container_startup_health_check_timeout=900,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=base_inference_component_name,
    resources=ResourceRequirements(requests={"num_accelerators": 1, "memory": 4096, "copies": 1,}),
)

### Option 2: Deploy a model from HuggingFace model hub

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

model_id = "meta-llama/Llama-3.1-8B-Instruct"

model_name = endpoint_name = sagemaker.utils.name_from_base("test")
base_inference_component_name = "base-" + model_name

env = {
    "HF_MODEL_ID": model_id,
    "HF_TOKEN": "<YOUR_HF_TOKEN>",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "1",
    "OPTION_MAX_CPU_LORAS": "2",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

lmi_model = sagemaker.Model(image_uri = inference_image_uri,
                            env = env,
                            role = role,
                            name = model_name)


lmi_model.deploy(instance_type = "ml.g5.2xlarge",
                 initial_instance_count = 1,
                 container_startup_health_check_timeout = 900,
                 endpoint_name = endpoint_name,
                 endpoint_type = sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
                 inference_component_name = base_inference_component_name,
                 resources = ResourceRequirements(requests={"num_accelerators": 1, "memory": 4096, "copies": 1}))

### Option 3: S3 bucket

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

model_id = "s3://YOUR_BUCKET"

model_name = endpoint_name = sagemaker.utils.name_from_base("test")
base_inference_component_name = "base-" + model_name

env = {
    "HF_MODEL_ID": model_id,
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "1",
    "OPTION_MAX_CPU_LORAS": "2",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

lmi_model = sagemaker.Model(image_uri = inference_image_uri,
                            env = env,
                            role = role,
                            name = model_name)


lmi_model.deploy(instance_type = "ml.g5.2xlarge",
                 initial_instance_count = 1,
                 container_startup_health_check_timeout = 900,
                 endpoint_name = endpoint_name,
                 endpoint_type = sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
                 inference_component_name = base_inference_component_name,
                 resources = ResourceRequirements(requests={"num_accelerators": 1, "memory": 4096, "copies": 1}))

### View logs for the base inference component (and adapters after they're loaded)

In [None]:
import urllib

cw_path = urllib.parse.quote_plus(f'/aws/sagemaker/InferenceComponents/{base_inference_component_name}', safe='', encoding=None, errors=None)

print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

### Create the Inference Components (ICs) for the adapters

In this example you’ll create a single adapter, but you could host up to hundreds of them per endpoint. They will need to be compressed and uploaded to S3.

The adapter package has the following files at the root of the archive with no sub-folders:

![](./images/adapter_files.png)

For this example, an adapter was fine tuned using QLoRA and [Fully Sharded Data Parallel (FSDP)](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-v2.html) on the training split of the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum). Training took 21 minutes on a ml.p4d.24xlarge and cost ~$13 using current [on-demand pricing](https://aws.amazon.com/sagemaker/pricing/).

#### Compress and copy local adapter to S3

In [None]:
ectsum_adapter_filename = "ectsum-adapter.tar.gz"
ectsum_adapter_local_path = "~/sagemaker-genai-hosting-examples/genai-recipes/Multi-LoRA-Adapters/SM-Managed-Multi-Adapter-Deployment/adapters/ectsum-adapter/"
ectsum_adapter_s3_uri = f"s3://{bucket}/adapters/{ectsum_adapter_filename}"
print(ectsum_adapter_s3_uri)

!tar -cvzf {ectsum_adapter_filename} -C {ectsum_adapter_local_path} .

!aws s3 cp ./{ectsum_adapter_filename} {ectsum_adapter_s3_uri}

### Create ECTSum adapter inference component

For each adapter you are going to deploy, you need to specify an `InferenceComponentName`, an `ArtifactUrl` with the S3 location of the adapter archive, and a `BaseInferenceComponentName` to create the connection between the base model IC and the new adapter ICs. You will repeat this process for each additional adapter.

#### This step can take around 2 minutes

In [None]:
%%time

ic1_adapter_name = f"ic1-ectsum-{model_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic1_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic1_adapter_name)

print(f"\nCreated Adapter inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

Look at base inference component logs again.

It should show a line that looks like:

`Registered adapter <ADAPTER_NAME> from /opt/ml/models/ ... successfully`.

In [None]:
print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

## Step 3: Invoking the Endpoint

First you will pull a random datapoint form the ECTSum test split. You'll use the `text` field to invoke the model and the `summary` filed to compare with ground truth later.

In [None]:
from datasets import load_dataset
dataset_name = "mrSoul7766/ECTSum"

test_dataset = load_dataset(dataset_name, split="test")

#due to GPU memory limitations on ml.g5.2xlarge, we have limited the max sequence length to 6000 tokens.
#Some of the ECTSum samples are too large.
#This code will loop until it gets a sample that is < 5500 so that inference does not throw errors.

valid_test_value = False
while not valid_test_value:
    test_item = test_dataset.shuffle().select(range(1))
    sample_size = len(test_item["text"][0])/4
    if sample_size > 5500:
        print(f'sample size {sample_size} > 5500, fetching new sample.')
    else:
        print(f'sample_size {sample_size}')
        valid_test_value = True

ground_truth_response = test_item["summary"]

Next you will build a prompt to invoke the model for earnings summarization, filling in the source text with a random item from the ECTSum dataset. 

In [None]:
prompt =f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
                You are an AI assistant trained to summarize earnings calls. Provide a concise summary of the call, capturing the key points and overall context. Focus on quarter over quarter revenue, earnings per share, changes in debt, highlighted risks, and growth opportunities.
                <|eot_id|><|start_header_id|>user<|end_header_id|>
                Summarize the following earnings call:

                {test_item["text"]}
                <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

### Plain base model with no adapters

To test the base model, specify the `EndpointName` for the endpoint you created earlier and the name of the base inference component as `InferenceComponentName` along with your prompt and other inference parameters in the `Body` parameter.

In [None]:
%%time

component_to_invoke = base_inference_component_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 125, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

base_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"]}\n\n')
print(f'Base Model Response:\n\n {base_response}\n')

### Invoke ECTSum adapter

To invoke the adapter, use the adapter inference component name in your `invoke_endpoint` call.

In [None]:
%%time

component_to_invoke = ic1_adapter_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 125, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"]}\n\n')
print(f'Adapter Model Response:\n\n {adapter_response}\n')

### Compare outputs

Compare the outputs of the base model and adapter to ground truth. In this test, notice that while the base model looks subjectively more visually attractive, the adapter response is significantly closer to ground truth; which is what you are looking for. This will be proven with metrics in the next section.

In [None]:
print(f'Ground Truth:\n\n {test_item["summary"][0]}\n\n')
print("\n----------------------------------\n")
print(f'Base Model Response:\n\n {base_response}')
print("\n----------------------------------\n")
print(f'Adapter Model Response:\n\n {adapter_response}')

To validate the true adapter performance, you can use a tool like [fmeval](https://github.com/aws/fmeval) to run an evaluation of summarization accuracy. This will calculate the METEOR, ROUGE, and BertScore metrics for the adapter versus the base model. Doing so against the test split of ECTSum yields the following results:

![](./images/fmeval-overall.png)

The fine-tuned adapter shows a 59% increase in METEOR score, 159% increase in ROUGE score, and 8.6% in BertScore. The following diagram shows the frequency distribution of scores for the different metrics, with the adapter consistently scoring better more often in all metrics. 

Since the adapter is already loaded into GPU memory, model latency is largely unaffected, with only a difference of 2% between direct base model invocation and the adapter. If the adapter is loaded from CPU memory or disk, it will incur an cold start delay for the first load to GPU.

![](./images/fmeval-histogram.png)

## Step 4: Swapping adapters between GPU/CPU/disk

To illustrate the swapping of adapters between different tiers, you will create 2 more adapter inference components. For simplicity, you can reuse the same adapter artifact code from earlier.

When registering new adapters, the newest registration moves into GPU and if `OPTION_MAX_LORAS` is exceeded, will evict the least recently used (LRU) adapter to the CPU tier. If this move causes `OPTION_MAX_CPU_LORAS` to be exceeded, the LRU adapter from the CPU is then evicted to disk.

Since you have set up `OPTION_MAX_LORAS` to `1` and `OPTION_MAX_CPU_LORAS` to `2` in the earlier section, the registration of IC2 in the next step will:
- precache IC2 in CPU
- load IC2 in GPU
- evict IC1 to CPU

The subsequent registration of IC3 will:
- precache IC3 to CPU
- evict IC1 from CPU (available from disk)
- load IC3 in GPU
- evict IC2 from GPU (already precached in CPU)

Invoking adapters not currently in GPU will incur a cold start penalty on the first invocation. `max_new_tokens` is set to `1` on to focus on the cold start impact.

In [None]:
%%time

ic2_adapter_name = f"ic2-ectsum-{base_inference_component_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic2_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic2_adapter_name)

print(f"\nCreated Adapter 2 inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

In [None]:
%%time

ic3_adapter_name = f"ic3-ectsum-{base_inference_component_name}"

adapter_create_inference_component_response = sm_client.create_inference_component(
    InferenceComponentName = ic3_adapter_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic3_adapter_name)

print(f"\nCreated Adapter 3 inference component ARN: {adapter_create_inference_component_response['InferenceComponentArn']}")

In [None]:
import time

#starting tier indexes 0 - GPU, 1 - CPU, 2 - Disk
tiers = [ ic3_adapter_name, ic2_adapter_name, ic1_adapter_name ]

cycles = 10

invocation_order = [
    ic3_adapter_name, #ic3 in GPU already.  GPU: ic3 CPU: ic2 DISK: ic1
    ic2_adapter_name, #swap ic2 from CPU.   GPU: ic2 CPU: ic3 DISK: ic1
    ic2_adapter_name, #ic2 is still in GPU. GPU: ic2 CPU: ic3 DISK: ic1
    ic1_adapter_name, #swap ic1 from disk.  GPU: ic1 CPU: ic2 DISK: ic3
    ic1_adapter_name, #ic1 is still in GPU. GPU: ic1 CPU: ic2 DISK: ic3
    ic2_adapter_name, #swap ic2 from CPU.   GPU: ic2 CPU: ic1 DISK: ic3
    ic3_adapter_name, #swap ic3 from disk.  GPU: ic3 CPU: ic2 DISK: ic1
    # back to the starting configuration
]

no_swaps = []
cpu_swaps = []
disk_swaps = []

swap_type = ""

for cycle in range(cycles):
    for invocation in invocation_order:

        if invocation == base_inference_component_name or tiers.index(invocation) == 0:
            #do nothing
            swap_type = "NONE"
            pass
        elif tiers.index(invocation) == 1:
            tiers[1] = tiers[0]
            tiers[0] = invocation
            swap_type = "FROM_CPU"
        elif tiers.index(invocation) == 2:
            tiers[2] = tiers[1]
            tiers[1] = tiers[0]
            tiers[0] = invocation
            swap_type = "FROM_DISK"


        component_to_invoke = invocation

        start = time.time()*1000

        response_model = sm_runtime.invoke_endpoint(
            EndpointName = endpoint_name,
            InferenceComponentName = component_to_invoke,
            Body = json.dumps(
                {
                    "inputs": prompt,
                    "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 1, "temperature":0.9}
                }
            ),
            ContentType = "application/json",
        )

        end = time.time()*1000

        total = int(end - start)

        if swap_type == "NONE":
            no_swaps.append(total)
        elif swap_type == "FROM_CPU":
            cpu_swaps.append(total)
        elif swap_type == "FROM_DISK":
            disk_swaps.append(total)

        print(f'call to [{invocation.split("-")[0]}] {total} ms. swap: [{swap_type}] [ GPU: {tiers[0].split("-")[0]} CPU: {tiers[1].split("-")[0]} Disk: {tiers[2].split("-")[0]} ]')

no_swaps_count = len(no_swaps)
no_swaps_avg = int(sum(no_swaps)/len(no_swaps))
cpu_swaps_count = len(cpu_swaps)
cpu_swaps_avg = int(sum(cpu_swaps)/len(cpu_swaps))
disk_swaps_count = len(disk_swaps)
disk_swaps_avg = int(sum(disk_swaps)/len(disk_swaps))

In [None]:
import pandas as pd
import numpy as np
np_no_swaps = np.array(no_swaps)
np_cpu_swaps = np.array(cpu_swaps)
np_disk_swaps = np.array(disk_swaps)

data = {
    "count": [no_swaps_count, cpu_swaps_count, disk_swaps_count],
    "average": [no_swaps_avg, cpu_swaps_avg, disk_swaps_avg],
    "+latency avg": [0, cpu_swaps_avg-no_swaps_avg, disk_swaps_avg-no_swaps_avg],
    "+%latency avg": [0, ((cpu_swaps_avg-no_swaps_avg)/no_swaps_avg)*100, ((disk_swaps_avg-no_swaps_avg)/no_swaps_avg)*100],
    "p50": [int(np.percentile(np_no_swaps, 50)), int(np.percentile(np_cpu_swaps, 50)), int(np.percentile(np_disk_swaps, 50))],
    "p75": [int(np.percentile(np_no_swaps, 75)), int(np.percentile(np_cpu_swaps, 75)), int(np.percentile(np_disk_swaps, 75))],
    "p99": [int(np.percentile(np_no_swaps, 99)), int(np.percentile(np_cpu_swaps, 99)), int(np.percentile(np_disk_swaps, 99))]
}


df = pd.DataFrame(data, index = ["no swap", "cpu swap", "disk swap"])

df

If you were to run a similar test on 1000 cycles (7000 invocations), you'd see the following:

![](./images/adapter-load-latency-1000.png)

## Step 5. Upload a new ECTSum adapter artifact and update the live adapter inference component

Since adapters are managed as Inference Components, you can update them on a running endpoint. SageMaker handles the unloading/deregistering of the old adapter and loading/registering of the new adapter onto every base ICs on all of the instances that it is running on for this endpoint. To update an adapter IC, use the  update_inference_component  API and supply the existing IC name and the S3 path to the new compressed adapter archive. 

You can train a new adapter, or re-upload the existing adapter artifact to test this functionality.

In [None]:
new_ectsum_adapter_s3_uri = f"s3://{bucket}/adapters/new-ectsum-adapter.tar.gz"
print(new_ectsum_adapter_s3_uri)

!aws s3 cp ./ectsum-adapter.tar.gz {new_ectsum_adapter_s3_uri}

#### This step can take around 5 minutes

In [None]:
%%time

update_inference_component_response = sm_client.update_inference_component(
    InferenceComponentName = ic1_adapter_name,
    Specification={
        "Container": {
            "ArtifactUrl": new_ectsum_adapter_s3_uri
        },
    },
)

sess.wait_for_inference_component(ic1_adapter_name)

print(f'\nUpdated inference component adapter ARN: {update_inference_component_response["InferenceComponentArn"]}')

If you view your inference component logs (link below), you will see log entries for the deregistration of the old adapter and the registration of the new one.

You should see something similar to:
`[INFO ] PyProcess - W-200-0d1e4741a42db26-stdout: [1,0]<stdout>:INFO::Unregistered adapter ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401 successfully`

`[INFO ] PyProcess - W-200-0d1e4741a42db26-stdout: [1,0]<stdout>:INFO::Registered adapter ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401 from /opt/ml/models/container_340043819279-ic-ectsum-base-llama-3-1-8b-instruct-2024-11-25-20-41-07-401-1732570150851-MaeveWestworldService-1.0.9353.0 successfully`

In [None]:
print(f'You can view your inference component logs here:\n\n https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups/log-group/{cw_path}')

### Retest with updated adapter

In [None]:
%%time

component_to_invoke = ic1_adapter_name

response_model = sm_runtime.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"do_sample": True, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 125, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_response = json.loads(response_model["Body"].read().decode("utf8"))["generated_text"]

print(f'Ground Truth:\n\n {test_item["summary"][0]}\n\n')
print(f'Updated Adapter Model Response:\n\n {adapter_response}\n')

## Step 6: Clean up the environment

If you need to delete an adapter, call the `delete_inference_component` API with the IC name to remove it. 

In [None]:
sess.delete_inference_component(ic1_adapter_name, wait = True)
print(f'Adapter Component {ic1_adapter_name} deleted.')

sess.delete_inference_component(ic2_adapter_name, wait = True)
print(f'Adapter Component {ic2_adapter_name} deleted.')

sess.delete_inference_component(ic3_adapter_name, wait = True)
print(f'Adapter Component {ic3_adapter_name} deleted.')

Deleting the base model IC will automatically delete the base IC and any associated adapter ICs.

In [None]:
sess.delete_inference_component(base_inference_component_name, wait = True)

print(f'Base Component {base_inference_component_name} deleted.')

Clean up the running endpoint and its configuration.

In [None]:
sess.delete_endpoint(endpoint_name)
print(f'Endpoint {endpoint_name} deleted.')

sess.delete_endpoint_config(endpoint_name)
print(f'Endpoint Configuration {endpoint_name} deleted.')

sess.delete_model(model_name)
print(f'Model {model_name} deleted.')