# Utilizing Inference Components (ICs) to Host Multiple LLMs on SageMaker Real-Time Endpoints
In this example we utilize [SageMaker Inference Components](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/) to host variants of both a [Qwen](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct?sagemaker_deploy=true) and [OpenLlama](https://huggingface.co/openlm-research/open_llama_7b) model on a singular endpoint. 

Unlike traditional SageMaker Real-Time Endpoints we follow the flow of Endpoint Config -> Endpoint -> IC (1...n). In this case we create the endpoint and then add both a Qwen and OpenLlama Model as their own ICs. ICs are similar to SageMaker Model objects we can define the model data and container information, the difference is we can enable AutoScaling as well at the IC level, we will explore this in the next notebook. For now you can imagine the IC architecture as the following: 

![ic-arch](ic-arch.png)

An IC inherits the SageMaker model construct and adds two parameters:
1. <b>Hardware Resource Requirements</b>: This is what you reserve for this specific component from a hardware perspective, in this case we work with an 8 GPU instance and reserve 4 for the OpenLlama model and 1 for the Qwen model. Obviously this is not optimal usage, but the idea is to showcase how you can leverage an endpoint to host multiple LLMs, ideally you want to setup appropriate scaling at the component level as well, which we will explore in coming sections.
2. <b>Copy Count</b>: This is the new CW metric to be aware from which we can scale up the ICs individually. Each copy retains the hardware resource requirements you have allocated.

## Setup

In [None]:
!pip install sagemaker boto3 --upgrade --quiet

In [None]:
import boto3
import sagemaker
import time
from time import gmtime, strftime

#Setup
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")
boto_session = boto3.session.Session()
s3 = boto_session.resource('s3')
region = boto_session.region_name
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
print(f"Role ARN: {role}")
print(f"Region: {region}")

## SM Endpoint Config and Endpoint Creation
We first create the endpoint configuration and endpoint and then allocate the resources to each inference component afterwards. Note you might need a limit increase request for ml.g5.48xlarge for the instance type behind the endpoint.

In [None]:
# endpoint config name
epc_name = "ic-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(f"Endpoint Config Name: {epc_name}")

# Container Parameters, increase health check for LLMs: 
variant_name = "AllTraffic"
instance_type = "ml.g5.48xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

# Setting up managed AutoScaling
initial_instance_count = 1
max_instance_count = 2
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")

# Endpoint Config Creation
endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=epc_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            # can set to least outstanding or random: https://aws.amazon.com/blogs/machine-learning/minimize-real-time-inference-latency-by-using-amazon-sagemaker-routing-strategies/
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

In [None]:
#Endpoint Creation
endpoint_name = "ic-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
#Monitor creation
describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(60)
print(describe_endpoint_response)

## Inference Component Creation
For an IC you first create a SM Model object which you inherit from and then allocate the compute resource requirements and copy count in the <b>create_ic</b> API call: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_inference_component.html.

### Open Llama Inf Component Creation
For Open Llama we will deploy using the LMI container with an Nvidia TensorRT Backend, reference this video for background around LMI: https://www.youtube.com/watch?v=Q-Kz5Yi0QiQ.

In [None]:
# First create the model object
openllama_env = {
    "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_MODEL_ID": "openlm-research/open_llama_7b",
    "OPTION_ROLLING_BATCH": "trtllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16"
}
# TRT Image URI for the OpenLlama container
openllama_lmi_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-tensorrtllm0.12.0-cu125"
# create SM model object for OpenLlama
ollama_model_name = sagemaker.utils.name_from_base("lmi-openllama-7b")
print(ollama_model_name)

# model object for ollama LMI deployment
create_model_response = client.create_model(
    ModelName=ollama_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": openllama_lmi_image_uri,
        "Environment": openllama_env,
    }
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")

# Inf Component Creation for Ollama
ollama_ic_name = "ollama-ic" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
variant_name = "AllTraffic"

# inference component reaction
create_ollama_ic_response = client.create_inference_component(
    InferenceComponentName=ollama_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": ollama_model_name,
        "ComputeResourceRequirements": {
            # enables tensor parallel
            "NumberOfAcceleratorDevicesRequired": 4,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    # can setup autoscaling for copies
    RuntimeConfig={"CopyCount": 1},
)

print("IC OpenLlama Arn: " + create_ollama_ic_response["InferenceComponentArn"])

In [None]:
describe_ic_ollama_response = client.describe_inference_component(
    InferenceComponentName=ollama_ic_name)

while describe_ic_ollama_response["InferenceComponentStatus"] == "Creating":
    describe_ic_ollama_response = client.describe_inference_component(InferenceComponentName=ollama_ic_name)
    print(describe_ic_ollama_response["InferenceComponentStatus"])
    time.sleep(30)
print(describe_ic_ollama_response)

### Qwen Inference Component Creation
In this IC for Qwen we use the TGI container which simplifies deployment from HF, also want to showcase how you can use different containers and serving stacks for models whereas with MME it's a single container.

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri
qwen_tgi_image_uri= get_huggingface_llm_image_uri("huggingface",version="3.0.1")
qwen_model = {"Image": qwen_tgi_image_uri, "Environment": {'HF_MODEL_ID':'Qwen/Qwen2.5-7B-Instruct'}}
qwen_model_name = "qwen-model" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(f"Qwen Model Name: {qwen_model_name}")

# create qwen model object
create_qwen_model_response = client.create_model(
    ModelName=qwen_model_name,
    ExecutionRoleArn=role,
    Containers=[qwen_model],
)
print("Qwen Model Arn: " + create_qwen_model_response["ModelArn"])

qwen_ic_name = "qwen-ic" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
variant_name = "AllTraffic"

# qwen inference component reaction
create_qwen_ic_response = client.create_inference_component(
    InferenceComponentName=qwen_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": qwen_model_name,
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    # can setup autoscaling for copies
    RuntimeConfig={"CopyCount": 1},
)

print("IC Qwen Arn: " + create_qwen_ic_response["InferenceComponentArn"])

In [None]:
describe_ic_qwen_response = client.describe_inference_component(
    InferenceComponentName=qwen_ic_name)

while describe_ic_qwen_response["InferenceComponentStatus"] == "Creating":
    describe_ic_qwen_response = client.describe_inference_component(InferenceComponentName=qwen_ic_name)
    print(describe_ic_qwen_response["InferenceComponentStatus"])
    time.sleep(60)
print(describe_ic_qwen_response)

## Sample Inference
Here we use the same [invoke_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html) API call, but just have a header to specify the target IC name.

In [None]:
# OpenLlama
import json
payload = "What is the capitol of the United States?"
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=ollama_ic_name, #specify IC name
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(
        {
            "inputs": payload,
            "parameters": {
                "max_new_tokens": 200  # Adjust this value as needed
                },
        }
    ),
)
result = json.loads(response["Body"].read().decode())['generated_text']
result

In [None]:
# Qwen
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=qwen_ic_name, #specify IC name
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(
        {
            "inputs": payload,
            "parameters": {
                "max_new_tokens": 200  # Adjust this value as needed
                },
        }
    ),
)
result = json.loads(response["Body"].read().decode())[0]['generated_text']
result

## Cleanup
Ensure to delete both the ICs and Endpoint resources

In [None]:
client.delete_inference_component(InferenceComponentName=qwen_ic_name)
client.delete_inference_component(InferenceComponentName=ollama_ic_name)
client.delete_endpoint(EndpointName = endpoint_name)