# Open-Llama LMI Container Deployment on Amazon SageMaker Real-Time Endpoints
In this notebook we take a look at how we can leverage the Large Model Inference (LMI) Container to deploy a sample [OSS Llama variant](https://huggingface.co/openlm-research/open_llama_7b) on [SageMaker Real-Time Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html). We explore how we can leverage different LLM serving backends such as TensorRT-LLM and vLLM via the LMI container to deploy Open-Llama. In the coming sections we'll leverage Inference Components to deploy multiple LLMs on a singular endpoint in an efficient manner.

#### License: Apache-2.0

## Credits/Resources
- [Original Notebook](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Open-Llama/LMI/open_llama_7b.ipynb)
- [LMI Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html)
- [TRT-LLM User Guide](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/trt_llm_user_guide.html)

## Setup
Instantiate our usual SM clients and setup S3 buckets for model data.

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

## Container Specification
To work with the LMI container you can either provide a serving.properties file as we did with our traditional ML model examples or a Python dictionary with the environment variables for serving the LLM such as [Tensor Parallel](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#tensor-parallelism-configuration) and [Server Side Batching](https://aws.amazon.com/blogs/machine-learning/improve-throughput-performance-of-llama-2-models-using-amazon-sagemaker/), we also define the backend LLM serving engine we want to use which in this case is [Nvidia's TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). 

Note that we also specify the <b>HF Model ID</b> which pulls down the model artifacts from that repo, optionally if you have a custom model you can specify the S3 path with the model data in the model_id key listed below.

In [None]:
env = {
    "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_MODEL_ID": "openlm-research/open_llama_7b",
    "OPTION_ROLLING_BATCH": "trtllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16"
}

# TRT Image URI
trt_llm_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-tensorrtllm0.12.0-cu125"

## SageMaker Constructs Creation
Here we define the usual objects to capture model data, container and specify the hardware requirements for deployment.

In [None]:
model_name = sagemaker.utils.name_from_base("lmi-openllama-7b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": trt_llm_image_uri,
        "Environment": env,
    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

# enabled LOR: https://aws.amazon.com/blogs/machine-learning/minimize-real-time-inference-latency-by-using-amazon-sagemaker-routing-strategies/
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Sample Inference

In [None]:
# boto3 inference sample
import json
content_type = "application/json"
payload = {"inputs": "Who is Roger Federer?"} #optionally add any parameters for your model

# sample inference
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Body=json.dumps(payload))
result = json.loads(response['Body'].read().decode())['generated_text']
print(result)

## Cleanup

In [None]:
sm_client.delete_endpoint(EndpointName = endpoint_name)