# Llama2-7B Single Model Deployment
In this example we'll take a look at using the Large Model Inference (LMI) container to optimize hosting of a LLM on SageMaker Inference. Factors we will look at include:

    - LMI TensorRT-LLM Optimizations
    - Batching Techniques (Paged Attention)
    - AutoScaling at Hardware Level (SageMaker)

#### Credits/Reference
This notebook is derived from the following [sample](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/llm-workshop/llama2-7b-batching-throughput/llama2-7b-batching-throughput.ipynb) and [blog](https://aws.amazon.com/blogs/machine-learning/improve-throughput-performance-of-llama-2-models-using-amazon-sagemaker/). Please refer these for a deeper analysis and understanding of batching.

## Setup

In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade --quiet

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

In [None]:
model_bucket = sess.default_bucket()  # bucket to house model artifacts
s3_code_prefix = "hf-large-model-djl/meta-llama/Llama-2-7b-fp16/code"  # folder within bucket where code artifact will go

s3_model_prefix = "hf-large-model-djl/meta-llama/Llama-2-7b-fp16/model"  # folder within bucket where model artifact will go
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

#### [OPTIONAL] Download the model from Hugging Face and upload the model artifacts on Amazon S3

If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step.

In [None]:
"""from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = "TheBloke/Llama-2-7b-fp16"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.txt", "*.model", "*.safetensors", "*.bin", "*.chk", "*.pth"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name, cache_dir=local_model_path, allow_patterns=allow_patterns
)"""

In [None]:
# upload files from local to S3 location
# pretrained_model_location = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
# print(f"Model uploaded to --- > {pretrained_model_location}")

In [None]:
# Cleanup locally stored model files post S3 upload
#!rm -rf {model_download_path}

#### Define a variable to contain the s3 url of the location that has the model

In [None]:
# Define a variable to contain the s3 url of the location that has the model. For demo purpose, we use Llama-2-7b-fp16 model artifacts from our S3 bucket
pretrained_model_location = f"s3://sagemaker-example-files-prod-{region}/models/llama-2/fp16/7B/"

In [None]:
!aws s3 ls {pretrained_model_location}

## Paged Attention Batching
#### serving.properties for Paged Attention

In [None]:
!rm -rf code_llama2_7b_fp16
!mkdir -p code_llama2_7b_fp16

In [None]:
%%writefile code_llama2_7b_fp16/serving.properties
engine=MPI
option.tensor_parallel_degree=4
option.rolling_batch=trtllm
option.paged_attention = true
option.max_rolling_batch_prefill_tokens = 16080
option.max_rolling_batch_size=64
option.model_loading_timeout = 900
option.model_id = {{model_id}}

In [None]:
# we plug in the appropriate model location into our `serving.properties`
template = jinja_env.from_string(Path("code_llama2_7b_fp16/serving.properties").open().read())
Path("code_llama2_7b_fp16/serving.properties").open("w").write(
    template.render(model_id=pretrained_model_location)
)
!pygmentize code_llama2_7b_fp16/serving.properties | cat -n

#### Retreive DJL TensorRT Image

In [None]:
image_uri = image_uris.retrieve(
        framework="djl-tensorrtllm",
        region=sess.boto_session.region_name,
        version="0.26.0"
    )

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz code_llama2_7b_fp16

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

## Deploy Endpoint for Paged Attention Batching

In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"Llama-2-7b-fp16-mpi")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": image_uri, "ModelDataUrl": s3_code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 900,
            "ContainerStartupHealthCheckTimeoutInSeconds": 900,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### Sample Inference

In [None]:
payload = {"inputs": "Who is Roger Federer?", 
           "parameters": {"max_new_tokens":128, "do_sample":True}}

In [None]:
json.dumps(payload)

In [None]:
import json

runtime_client = boto3.client('sagemaker-runtime')
content_type = "application/json"

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Body=json.dumps(payload))
result = json.loads(response['Body'].read().decode())['generated_text']
print(result)

In [None]:
%%time

# sequential test
for i in range(20):
    response = runtime_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType=content_type,
        Body=json.dumps(payload))

## Load Testing

For Load Testing we'll use the open source Python framework: Locust. With Locust we can simulate concurrent users to generate traffic, for a deeper guide please refer to this [blog](https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/). For the test we have will two scripts we provide:

- distributed.sh: Can control users and workers to increase traffic (TPS)
- locust_script.py: Python script that defines task to test on, in this case it is our invoke_endpoint REST API call.

In [None]:
#!pip install locust

In [None]:
!which locust

In [None]:
%%bash -s "$endpoint_name"
./distributed.sh $1

In [None]:
import pandas as pd
locust_data = pd.read_csv('results_stats.csv')
for index, row in locust_data.head(n=2).iterrows():
     print(index, row)

### Monitor Metrics via CloudWatch

You can also understand hardware and invocation metrics via CloudWatch, for direct access go to SageMaker Endpoint UI and the settings tab to understand the metrics deeper.

<div style="display: flex;">
    <img src="images/instance-metrics-one.png" alt="instance-metrics-one" style="width: 50%; height: auto;">
    <img src="images/instance-metrics-two.png" alt="instance-metrics-two" style="width: 50%; height: auto;">
</div>

-----------------
You can also view Invocation Metrics such as Model Latency, Overhead Latency, etc. Most importantly we will take the InvocationsPerInstance metric into consideration which we will use for AutoScaling

![invocations](images/invocation-metrics.png)

## AutoScaling

You can also enable AutoScaling at an endpoint level on Amazon SageMaker. Before getting to AutoScaling it is recommended that you load test a single instance behind the endpoint, this will help you determine how much you are getting out of a singular instance. One this has been derived and the appropriate instance is chosen you can determine your scaling policy with Managed AutoScaling. For a deeper dive blog into AutoScaling with SageMaker Inference, refer to this [blog](https://towardsdatascience.com/autoscaling-sagemaker-real-time-endpoints-b1b6e6731c59).

We will work with setting up a Managed AutoScaling policy via Application AutoScaling using the Boto3 SDK.

In [None]:
# AutoScaling client
asg = boto3.client('application-autoscaling')

# Resource type is variant and the unique identifier is the resource ID.
# rename variant1 with your production variant name
resource_id=f"endpoint/{endpoint_name}/variant/variant1"

# scaling configuration
response = asg.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    MinCapacity=1,
    MaxCapacity=4
)

#Target Scaling
response = asg.put_scaling_policy(
    PolicyName=f'Request-ScalingPolicy-{endpoint_name}',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 5.0, # Threshold, 5 requests in a minute
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
        },
        'ScaleInCooldown': 300, # duration until scale in
        'ScaleOutCooldown': 60 # duration between scale out
    }
)

Our AutoScaling policy will now be reflected in the UI:

![asg-policy](images/pre-asg.png)

In [None]:
request_duration = 60 * 15 # 15 minutes
end_time = time.time() + request_duration
print(f"test will run for {request_duration} seconds")
while time.time() < end_time:
    response = runtime_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType=content_type,
        Body=json.dumps(payload))

We can now see the instance updating and eventually scaled up to the desired instance count:

![asg-policy](images/scaled-up.png)