# Serve Llama2-13b on SageMaker using the LMI container.


In this notebook, we deploy the [llama2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) model on SageMaker by leveraging the [SageMaker Large Model Inference Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). For the purpose of this notebook, we'll use the weights from the following source:
- https://huggingface.co/TheBloke/Llama-2-13B-fp16
- https://huggingface.co/TheBloke/Llama-2-13B-Chat-fp16

However, you can use the same approach to deploy the model using any other Llama2 weights like https://huggingface.co/meta-llama/Llama-2-13b-chat-hf, etc.

For information on Llama2, please refer the paper [Llama 2: Open Foundation and Fine-Tuned Chat Models](https://arxiv.org/pdf/2307.09288.pdf).

This notebook explains how to deploy model optimized for latency and throughput. The tuning guide is available [LLM Tuning Guide](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). There are some key Gen AI patterns and use cases, and they need different settings when hosting the model. The typical broad use case categorization would be: 

1. Chatbot / QA.  These applications need the ability to handle large model inputs and large model outputs. With contextual applications, these require prescriptive and factual responses back from LLM, which can be controlled by setting the appropriate decoding parameters. Latency and accuracy are top priorities.
2. Summarization. These applications usually have a large payload as input to the model and will have a small to medium-length output payload. If we run these as batches, we have some tolerance for latency, but throughput is a major concern. 
3. Generation. These applications usually have a smaller input payload, but open-ended generation can involve generating the full content length of the model. Throughput is of major concern.

# License agreement
View license information https://huggingface.co/meta-llama before using the model.
This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0.

## Install, import the required libraries, and set some variables

In [2]:
%pip install sagemaker boto3 huggingface_hub awscli --upgrade  --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import json
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


### Select the appropriate configuration parameters and container

To optimize the deployment of Large Language Models (LLMs); one needs to choose the appropriate model partitioning framework, optimal batching technique, batching size, tensor parallelism degree, etc. The choice of a particular configuration depends on the use case.

Hence, based on the use case, you need to:
1. Set the configuration parameters for the container.
2. Select the appropriate container image to be used for inference.

## Use case: Open Ended generation - Chatbots, etc
Consider the following scenarios:
- Prompts with small input size and a small generated text
- Prompts with a small input size that generate a large number of tokens

Applications like chatbots, etc. have the above characteristics and also need to support high throughput. This needs to be taken into consideration while selecting the configuration parameters.

![chatbot](images/chatbot.png)

### Set the configuration parameters using environment variables
1. `SERVING_LOAD_MODELS` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the **Python** engine.

2. `OPTION_MODEL_ID`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance.
If you want to download the model from huggingface.co, you can set `OPTION_MODEL_ID` to the model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

3. `OPTION_TENSOR_PARALLEL_DEGREE`: Set to the number of Inferentia devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. In this example we use the `ml.inf2.24xlarge` instance that has 6 accelerators; hence this is set to 4.

4. `OPTION_ROLLING_BATCH`: This parameter enables the use of a particular batching technique for continuous or iteration level batching to enable merging multiple concurrent requests that arrive at different times for inference.
In scenarios that involves open ended generation and chatbots, there is a need for having a high throughput. [vLLM](https://arxiv.org/pdf/2309.06180.pdf) is a fast LLM inference and serving framework that uses techniques like PagedAttention and continuous batching to improve the throughput. Hence, we set the `rolling_batch` parameter to `vllm`. When using `vllm`, you can also use some [additional parameters](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md#vllm).

5. `OPTION_MAX_ROLLING_BATCH_SIZE`: The maximum number of concurrent requests to be used in a batch by the model server for inference. Clients can still send more requests to the endpoint, they will be queued.


For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


### Select the relevant Large Model Inference container
SageMaker offers optimized [large model inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) that contains different frameworks for model parallelism enabling inference of LLMs on multiple accelerators.

In this scenario, since we are leveraging `vllm` as the batching technique, we leverage the `NeuronX` container that has frameworks like NeuronX, vllm, etc.

In [4]:
vllm_image_uri = image_uris.retrieve(
    framework="djl-neuronx", region=sess.boto_session.region_name, version="0.26.0"
)

env_generation = {
    "HUGGINGFACE_HUB_CACHE": "/tmp",
    "TRANSFORMERS_CACHE": "/tmp",
    "SERVING_LOAD_MODELS": "test::Python=/opt/ml/model",
    "OPTION_MODEL_ID": "TheBloke/Llama-2-13B-Chat-fp16",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "OPTION_TENSOR_PARALLEL_DEGREE": "6",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
    "OPTION_DTYPE": "fp16",
}

In [5]:
print(vllm_image_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-neuronx-sdk2.16.0


In [6]:
# - Select the appropriate environment variable which will tune the deployment server.
env = env_generation  # use this in case it is 'generation' task

# - now we select the appropriate container
inference_image_uri = vllm_image_uri  # use this in case it is 'generation' task


print(f"Environment variables are ---- > {env}")
print(f"Image going to be used is ---- > {inference_image_uri}")

Environment variables are ---- > {'HUGGINGFACE_HUB_CACHE': '/tmp', 'TRANSFORMERS_CACHE': '/tmp', 'SERVING_LOAD_MODELS': 'test::Python=/opt/ml/model', 'OPTION_MODEL_ID': 'TheBloke/Llama-2-13B-Chat-fp16', 'OPTION_TRUST_REMOTE_CODE': 'true', 'OPTION_TENSOR_PARALLEL_DEGREE': '6', 'OPTION_ROLLING_BATCH': 'vllm', 'OPTION_MAX_ROLLING_BATCH_SIZE': '32', 'OPTION_DTYPE': 'fp16'}
Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-neuronx-sdk2.16.0


### To create the endpoint the steps are:
- Create the Model using the inference image container

- Create the endpoint config using the following key parameters

In this notebook we leverage the boto3 SDK. You can also use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/).

In [7]:
instance_type = "ml.inf2.24xlarge"
endpoint_name = sagemaker.utils.name_from_base("llama2-13b-chat-model")

### Create the Model
Leverage the `inference_image_uri` to create a model object. We will leverage the Least routing algorithim -- [Least Routing Algorithim](https://aws.amazon.com/blogs/machine-learning/minimize-real-time-inference-latency-by-using-amazon-sagemaker-routing-strategies/). This innovation from sagemnaker has shown to reduce latency by 10% or more when we have multiple instances configured to serve the endpoints

In [8]:
model_name = sagemaker.utils.name_from_base("lmi-llama2-13b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

lmi-llama2-13b-2024-03-08-16-13-56-006
Created Model: arn:aws:sagemaker:us-east-1:972812897072:model/lmi-llama2-13b-2024-03-08-16-13-56-006


### Create an endpoint config
Create an endpoint configuration using the appropriate instance type. Set the `ContainerStartupHealthCheckTimeoutInSeconds` to account for the time taken to download the LLM weights from S3 or the model hub; and the time taken to load the model on the accelerators.

In [9]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:972812897072:endpoint-config/lmi-llama2-13b-2024-03-08-16-13-56-006-config',
 'ResponseMetadata': {'RequestId': '11f2057c-4a42-43c0-8269-d1e0994e193f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '11f2057c-4a42-43c0-8269-d1e0994e193f',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '126',
   'date': 'Fri, 08 Mar 2024 16:14:07 GMT'},
  'RetryAttempts': 0}}

### Create an endpoint using the model and endpoint config

In [10]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:972812897072:endpoint/lmi-llama2-13b-2024-03-08-16-13-56-006-endpoint


#### This step can take ~15 mins or longer

In [11]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:972812897072:endpoint/lmi-llama2-13b-2024-03-08-16-13-56-006-endpoint
Status: InService


### Invoke the endpoint with a sample prompt

In [12]:
# - use these for Summarization use case test
prompt = """Briefly summarize this paragraph: Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition.
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input.
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend. Amazon Comprehend’s Dominant language capability can examine documents and determine the dominant language for a far wider selection of languages."""
params = {"max_new_tokens": 64, "temperature": 0.1}

In [13]:
# use this for Chatbot or QA or open ended generation task
prompt = "Amazon.com is the best"
params = {"max_new_tokens": 100, "do_sample": False}

In [14]:
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": prompt, "parameters": params}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

'{"generated_text": "place to find and buy all kinds of products, including books, electronics, clothing, home goods, and more. Here are some of the reasons why Amazon.com is the best place to shop online:\\n\\n1. Wide selection: Amazon.com offers a vast selection of products, including millions of books, tens of thousands of DVDs, and hundreds of thousands of other items. You\'re sure to find what you\'re looking for on Amazon.\\n2. Con"}'

## Clean up the environment

In [15]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

{'ResponseMetadata': {'RequestId': '013ebfaf-a088-449b-84ac-eb0e95b305e9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '013ebfaf-a088-449b-84ac-eb0e95b305e9',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Fri, 08 Mar 2024 16:30:23 GMT'},
  'RetryAttempts': 0}}

#### Resource:
- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)
- [Deep Java Library for Large Model Inference](https://docs.djl.ai/docs/serving/serving/docs/large_model_inference.html)