# Deploy Falcon 40B on Amazon SageMaker using LMI TensorRT-LLM

## Resources
- [Falcon-40B model card](https://huggingface.co/tiiuae/falcon-40b)
- [LMI Configuration Documentation](https://docs.djl.ai/docs/serving/serving/docs/lmi/configurations_large_model_inference_containers.html)
- [DJL-Demo Samples](https://github.com/deepjavalibrary/djl-demo/tree/2a5152f578f5954b8b68acdee18eed4e2a75c81f/aws/sagemaker/large-model-inference/sample-llm)

## TensorRT-LLM

Amazon SageMaker offers LMI deep learning containers (DLCs) to help customers maximize the utilization of available resources and improve performance. The latest LMI DLCs offer continuous batching support for inference requests to improve throughput, efficient inference collective operations to improve latency, and the latest TensorRT-LLM library from NVIDIA to maximize performance on GPUs. LMI TensorRT-LLM DLC offers low-code interface that simplifies compilation with TensorRT-LLM by just requiring the model id and optional model parameters; all of the heavy lifting required with building TensorRT-LLM optimized model is managed by LMI DLC. Customers can also leverage the latest quantization techniques — GPTQ, AWQ, SmoothQuant — with LMI DLCs. 

In this example we walk through how to deploy and perform inference on the **Falcon 7B model** using the **Large Model Inference(LMI)** container provided by AWS using **DJL Serving** and **TensorRT-LLM**. The **Falcon 7B model** is a casual decoder model. We will deploy using a g5.12xlarge instance.

*Please note, Falcon-7B can fit on g5.2xlarge instance but because we will be using just-in-time (JIT) compilation there is not enough memory on g5.2xlarge to do the compilation. As an alternative you can try ahead-of-time (AOT) compilation, copy compiled model to S3 and use g5.2xlarge for inference*

## Step 1: Setup

In [9]:
# %pip install sagemaker --upgrade  --quiet

In [10]:
import sagemaker
import boto3
import json
print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

boto3 version: 1.34.39
sagemaker version: 2.209.0


In [11]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

## Step 2: Create a model, endpoint configuration and endpoint

Retrieve the ECR image URI for the DJL TensorRT accelerated large language model framework. The image URI is looked up based on the framework name, AWS region, and framework version. This allows us to dynamically select the right Docker image for our environment.

Functions for generating ECR image URIs for pre-built SageMaker Docker images. See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [12]:
version = "0.26.0"
inference_image_uri = sagemaker.image_uris.retrieve(
    "djl-tensorrtllm", region=region, version=version
)
print(f"Image going to be used is ----> {inference_image_uri}")

Image going to be used is ----> 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.26.0-tensorrtllm0.7.1-cu122


In [13]:
model_name = sagemaker.utils.name_from_base("falcon40b-trtllm")
print(model_name)

env = {
    "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
    "OPTION_MODEL_ID" : "tiiuae/falcon-40b",
    "OPTION_TENSOR_PARALLEL_DEGREE": "8",
    "OPTION_MAX_ROLLING_BATCH": "32",
    "OPTION_MAX_INPUT_LEN": "512",
    "OPTION_MAX_OUTPUT_LEN": "256",
}

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image_uri, 
        "Environment": env,
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

falcon40b-trtllm-2024-02-26-21-59-32-663
Created Model: arn:aws:sagemaker:us-west-2:461312420708:model/falcon40b-trtllm-2024-02-26-21-59-32-663


These two cells below deploy the model to a SageMaker endpoint for real-time inference. The instance_type defines the machine instance for the endpoint. The endpoint name is programmatically generated based on the base name. The model is deployed with a large container startup timeout specified, as the TensorRT model takes time to initialize on the GPU instance.

In [14]:
endpoint_config_name = f"{model_name}-config"
instance_type = "ml.g5.48xlarge"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants = [
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1800,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:461312420708:endpoint-config/falcon40b-trtllm-2024-02-26-21-59-32-663-config',
 'ResponseMetadata': {'RequestId': 'abc2f0fd-c4d4-4272-975d-1ec4cf771c01',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'abc2f0fd-c4d4-4272-975d-1ec4cf771c01',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '128',
   'date': 'Mon, 26 Feb 2024 21:59:32 GMT'},
  'RetryAttempts': 0}}

In [15]:
endpoint_name = f"{model_name}-endpoint"
create_endpoint_response = sm_client.create_endpoint(
    EndpointName = endpoint_name, EndpointConfigName = endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:461312420708:endpoint/falcon40b-trtllm-2024-02-26-21-59-32-663-endpoint


### This step can take ~ 10 min or longer so please be patient

In [16]:
#
# Using helper function to wait for the endpoint to be ready
#
sess.wait_for_endpoint(endpoint_name)

----------------------------!

{'EndpointName': 'falcon40b-trtllm-2024-02-26-21-59-32-663-endpoint',
 'EndpointArn': 'arn:aws:sagemaker:us-west-2:461312420708:endpoint/falcon40b-trtllm-2024-02-26-21-59-32-663-endpoint',
 'EndpointConfigName': 'falcon40b-trtllm-2024-02-26-21-59-32-663-config',
 'ProductionVariants': [{'VariantName': 'variant1',
   'DeployedImages': [{'SpecifiedImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.26.0-tensorrtllm0.7.1-cu122',
     'ResolvedImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference@sha256:e4a14395bab025b5cea5b950e6003fd3a98d4b82e0f933fc8b5735e6e0018b3b',
     'ResolutionTime': datetime.datetime(2024, 2, 26, 21, 59, 34, 835000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2024, 2, 26, 21, 59, 34, 55000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 2, 26, 22, 13, 39, 555000,

## Step 3: Invoke the Endpoint

In [17]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName = endpoint_name,
    Body = json.dumps(
        {
            "inputs": "What is AWS re:invent? Where does it happen every year?", 
             "parameters": {"max_new_tokens": 256, "do_sample": True}
        }
    ),
    ContentType = "application/json",
)

response_model["Body"].read().decode("utf8")

CPU times: user 14.1 ms, sys: 522 µs, total: 14.7 ms
Wall time: 7.66 s


'{"generated_text": "\\nAWS re:Invent is a learning conference hosted by Amazon Web Services for the global cloud computing community. The event hosts more than 65,000 attendees from 150 countries and features more than 2,500 sessions and workshops.\\nThe event is held in Las Vegas, Nevada, United States.\\nWhat are the AWS re:invent 2022 dates?\\nAWS re:Invent 2022 will be held from 27th November to 2nd December 2022 at the Venetian in Las Vegas, Nevada.\\nWho attends AWS re:invent?\\nThe event is attended by AWS customers, partners, and employees.\\nWhat are the AWS re:invent 2022 topics?\\nThe event will cover topics like cloud computing, artificial intelligence, machine learning, and more.\\nWhat is the AWS re:invent 2022 schedule?\\nThe event will be held from 27th November to 2nd December 2022.\\nWhat is the AWS re:invent 2022 agenda?\\nThe event will cover topics like cloud computing, artificial intelligence, machine learning, and more.\\nWhat is the AWS re:invent 2022 registrat

## Step 4: Clean up the environment

In [18]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_config_name)
sess.delete_model(model_name)