# Serving Large Language Models (LLMs) at Scale on AWS

## Introduction


Language Model (LLM) serving at scale refers to the process of delivering an LLM as a service to a large number of users efficiently. This involves setting up a robust and efficient infrastructure to handle a large number of requests quickly and at a low latency.

LLM inference at scale can be achieved through various techniques such as model parallelism, data parallelism, efficient data handling, optimized inference, asynchronous inference, auto-scaling, load balancing, optimized hardware, caching, and using model serving frameworks. The goal is to minimize the latency of individual requests, reduce the computational complexity of the model, and improve the overall system throughput to meet the demands of a larger number of users. 

By serving an LLM at scale, you can make it accessible to a wider audience and enable them to use the model to generate text, translate languages, answer questions, and perform other natural language processing tasks quickly and efficiently.

In this article, I will use Amazon SageMaker to show how we can control resources to serve our LLM models at scale. It may include the number of GPUs, memory, or the replicas assigned to serve dynamic amounts of requests to our LLM models. Moreover, I also show how to attach an auto-scaling policy to our serving endpoint which will scale our endpoint automatically when workload varies. 

## AWS Deep Learning Containers


Large Language Models (LLMs) have become a forefront of innovation in artificial intelligence, capturing the attention of academic establishments, tech companies and enthusiasts with their sophisticated capabilities. Models built on architectures like GPT and Llama have rapidly gained traction for a wide range of uses such as language comprehension, conversational interfaces, and automated content creation. This surge in demand has led many companies to explore and integrate LLM-driven features into their products.

However, deploying LLMs on a large scale involves complex engineering challenges. To ensure a seamless user experience, hosting services for LLMs need to maintain quick response times while supporting numerous users simultaneously. Due to the substantial resource demands of these models, standard inference frameworks often fall short in delivering the necessary optimizations for optimal resource use and performance.

Key optimizations that can enhance LLM hosting include:

* Tensor parallelism, which spreads computation across multiple processing units.
* Model quantization, which reduces the model’s memory usage.
* Dynamic batching of requests to increase processing throughput and more.


Recently, AWS has released a new Hugging Face Deep Learning Container (DLC) for inference with Large Language Models (LLMs). This new Hugging Face LLM DLC is powered by Text Generation Inference (TGI), an open source, purpose-built solution for deploying and serving Large Language Models. TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs. The Hugging Face LLM DLC incorporates all the aforementioned optimizations as standard features, simplifying the large-scale deployment of LLMs.

# Serving Llama 3 at Scale in SageMaker

Let's start off by installing the required modules

In [None]:
!pip install -U sagemaker transformers

In [3]:
import sagemaker
import boto3

role = sagemaker.get_execution_role()  
sess = sagemaker.session.Session() 
region = sess.boto_region_name
bucket = sess.default_bucket() 


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::609362070692:role/service-role/AmazonSageMaker-ExecutionRole-20231122T115899
sagemaker session region: us-east-1


Next, we need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the <code>get_huggingface_llm_image_uri</code> method provided by the sagemaker SDK.

In [5]:
from huggingface_hub import login

# login(token="Your Hugging Face access token")

from sagemaker.huggingface import get_huggingface_llm_image_uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
)

print(f"llm image uri: {llm_image}")

llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.0.2-gpu-py310-cu121-ubuntu22.04


To deploy Llama 3 70B to Amazon SageMaker we create a <code>HuggingFaceModel</code> model class and define our endpoint configuration including the hf_model_id, instance_type etc. We will use a g5.45xlarge instance type, which has 8 NVIDIA A10G GPUs and 192GB of GPU memory. You need atleast > 100GB of GPU memory to run Mixtral 8x7B in float16 with decent input length.

In [7]:
import json
from sagemaker.huggingface import HuggingFaceModel
 
# sagemaker config
instance_type = "ml.g5.48xlarge"
number_of_gpu = 8
health_check_timeout = 300
 
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "NousResearch/Llama-2-7b-chat-hf", #  "meta-llama/Meta-Llama-3-8B-Instruct"
  'SM_NUM_GPUS': "1", # Number of GPU used per replica
  'MAX_INPUT_LENGTH': "2048",  # Max length of input text
  'MAX_TOTAL_TOKENS': "4096",  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': "8192", # Limits the number of tokens that can be processed in parallel during the generation. 

}
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

# Multi-model deployment on SageMaker
Multi-model endpoints offer a scalable and cost-effective way to deploy models. Multi-model endpoints use a single set of resources and a shared serving container to host all the models together. This shared setup allows the models to efficiently use the same hardware, which helps lower hosting costs. By improving endpoint's resource usage, multi-model endpoints are more cost-effective than deploying each model on its own separate endpoint. Figure 1 compares single-model endpoint and multi-model endpoint. You can use Amazon SageMaker to deploy one or more models to an endpoint. If you deploy multiple models to the same endpoint, they will share the resources available there, including ML compute instances, CPUs, and accelerators. This means that all the models will use the same hardware resources for their computations. The most flexible way to deploy multiple models to an endpoint is to define each model as an <i>inference component</i>. 

## Inference components
An inference component is a SageMaker hosting object that you can use to deploy a model to an endpoint. In the inference component settings, you specify the model, the endpoint, and how the model utilizes the resources that the endpoint hosts. 

With <code>ResourceRequirements</code> you can assign endpoint resources to a model. These resources include CPU cores, accelerators, and memory. Let's see them in code:


<center><figure><img src="imgs/SME-MME.png" alt="drawing" width="800"/><figcaption>Fig. 1: single-model endpoint vs. multi-model endpoint</figcaption></figure></center> 

In [14]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

llama3_resource_config = ResourceRequirements(
    requests = {
        "copies": 8, #1
        "num_accelerators": 1, #2
        "num_cpus": 18,  #3 
        "memory": 72 * 1024,  #4
    },
)

In the above configuration, #1 is the number of copies of the model, #2 define the number of GPUs to dedicate to the model, #3 is the number of CPUs to be assigned to the model, and #4 is the amount of memory dedicated to the model. But, what are the right values in this configuration? Since we defined 8 replicas, we need to divide our resource by 8. That is 1 GPU per replica. However, SageMaker sets aside some CPU and memory resources for task management. More precisely, It reserves 2 units for this purpose. For example, ml.g5.48xlarge instance has 192 CPUs. 192/8 = 24. Therefore, we will have: $(192- (2*24))/8 = 18$ CPUs and $(768- (2*768/8))/8 = 72 KB$ memory for each replica. 

Finally, we deploy the model:

In [None]:
%%time
import uuid
from sagemaker.enums import EndpointType

llm = llm_model.deploy(
    initial_instance_count=1, 
    instance_type=instance_type, 
    resources=llama3_resource_config, 
    container_startup_health_check_timeout=health_check_timeout, #1
    endpoint_name=f"llama3-chat-{str(uuid.uuid4())}",
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED, #2
    tags=[{"Key": "aKey", "Value": "aValue"}],
    model_name="llama3-chat"
)

-----!-----------------------------------------------------!CPU times: user 373 ms, sys: 46.6 ms, total: 419 ms
Wall time: 21min 8s


In above: #1 Specifies the health checkup timeout in seconds which is the timeout the container has to respond to health checks. If CloudWatch logs indicate a health check timeout, you should increase this quota. #2 inorder to attach the resource requirement, the endpoint type should be an inference component as we discussed above. 

# AutoScaling 

Amazon SageMaker offers automatic scaling (auto scaling) for hosted models, which dynamically adjusts the number of instances based on workload changes. When the workload increases, auto scaling activates additional instances. Conversely, when the workload decreases, it removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

First, we need to define a scaling policy that adds and removes the number of instances for our production endpoint in response to workload changes.


# Autoscaling Policies

There are three main types of autoscaling policies for SageMaker Endpoints: target tracking, simple, and step scaling:

* Target Tracking:
With the target tracking scaling policy, you choose an Amazon CloudWatch metric and target value, such as SageMaker VariantInvocationsPerInstance = 100, and SageMaker can keep VariantInvocationsPerInstance at, or close to 100. This approach is very common due to its ease of configuration.

* Simple Scaling:
The simple scaling policy triggers a scaling event based on a specified metric at a defined threshold with a fixed amount of scaling. For instance, "when SageMaker VariantInvocationsPerInstance > 1000, add 10 instances." This strategy requires more configuration but offers greater control compared to target tracking.

* Step Scaling:
You can use step scaling when you require an advanced configuration, such as specifying how many instances to deploy under what conditions. For example, "when SageMaker VariantInvocationsPerInstance > 1000, add 10 instances; when SageMaker VariantInvocationsPerInstance > 2000, add 50 instances." This approach demands the most configuration but provides the highest level of control, especially for handling spiky traffic.

In [16]:
autoscale = boto3.Session().client(service_name="application-autoscaling")

First, we need to register the resource as a scalable target.

In [18]:
endpoint_name = llm.endpoint_name
autoscale.register_scalable_target(    
    ServiceNamespace="sagemaker", #1
    ResourceId="endpoint/" + endpoint_name + "/variant/AllTraffic", #2
    ScalableDimension="sagemaker:variant:DesiredInstanceCount", #3
    MinCapacity=1, #4
    MaxCapacity=2, #5
    RoleARN=role,
    SuspendedState={
        "DynamicScalingInSuspended": False,
        "DynamicScalingOutSuspended": False,
        "ScheduledScalingSuspended": False,
        
    },
)

{'ScalableTargetARN': 'arn:aws:application-autoscaling:us-east-1:609362070692:scalable-target/056mc0015e0cc4ea489497ea06e58faa340c',
 'ResponseMetadata': {'RequestId': 'caa5d34c-9b7f-4fee-886a-04588d39e217',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'caa5d34c-9b7f-4fee-886a-04588d39e217',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '131',
   'date': 'Tue, 28 May 2024 23:12:21 GMT'},
  'RetryAttempts': 0}}

In the above configuration, we have defined #1) The AWS service name, #2) The identifier of the resource that is associated with the scalable target. #3) The scalable property associated with the scalable target (e.g., the number of EC2 instances for a SageMaker model endpoint variant) #4, #5) The minimum/maximum value that you plan to scale in/ scale out to. Please refere to the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling/client/register_scalable_target.html) for other configurable parameters.

After registering sagemaker as a scalable target, we can create or update the scaling policy for that target using <code>put_scaling_policy</code>. 

In [19]:
autoscale.put_scaling_policy(
    PolicyName="autoscale-policy-llama3-8b",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/" + endpoint_name + "/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 400, # 400% of 800% total GPU utilization (8 GPUs)
        "CustomizedMetricSpecification":
        {
            "MetricName": "GPUUtilization",
            "Namespace": "/aws/sagemaker/Endpoints",
            "Dimensions": [
                {"Name": "EndpointName", "Value": endpoint_name },
                {"Name": "VariantName", "Value": "AllTraffic"}
            ],
            "Statistic": "Average",
            "Unit": "Percent"
        },
        "ScaleOutCooldown": 60,
        "ScaleInCooldown": 300,
    }
)

{'PolicyARN': 'arn:aws:autoscaling:us-east-1:609362070692:scalingPolicy:c0015e0c-c4ea-4894-97ea-06e58faa340c:resource/sagemaker/endpoint/llama3-chat-7933751b-cce5-40ed-b7a4-ab1a9c182a04/variant/AllTraffic:policyName/autoscale-policy-llama3-8b',
 'Alarms': [{'AlarmName': 'TargetTracking-endpoint/llama3-chat-7933751b-cce5-40ed-b7a4-ab1a9c182a04/variant/AllTraffic-AlarmHigh-e3070ea0-e84e-4491-b1e7-6bef30bdb5da',
   'AlarmARN': 'arn:aws:cloudwatch:us-east-1:609362070692:alarm:TargetTracking-endpoint/llama3-chat-7933751b-cce5-40ed-b7a4-ab1a9c182a04/variant/AllTraffic-AlarmHigh-e3070ea0-e84e-4491-b1e7-6bef30bdb5da'},
  {'AlarmName': 'TargetTracking-endpoint/llama3-chat-7933751b-cce5-40ed-b7a4-ab1a9c182a04/variant/AllTraffic-AlarmLow-6c117972-4125-40a9-a1ed-1ae17f75cb73',
   'AlarmARN': 'arn:aws:cloudwatch:us-east-1:609362070692:alarm:TargetTracking-endpoint/llama3-chat-7933751b-cce5-40ed-b7a4-ab1a9c182a04/variant/AllTraffic-AlarmLow-6c117972-4125-40a9-a1ed-1ae17f75cb73'}],
 'ResponseMetadata

Here, we need to define PolicyType which will be one of the auto-scaling policies described earlier. We define Target Value which acts as a metric trigger to scale out or scale in the defined scalable dimension (e.g., sagemaker:variant:DesiredInstanceCount). For example in the above configuration, when GPUUtilization exceeds 400 percent a new instance is added to our deployed endpoint. 

A cooldown period defines the time interval the scaling policy waits before initiating another scaling action. This mechanism helps prevent over-scaling.

<code>ScaleOutCooldown</code>: Following a successful scale-out by the scaling policy, auto scaler begins calculating the cooldown period. The policy will not increase the desired capacity again unless a more significant scale-out event occurs or the cooldown period finishes.

<code>ScaleInCooldown</code>: To ensure application availability, upcoming scale-in activities are suspended until the scale-in cooldown period has ended. Default value is 300 seconds for both of them.


Figure 2 shows how auto-scaling works. 

<center><figure><img src="imgs/scaling.png" alt="drawing" width="800"/><figcaption>Fig. 2: Auto-scaling Policy</figcaption></figure></center> 

## Trigger autoscaling

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(config["HF_MODEL_ID"])

# Conversational messages
messages = [
  {"role": "system", "content": "You are an helpful Travel Assistant."},
  {"role": "user", "content": "Where is a good vacation destination in north america for summer?"},
]

# generation parameters
parameters = {    
    "top_p": 0.6,
    "temperature": 0.9,
}

for i in range(0, 100):
    res = llm.predict(
      {
        "inputs": tokenizer.apply_chat_template(messages, tokenize=False),
        "parameters": parameters
       })

In the above code snippet, <code>apply_chat_template</code> with convert the message to a chat template of the underlying LLM. Figure 3 shows the result of this code (endpoint triggering) on CloudWatch dashboard.



<center><figure><img src="imgs/cloud-watch.png" alt="drawing" width="1000"/><figcaption>Fig. 3: GPU utilization of the endpoint</figcaption></figure></center> 

In [21]:
autoscale.describe_scaling_activities(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/" + endpoint_name + "/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MaxResults=100
)

{'ScalingActivities': [{'ActivityId': '7bbe8709-4519-4e52-99f1-a0da6a87ad26',
   'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/llama3-chat-7933751b-cce5-40ed-b7a4-ab1a9c182a04/variant/AllTraffic',
   'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
   'Description': 'Setting desired instance count to 2.',
   'Cause': 'monitor alarm TargetTracking-endpoint/llama3-chat-7933751b-cce5-40ed-b7a4-ab1a9c182a04/variant/AllTraffic-AlarmHigh-e3070ea0-e84e-4491-b1e7-6bef30bdb5da in state ALARM triggered policy autoscale-policy-llama3-8b',
   'StartTime': datetime.datetime(2024, 5, 28, 23, 16, 11, 162000, tzinfo=tzlocal()),
   'StatusCode': 'InProgress',
   'StatusMessage': 'Successfully set desired instance count to 2. Waiting for change to be fulfilled by sagemaker.'}],
 'ResponseMetadata': {'RequestId': 'a2d5fab8-819c-4d1a-966e-15852efec88e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a2d5fab8-819c-4d1a-966e-15852efec88e',
   'content-type': 'app

Looking closely at the output, we can see that the number of instances are successfully increased to 2. 

# Experiment cost

As you can see it takes about 22 minutes to deploy the model to an endpoint. The loop to trigger auto-scaling takes around 4 minutes. Usage rate of a <code>ml.g5.48xlarge</code> instance is $\$20.360/hr$ which will be around $\$7.46$ to deploy the model and $\$1.35$ for the auto-scaling triggering loop. Considering the whole experiment, it will cost you around $\$10$ to successfully run this notebook.

# Don't forget to clean up

We need to delete the inference component, the model, and the endpoint to stop any unwilling charges. Otherwise, since this is an expensive instance, you will pay for your forgetfulness!

In [22]:
inference_component = llm_model.sagemaker_session.list_inference_components(endpoint_name_equals=llm.endpoint_name).get("InferenceComponents")[0].get("InferenceComponentName")
llm_model.sagemaker_session.delete_inference_component(inference_component_name=inference_component)
llm.delete_model()
llm.delete_endpoint()