# Serving Large Language Models (LLMs) at Scale on AWS

## Introduction


Language Model (LLM) serving at scale refers to the process of delivering an LLM as a service to a large number of users efficiently. This involves setting up a robust and efficient infrastructure to handle a large number of requests quickly and at a low latency.

LLM inference at scale can be achieved through various techniques such as model parallelism, data parallelism, efficient data handling, optimized inference, asynchronous inference, auto-scaling, load balancing, optimized hardware, caching, and using model serving frameworks. The goal is to minimize the latency of individual requests, reduce the computational complexity of the model, and improve the overall system throughput to meet the demands of a larger number of users. 

By serving an LLM at scale, you can make it accessible to a wider audience and enable them to use the model to generate text, translate languages, answer questions, and perform other natural language processing tasks quickly and efficiently.

In this article, I will use Amazon SageMaker to show how we can control resources to serve our LLM models at scale. It may include the number of GPUs, memory, or the replicas assigned to serve dynamic amounts of requests to our LLM models. Moreover, I also show how to attach an auto-scaling policy to our serving endpoint which will scale our endpoint automatically when workload varies. 

## AWS Deep Learning Containers


Large Language Models (LLMs) have become a forefront of innovation in artificial intelligence, capturing the attention of academic establishments, tech companies and enthusiasts with their sophisticated capabilities. Models built on architectures like GPT and Llama have rapidly gained traction for a wide range of uses such as language comprehension, conversational interfaces, and automated content creation. This surge in demand has led many companies to explore and integrate LLM-driven features into their products.

However, deploying LLMs on a large scale involves complex engineering challenges. To ensure a seamless user experience, hosting services for LLMs need to maintain quick response times while supporting numerous users simultaneously. Due to the substantial resource demands of these models, standard inference frameworks often fall short in delivering the necessary optimizations for optimal resource use and performance.

Key optimizations that can enhance LLM hosting include:

* Tensor parallelism, which spreads computation across multiple processing units.
* Model quantization, which reduces the model’s memory usage.
* Dynamic batching of requests to increase processing throughput and more.


Recently, AWS has released a new Hugging Face Deep Learning Container (DLC) for inference with Large Language Models (LLMs). This new Hugging Face LLM DLC is powered by Text Generation Inference (TGI), an open source, purpose-built solution for deploying and serving Large Language Models. TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs. The Hugging Face LLM DLC incorporates all the aforementioned optimizations as standard features, simplifying the large-scale deployment of LLMs.

# Serving Llama 3 at Scale in SageMaker

Let's start off by installing the required modules

In [None]:
!pip install -U sagemaker transformers

In [4]:
import sagemaker
import boto3

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess.boto_region_name
bucket = sess.default_bucket()  # bucket to house artifacts

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {region}")

sagemaker role arn: arn:aws:iam::609362070692:role/service-role/AmazonSageMaker-ExecutionRole-20231122T115899
sagemaker session region: us-east-1


Next, we need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the <code>get_huggingface_llm_image_uri</code> method provided by the sagemaker SDK

In [20]:
from sagemaker.huggingface import get_huggingface_llm_image_uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
)
from huggingface_hub import login

login(token="Your token")

print(f"llm image uri: {llm_image}")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04


To deploy Llama 3 70B to Amazon SageMaker we create a <code>HuggingFaceModel</code> model class and define our endpoint configuration including the hf_model_id, instance_type etc. We will use a g5.45xlarge instance type, which has 8 NVIDIA A10G GPUs and 192GB of GPU memory. You need atleast > 100GB of GPU memory to run Mixtral 8x7B in float16 with decent input length.

In [21]:
import json
from sagemaker.huggingface import HuggingFaceModel
 
# sagemaker config
instance_type = "ml.p3.8xlarge"
number_of_gpu = 4
health_check_timeout = 300
 
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "meta-llama/Meta-Llama-3-8B-Instruct", 
  'SM_NUM_GPUS': "1", # Number of GPU used per replica
  'MAX_INPUT_LENGTH': "2048",  # Max length of input text
  'MAX_TOTAL_TOKENS': "4096",  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': "8192",  # Limits the number of tokens that can be processed in parallel during the generation. The context window of llama3 models is 8192 tokens
}
 
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

With <code>ResourceRequirements</code> you can assign endpoint resources to a model. These resources include CPU cores, accelerators, and memory.

In [22]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

llama3_resource_config = ResourceRequirements(
    requests = {
        "copies": 4, # Number of replicas
        "num_accelerators": 1, # Number of GPUs
        "num_cpus": 6,  # Number of CPU cores 32 // num_replica - more for management
        "memory": 40 * 1024,  # Minimum memory (MB) 244 // num_replica - more for management
    },
)

In [None]:
%%time
import uuid
from sagemaker.enums import EndpointType

llm = llm_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type=instance_type, # base instance type
    resources=llama3_resource_config, # resource config for multi-replica
    container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
    endpoint_name=f"llama3-chat-{str(uuid.uuid4())}", # name needs to be unique
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED, # needed to use resource config
    tags=[{"Key": "aKey", "Value": "aValue"}],
    model_name="llama3-chat"
)

# AutoScaling 

Amazon SageMaker offers automatic scaling (auto scaling) for hosted models, which dynamically adjusts the number of instances based on workload changes. When the workload increases, auto scaling activates additional instances. Conversely, when the workload decreases, it removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

First, we need to define a scaling policy that adds and removes the number of instances for our production endpoint in response to workload changes.


# Autoscaling Policies

There are three main types of autoscaling policies for SageMaker Endpoints: target tracking, simple, and step scaling:

* Target Tracking:
With the target tracking scaling policy, you choose an Amazon CloudWatch metric and target value, such as SageMaker VariantInvocationsPerInstance = 100, and SageMaker can keep VariantInvocationsPerInstance at, or close to 100. This approach is very common due to its ease of configuration.

* Simple Scaling:
The simple scaling policy triggers a scaling event based on a specified metric at a defined threshold with a fixed amount of scaling. For instance, "when SageMaker VariantInvocationsPerInstance > 1000, add 10 instances." This strategy requires more configuration but offers greater control compared to target tracking.

* Step Scaling:
You can use step scaling when you require an advanced configuration, such as specifying how many instances to deploy under what conditions. For example, "when SageMaker VariantInvocationsPerInstance > 1000, add 10 instances; when SageMaker VariantInvocationsPerInstance > 2000, add 50 instances." This approach demands the most configuration but provides the highest level of control, especially for handling spiky traffic.

In [None]:
autoscale = boto3.Session().client(service_name="application-autoscaling")

First, we need to register the resource as a scalable target.

In [None]:
autoscale.register_scalable_target(
    ServiceNamespace="sagemaker", #1
    ResourceId="endpoint/" + endpoint_name + "/variant/AllTraffic", #2
    ScalableDimension="sagemaker:variant:DesiredInstanceCount", #3
    MinCapacity=1, #4
    MaxCapacity=2, #5
    RoleARN=role,
)

In the above configuration, we have defined #1) The AWS service name, #2) The identifier of the resource that is associated with the scalable target. #3) The scalable property associated with the scalable target (e.g., the number of EC2 instances for a SageMaker model endpoint variant) #4, #5) The minimum/maximum value that you plan to scale in/ scale out to. Please refere to the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling/client/register_scalable_target.html) for other configurable parameters.

After registering sagemaker as a scalable target, we can creates or updates the scaling policy for that target using <code>put_scaling_policy</code>. 

In [None]:
autoscale.put_scaling_policy(
    PolicyName="autoscale-policy-llama3-8b",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/" + endpoint_name + "/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 20,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'            
        }, 
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 300,
    }
)

Here, we need to define PolicyType which will be one of the auto-scaling policies described ealier. We define Target Value which acts as a metric trigger to scale out or scale in the defined scalable dimension (e.g., sagemaker:variant:DesiredInstanceCount). For example in the above configuration, when the number of sagemaker invocations per instance (SageMakerVariantInvocationsPerInstance) exceeds 100 per minute a new instance is added to our deployed endpoint. 

A cooldown period defines the time interval the scaling policy waits before initiating another scaling action. This mechanism helps prevent over-scaling.

<code>ScaleOutCooldown</code>: Following a successful scale-out by the scaling policy, auto scaler begins calculating the cooldown period. The policy will not increase the desired capacity again unless a more significant scale-out event occurs or the cooldown period finishes.

<code>ScaleInCooldown</code>: To ensure application availability, upcoming scale-in activities are suspended until the scale-in cooldown period has ended. Default value is 300 seconds for both of them.

Figure 1 shows how auto-scaling works. 

<center><figure><img src="imgs/scaling.png" alt="drawing" width="800"/><figcaption>Fig. 1: Auto-scaling Policy</figcaption></figure></center> 

## Trigger autoscaling

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(config["HF_MODEL_ID"])

# Conversational messages
messages = [
  {"role": "system", "content": "You are an helpful Travel Assistant."},
  {"role": "user", "content": "Where is a good vacation destination in north america for summer?"},
]

# generation parameters
parameters = {
    "do_sample" : True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 50,
    "repetition_penalty": 1.03,
    "return_full_text": False,
}

for i in range(0, 100):
    res = llm.predict(
      {
        "inputs": tokenizer.apply_chat_template(messages, tokenize=False),
        "parameters": parameters
       })

    print(res)

In the above code snippet, <code>apply_chat_template</code> with convert the message to a chat template of the underlying LLM. 

In [None]:
autoscale.describe_scaling_activities(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/" + endpoint_name + "/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MaxResults=100
)