# Auto scaling of SageMaker Endpoint

SageMaker allows you to automatically scale out (increase the number of instances) and scale in (decrease the number of instances) for real-time endpoints and asynchronous endpoints. When inference traffic increases, scaling out maintains steady endpoint performance while keeping costs to a minimum. When inference traffic decreases, scaling in allows you to minimize the inference costs. For real-time endpoints, the minimum instance size is 1; asynchronous endpoints can scale to 0 instances. The following diagram shows this:

In this example, we will learn how to apply the target tracking autoscaling policy to a real-time endpoint. 

## Deploy Real-time Endpoint

We start by creating regular SageMaker Real-time endpoint. Follow the steps below to create an endpoint with `T5-small` model for HuggingFace model hub which can be used for different NLP texts such as summarization, translation, text classification and others. 

In [None]:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()

In [None]:
PYTHON_VERSION = "py38"
PYTORCH_VERSION = "1.10.2"
TRANSFORMER_VERSION = "4.17.0"

In [None]:
# Model parameters
hub = {
	'HF_MODEL_ID':'t5-small',
	'HF_TASK':'translation'
}

huggingface_model = HuggingFaceModel(
	transformers_version=TRANSFORMER_VERSION,
	pytorch_version=PYTORCH_VERSION,
	py_version=PYTHON_VERSION,
	env=hub,
	role=role, 
)

predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type='ml.m5.xlarge'
)

Once the endpoint deploy, let's sent a test sample:

In [None]:
response = predictor.predict({
	'inputs': "Berlin is the capital and largest city of Germany by both area and population"
})

print(response)

## Applying Auto Scaling Policies

Next, we apply Auto Scaling to running endpoint. For this we will create two autoscaling resources: `a scalable target` and `a scaling policy`. The scalable target defines a specific AWS resource that we want to scale using the Application Auto Scaling service. In the following code snippet, we are instantiating the client for the Application Auto Scaling service and registering our SageMaker endpoint as a scalable target with following parameters:
- `ResourceId` parameter defines a reference to a specific endpoint and production variant.
- `ScalableDimension` parameter for SageMaker resources always references the number of instances behind the production variant. 
- `MinCapacity` and `MaxCapacity` define the instance scaling range.

In [58]:
import boto3

as_client = boto3.client('application-autoscaling')
 
resource_id=f"endpoint/{predictor.endpoint_name}/variant/AllTraffic"
policy_name = f'Request-ScalingPolicy-{predictor.endpoint_name}'
scalable_dimension = 'sagemaker:variant:DesiredInstanceCount'

# define scaling configuration
response = as_client.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    MinCapacity=1,
    MaxCapacity=4
)

Next, we will create a policy for our scalable target. The scaling policy defines how endpoint should be scaled based on target metric.

In [None]:
response = as_client.put_scaling_policy(
    PolicyName=policy_name,
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 10.0, # Threshold
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
        },
        'ScaleInCooldown': 300, # duration until scale in
        'ScaleOutCooldown': 60 # duration between scale out
    }
)

## Auto Scaling Endpoint

Now, let's test that our endpoint can actually scale automatically according to applied above policy. For this, we need to generate sufficient inference traffic to breach the target metric value for a duration longer than the scale-out cooldown period. For this purpose, we can use the [Locust.io load testing framework](https://locust.io/), which provides a simple mechanism to mimic various load patterns. Follow the instructions in the notebook to create a Locust configuration for your endpoint and provide your AWS credentials for authorization purposes.

1. We start by installing Locust Python package locally:

In [None]:
! pip install -r "../utils/load_testing/requirements.txt"

2. Next, we need to generate a config file for Locust to generate inference requests to SageMaker endpoint. Run the cell below to create configuration file. Please make sure to correctly fill following placeholder parameters:
    - AWS region.
    - your endpoint name.
    - AWS access and secret keys.


In [None]:
%%writefile ../utils/load_testing/config.py

# provide configuration parameters
# TODO: clean up config from personal data

HOST = 'runtime.sagemaker.<USE YOUR REGION>.amazonaws.com'
REGION = '<USE YOUR REGION>'
# replace the url below with the sagemaker endpoint you are load testing
ENDPOINT_NAME = "USE YOUR ENDPOINT NAME"
SAGEMAKER_ENDPOINT_URL = f'https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/{ENDPOINT_NAME}/invocations'
ACCESS_KEY = '<USE YOUR AWS ACCESS KEY HERE>'
SECRET_KEY = '<USE YOUR AWS SECRET KEY HERE>'
# replace the context type below as per your requirements
CONTENT_TYPE = 'application/json'
METHOD = 'POST'
SERVICE = 'sagemaker'
SIGNED_HEADERS = 'content-type;host;x-amz-date'
CANONICAL_QUERY_STRING = ''
ALGORITHM = 'AWS4-HMAC-SHA256'

3. Now, run following command in separate console to start generating simulatenous requests to SageMaker endpoint where `u` is a number of concurrent users and `r` is a spawn rate (users per sec):
```bash 
    locust -f ../utils/load_testing/locustfile.py --headless -u 20 -r 1 --run-time 5m
```

4. During the load test, you can observe your endpoint status as well as the associated scaling alerts in the Amazon CloudWatch console. First, you can see that scale-out and scale-in alerts have been configured based on the provided cooldown periods and target metric value:

    <img src="static/scaling_alerts.png" width="600">

5. After the initial scale-out cooldown period has passed, the scale-out alert switches to the In alarm state, which causes the endpoint to scale out. Note that in the following screenshot, the red line is the desired value of the tracking metric, while the blue line is the number of invocations per endpoint instance:

    <img src="static/triggered_alert.png" width="600">    
6. After triggering scaling out, your endpoint status will change from in Service to Updating. Now, we can run the `describe_endpoint()` method to confirm that the number of instances has been increased. Since we are generating a sufficiently large concurrent load in a short period, SageMaker immediately scaled our endpoint to the maximum number of instances. Run cell below to observe how SageMaker updates instance cound behind endpoint once scaling out is triggered.

In [None]:
import time

sm_client = sagemaker_session.sagemaker_client # SageMaker boto3 client

endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
status = endpoint_description['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
    status = endpoint_description['EndpointStatus']
    instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

7. Once we've finished loading our endpoint and cool down period has passed, we should expect our endpoint to scale in. Run the cell below to get current instance count.

In [None]:
endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
print(f"Endpoint instance count: {instance_count}")

## Updating Endpoint Manually

SageMaker also allows you to manually update instance count behind endpoint. Let's see how this can be done.

1. First we deklete scaling policy and scalable target to disable Auto Scaling for our endpoint:

In [None]:
response = as_client.delete_scaling_policy(
    PolicyName=policy_name,
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension
)

response = as_client.deregister_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension
)

2. Next, we add additional instance to current instance fleet behind the endpoint:

In [None]:
endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']

print(f"Current instance count: {instance_count}")

target_instance_count = (int(instance_count)+1)

sm_client.update_endpoint_weights_and_capacities(EndpointName=predictor.endpoint_name,
                            DesiredWeightsAndCapacities=[
                                {
                                    'VariantName': 'AllTraffic',
                                    'DesiredInstanceCount': target_instance_count
                                }
                            ])

3. Let's observe endpoint update process and confirm that endpoint count is updated:

In [None]:
endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
status = endpoint_description['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
    status = endpoint_description['EndpointStatus']
    instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
print(f"Current instance count: {instance_count}")

### Resource Clean up

Run following cell to delete cloud resources:

In [None]:
sm_client.delete_endpoint(EndpointName=predictor.endpoint_name)
sm_client.delete_model(ModelName = huggingface_model)