# AutoScaling SageMaker Endpoints

In this sample we use the same notebook code from Part 1 of the Inference Video Series to create a SageMaker Endpoint: https://www.youtube.com/watch?v=omFOOr4elnc&list=PLThJtS7RDkOeo9mpNjFVnIGDyiazAm9Uk&index=2. For further context around the code and more details please follow the original notebook.

Specifically for this notebook we explore how we can enable AutoScaling for SageMaker Endpoints expanding upon the previous section which covered [Load Testing SM Endpoints](https://www.youtube.com/watch?v=ZURoZZbiqj0&t=1120s). Note there's also features such as [Scale Down to Zero](https://aws.amazon.com/blogs/machine-learning/unlock-cost-savings-with-the-new-scale-down-to-zero-feature-in-amazon-sagemaker-inference/) for Real-Time Endpoints that we'll explore more in depth in future sections.

### Additional Resources/Credits/References

- [AutoScaling Blog](https://towardsdatascience.com/autoscaling-sagemaker-real-time-endpoints-b1b6e6731c59/)
- [Scale Down To Zero Blog](https://aws.amazon.com/blogs/machine-learning/unlock-cost-savings-with-the-new-scale-down-to-zero-feature-in-amazon-sagemaker-inference/)
- [AutoScaling Docs](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html)
- [Transformers AutoScaling Sample](https://github.com/philschmid/huggingface-sagemaker-workshop-series/blob/main/workshop_2_going_production/lab3_autoscaling.ipynb)

## Setup & Environment
We will be working in a ml.c5.2xlarge SageMaker Classic Notebook Instance using a conda_python3 kernel. You can also optionally use SageMaker Studio or an environment where you have proper credentials for working with SageMaker. Here we scale our endpoint to have two c5.xlarge instances so ensure that you have access to this amount or put in a limit request for your account.

## Endpoint Creation
For an end to end explained guide on pre-trained deployment please refer to the earlier notebook here: https://github.com/RamVegiraju/SageMaker-Deployment/blob/master/SM-Inference-Video-Series/Pre-Trained-Model-Dept/pre-trained-sklearn-model-dept.ipynb

In [None]:
!pip install -U sagemaker boto3 --quiet

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
import boto3
import json
import os
import joblib
import pickle
import tarfile
import sagemaker
from sagemaker.estimator import Estimator
import time
from time import gmtime, strftime
import subprocess

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name
account_id = sess.account_id()
s3_model_prefix = "djl-sme-sklearn-regression" 

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [None]:
%%sh
python3 local_model.py
tar -cvpzf model.tar.gz model.joblib requirements.txt model.py serving.properties

In [None]:
# upload model data to S3
with open("model.tar.gz", "rb") as f:
    s3_client.upload_fileobj(f, bucket, "{}/model.tar.gz".format(s3_model_prefix))
sme_artifacts = "s3://{}/{}/{}".format(bucket, s3_model_prefix, "model.tar.gz")
# replace this with your ECR image URI based off of your region, we are utilizing the CPU image here
inference_image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-cpu-full'
print(f"Pushing the data to the following location: {sme_artifacts}")
print(f"Using the following serving image: {inference_image_uri}")

In [None]:
#Step 1: Model Creation
sme_model_name = "sklearn-djl-sme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + sme_model_name)

create_model_response = sm_client.create_model(
    ModelName=sme_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "Mode": "SingleModel", "ModelDataUrl": sme_artifacts},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

#Step 2: EPC Creation
variant_name = "sklearnvariant"
sme_epc_name = "sklearn-djl-sme-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=sme_epc_name,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "ModelName": sme_model_name,
            "InstanceType": "ml.c5.xlarge",
            "InitialInstanceCount": 1
        },
    ],
)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

#Step 3: EP Creation
sme_endpoint_name = "sklearn-djl-ep-sme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=sme_endpoint_name,
    EndpointConfigName=sme_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

#Monitor creation
describe_endpoint_response = sm_client.describe_endpoint(EndpointName=sme_endpoint_name)
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = sm_client.describe_endpoint(EndpointName=sme_endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)
print(describe_endpoint_response)

In [None]:
# sample invocation
import json
content_type = "application/json"
request_body = '[[0.5]]' #replace with your request body

response = smr_client.invoke_endpoint(
    EndpointName=sme_endpoint_name,
    ContentType=content_type,
    Body=request_body)
result = json.loads(response['Body'].read().decode())
print(result)

## Enabling AutoScaling for SageMaker Endpoint
SageMaker Real-Time Endpoints are integrated with <b>Application AutoScaling</b>: https://docs.aws.amazon.com/autoscaling/application/userguide/what-is-application-auto-scaling.html. 

In this case with a SageMaker Endpoint variant we can define different types of scaling policies with CloudWatch metrics that are supported by SM Endpoints: https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation. For this case we use the <b>invocations per instance</b> as the target metric, but you can also AutoScale on the metric of your choice (some use CPU or GPU Utilization if you want scale based off of hardware saturation).

In [None]:
# AutoScaling client
asg = boto3.client('application-autoscaling')

# Resource type is variant and the unique identifier is the resource ID.
resource_id=f"endpoint/{sme_endpoint_name}/variant/{variant_name}"

# Instance count
min_instance_count = 1
max_instance_count = 2

# scaling configuration
response = asg.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    MinCapacity=min_instance_count,
    MaxCapacity=max_instance_count
)

#Target Scaling
# Metric we use is invocations per instance: https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation
response = asg.put_scaling_policy(
    PolicyName=f'Request-ScalingPolicy-{sme_endpoint_name}',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 10.0, # Threshold setting to 10 invocations per minute
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
        },
        'ScaleInCooldown': 400, # duration until scale in, increasing so we can display instance count rising later
        'ScaleOutCooldown': 60 # duration between scale out
    }
)

## Test AutoScaling
Here we send requests for a certain duration that will hit our scaling target of 10 invocations per minute, we should see our instance count scale to two and the status of the endpoint changing to updating during that timeframe).

In [None]:
# code snippet borrowed from following NB: https://github.com/philschmid/huggingface-sagemaker-workshop-series/blob/main/workshop_2_going_production/lab3_autoscaling.ipynb
request_duration = 250
end_time = time.time() + request_duration
print(f"test will run for {request_duration} seconds")
while time.time() < end_time:
    resp = smr_client.invoke_endpoint(EndpointName=sme_endpoint_name, 
                                      Body=request_body, 
                                      ContentType=content_type)

In [None]:
# re-run this cell if needed, the status should be updating during AutoScaling
response = sm_client.describe_endpoint(EndpointName=sme_endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)

# check the endpoint status to get the instance count and see it increase over time
while status=='Updating':
    time.sleep(15)
    response = sm_client.describe_endpoint(EndpointName=sme_endpoint_name)
    status = response['EndpointStatus']
    instance_count = response['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

## Cleanup
Ensure to delete your endpoint to not incur any further costs, you can also enable scale down to zero optionally. Make sure to also turn off your notebook instance after usage.

In [None]:
sm_client.delete_endpoint(EndpointName = sme_endpoint_name)