# Autoscale SageMaker endpoints

## In this notebook, we will discover 3 distinct ways to scale SageMaker Endpoints. 

![Auto Scaling Groups](as-basic-diagram.png)

#### 1.	Autoscale using Auto Scaling Groups and Simple Scaling
#### 2.	Autoscale using Auto Scaling Groups and Step Scaling 
#### 3.	Autoscale on demand, without defining a trigger a priori – using `update_endpoint_weights_and_capacities` API call

![Update API](update_api.png)

With ```step scaling and simple scaling```, you choose scaling metrics and threshold values for the CloudWatch alarms that trigger the scaling process. You also define how your Auto Scaling group should be scaled when a threshold is in breach for a specified number of evaluation periods. We strongly recommend that you use a target tracking scaling policy to scale on a metric like average CPU utilization or the SageMakerVariantInvocationsPerInstance metric. 

Metrics that decrease when capacity increases and increase when capacity decreases can be used to proportionally scale out or in the number of instances using target tracking. 

![SageMaker AutoScaling](SM-AS.png)

You still have the option to use ```step scaling as an additional policy for a more advanced configuration```. For example, you can configure a more aggressive response when demand reaches a certain level.

Step scaling policies and simple scaling policies are two of the dynamic scaling options available for you to use. The main difference between the policy types is the step adjustments that you get with step scaling policies. When step adjustments are applied, and they increase or decrease the current capacity of your Auto Scaling group, the adjustments vary based on the size of the alarm breach.


In [None]:
import pprint
import boto3
import time
from time import gmtime, strftime
from sagemaker import get_execution_role
import sagemaker
import json
from IPython.display import clear_output

pp = pprint.PrettyPrinter(indent=4, depth=4)
role = get_execution_role()
sagemaker_client = boto3.Session().client(service_name='sagemaker')

# Copy over the endpoint name from our other notebook

endpoint_name = 'chazarey-mxnet-serving-160-gpu-py2-2020-06-11-04-39-32-640'

In [None]:
response = sagemaker_client.describe_endpoint(
    EndpointName=endpoint_name
)
pp.pprint(response) # We are interested in 'EndpointStatus', CurrentInstanceCount' & 'DesiredInstanceCount' same can be observed from the console

### Apply a scaling policy to autoscale based on number of times per minute that each instance is invoked

In [None]:
client = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services

resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic' # This is the format in which application autoscaling references the endpoint

response = client.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=2
)

response = client.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', # The namespace of the AWS service that provides the resource. 
    ResourceId=resource_id, # Endpoint name 
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # SageMaker supports only Instance Count
    PolicyType='TargetTrackingScaling', # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 10.0, # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance', # is the average number of times per minute that each instance for a variant is invoked. 
        },
        'ScaleInCooldown': 600, # The cooldown period helps you prevent your Auto Scaling group from launching or terminating 
                                # additional instances before the effects of previous activities are visible. 
                                # You can configure the length of time based on your instance startup time or other application needs.
                                # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start. 
        'ScaleOutCooldown': 300 # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        
        # 'DisableScaleIn': True|False - ndicates whether scale in by the target tracking policy is disabled. 
                            # If the value is true , scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    }
)

### Wait till endpoint state is 'Updating'

In [None]:
%%time

#We need to wait untill the state transitions from 'Updating' to apply another scaling operation

response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    print("Status: " + status)
    clear_output(wait=True)

### Apply a scaling policy based on CPUUtilization metric

In [None]:
response = client.put_scaling_policy(
    PolicyName='CPUUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 90.0,
        'CustomizedMetricSpecification':
        {
            'MetricName': 'CPUUtilization',
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name },
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average', # Possible - 'Statistic': 'Average'|'Minimum'|'Maximum'|'SampleCount'|'Sum'
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 600,
        'ScaleOutCooldown': 300
    }
)

### Wait till endpoint state is 'Updating'

In [None]:
%%time

response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    print("Status: " + status)
    clear_output(wait=True)

### Apply a step scaling policy based on OverheadLatency metric

In [None]:
response = client.put_scaling_policy(
    PolicyName='OverheadLatency-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='StepScaling', 
    StepScalingPolicyConfiguration={
        'AdjustmentType': 'ChangeInCapacity', # 'PercentChangeInCapacity'|'ExactCapacity' Specifies whether the ScalingAdjustment value in a StepAdjustment 
                                              # is an absolute number or a percentage of the current capacity.
        'StepAdjustments': [ # A set of adjustments that enable you to scale based on the size of the alarm breach.
            {
                'MetricIntervalLowerBound': 0.0, # The lower bound for the difference between the alarm threshold and the CloudWatch metric.
                 # 'MetricIntervalUpperBound': 100.0, # The upper bound for the difference between the alarm threshold and the CloudWatch metric.
                'ScalingAdjustment': 1 # The amount by which to scale, based on the specified adjustment type. 
                                       # A positive value adds to the current capacity while a negative number removes from the current capacity.
            },
        ],
        # 'MinAdjustmentMagnitude': 1, # The minimum number of instances to scale. - only for 'PercentChangeInCapacity'
        'Cooldown': 120,
        'MetricAggregationType': 'Average', # 'Minimum'|'Maximum'
    }
)

Another example - scaling using a diffrent amount at each step - 

```yaml
{
  "AdjustmentType": "ChangeInCapacity",
  "MetricAggregationType": "Average",
  "Cooldown": 60,
  "StepAdjustments": [ 
    {
      "MetricIntervalLowerBound": 0,
      "MetricIntervalUpperBound": 15,
      "ScalingAdjustment": 1
    },
    {
      "MetricIntervalLowerBound": 15,
      "MetricIntervalUpperBound": 25,
      "ScalingAdjustment": 2
    },
    {
      "MetricIntervalLowerBound": 25,
      "ScalingAdjustment": 3
    }
  ]
}
```

### Wait till endpoint state is 'Updating'

In [None]:
%%time

response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    print("Status: " + status)
    clear_output(wait=True)

### List the scaling policies attached to this target

In [None]:
response = client.describe_scaling_policies(
    ServiceNamespace='sagemaker'
)

for i in response['ScalingPolicies']:
    print('')
    pp.pprint(i['PolicyName'])
    print('')
    if('TargetTrackingScalingPolicyConfiguration' in i):
        pp.pprint(i['TargetTrackingScalingPolicyConfiguration']) the
    else:
        pp.pprint(i['StepScalingPolicyConfiguration'])
    print('')

## Scale without defining a policy - use - ```update_endpoint_weights_and_capacities```

### Lets see if we have the endpoint available 

In [None]:
%%time

response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    print("Status: " + status)
    clear_output(wait=True)

In [None]:
response = sagemaker_client.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            'VariantName': 'AllTraffic',
            'DesiredInstanceCount': 5
        },
    ]
)
response = sagemaker_client.describe_endpoint(
    EndpointName=endpoint_name
)
pp.pprint(response) # We are interested in 'EndpointStatus', CurrentInstanceCount' & 'DesiredInstanceCount' same can be observed from the console

### Wait till endpoint state is 'Updating'

In [None]:
%%time

response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    print("Status: " + status)
    clear_output(wait=True)

### Do it again, but scale down

In [None]:
response = sagemaker_client.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            'VariantName': 'AllTraffic',
            'DesiredInstanceCount': 1
        },
    ]
)
response = sagemaker_client.describe_endpoint(
    EndpointName=endpoint_name
)
pp.pprint(response)

### Debug - List the scaling activties performed till now 

In [None]:
# Provides descriptive information about the scaling activities in the specified namespace from the previous six weeks.

response = client.describe_scaling_activities(
    ServiceNamespace='sagemaker'
)
pp.pprint(response)

# Cleanup

In [None]:
# Delete all the policies attached to this target

# You can delete a scaling policy with the AWS Management Console, 
# the AWS CLI, or the Application Auto Scaling API. You must delete a scaling policy if you wish to update a model's endpoint.

response = client.describe_scaling_policies(
    ServiceNamespace='sagemaker'
)

for i in response['ScalingPolicies']:
    print('')
    pp.pprint(i['PolicyName'])
    print('')
    #pp.pprint(i['TargetTrackingScalingPolicyConfiguration']) or pp.pprint(i['StepScalingPolicyConfiguration'])
    print('')
    response = client.delete_scaling_policy(
        PolicyName=i['PolicyName'],
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount'
    )

In [None]:
# Deregister scalable target

response = client.deregister_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount'
)

### References:

* How to define a autoscaling policy - https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-add-code-define.html
* How to load test to derive a scaling strategy - https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-scaling-loadtest.html
* API references: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html
* SageMaker CloudWatch Metrics Definitions - https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html
* Customized metric specification - https://docs.aws.amazon.com/autoscaling/application/APIReference/API_CustomizedMetricSpecification.html
* Publishing custom metrics - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html