## Auto scaling

In this example we will apply simple tracking policy to TF HuggingFace endppoint and load it with some synthethic traffic

In [None]:
# Download artifacts for DistilBert model for Question-Answering task

! mkdir distilbert-base-uncased-distilled-squad
! mkdir distilbert-base-uncased-distilled-squad/1
! mkdir distilbert-base-uncased-distilled-squad/code

! wget https://huggingface.co/distilbert-base-cased-distilled-squad/resolve/main/saved_model.tar.gz
! tar -zxvf saved_model.tar.gz -C distilbert-base-uncased-distilled-squad/1

! cp 1_src/inference.py distilbert-base-uncased-distilled-squad/code
! cp 1_src/requirements.txt distilbert-base-uncased-distilled-squad/code

In [4]:
!tar -C "$PWD" -czf distilbert-base-uncased-distilled-squad.tar.gz distilbert-base-uncased-distilled-squad/

### Upload model data to S3

In [44]:

import sagemaker
from sagemaker import get_execution_role
import os 

sagemaker_session = sagemaker.Session()
#role = get_execution_role()  # TODO: replace it
role="arn:aws:iam::941656036254:role/service-role/AmazonSageMaker-ExecutionRole-20210904T193230" # TODO: this has to be replaced

bucket = sagemaker_session.default_bucket()
prefix = 'auto-scaling'
s3_path = 's3://{}/{}'.format(bucket, prefix)


In [45]:
model_data = sagemaker_session.upload_data('distilbert-base-uncased-distilled-squad.tar.gz',
                                           bucket,
                                           os.path.join(prefix, 'model-artifacts'))     

print(model_data)                       


s3://sagemaker-us-east-1-941656036254/auto-scaling/model-artifacts/distilbert-base-uncased-distilled-squad.tar.gz


In [46]:
from sagemaker.tensorflow import TensorFlowModel

env = { "NLP_TASK":"question-answering"
    }

# The "Model" object doesn't create a SageMaker Model until a Transform Job or Endpoint is created.
tensorflow_serving_model = TensorFlowModel(model_data=model_data,
                                 name="qa-tensorflow",
                                 role=role,
                                 framework_version='2.8',
                                 env=env,
                                 sagemaker_session=sagemaker_session)

In [47]:
instance = "ml.c5.2xlarge"

predictor = tensorflow_serving_model.deploy(initial_instance_count=1, instance_type=instance)

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Using already existing model: qa-tensorflow


----!

In [50]:

sm_client = sagemaker_session.sagemaker_client
runtime_sm_client = sagemaker_session.sagemaker_runtime_client



In [51]:
sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)

{'EndpointName': 'qa-tensorflow-2022-08-18-11-38-23-027',
 'EndpointArn': 'arn:aws:sagemaker:us-east-1:941656036254:endpoint/qa-tensorflow-2022-08-18-11-38-23-027',
 'EndpointConfigName': 'qa-tensorflow-2022-08-18-11-38-23-027',
 'ProductionVariants': [{'VariantName': 'AllTraffic',
   'DeployedImages': [{'SpecifiedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.8-cpu',
     'ResolvedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference@sha256:d72f9623bab06fcf97cef4cad7a5748926f002a62503fc06c89fb29f09a2beaf',
     'ResolutionTime': datetime.datetime(2022, 8, 18, 7, 38, 24, 597000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2022, 8, 18, 7, 38, 23, 581000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2022, 8, 18, 7, 40, 1, 793000, tzinfo=tzlocal()),
 'ResponseMetadata': {'

## Testing Multi Container Endpoint

This has to be replaced with locust of some sort: https://github.com/arunprsh/SageMaker-Load-Testing

In [52]:
import json

article = r"""
The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.
"""

question="What kind of forest is Amazon?"


In [53]:
#  preparing data for TF Serving format

from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
import tensorflow as tf
import numpy as np

max_length = 384
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

encoded_input = tokenizer(question, article, padding='max_length', max_length=max_length)
encoded_input = dict(encoded_input)
qa_inputs = [{"input_ids": np.array(encoded_input["input_ids"]).tolist(), "attention_mask":np.array(encoded_input["attention_mask"]).tolist()}]
#qa_inputs = {"input_ids": np.array(encoded_input["input_ids"]).tolist(), "attention_mask":np.array(encoded_input["attention_mask"]).tolist()}
qa_inputs = {"instances" : qa_inputs}

Some layers from the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased-distilled-squad and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [54]:
import numpy as np

tf_response = runtime_sm_client.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(qa_inputs),
)

In [55]:
predictions = json.loads(tf_response["Body"].read().decode())

In [56]:
answer_start_index = int(tf.math.argmax(predictions['predictions'][0]['output_0']))
answer_end_index = int(tf.math.argmax(predictions['predictions'][0]['output_1']))

predict_answer_tokens = encoded_input["input_ids"][answer_start_index : answer_end_index + 1]
tf_response = tokenizer.decode(predict_answer_tokens)

print(f"Question: {question}, answer: {tf_response}")


Question: What kind of forest is Amazon?, answer: moist broadleaf forest


In [57]:
payload_file = "3_src/payload.json"
json.dump(qa_inputs, open(payload_file, "w"))

# Applying Scaling Policies

We start from simple tracking policy

In [58]:
import boto3 

as_client = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services

In [59]:
# Resource type is variant and the unique identifier is the resource ID.
resource_id=f"endpoint/{predictor.endpoint_name}/variant/AllTraffic"
policy_name = f'Request-ScalingPolicy-{predictor.endpoint_name}'
scalable_dimension = 'sagemaker:variant:DesiredInstanceCount'

# scaling configuration
response = as_client.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    MinCapacity=1,
    MaxCapacity=4
)


#Target Scaling
response = as_client.put_scaling_policy(
    PolicyName=policy_name,
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 10.0, # Threshold
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
        },
        'ScaleInCooldown': 300, # duration until scale in
        'ScaleOutCooldown': 60 # duration between scale out
    }
)

# Running load tests

In [None]:
! pip install -r "../utils/load_testing/requirements.txt"

In [32]:
%%writefile ../utils/load_testing/config.py

# provide configuration parameters
# TODO: clean up config from personal data

HOST = 'runtime.sagemaker.us-east-1.amazonaws.com'
REGION = 'us-east-1'
# replace the url below with the sagemaker endpoint you are load testing
ENDPOINT_NAME = "qa-tensorflow-2022-08-16-12-55-04-479"
SAGEMAKER_ENDPOINT_URL = f'https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/{ENDPOINT_NAME}/invocations'
ACCESS_KEY = '<USE YOUR AWS ACCESS KEY HERE>'
SECRET_KEY = '<USE YOUR AWS SECRET KEY HERE>'
# replace the context type below as per your requirements
CONTENT_TYPE = 'application/json'
METHOD = 'POST'
SERVICE = 'sagemaker'
SIGNED_HEADERS = 'content-type;host;x-amz-date'
CANONICAL_QUERY_STRING = ''
ALGORITHM = 'AWS4-HMAC-SHA256'

Overwriting ../utils/load_testing/config.py


# Start locust

Beloew run in console.

In [None]:
! locust -f ../utils/load_testing/locustfile.py --headless -u 20 -r 1 --run-time 5m
# u - number of concurrent users
# r - spawn rate (users per sec)

In [68]:
import time 

endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
status = endpoint_description['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
    status = endpoint_description['EndpointStatus']
    instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

Status: InService


## Update manually endpoint

Details are described here: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-scaling.html

In [None]:
# First, we need to delete scaling policy

response = as_client.delete_scaling_policy(
    PolicyName=policy_name,
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension
)

print(response)

In [73]:
response = as_client.deregister_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension
)

print(response)

{'ResponseMetadata': {'RequestId': '4c2d0f4b-d565-4be7-9ad0-00870c477c23', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '4c2d0f4b-d565-4be7-9ad0-00870c477c23', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2', 'date': 'Thu, 18 Aug 2022 12:06:20 GMT'}, 'RetryAttempts': 0}}


In [79]:
# get current instance count

endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
print(f"Current instance count: {instance_count}")

Current instance count: 4


In [80]:
sm_client.update_endpoint_weights_and_capacities(EndpointName=predictor.endpoint_name,
                            DesiredWeightsAndCapacities=[
                                {
                                    'VariantName': 'AllTraffic',
                                    'DesiredInstanceCount': 1
                                }
                            ])

{'EndpointArn': 'arn:aws:sagemaker:us-east-1:941656036254:endpoint/qa-tensorflow-2022-08-18-11-38-23-027',
 'ResponseMetadata': {'RequestId': '41729f43-28ca-49bc-9471-d70a15a1d3bf',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '41729f43-28ca-49bc-9471-d70a15a1d3bf',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '105',
   'date': 'Thu, 18 Aug 2022 12:11:49 GMT'},
  'RetryAttempts': 0}}

In [84]:
endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
status = endpoint_description['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
    status = endpoint_description['EndpointStatus']
    instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

Status: InService


In [85]:
endpoint_description = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
print(f"Current instance count: {instance_count}")

Current instance count: 1


In [None]:
sm_client.delete_endpoint(predictor.endpoint_name)