# 5) Model deployement
In this notebook, we will deploy the model, set up autoscalling, and finally invoke the endpoint a few times as a sanity check.

In [49]:
import sagemaker
import boto3
from sagemaker.tuner import CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
from sagemaker.debugger import Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig, ProfilerRule, rule_configs
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
from sagemaker.model_monitor import DataCaptureConfig
from sagemaker.pytorch import PyTorchModel
from sagemaker.predictor import Predictor
from PIL import Image
import io
import base64
import json
import pprint

session = sagemaker.Session()

bucket = session.default_bucket()
print("Default Bucket: {}".format(bucket))

region = session.boto_region_name
print("AWS Region: {}".format(region))

role = get_execution_role()
print("RoleArn: {}".format(role))

prefix = "capstone-inventory-project"

Default Bucket: sagemaker-us-east-1-646714458109
AWS Region: us-east-1
RoleArn: arn:aws:iam::646714458109:role/service-role/AmazonSageMaker-ExecutionRole-20211122T183493


Let's first deploy our model, we provided our own inference script "inference.py".  
To keep track of the model inferences, we will attach a datacapture.

In [44]:
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri=f"s3://{bucket}/{prefix}/data_capture"
)

In [45]:
pytorch_model = PyTorchModel(model_data="s3://sagemaker-us-east-1-646714458109/capstone-inventory-project/main_training/pytorch-training-2022-01-17-10-08-10-837/output/model.tar.gz", 
                             role=role, 
                             entry_point='scripts/inference.py',
                             py_version='py3',
                             framework_version='1.5')

Amazon Elastic Inference will be used to optimize inference speed cost-effectively. A low-cost GPU-powered acceleration will be attached to the deployed EC2 instance. This configuration tends to reduce costs up to 75% compared to traditional  GPU instances.

In [47]:
predictor = pytorch_model.deploy(initial_instance_count=1, 
                                 data_capture_config=data_capture_config,
                                 instance_type='ml.m5.large',
                                 accelerator_type='ml.eia2.medium' # Low cost GPU
                                )  

-----------------!

## Autoscalling
To reduce potential latency, a custom sclaing policy will be setup.  
Up to 3 instances can be instantiated to meet demand based on CPU usage. More specificaly, if an endpoint has an average CPU utilization of more than 70% for more than 30sc, another endpoint will be deployed. This policy was implemented following this [documentaion](https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/).

In [50]:
pp = pprint.PrettyPrinter(indent=4, depth=4)
role = get_execution_role()
sagemaker_client = boto3.Session().client(service_name='sagemaker')
endpoint_name = 'pytorch-inference-eia-2022-01-18-12-59-02-036'
response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
pp.pprint(response)

#Let us define a client to play with autoscaling options
client_auto = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services

{   'CreationTime': datetime.datetime(2022, 1, 18, 12, 59, 2, 417000, tzinfo=tzlocal()),
    'DataCaptureConfig': {   'CaptureStatus': 'Started',
                             'CurrentSamplingPercentage': 100,
                             'DestinationS3Uri': 's3://sagemaker-us-east-1-646714458109/capstone-inventory-project/data_capture',
                             'EnableCapture': True},
    'EndpointArn': 'arn:aws:sagemaker:us-east-1:646714458109:endpoint/pytorch-inference-eia-2022-01-18-12-59-02-036',
    'EndpointConfigName': 'pytorch-inference-eia-2022-01-18-12-59-02-036',
    'EndpointName': 'pytorch-inference-eia-2022-01-18-12-59-02-036',
    'EndpointStatus': 'InService',
    'LastModifiedTime': datetime.datetime(2022, 1, 18, 13, 7, 24, 717000, tzinfo=tzlocal()),
    'ProductionVariants': [   {   'CurrentInstanceCount': 1,
                                  'CurrentWeight': 1.0,
                                  'DeployedImages': [{...}],
                                  'Desir

In [51]:
resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic' # This is the format in which application autoscaling references the endpoint "AllTraffic" is the auto assign variant to the endpoint

response = client_auto.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=3
)

In [53]:
response = client_auto.put_scaling_policy(
    PolicyName='CPUUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 70.0,
        'CustomizedMetricSpecification':
        {
            'MetricName': 'CPUUtilization',
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name },
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average', # Possible - 'Statistic': 'Average'|'Minimum'|'Maximum'|'SampleCount'|'Sum'
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 30,
        'ScaleOutCooldown': 30
    }
)

The policy was correctly implemented:  
![alt text](images/scaling_policy.png "Title")

## Invoke using boto3 client
Let's first try to invoke the endpoint directly using boto3

In [54]:
with open("images/test.jpg", "rb") as f:
    image_data = f.read()

In [55]:
runtime = boto3.Session().client('sagemaker-runtime')

response = runtime.invoke_endpoint(EndpointName = endpoint_name,      # The name of the endpoint we created
                                   ContentType = 'image/jpeg',         # The data format that is expected
                                   Body = image_data)  


In [56]:
json.loads(response['Body'].read().decode())

[[-2.5246334075927734,
  -0.3356003761291504,
  0.7459352612495422,
  1.120079517364502,
  1.032776117324829]]

## Invoke using lambda (by providing url)
Let's now make a prediction using a lambda function as an intermediate (the lambda function code is available in the lambda.py file). 
1) We invoke the lambda function, and send an url (pointing the S3 image) as payload.   
2) The function invoke the endpoint using the url as input.  
3) The endpoint download the image, performance inference, and return the prediction to the lambda function.  
4) The prediction is returned by the lambda function.

In [57]:
client = boto3.client('lambda')

In [58]:
response = client.invoke(
    FunctionName='inference_capstone',
    Payload='{"url": "https://sagemaker-us-east-1-646714458109.s3.amazonaws.com/capstone-inventory-project/data/train/1/00014.jpg"}',
)

In [59]:
print(response)

{'ResponseMetadata': {'RequestId': '37027ec4-9054-4840-9b09-909305be2102', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 18 Jan 2022 13:12:33 GMT', 'content-type': 'application/json', 'content-length': '131', 'connection': 'keep-alive', 'x-amzn-requestid': '37027ec4-9054-4840-9b09-909305be2102', 'x-amzn-remapped-content-length': '0', 'x-amz-executed-version': '$LATEST', 'x-amzn-trace-id': 'root=1-61e6bcbf-0f4793ac7357c1fa22c31f4e;sampled=0'}, 'RetryAttempts': 0}, 'StatusCode': 200, 'ExecutedVersion': '$LATEST', 'Payload': <botocore.response.StreamingBody object at 0x7f499468a910>}


In [60]:
json.loads(response['Payload'].read().decode())["body"]

[[-2.5246334075927734,
  -0.3356003761291504,
  0.7459352612495422,
  1.120079517364502,
  1.032776117324829]]