# Load Testing & Profiling SageMaker Endpoints

In this sample we use the same notebook code from Part 1 of the Inference Video Series to create a SageMaker Endpoint: https://www.youtube.com/watch?v=omFOOr4elnc&list=PLThJtS7RDkOeo9mpNjFVnIGDyiazAm9Uk&index=2. For further context around the code and more details please follow the original notebook. 

In this notebook we'll quickly create the endpoint and focus on how you can load test using an open source Python load testing tool known as [Locust](https://locust.io/). For further information on Locust please refer to the official documentation and attached blog below:
- <b>Docs</b>: https://docs.locust.io/en/stable/
- <b>Starter Blog</b>: https://towardsdatascience.com/why-load-testing-is-essential-to-take-your-ml-app-to-production-faab0df1c4e1?sk=408d5c906510883bd6ac615df24103d2

## Setup & Environment
We will be working in a ml.c5.4xlarge SageMaker Classic Notebook Instance using a conda_python3 kernel. Note you can scale this instance type to one with more CPU cores/compute if you want to increase the concurrency for your load tests, we keep it very minimal in this sample.

## Endpoint Creation
For an end to end explained guide on pre-trained deployment please refer to the earlier notebook here: https://github.com/RamVegiraju/SageMaker-Deployment/blob/master/SM-Inference-Video-Series/Pre-Trained-Model-Dept/pre-trained-sklearn-model-dept.ipynb

In [None]:
!pip install -U sagemaker boto3 scikit-learn locust --quiet

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
import boto3
import json
import os
import joblib
import pickle
import tarfile
import sagemaker
from sagemaker.estimator import Estimator
import time
from time import gmtime, strftime
import subprocess

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name
account_id = sess.account_id()
s3_model_prefix = "djl-sme-sklearn-regression" 

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [None]:
%%sh
python3 local_model.py
tar -cvpzf model.tar.gz model.joblib requirements.txt model.py serving.properties

In [None]:
# upload model data to S3
with open("model.tar.gz", "rb") as f:
    s3_client.upload_fileobj(f, bucket, "{}/model.tar.gz".format(s3_model_prefix))
sme_artifacts = "s3://{}/{}/{}".format(bucket, s3_model_prefix, "model.tar.gz")
# replace this with your ECR image URI based off of your region, we are utilizing the CPU image here
inference_image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-cpu-full'
print(f"Pushing the data to the following location: {sme_artifacts}")
print(f"Using the following serving image: {inference_image_uri}")

In [None]:
#Step 1: Model Creation
sme_model_name = "sklearn-djl-sme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + sme_model_name)

create_model_response = sm_client.create_model(
    ModelName=sme_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "Mode": "SingleModel", "ModelDataUrl": sme_artifacts},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

#Step 2: EPC Creation
sme_epc_name = "sklearn-djl-sme-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=sme_epc_name,
    ProductionVariants=[
        {
            "VariantName": "sklearnvariant",
            "ModelName": sme_model_name,
            "InstanceType": "ml.c5.xlarge",
            "InitialInstanceCount": 1
        },
    ],
)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

#Step 3: EP Creation
sme_endpoint_name = "sklearn-djl-ep-sme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=sme_endpoint_name,
    EndpointConfigName=sme_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

#Monitor creation
describe_endpoint_response = sm_client.describe_endpoint(EndpointName=sme_endpoint_name)
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = sm_client.describe_endpoint(EndpointName=sme_endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)
print(describe_endpoint_response)

In [None]:
# sample invocation
import json
content_type = "application/json"
request_body = '[[0.5]]' #replace with your request body

response = smr_client.invoke_endpoint(
    EndpointName=sme_endpoint_name,
    ContentType=content_type,
    Body=request_body)
result = json.loads(response['Body'].read().decode())
print(result)

## Load Testing & Profiling SM Endpoint W/ Locust
Note that if you would like to scale the throughput/concurrency you can toggle/play with the user and worker count in the distributed.sh shell script. But as you increase your worker count the environment in which you run your load tests you should also ensure has enough compute to generate the load that you are trying to achieve. With [Locust Distributed Mode](https://docs.locust.io/en/stable/running-distributed.html) you can also run load tests across multiple machines (ex: EC2 instances) as you scale your traffic.

In [None]:
%%bash -s "$sme_endpoint_name"
chmod +x distributed.sh
./distributed.sh $1

In [None]:
import pandas as pd
locust_data = pd.read_csv('results_stats.csv')
for index, row in locust_data.head(n=2).iterrows():
     print(index, row)

## Cleanup
Ensure to delete your endpoint to avoid incurring further costs.

In [None]:
sm_client.delete_endpoint(EndpointName = sme_endpoint_name)