# Sample SKLearn Multi-Model Endpoint Deployment 
In this example we take a look at using a sample SKLearn model that we've been working with in this series and make a few hundred copies of this model artifact to simulate a Multi-Model Deployment on SageMaker Real-Time Inference. In this guide we'll go through taking these 300 models and deploying them on a SageMaker Endpoint using the Multi-Model Endpoints feature.

Note MME with GPU based instance deployment is a little different and will be covered in the coming parts of this series, if you would like to get started early refer to this blog: https://medium.com/towards-data-science/host-hundreds-of-nlp-models-utilizing-sagemaker-multi-model-endpoints-backed-by-gpu-instances-1ec215886248?sk=def3b784378ab48190f37e6d1c2f3d00.

## Setup & Environment
We will be working in a ml.c5.4xlarge SageMaker Classic Notebook Instance using a conda_python3 kernel. You can also optionally use Studio or a smaller notebook instance.

## Additional Resources
- [Load Testing DJL MME AWS Blog](https://aws.amazon.com/blogs/machine-learning/run-ml-inference-on-unplanned-and-spiky-traffic-using-amazon-sagemaker-multi-model-endpoints/)
- [Inference Playlist Series](https://www.youtube.com/watch?v=pVVKqiMiArc&list=PLThJtS7RDkOeo9mpNjFVnIGDyiazAm9Uk)

In [None]:
!pip install -U sagemaker boto3

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
import boto3
import json
import os
import joblib
import pickle
import tarfile
import sagemaker
from sagemaker.estimator import Estimator
import time
from time import gmtime, strftime
import subprocess

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name
account_id = sess.account_id()
s3_model_prefix = "djl-mme-sklearn-regression" 

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

## Tarball Creation & Multi-Model Copy Creation
We take the artifacts that we have and wrap them into the model tarball and specifically for our MME dummy case here make 300 copies of the model in the same S3 bucket. These models all need to rest in the same S3 location for MME to work properly.

In [None]:
%%sh
python3 local_model.py
tar -cvpzf model.tar.gz model.joblib requirements.txt model.py serving.properties

In [None]:
%%time
# we make a 300 copies of the tarball as a dummy, you can replace this with your actual model.joblibs in tarball
for i in range(300):
    with open("model.tar.gz", "rb") as f:
        s3_client.upload_fileobj(f, bucket, "{}/sklearn-{}.tar.gz".format(s3_model_prefix,i))

In [None]:
mme_artifacts = "s3://{}/{}/".format(bucket, s3_model_prefix)
mme_artifacts

In [None]:
#verify all 300 tar balls are present
!aws s3 ls {mme_artifacts}

## SageMaker Inference Objects Creation
Here we create the same inference constructs we always would, but also specify Multi-Model specifically for the inference mode.

### Model Creation
In this case we use DJL Serving the same model server/container we used for Single Model Endpoints. Once again you can choose a container/server that you are comfortable with, in this case we use DJL as there's per model worker scaling enabled as well which allows for us to handle different traffic patterns more robustly at the serving level itself.

In [None]:
# replace this with your ECR image URI based off of your region, we are utilizing the CPU image here
inference_image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-cpu-full'

#Step 1: Model Creation
mme_model_name = "sklearn-djl-mme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + mme_model_name)

# here we specify the mode as Multi-Model and the S3 path with all our artifacts
create_model_response = sm_client.create_model(
    ModelName=mme_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "Mode": "MultiModel", "ModelDataUrl": mme_artifacts},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

### EPC Creation
Here ensure you have the capacity/instance limit needed, worst case put a limit increase request. For deciding capacity behind an MME endpoint, you want to think of the number of models, average model size, and the memory behind the instance type you choose. MME loads these models into the instance's memory when invoked so you want to think about your traffic patterns as well in regards to how many models might be loaded in memory at the same time and if there will be enougn available memory for what you might be experiencing at peak traffic.

In [None]:
#Step 2: EPC Creation
mme_epc_name = "sklearn-djl-mme-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=mme_epc_name,
    ProductionVariants=[
        {
            "VariantName": "sklearnvariant",
            "ModelName": mme_model_name,
            "InstanceType": "ml.c5d.4xlarge",
            "InitialInstanceCount": 2
        },
    ],
)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

### Endpoint Creation

In [None]:
#Step 3: EP Creation
mme_endpoint_name = "sklearn-djl-ep-mme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=mme_endpoint_name,
    EndpointConfigName=mme_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
#Monitor creation
describe_endpoint_response = sm_client.describe_endpoint(EndpointName=mme_endpoint_name)
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = sm_client.describe_endpoint(EndpointName=mme_endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)
print(describe_endpoint_response)

## Sample Inference
Here we have the same invoke_endpoint API call to interact with the endpoint: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html. The key difference from single model endpoints is we specify a TargetModel parameter which is the "model-1.tar.gz" filename you have uploaded to S3 (this can obviously change depending on what you named your files).

In [None]:
import json
content_type = "application/json"
request_body = '[[0.5]]' #replace with your request body


# sample inference, the target model string should look like sklearn-modelversion.tar.gz
# initial request might take a little longer as the model is loaded into memory
response = smr_client.invoke_endpoint(
    EndpointName=mme_endpoint_name,
    ContentType=content_type,
    TargetModel = "sklearn-290.tar.gz",
    Body=request_body)
result = json.loads(response['Body'].read().decode())
print(result)

In [None]:
# sample inference across many models, might take a little to run this cell
import random

for i in range(500):
    random_model = random.randint(1,300) #randomly pick from our 300 models we have behind endpoint
    target_model = f"sklearn-{random_model}.tar.gz"
    print(f"Invoking following model: {target_model}")
    response = smr_client.invoke_endpoint(
        EndpointName=mme_endpoint_name,
        ContentType=content_type,
        TargetModel = target_model,
        Body=request_body)
    result = json.loads(response['Body'].read().decode())
    print(result)

## Cleanup
Ensure to delete your endpoint to not incur any further costs.

In [None]:
sm_client.delete_endpoint(EndpointName = mme_endpoint_name)