## Multi-Model Endpoints on SageMaker

A multi-model endpoint (**"MME"**) is a special type of SageMaker model endpoint that allows you to host thousands of models behind a single endpoint simultaneously. This type of endpoint is suitable for scenarios for similarly sized models with relatively low resource requirements that can be served from the same inference container.

In this code sample, we will learn how to deploy two NLP models simultaneously using an MME. One model analyzes the sentiment of German text, while the other analyzes the sentiment of English text. We will use the HuggingFace PyTorch container for this. For this task, we will use following models from HuggingFace Model Hub: `distilbert-base-uncased-finetuned-sst-2-english` and `oliverguhr/german-sentiment-bert`. 

### Prerequisites

Run cell below to install Python dependencies for this example:

In [None]:
! pip install -r requirements.txt

## Prepare Model Packages for MME
SageMaker MME requires you to create a separate package for each model and upload it to Amazon S3. Follow the steps below to prepare two packages with English and German models:

1. We will start by fetching the models from the HuggingFace Model hub and saving them locally. Note, that we also run inference locally using positive English and negative German samples to test models locally.

In [None]:
import os

import torch
from transformers import (DistilBertForSequenceClassification,
                          DistilBertTokenizer)

# Loading English model from HuggingFace Model Hub
EN_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
en_tokenizer = DistilBertTokenizer.from_pretrained(EN_MODEL)
en_model = DistilBertForSequenceClassification.from_pretrained(EN_MODEL)

# Running inference locally
inputs = en_tokenizer("Hello, my dog is cute", return_tensors="pt")
with torch.no_grad():
    logits = en_model(**inputs).logits

predicted_class_id = logits.argmax().item()
predictions = en_model.config.id2label[predicted_class_id]

print(f"Expected: positive, actual: {predictions}")

# Saving model locally
en_model_path = "models/english_sentiment"
os.makedirs(en_model_path, exist_ok=True)

en_model.save_pretrained(save_directory=en_model_path)
en_tokenizer.save_pretrained(save_directory=en_model_path)

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer

# Loading German model from HuggingFace Model Hub
GER_MODEL = "oliverguhr/german-sentiment-bert"
ger_tokenizer = BertTokenizer.from_pretrained(GER_MODEL)
ger_model = BertForSequenceClassification.from_pretrained(GER_MODEL)

# Running inference locally
inputs = ger_tokenizer("Das ist gar nicht mal so gut", return_tensors="pt")
with torch.no_grad():
    logits = ger_model(**inputs).logits

predicted_class_id = logits.argmax().item()
predictions = ger_model.config.id2label[predicted_class_id]

print(f"Expected: negative, actual:{predictions}")

# Saving model locally
ger_model_path = "models/german_sentiment"
os.makedirs(ger_model_path, exist_ok=True)

en_model.save_pretrained(save_directory=ger_model_path)
en_tokenizer.save_pretrained(save_directory=ger_model_path)

2. An MME has the same requirements as those for the inference scripts of single-model endpoints. Run the cell below to review inference script and pay attention to functions for model loading (`model_fn()`), inference(`predict_fn()`), and data pre-/post-processing (`input_fn()` and `output_fn()` respectively). 

In [None]:
# inference script for English model
! pygmentize 1_src/en_inference.py

# inference script for German model
! pygmentize 1_src/ger_inference.py


3. Next, we need to package the model and inference code for the MME. SageMaker requests a specific directory structure that varies for PyTorch and TensorFlow containers. For PyTorch containers, the model and code should be packaged into a single tar.gz archive and have the following structure:
```python
        model.tar.gz/
                |- model.pth # and any other model artifacts
                |- code/
                        |- inference.py
                        |- requirements.txt # optional
```

Run the code below to prepare model packages for 


In [None]:
! mkdir models/english_sentiment/code
! cp 1_src/en_inference.py models/english_sentiment/code/inference.py
! tar -czvf models/english_sentiment.tar.gz -C models/english_sentiment/ .

In [None]:
! mkdir models/german_sentiment/code
! cp 1_src/ger_inference.py models/german_sentiment/code/inference.py
! tar -czvf models/german_sentiment.tar.gz -C models/german_sentiment/ .

4. Finally, we upload model packages to Amazon S3 using SageMaker Session object. Note, that both model  packages are stored under the sageme S3 key (variable `mm_data_path`).

In [None]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role() 

bucket = sagemaker_session.default_bucket()
prefix = 'multi-model'
mm_data_path = f"s3://{bucket}/{prefix}/"
region = sagemaker_session.boto_region_name

en_model_data = sagemaker_session.upload_data('models/english_sentiment.tar.gz', bucket=bucket,key_prefix=prefix)
ger_model_data = sagemaker_session.upload_data('models/german_sentiment.tar.gz', bucket=bucket,key_prefix=prefix)


## Deploy Multi Model Endpoint
Once model packages are prepared, we are ready to create MME endpoint hosting them. Follow steps below for this:

1. We start by identifying appropriate SageMaker inference container. Run ell below to get container URI:

In [None]:
from sagemaker import image_uris

HF_VERSION = '4.17.0'
PT_VERSION = 'pytorch1.10.2'

pt_container_uri = image_uris.retrieve(framework='huggingface',
                                region=region,
                                version=HF_VERSION,
                                image_scope='inference',
                                base_framework_version=PT_VERSION,
                                instance_type='ml.c5.xlarge')

print(pt_container_uri)

2. Then, we need to configure the MME parameters. Specifically, we must define the MultiModel mode. Note that we provide two specific environment variables – `SAGEMAKER_PROGRAM` and `SAGEMAKER_SUBMIT_DIRECTORY` – so that the SageMaker inference framework knows how to register the model handler:

In [64]:
container  = {
    'Image': pt_container_uri,
    'ContainerHostname': 'MultiModel',
    'Mode': 'MultiModel',
    'ModelDataUrl': mm_data_path,
    'Environment': {
	    'SAGEMAKER_PROGRAM':'inference.py',
	    'SAGEMAKER_SUBMIT_DIRECTORY':mm_data_path
    }
}


3. The last step of configuring the MME is to create a SageMaker model instance, endpoint configuration, and the endpoint itself. When creating the model, we must provide the MultiModel-enabled container from the preceding step. Note, that to deploy MME endpoint, we are using SageMaker boto3 client (variable `sm_client`)

In [66]:
import datetime

unique_id = datetime.datetime.now().strftime("%Y-%m-%d%H-%M-%S")
model_name = f"mme-sentiment-model-{unique_id}"

sm_client = sagemaker_session.sagemaker_client

create_model_response = sm_client.create_model(
    ModelName=model_name,
    PrimaryContainer=container,
    ExecutionRoleArn=role,
)

In [67]:
endpoint_config_name = f"{model_name}-ep-config"
instance_type = "ml.m5.4xlarge"

endpoint_config = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "prod",
            "ModelName": model_name,
            "InitialInstanceCount": 1,
            "InstanceType": instance_type,
        },
    ],
)

In [68]:
import time
endpoint_name = f"{model_name}-ep"

endpoint = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

# Code to wait for MME deployment completion
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Testing MME

Once the endpoint has been created, we can run and invoke our models. For this, in the invocation request, we need to supply a special parameter called `TargetModel`. Execute cells below to get predictions from both English and German sentiment models.

In [52]:
import json

runtime_sm_client = sagemaker_session.sagemaker_runtime_client

ger_input = "Der Test verlief positiv."
en_input = "Test results are positive."

In [None]:
# getting response from English model
en_response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Accept="application/json",
    TargetModel="english_sentiment.tar.gz",
    Body=json.dumps(en_input),
)

predictions = json.loads(en_response["Body"].read().decode())
print(predictions)

In [None]:
# getting response from German model
ger_response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Accept="application/json",
    TargetModel="german_sentiment.tar.gz",
    Body=json.dumps(ger_input),
)

predictions = json.loads(ger_response["Body"].read().decode())
print(predictions)

## Resource Clean up

Run cell below to delete cloud resource:

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName = model_name)