## Multi-Container Endpoints on SageMaker

A multi-container endpoint (**"MCE"**) allows you to host up to 15 inference containers simultaneously. In this case, each container would serve its own model. MCEs are a good fit for use cases where models require different runtime environments/containers but not every single model can fully utilize the available instance resources. Another scenario is when models are called at different times.

In this example, we will run an inference workload with two NLP models using different runtime environments: TensorFlow and PyTorch. We will host the Q&A model in a TensorFlow container and the text summarization model in a PyTorch container.

### Prerequisites

Run cell below to install Python dependencies for this example:

In [None]:
! pip install -r requirements.txt

## Prepare Model Packages for MCE

To deploy MME endpoint, we need to prepare separate packages for target models. Depending on model framework (PyTorch or TensorFlow), model package sturucture will be slightly different. Follow steps below to prepare model packages.


### Prepare TensorFlow Model Package
SageMaker expect following package structure for TensorFlow models:
```python
    model.tar.gz/
                |--[model_version_number]/
                                        |--variables
                                        |--saved_model.pb
                code/
                    |--inference.py
                    |--requirements.txt # optional
```

Follow the steps below to prepare TensorFlow model package:
1. We start by fetching models from HuggingFace Hub. Note, that we download model bundle `saved_model.tar.gz` which is ready to be deployed on TensorFlow serving.

In [None]:
# Create model package local directories
! mkdir -p distilbert-base-uncased-distilled-squad/1
! mkdir -p distilbert-base-uncased-distilled-squad/code

# Download artifacts for TensorFlow DistilBert model for Question-Answering task
! wget https://huggingface.co/distilbert-base-cased-distilled-squad/resolve/main/saved_model.tar.gz
! tar -zxvf saved_model.tar.gz -C distilbert-base-uncased-distilled-squad/1

2. Next, we prepare inference script. Note, that in our case, we will use the same inference script for both PyTorch and TensorFlow models (thanks for HuggingFace robust `pipeline` API!). Execute cell below to copy inference code and requirements.txt file to model package. Then archive model package into tarball. Feel free to review inference script by running `pygmentize 2_src/inference.py` command in separate cell.

In [1]:
# Copy files into model package directory
! cp 2_src/inference.py distilbert-base-uncased-distilled-squad/code
! cp 2_src/requirements.txt distilbert-base-uncased-distilled-squad/code

# Archive model package
!tar -C "$PWD" -czf distilbert-base-uncased-distilled-squad.tar.gz distilbert-base-uncased-distilled-squad/

### Prepare PyTorch Model Package

SageMaker expects following package structure for PyTorch models:
```python
        model.tar.gz/
                |- model.pth # and any other model artifacts
                |- code/
                        |- inference.py
                        |- requirements.txt # optional
```

Follow the steps below to prepare text summarization model:
1. We fetch model artifacts and save them locally using HuggingFace model and tokenizer APIs:

In [3]:
import os
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Downloading model from Model Hub
SUM_MODEL = "sshleifer/distilbart-cnn-6-6"
sum_model = AutoModelForSeq2SeqLM.from_pretrained(SUM_MODEL)
sum_tokenizer = AutoTokenizer.from_pretrained(SUM_MODEL)

# Saving model locally
sum_model_path = "distilbart-cnn-6-6"
os.makedirs(sum_model_path, exist_ok=True)

sum_model.save_pretrained(save_directory=sum_model_path)
sum_model.save_pretrained(save_directory=sum_model_path)

2. Next, we copy inference code and dependencies in the model package and create single archive:

In [6]:
# Copy inference code and dependencies
! mkdir -p distilbart-cnn-6-6/code
! cp 2_src/inference.py distilbart-cnn-6-6/code
! cp 2_src/requirements.txt distilbart-cnn-6-6/code

# Create model poackage tarball
!tar -C "$PWD" -czf distilbart-cnn-6-6.tar.gz distilbart-cnn-6-6/

### Upload model data to S3

Finally, we upload both packages to Amazon S3:

In [1]:

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = 'multi-container'
s3_path = 's3://{}/{}'.format(bucket, prefix)

In [28]:
qa_model_data = sagemaker_session.upload_data('distilbert-base-uncased-distilled-squad.tar.gz',
                                           bucket,
                                           os.path.join(prefix, 'model-artifacts'))

summarization_model_data = sagemaker_session.upload_data('distilbart-cnn-6-6.tar.gz',
                                           bucket,
                                           os.path.join(prefix, 'model-artifacts'))    

## Deploy MCE

Now we are ready to configure and deploy our MCE endpoint. 

### Configure Inference Container
For this we need to configure for our target models runtime containers. Execute cells below to fetch SageMaker PyTorch container image, associate model arctifacts with each container, and then provide runtime configuration via environmental variables (`qa_env` for TensorFlow container and `summarization_env` for PyTorch container). Note, that in our inference script we rely on variable `NLP_TASK` to identify which inference pipeline to run (refer to inference script for details).

In [64]:
region = sagemaker_session.boto_region_name
instance_type = "ml.m5.4xlarge"

In [66]:
qa_env = {
    "NLP_TASK" : "question-answering"
}

tf_inference_image_uri = sagemaker.image_uris.retrieve(
    framework="tensorflow",
    region=region,
    version="2.8",
    py_version="py38",
    instance_type=instance_type,
    image_scope="inference",
)

tensorflow_container = {
    "ContainerHostname": "tensorflow-distilbert-qa",
    "Image": tf_inference_image_uri,
    "ModelDataUrl": qa_model_data,
    "Environment" : qa_env
}


In [65]:
summarization_env = {
    "NLP_TASK" : "summarization",
    "SAGEMAKER_PROGRAM" : "inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY": summarization_model_data,
}

pt_inference_image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="1.9.0",
    py_version="py38",
    instance_type=instance_type,
    image_scope="inference",
)

pytorch_container = {
    "ContainerHostname": "pytorch-bart-summarizer",
    "Image": pt_inference_image_uri,
    "ModelDataUrl": summarization_model_data,
    "Environment" : summarization_env
}


### Creating MCE Endpoint

To create model, endpoint configuration, and endpoint, we use SageMaker boto3 client. Run cells below for this. Note, that we supply both TensorFlow container and Pytorch container to a single model. We also set endpoint mode to `Direct`, so we can directly invoce both models.

In [69]:
import datetime

sm_client = sagemaker_session.sagemaker_client # SageMaker boto3 client

unique_id = datetime.datetime.now().strftime("%Y-%m-%d%H-%M-%S")

model_name = f"mce-nlp-model-{unique_id}"

create_model_response = sm_client.create_model(
    ModelName=model_name,
    Containers=[tensorflow_container, pytorch_container],
    InferenceExecutionConfig={"Mode": "Direct"},
    ExecutionRoleArn=role,
)

In [70]:
endpoint_config_name = f"{model_name}-ep-config"

endpoint_config = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "prod",
            "ModelName": model_name,
            "InitialInstanceCount": 1,
            "InstanceType": instance_type,
        },
    ],
)

In [71]:
import time 

endpoint_name = f"{model_name}-ep"

endpoint = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

# Code to wait for MCE deployment completion
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Running Inference 

Once MCE endpoint is deployed, we can run inference for Q&A and Summarization models. For this, we use a paragraph about Amazon rain forest. We expect that summarization model will be able to condense the article into shorter paragraph, while Q&A model will be able to provide us with answer on the question based on the input article.

Run the cell below to define article and question.

In [7]:
import json

article = r"""
The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.
"""

question="What is Spanish name for Amazon?"


Our Q&A model is implemented in TensorFlow framework and requries initial preparation to match TensorFlow Serving model signature. Run cell below to tokenize text and form payload according to model signature:

In [None]:
#  preparing data for TF Serving format
import numpy as np
import tensorflow as tf
from transformers import DistilBertTokenizer

max_length = 384
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")

encoded_input = tokenizer(question, article, padding='max_length', max_length=max_length)
encoded_input = dict(encoded_input)
qa_inputs = [{"input_ids": np.array(encoded_input["input_ids"]).tolist(), "attention_mask":np.array(encoded_input["attention_mask"]).tolist()}]
qa_inputs = {"instances" : qa_inputs}

Now we can send inference request to TensorFlow endpoint.  Note that we supply the `TargetContainerHostname` header so that SageMaker knows where to route our inference request:

In [22]:
runtime_sm_client = sagemaker_session.sagemaker_runtime_client

tf_response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Accept="application/json",
    TargetContainerHostname="tensorflow-distilbert-qa",
    Body=json.dumps(qa_inputs),
)

# Processing predictions

predictions = json.loads(tf_response["Body"].read().decode())
answer_start_index = int(tf.math.argmax(predictions['predictions'][0]['output_0']))
answer_end_index = int(tf.math.argmax(predictions['predictions'][0]['output_1']))

predict_answer_tokens = encoded_input["input_ids"][answer_start_index : answer_end_index + 1]
tf_response = tokenizer.decode(predict_answer_tokens)

print(f"Question: {question}, answer: {tf_response}")

Run cell below to get summary of text using PyTorch model:

In [None]:
summarization_input = {"article":article, "max_length":100}

pt_result = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Accept="application/json",
    TargetContainerHostname="pytorch-bart-summarizer", 
    Body=json.dumps(summarization_input),
)

print(pt_result)

## Resource Clean up

Run cell below to delete cloud resources:

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName = model_name)