# # Run mulitple NLP models on GPU with SageMaker Multi-model endpoints

<div class="alert alert-info"> 💡 <strong> Note </strong>
SageMaker Multi-Model Endpoint with GPU support is a beta feature and is not recommended for production use cases
</div>

[Amazon SageMaker](https://aws.amazon.com/sagemaker/) helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker accelerates innovation within your organization by providing purpose-built tools for every step of ML development, including labeling, data preparation, feature engineering, statistical bias detection, AutoML, training, tuning, hosting, explainability, monitoring, and workflow automation.

Customers are training ML models to cater individual users, granular market segments, hyper personalized content etc. For example, a call center analytics application using NLP language translation service to serve customers from different geographic location train custom models for different languages. Building large number of custom models can increase the cost of inference and managing models. These challenges become more pronounced when not all models are accessed at the same rate but still need to be available at all times.


This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g4dn`.


## SageMaker Multi-Model Endpoints with GPU

SageMaker multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large numbers of ML models in the cloud. SageMaker multi-model endpoints enable you to deploy multiple ML models behind a single endpoint and serve them using a single serving container. Today, customers can use MME on CPU based instance types limiting them to deploy deep learning models that need accelerated compute GPUs. With announcement of new private beta feature, customer can host and serve deep learning GPU models using SageMaker multi-model endpoint. 

In [None]:
from IPython import display
display.Image("./images/mme-gpu.png")

## How it works?

SageMaker MME with GPU works with NVIDIA Triton server

[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.

Some key features of Triton are:
* **Support for Multiple frameworks**: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats. 
* **Model pipelines**: Triton model ensemble represents a pipeline of one or more models or pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.
* **Concurrent model execution**: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.
* **Dynamic batching**: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
* **Diverse CPUs and GPUs**: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

In this notebook, we will use deploy multiple `bert-large-uncased` models to SageMaker multi model endpoint behind `g4dn.4xlarge` instance type

### Installs

Installs the dependencies required to package the model and run inferences using Triton server. Update SageMaker, boto3, awscli etc

In [4]:
!pip install -qU pip awscli boto3 sagemaker transformers==4.9.1
!pip install nvidia-pyindex
!pip install tritonclient[http]

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.0.1 requires botocore<1.22.9,>=1.22.8, but you have botocore 1.27.67 which is incompatible.[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... [?25ldone
[?25h  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8413 sha256=1273196bffda8d0fe101accbb2386a5f2429062e8811703a5121b57801e2b0c4
  Stored in directory: /home/ec2-user/.cache/pip/wheels/e0/c2/fb/5cf4e1cfaf28007238362cb746fb38fc2dd76348331a748d54
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Succ

#### Imports

In [5]:
# general imports
import boto3
import json
import os
import re
import copy
import time
from time import gmtime, strftime
import numpy as np
import datetime
import pprint
import pandas as pd

# sagemaker
import sagemaker
from sagemaker import get_execution_role

# triton
import tritonclient.http as httpclient

# transformers
from transformers import BertTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


#### Set Variables

We set SageMaker variables and other variables below, also define the IAM role that will give Amazon SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

In [None]:
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
region = boto3.Session().region_name
bucket = sagemaker.Session().default_bucket()
prefix = "bert_mme_gpu"

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
s3_client = boto3.client('s3')


sm_client = boto3.client("sagemaker", region_name=region)
cw_client = boto3.client("cloudwatch", region)

print(f"SageMaker Role: {role}")
print(f"Region Name: {region}")

In [None]:
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_model_name = "bert-mme-gpu-" + ts
print(f"SageMaker Model Name: {sm_model_name}")

#### Amazon SageMaker Triton Inference Server Deep Learning Container Image

Set `mme_triton_image_uri` based on the `account_id` and `region` information

In [None]:
#ap-south-1 region
mme_triton_image_uri = "850464037171.dkr.ecr.ap-south-1.amazonaws.com/tritonserver:22.07-py3"
#us-east-1 region
#mme_triton_image_uri = "785573368785.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:22.07-py3"

## NVIDIA Triton Setup with Amazon SageMaker

1. We will use this [generate_models.sh](./workspace/generate_models.sh) to generate the `bert-large-uncased` model to be used with NVIDIA Triton inference server.
2. The script for loading the pre-trained `bert_large_uncased` model and saving it can be found in this [pt_exporter.py](./workspace/pt_exporter.py)
3. Pre-trained model is loaded in torchscript format and model artifacts are jit traced with a dummy input and store in model.pt format
4. After the model is serialized we package it into the format that Triton and SageMaker expect it to be.
5. We used the pre-configured `config.pbtxt` file provided with this repo to specify model [configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) which Triton uses to load the model. 
6. We tar the model directory and upload it to s3 to later create a [SageMaker Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html).

<div class="alert alert-info"> 💡 <strong> NOTE </strong>
The below script uses docker and thus will not work on Amazon SageMaker Studio notebook. Please use Amazon SageMaker Notebook instance to execute this notebook
</div>

### Workflow overview

#### 1a. PyTorch model
In this step, we load pre-trained ResNet50 model from torch and save as `model.pt` file. We use torch.jit.script to compile the code as TorchScript code using TorchScript compiler. It needs an example inputs, so we pass 1 instance of a RGB image(3X224X224).

In [None]:
!docker run --gpus=all --rm -it \
            -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.07-py3 \
            /bin/bash generate_model_pytorch.sh

The script saves the model in this [workspace](./workspace/) directory

#### 2. Build Model Respository

We used the pre-configured `config.pbtxt` file provided with this repo to specify model [configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) which Triton uses to load the model.

The model repository contains model to serve, in our case it will be the model.plan and configuration file with input/output specifications and metadata.

**Note**: Amazon SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`. Below is the sample model directory structure

```
bert
├── 1
│   └── model.pt
└── config.pbtxt
```

In [None]:
!mkdir -p triton-serve-pt/bert/

In [None]:
%%writefile triton-serve/bert-uc/config.pbtxt
platform: "pytorch_libtorch"
max_batch_size: 32
input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [512]
  },
  {
    name: "INPUT__1"
    data_type: TYPE_INT32
    dims: [512]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [512, 768]
  },
  {
    name: "1634__1"
    data_type: TYPE_FP32
    dims: [768]
  }
]

#### 3. Export model artifacts to S3

SageMaker expects the model artifacts in below format, it should also satisfy Triton container requirements such as model name, version, config.pbtxt files etc. `tar` the folder containing the model file as `model.tar.gz` and upload it to s3

In [None]:
!mkdir -p triton-serve-pt/bert/1/
!mv -f workspace/model.pt triton-serve-pt/bert/1/
!tar -C triton-serve-pt/ -czf bert_pt_v0.tar.gz bert
model_uri_pt = sagemaker_session.upload_data(path="bert_pt_v0.tar.gz", key_prefix="bert_mme_gpu")

#### 4. Creating copies of model to be loaded to MME

We will create 4 copies of the `bert-large-uncased` model to be used with SageMaker multi-model endpoint(MME). In practice, this could be 1000s of custom model depending on ML application and use case.

In [None]:
!aws s3 cp s3://$bucket/$prefix/bert_pt_v0.tar.gz s3://$bucket/$prefix/bert_pt_v1.tar.gz
!aws s3 cp s3://$bucket/$prefix/bert_pt_v0.tar.gz s3://$bucket/$prefix/bert_pt_v2.tar.gz
!aws s3 cp s3://$bucket/$prefix/bert_pt_v0.tar.gz s3://$bucket/$prefix/bert_pt_v3.tar.gz


#### 5. Create SageMaker Endpoint

Now that we have 4 models, we start off by creating a [sagemaker model](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) from the model files we uploaded to s3 in the previous step.

In this step we also provide an additional Environment Variable i.e. `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. **The value of this key should match the folder name in the model package uploaded to s3**. This variable is optional in case of a single model. In case of ensemble models, this key **has to be** specified for Triton to startup in SageMaker.

Additionally, customers can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts.

`model_data_url` is the S3 directory that contains all the models that SageMaker mulit-model endpoint will use to load  and serve predictions. `Mode` indicated the mode in which SageMaker would host this model - `MultiModel`

In [None]:
!aws s3 ls s3://$bucket/$prefix/

In [None]:
model_data_url = f"s3://{bucket}/{prefix}/"

container = {
    "Image": mme_triton_image_uri,
    "ModelDataUrl": model_data_url,
    "Mode": "MultiModel",
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "bert"},
}

Once the image, data location are set we create the model using `create_model` by specifying the `ModelName` and the Container definition

In [None]:
sm_model_name = "bert-mme-gpu-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Using the model above, we create an [endpoint configuration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) where we can specify the type and number of instances we want in the endpoint.

In [None]:
endpoint_config_name = "bert-mme-gpu-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g4dn.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful.

In [None]:
endpoint_name = "bert-mme-gpu-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### 7. Run Inference

####  Create payload 

In [6]:
def tokenize_text(text):
    enc = BertTokenizer.from_pretrained("bert-large-uncased")
    encoded_text = enc(text, padding="max_length", max_length=512, truncation=True)
    return encoded_text["input_ids"], encoded_text["attention_mask"]

If you want to change the payload (Token Length), below are the changes -
1. Change the JSON with shape reflecting the right token length below
2. Change the tokenize_text method to reflect the token length
3. Change the config.pbtxt the triton* folder to reflect the input id and attention mask length.

In [7]:
text_triton = """
                Create payload JSON and upload it on S3. 
                This will be used by Inference Recommender to run the load test.
              """

input_ids, attention_mask = tokenize_text(text_triton)

payload = {
    "inputs": [
        {"name": "INPUT__0", "shape": [1, 512], "datatype": "INT32", "data": input_ids},
        {"name": "INPUT__1", "shape": [1, 512], "datatype": "INT32", "data": attention_mask},
    ]
}

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

In [12]:
print(input_ids)
print(len(input_ids)) # should be 512

[101, 3443, 18093, 1046, 3385, 1998, 2039, 11066, 2009, 2006, 1055, 2509, 1012, 2023, 2097, 2022, 2109, 2011, 28937, 16755, 2121, 2000, 2448, 1996, 7170, 3231, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

We specify the model artifact name `TargetModel` model to request for inference when invoking a multi-model endpoint.

In [None]:
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload),
    TargetModel='bert_pt_v0.tar.gz'
)

print(json.loads(response["Body"].read().decode("utf8")))

Let's invoke different model to simulate traffic to MME endpoint

In [None]:
import random

for i in range(10):
    n = random.randint(0,3)

    response = runtime_sm_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="application/octet-stream",
            Body=json.dumps(payload),
            TargetModel=f"bert_pt_v{n}.tar.gz"
        )
    response = json.loads(response["Body"].read().decode("utf8"))
    output = response['outputs'][0]['data']
    print(output)

In [None]:
sm_client.delete_model(ModelName=sm_model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

## Conclusion

This notebook provides an overview of new private beta feature to host mulitple deep learning models with Amazon SageMaker Mulit model endpoints(MME)