# Using Triton Model Server

**NVIDIA Triton** is an open source model server developed by NVIDIA. It supports multiple DL frameworks (such as TensorFlow, PyTorch, ONNX, Python, and OpenVINO), as well various hardware platforms and runtime environments (NVIDIA GPUs, x86 and ARM CPUs, and AWS Inferentia). Triton can be used for inference in cloud and data center environments and edge or mobile devices. Triton is optimized for performance and scalability on various CPU and GPU platforms. NVIDIA provides a specialized utility for performance analysis and model analysis to improve Triton’s performance.

You can use Triton model servers by utilizing a pre-built SageMaker DL container with it. Note that SageMaker Triton containers are not open source. You can find the latest list of Triton containers here: https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only.

In this example, we will deploy the image classification PyTorch ResNet50 model using Triton. First, we need to compile the model to the TensorRT runtime; then, the compiled model will be packaged and deployed to the Triton model server. 

### Prerequisites

There are several prerequisites in order to compile ResNet50 model to TensorRT runtime:
- Compilation environment with preinstalled dependencies. For this, we will use official NVIDIA PyTorch container `nvcr.io/nvidia/ pytorch:22.05-py3`.
- Compute instance with NVIDIA Docker and CUDA. You can use `g4dn` instance, for instance.


## Compiling Model for TensorRT

To compile model, we need to run compilation code within compilation environment - NVIDIA PyTorch Container. Let's review compilation code below.

### Preparing Compilation Code

Compilation code is available here: `3_src/compile_tensorrt.py`. We highlight key code blocks below:
1. We will start by loading the model from PyTorch Hub, setting it to evaluation mode, and placing it on the GPU device:
```python
    MODEL_NAME = "resnet50"
    model = (
        torch.hub.load("pytorch/vision:v0.10.0", MODEL_NAME, pretrained=True)
        .eval()
        .to(device)
    )
```
2. Next, we will compile it using the TensorRT-Torch compiler. As part of the compiler configuration, we will specify the expected inputs and target precision. Note that since we plan to use dynamic batching for our model, we will provide several input shapes with different values for the batch dimensions:
```python
    trt_model = torch_tensorrt.compile(
        model,
        inputs=[torch_tensorrt.Input((1, 3, 224, 224))],
        enabled_precisions={torch.float32},
    )
```
3. We then save compiled model:
```python
    torch.jit.save(trt_model, os.path.join(model_dir, "model.pt"))
```

### Running Compilation Code

Once we have compilation code ready, we can run it inside NVIDIA PyTorch container. 

1. First, we need to start a Docker container with following command in separate console (note, not in Jupyter notebook): 
`docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v $PWD/Chapter09/3_src:/workspace/3_src nvcr.io/nvidia/ pytorch:22.05-py3`
2. Your console session will open inside a container, where you can execute the compilation script by running the `python 3_src/compile_tensorrt.py` command.

The resulting model.pt file will be available outside of the Docker container in the `3_src` directory.

## Preparing Model Package

Once we have model compiled, we need to prepare model package. We mentioned previously that Triton uses a configuration file with a specific convention to define model signatures and runtime configuration. 


### Creating Inference Configuration
Run cell below below to create `config.pbtxt` file that we can use to host the ResNet50 model. Here, we define batching parameters (the max batch size and dynamic batching config), input and output signatures, as well as model copies and the target hardware environment (via the instance_group object):

In [None]:
%%writefile ./3_src/resnet50/config.pbtxt
name: "resnet50"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 1, 3, 224, 224 ] }
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1, 1000 ,1, 1]
    reshape { shape: [ 1, 1000 ] }
  }
]
dynamic_batching {
   preferred_batch_size: 16
   max_queue_delay_microseconds: 1000
 }
instance_group {
  count: 1
  kind: KIND_GPU
}

### Packaging Model Artifacts

To deploy the compiled model with its configuration, we need to bundle everything into a single tar.gz archive and upload it to Amazon S3. The following code shows the directory structure within the model archive:

```python
resnet50 
    |- 1
        |- model.pt
    |- config.pbtxt
```

Execute the cell below to prepare model package:


In [None]:
!tar -czvf 3_src/resnet50.tar.gz 3_src/resnet50

Finally, we upload model archive to Amazon S3. For this, we instantiate SageMaker Session object.

In [None]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = 'triton'
s3_path = 's3://{}/{}'.format(bucket, prefix)


model_data = sagemaker_session.upload_data("resnet50.tar.gz",
                                           bucket,
                                           prefix)
print(model_data)

## Deploying Triton Endpoint

The Triton inference container is not supported by the SageMaker Python SDK. Hence, we will need to use the boto3 SageMaker client to deploy the model.

1. First, we need to identify the correct Triton image. Use the following code to find the Triton container URI based on your version of the Triton server (we used `22.05` for both model compilation and serving) and your AWS region:

In [15]:
account_id_map = {
    'us-east-1': '785573368785',
    'us-east-2': '007439368137',
    'us-west-1': '710691900526',
    'us-west-2': '301217895009',
    'eu-west-1': '802834080501',
    'eu-west-2': '205493899709',
    'eu-west-3': '254080097072',
    'eu-north-1': '601324751636',
    'eu-south-1': '966458181534',
    'eu-central-1': '746233611703',
    'ap-east-1': '110948597952',
    'ap-south-1': '763008648453',
    'ap-northeast-1': '941853720454',
    'ap-northeast-2': '151534178276',
    'ap-southeast-1': '324986816169',
    'ap-southeast-2': '355873309152',
    'cn-northwest-1': '474822919863',
    'cn-north-1': '472730292857',
    'sa-east-1': '756306329178',
    'ca-central-1': '464438896020',
    'me-south-1': '836785723513',
    'af-south-1': '774647643957'
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = f"{account_id_map[region]}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:21.08-py3"

2. Next, we create the model, which defines the model data and serving container, as well as other parameters, such as environment variables. Note, that we use `sagemaker_client` for this.

In [None]:
import time

sm_client = sagemaker_session.sagemaker_client
runtime_sm_client = sagemaker_session.sagemaker_runtime_client

sm_model_name = "triton-resnet50-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": model_data,
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "resnet50"},
}

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

3. After that, we can define the endpoint configuration:

In [None]:
endpoint_config_name = "triton-resnet50-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g4dn.4xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

4. Now, we are ready to deploy our endpoint. We added waiter method below to wait for endpoint to be fully created.

In [None]:
endpoint_name = "triton-resnet50-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Running Inference

To run inference, we first download and pre-preprocess sample image to match expected model input.

In [None]:
from PIL import Image
import numpy as np
import boto3

s3_client = boto3.client('s3')
s3_client.download_file(
    "sagemaker-sample-files",
    "datasets/image/pets/shiba_inu_dog.jpg",
    "shiba_inu_dog.jpg"
)

def get_sample_image():
    image_path = "./shiba_inu_dog.jpg"
    img = Image.open(image_path).convert("RGB")
    img = img.resize((224, 224))
    img = (np.array(img).astype(np.float32) / 255) - np.array(
        [0.485, 0.456, 0.406], dtype=np.float32
    ).reshape(1, 1, 3)
    img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
    img = np.transpose(img, (2, 0, 1))
    return img.tolist()

Next, we construct a payload according to the model signature defined in `config.pbtxt`. Take a look at the following inference call. The response will follow a defined output signature as well:

In [None]:
import json

payload = {
    "inputs": [
        {
            "name": "INPUT__0",
            "shape": [1, 3, 224, 224],
            "datatype": "FP32",
            "data": get_sample_image(),
        }
    ]
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload)
)

print(json.loads(response["Body"].read().decode("utf8")))

## Resource Cleanup

Execute cell below to delete endpoints and model artifact:

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)