In [None]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Triton for Recommender Systems

The Triton Inference Server allows us to deploy our model to the web regardless of cloud provider, and it supports a number of different machine learning frameworks such as TensorFlow and PyTorch.

**Objectives:**

Learn how to deploy a model to Triton
1. Deploy saved NVTabular and PyTorch models to Triton Inference Server
2. Sent requests for predictions

**Pull and start Inference docker container**

At this point, before connecing to the Triton Server, we launch the inference docker container and then load the ensemble t4r_pytorch to the inference server. This is done with the scripts below:

launch the docker container:
```
docker run -it --gpus device=0 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v <path_to_saved_models>:/root/models/ nvcr.io/nvidia/merlin/merlin-inference:0.6
```

This script will mount your local model-repository folder that includes your saved models from the previous cell to /root/models directory in the merlin-inference docker container.

start triton server:
After you started the merlin-inference container, you can start triton server with the command below. You need to provide correct path of the models folder.
```
tritonserver --model-repository=<path_to_models> --model-control-mode=explicit
```
Note: The model-repository path for our example is /root/models. The models haven't been loaded, yet. Below, we will request the Triton server to load the saved ensemble model below.

## 1. Deploy PyTorch and NVTabular Model to Triton Inference Server

Our Triton server has already been launched with to the web and is ready to make requests. We already, exportex the saved PyTorch model in the previous notebook, and generated the config files for Triton Inference Server.

In [1]:
 # Import dependencies
import os
from time import time

import argparse
import numpy as np
import pandas as pd
import sys
import cudf

## 1.2 Review exported files

Triton expects a specific directory structure for our models as the following format:

```
<model-name>/
[config.pbtxt]
<version-name>/
  [model.savedmodel]/
    <pytorch_saved_model_files>/
      ...
```

Let's check out our model repository layout. You can install tree library with apt-get install tree, and then run `!tree /workspace/models/` to print out the model repository layout as below:

Triton needs a [config file](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_configuration.html) to understand how to interpret the model. Let's look at the generated config file. It defines the input columns with datatype and dimensions and the output layer. Manually creating this config file can be complicated and NVTabular provides an easy function with `export_pytorch_ensemble` to deploy PyTorch model to Triton.



The config file needs the following information:
* [name](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/protobuf_api/model_config.proto.html#_CPPv4N6nvidia15inferenceserver11ModelConfig4nameE): The name of our model. Must be the same name as the parent folder.
* [platform](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/protobuf_api/model_config.proto.html#_CPPv4N6nvidia15inferenceserver11ModelConfig8platformE): The type of framework serving the model.
* [input](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/protobuf_api/model_config.proto.html#_CPPv4N6nvidia15inferenceserver11ModelConfig5inputE): The input our model expects.
  * `name`: Should correspond with the model input name.
  * `data_type`: Should correspond to the input's data type.
  * `dims`: The dimensions of the *request* for the input, as in the dimensions of the data the user passes to us.
* [output](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/protobuf_api/model_config.proto.html#_CPPv4N6nvidia15inferenceserver11ModelConfig6outputE): The output parameters of our model.
  * `name`: Should correspond with the model output name.
  * `data_type`: Should correspond to the output's data type.
  * `dims`: The dimensions of the output.

## 1.3. Loading Model

Next, let's build a client to connect to our server. This InferenceServerClient object is what we'll be using to talk to Triton.

In [2]:
import tritonhttpclient

try:
    triton_client = tritonhttpclient.InferenceServerClient(url="10.110.20.127:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))
triton_client.is_server_live()

client created.
GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>




True

In [3]:
triton_client.get_model_repository_index()

POST /v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '77'}>
bytearray(b'[{"name":"t4r_pytorch"},{"name":"t4r_pytorch_nvt"},{"name":"t4r_pytorch_pt"}]')


[{'name': 't4r_pytorch'},
 {'name': 't4r_pytorch_nvt'},
 {'name': 't4r_pytorch_pt'}]

- We load the ensemble model

In [4]:
model_name = "t4r_pytorch"
triton_client.load_model(model_name=model_name)

POST /v2/repository/models/t4r_pytorch/load, headers None

<HTTPSocketPoolResponse status=400 headers={'content-type': 'application/json', 'content-length': '65'}>


InferenceServerException: failed to load 't4r_pytorch', no version is available

If all models are loaded succesfully, you should be seeing successfully loaded status next to each model name on your terminal.

## 2. Sent Requests for Predictions

- Load raw data for inference: We select the last 20 interactions and filter out sessions with less than 2 interactions

In [None]:
import cudf
batch = cudf.read_parquet('Oct-2019.parquet').iloc[20:40,:]
batch.head()

In [None]:
sessions_to_use = batch.user_session.value_counts()[batch.user_session.value_counts() > 1].index.values
filtered_batch = batch[batch.user_session.isin(sessions_to_use)]

In [None]:
import nvtabular.inference.triton as nvt_triton
import tritonclient.grpc as grpcclient

inputs = nvt_triton.convert_df_to_triton_input(filtered_batch.columns, filtered_batch, grpcclient.InferInput)

output_names = ["output"]

outputs = []
for col in output_names:
    outputs.append(grpcclient.InferRequestedOutput(col))
    
MODEL_NAME_NVT = "t4r_pytorch"

with grpcclient.InferenceServerClient("10.110.20.127:8001") as client:
    response = client.infer(MODEL_NAME_NVT, inputs)
    print(col, ':\n', response.as_numpy(col))

In [None]:
response.as_numpy('output').shape