<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./images/DLI_Header.png"></a></div>

# Triton for Recommender Systems

The [Triton Inference Server](https://github.com/triton-inference-server/server/blob/main/README.md#documentation) allows us to deploy our model to the web regardless of cloud provider, and it supports a number of different machine learning frameworks such as TensorFlow and PyTorch.

## Objectives
* Learn how to deploy a model to Triton
  * [1. Deploy TensorFlow Model to Triton Inference Server](#1.-Deploy-TensorFlow-Model-to-Triton-Inference-Server)
      * [1.1 Export a Model](#1.1-Export-a-Model)
      * [1.2 Review exported files](#1.2-Review-exported-files)
      * [1.3 Loading a Model](#1.3-Loading-a-Model)
  * [2. Sent requests for predictions](#2.-Sent-requests-for-predictions)
* Learn how to record deployment metrics
  * [3. Server Metrics](#3.-Server-Metrics)

## 1. Deploy TensorFlow Model to Triton Inference Server

Our Triton server has already been launched to the web and is ready to make requests. First, we need to export the saved TensorFlow model from Lab 2 and generate the config file for Triton Inference Server. NVTabular provides an easy-to-use function, which manages both tasks.

### 1.1 Export a Model

In [1]:
# External dependencies
import os
from time import time

import argparse
import numpy as np
import pandas as pd
import sys

import tritonhttpclient

import cudf
import tritonclient.grpc as grpcclient
import nvtabular.inference.triton as nvt_triton

Let's unzip the model that we saved as a zip file in the previous notebook, and then load it to be able to use it in the NVTabular `export_tensorflow_model()` function below. 

In [2]:
!unzip data/task_2_model.zip

Archive:  data/task_2_model.zip
   creating: task2_model/
  inflating: task2_model/saved_model.pb  
   creating: task2_model/variables/
  inflating: task2_model/variables/variables.data-00000-of-00001  
  inflating: task2_model/variables/variables.index  
   creating: task2_model/assets/


Next, we will load the TensorFlow model.

In [3]:
import tensorflow as tf

model = tf.keras.models.load_model('task2_model')

Since we will need the output name of the last layer to make predictions later, let's print them out using `model.output_names`.

In [4]:
model.output_names

['tf.__operators__.add']

We can export the model to `model_repository`. This folder is shared between the docker container for the jupyter notebook and the docker container that runs Triton Inference Server. Therefore, Triton will have access to the model files.

In [5]:
import nvtabular

# generate the TF saved model
from nvtabular.inference.triton.ensemble import export_tensorflow_model

tf_config = export_tensorflow_model(model, "wnd_tf", "model_repository/wnd_tf", version=1)

INFO:tensorflow:Assets written to: model_repository/wnd_tf/1/model.savedmodel/assets


To free GPU memory, we will restart the notebook.

In [6]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### 1.2 Review exported files

Let's look at the files `export_tensorflow_model` created. Triton expects [a specific directory structure](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md) for our models. The folder `/model_repository` is shared with our server, and it expects the following format:

```<model_repository_path>/
  <model-name>/
    [config.pbtxt]
    <version-name>/
      [model.savedmodel]/
        <tensorflow_saved_model_files>/
          ...
```

In [7]:
!tree model_repository

[01;34mmodel_repository[00m
└── [01;34mwnd_tf[00m
    ├── [01;34m1[00m
    │   └── [01;34mmodel.savedmodel[00m
    │       ├── [01;34massets[00m
    │       ├── saved_model.pb
    │       └── [01;34mvariables[00m
    │           ├── variables.data-00000-of-00001
    │           └── variables.index
    └── config.pbtxt

5 directories, 4 files


Let's look at the generated config file. It defines the input columns with datatype and dimensions and the output layer. Manually creating this config file can be complicated and NVTabular provides an easy function with `export_tensorflow_model` to deploy TensorFlow model to Triton.

Triton needs a [config file](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) to understand how to interpret the model. Our `export_tensorflow_model` method is automaticall creating the config file and the required folder structure for us, so that we do not need to create it manually.

The config file needs the following information:
* name: The name of our model. Must be the same name as the parent folder.
* platform: The type of framework serving the model.
* input: The input our model expects.
  * `name`: Should correspond with the model input name.
  * `data_type`: Should correspond to the input's data type.
  * `dims`: The dimensions of the *request* for the input, as in the dimensions of the data the user passes to us.
  * `reshape`: How to reshape the data from the client to pass it to our model. In this case, the minimum dims from the client is `[1]`, but like Keras, Triton appends a dimension for batching. If our model expects `[batch_size]` as a dimension, we can reshape our data to `[]` (empty brackets) to account for that.
* output: The output parameters of our model.
  * `name`: Should correspond with the model output name. In this case, we're using the name automatically assigned by TensorFlow.
  * `data_type`: Should correspond to the output's data type.
  * `dims`: The dimensions of the output.

In [1]:
!cat model_repository/wnd_tf/config.pbtxt

name: "wnd_tf"
platform: "tensorflow_savedmodel"
input {
  name: "brand_index"
  data_type: TYPE_INT64
  dims: -1
  dims: 1
}
input {
  name: "category_0_2_index"
  data_type: TYPE_INT32
  dims: -1
  dims: 1
}
input {
  name: "category_1_2_index"
  data_type: TYPE_INT32
  dims: -1
  dims: 1
}
input {
  name: "item_index"
  data_type: TYPE_INT64
  dims: -1
  dims: 1
}
input {
  name: "price_filled"
  data_type: TYPE_FP32
  dims: -1
  dims: 1
}
input {
  name: "salesRank_Electronics"
  data_type: TYPE_FP32
  dims: -1
  dims: 1
}
input {
  name: "user_index"
  data_type: TYPE_INT64
  dims: -1
  dims: 1
}
output {
  name: "tf.__operators__.add"
  data_type: TYPE_FP32
  dims: -1
  dims: 1
}
backend: "tensorflow"


### 1.3 Loading a Model

Now, we can communicate with the Triton Inference Server and sent the request to load the model. We can verify this by using [curl](https://curl.haxx.se/) to make a `GET` request.

In [2]:
!curl -i triton:8000/v2/health/ready

HTTP/1.1 200 OK
[1mContent-Length[0m: 0
[1mContent-Type[0m: text/plain



Next, let's build a client to connect to our server. This [InferenceServerClient](https://github.com/triton-inference-server/client) object is what we'll be using to talk to Triton.

In [3]:
import tritonhttpclient

import cudf
import tritonclient.grpc as grpcclient
import nvtabular.inference.triton as nvt_triton

import numpy as np
import pandas as pd

In [4]:
try:
    triton_client = tritonhttpclient.InferenceServerClient(url="triton:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))

client created.


We can verify that our server is ready to go by using [is_server_live](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L259). [get_model_repository_index](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L555) will also show what folders are in Triton's model repository.

In [5]:
triton_client.is_server_live()

GET v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>


True

In [6]:
triton_client.get_model_repository_index()

POST v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '19'}>
bytearray(b'[{"name":"wnd_tf"}]')


[{'name': 'wnd_tf'}]

Now that everything is configured, let's get the model loaded! First, we'll create a version for our model. By default, Triton loads the version in the first listed folder, so we'll use `1` for our version number.

Finally, we'll copy our model into the server.

We've set Triton's [Model Control Mode](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-explicit) to `EXPLICIT`, meaning, it's not going to automatically pick up the model placed in it's directory. This is done on line 13 of our `docker-compose.yml` file in the [previous lab](3-02_docker.ipynb). We could have used [POLL](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md) in order to do this, but it's not immediate when checking for changes.

In order to load our model, we'll use [load_model](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L601). When needed, we can use [unload_model](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L634) when we want to remove it from the Triton server.

In [7]:
model_name = "wnd_tf"
triton_client.load_model(model_name=model_name)

POST v2/repository/models/wnd_tf/load, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'wnd_tf'


Now that the model is loaded, we can use [get_model_metadata](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L429) to see our model's inputs and outputs.

In [8]:
triton_client.get_model_metadata(model_name=model_name)

GET v2/models/wnd_tf, headers None
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '577'}>
bytearray(b'{"name":"wnd_tf","versions":["1"],"platform":"tensorflow_savedmodel","inputs":[{"name":"brand_index","datatype":"INT64","shape":[-1,1]},{"name":"category_0_2_index","datatype":"INT32","shape":[-1,1]},{"name":"category_1_2_index","datatype":"INT32","shape":[-1,1]},{"name":"item_index","datatype":"INT64","shape":[-1,1]},{"name":"price_filled","datatype":"FP32","shape":[-1,1]},{"name":"salesRank_Electronics","datatype":"FP32","shape":[-1,1]},{"name":"user_index","datatype":"INT64","shape":[-1,1]}],"outputs":[{"name":"tf.__operators__.add","datatype":"FP32","shape":[-1,1]}]}')


{'name': 'wnd_tf',
 'versions': ['1'],
 'platform': 'tensorflow_savedmodel',
 'inputs': [{'name': 'brand_index', 'datatype': 'INT64', 'shape': [-1, 1]},
  {'name': 'category_0_2_index', 'datatype': 'INT32', 'shape': [-1, 1]},
  {'name': 'category_1_2_index', 'datatype': 'INT32', 'shape': [-1, 1]},
  {'name': 'item_index', 'datatype': 'INT64', 'shape': [-1, 1]},
  {'name': 'price_filled', 'datatype': 'FP32', 'shape': [-1, 1]},
  {'name': 'salesRank_Electronics', 'datatype': 'FP32', 'shape': [-1, 1]},
  {'name': 'user_index', 'datatype': 'INT64', 'shape': [-1, 1]}],
 'outputs': [{'name': 'tf.__operators__.add',
   'datatype': 'FP32',
   'shape': [-1, 1]}]}

Ok, time to shine! Let's make a request to our server!

### 2. Sent requests for predictions

We can use [InferInput](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L1449) to describe the tensors we'll be sending to the server. It needs the name of the input, the shape of the tensor we'll be passing to the server, and its datatype.

Then, we can use [set_data_from_numpy](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L1513) to pass it a NumPy array.

We'll use some fake data for now. The first row of our batch will have all `1`s and the second will have all `2`s.

In [9]:
inputs = []
outputs = []
batch_size = 2
inputs.append(tritonhttpclient.InferInput("user_index", [batch_size, 1], "INT64"))
inputs.append(tritonhttpclient.InferInput("item_index", [batch_size, 1], "INT64"))
inputs.append(tritonhttpclient.InferInput("brand_index", [batch_size, 1], "INT64"))
inputs.append(tritonhttpclient.InferInput("price_filled", [batch_size, 1], "FP32"))
inputs.append(tritonhttpclient.InferInput("salesRank_Electronics", [batch_size, 1], "FP32"))
inputs.append(tritonhttpclient.InferInput("category_0_2_index", [batch_size, 1], "INT32"))
inputs.append(tritonhttpclient.InferInput("category_1_2_index", [batch_size, 1], "INT32"))

inputs[0].set_data_from_numpy(np.array([[1], [2]], dtype=np.int64))
inputs[1].set_data_from_numpy(np.array([[1], [2]], dtype=np.int64))
inputs[2].set_data_from_numpy(np.array([[1], [2]], dtype=np.int64))
inputs[3].set_data_from_numpy(np.array([[1.0], [2.0]], dtype=np.float32))
inputs[4].set_data_from_numpy(np.array([[1.0], [2.0]], dtype=np.float32))
inputs[5].set_data_from_numpy(np.array([[1], [2]], dtype=np.int32))
inputs[6].set_data_from_numpy(np.array([[1], [2]], dtype=np.int32))

outputs.append(
    tritonhttpclient.InferRequestedOutput("tf.__operators__.add", binary_data=False)
)
results = triton_client.infer(model_name, inputs, outputs=outputs).get_response()

POST v2/models/wnd_tf/infer, headers {'Inference-Header-Content-Length': 759}
b'{"inputs":[{"name":"user_index","shape":[2,1],"datatype":"INT64","parameters":{"binary_data_size":16}},{"name":"item_index","shape":[2,1],"datatype":"INT64","parameters":{"binary_data_size":16}},{"name":"brand_index","shape":[2,1],"datatype":"INT64","parameters":{"binary_data_size":16}},{"name":"price_filled","shape":[2,1],"datatype":"FP32","parameters":{"binary_data_size":8}},{"name":"salesRank_Electronics","shape":[2,1],"datatype":"FP32","parameters":{"binary_data_size":8}},{"name":"category_0_2_index","shape":[2,1],"datatype":"INT32","parameters":{"binary_data_size":8}},{"name":"category_1_2_index","shape":[2,1],"datatype":"INT32","parameters":{"binary_data_size":8}}],"outputs":[{"name":"tf.__operators__.add","parameters":{"binary_data":false}}]}\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\

We'll get a bunch of data returned from our response, but the important one is the `"data"` at the very end. That's our prediction from our model!

In [10]:
results["outputs"][0]["data"]

[4.675514221191406, 4.7014875411987305]

This seems like a lot of work for only two predictions. Can we give it something meatier? We have loaded in the data from our previous labs. Let's try running our validation data from lab2 through the server.

In [11]:
ratings = pd.read_csv("./data/task_2_wide_and_deep.csv")
ratings = ratings[ratings["valid"]]
ratings.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,reviewerID,asin,overall,unixReviewTime,brand,category_0_0,category_0_1,category_0_2,category_0_3,category_1_0,...,user_index,item_index,brand_index,als_prediction,user_embed_0,user_embed_1,item_embed_0,item_embed_1,category_0_2_index,category_1_2_index
32,A236926FEQNZE5,B00006HVLW,3.0,1390608000,BELKIN,Electronics,Computers & Accessories,Cables & Accessories,Surge Protectors,,...,55593,1685,307,3.031549,1.267751,-2.118827,0.668555,-1.117375,28,101
33,A9SRRW7T0HY70,B00006HVLW,4.0,1395619200,BELKIN,Electronics,Computers & Accessories,Cables & Accessories,Surge Protectors,,...,155591,1685,307,4.308138,1.801603,-3.011068,0.699708,-1.169436,28,101
34,A17R3BS1KYP1OO,B00006HVLW,3.0,1388707200,BELKIN,Electronics,Computers & Accessories,Cables & Accessories,Surge Protectors,,...,11029,1685,307,3.47802,1.454458,-2.430878,0.574669,-0.96046,28,101
64,A3SOVP1ITVGD9A,B00007GQLU,5.0,1384473600,Canon,Electronics,Camera & Photo,Lenses,Camera Lenses,,...,142291,2065,573,4.575386,1.514265,-2.53083,0.679521,-1.135702,82,101
96,A39ZPRN0LTNHFD,B000068P8W,5.0,1363132800,Monster,Electronics,Computers & Accessories,Cables & Accessories,Cables & Interconnects,,...,115847,1435,1970,4.266848,1.613634,-2.696908,0.729389,-1.21905,28,101


Let's try to be a little more efficient with our code this time. We'll use a `for` loop to construct our inputs.

In [12]:
columns = [
    ('user_index', "INT64"),
    ('item_index', "INT64"),
    ('brand_index', "INT64"),
    ('price_filled', "FP32"),
    ('salesRank_Electronics', "FP32"),
    ('category_0_2_index', "INT32"),
    ('category_1_2_index', "INT32")
]

dtypes = {
    "INT32": np.int32,
    "INT64": np.int64,
    "FP32": np.float32
}

inputs = []
batch_size = 64
for column in columns:
    name = column[0]
    dtype = dtypes[column[1]]
    data = np.expand_dims(np.array(ratings.head(batch_size)[name], dtype=dtype), axis=-1)
    inputs.append(tritonhttpclient.InferInput(name, [batch_size, 1], column[1]))
    inputs[-1].set_data_from_numpy(data)

results = triton_client.infer(model_name, inputs, outputs=outputs).get_response()

print("\nprediction results:\n", results["outputs"][0]["data"])

POST v2/models/wnd_tf/infer, headers {'Inference-Header-Content-Length': 777}
b'{"inputs":[{"name":"user_index","shape":[64,1],"datatype":"INT64","parameters":{"binary_data_size":512}},{"name":"item_index","shape":[64,1],"datatype":"INT64","parameters":{"binary_data_size":512}},{"name":"brand_index","shape":[64,1],"datatype":"INT64","parameters":{"binary_data_size":512}},{"name":"price_filled","shape":[64,1],"datatype":"FP32","parameters":{"binary_data_size":256}},{"name":"salesRank_Electronics","shape":[64,1],"datatype":"FP32","parameters":{"binary_data_size":256}},{"name":"category_0_2_index","shape":[64,1],"datatype":"INT32","parameters":{"binary_data_size":256}},{"name":"category_1_2_index","shape":[64,1],"datatype":"INT32","parameters":{"binary_data_size":256}}],"outputs":[{"name":"tf.__operators__.add","parameters":{"binary_data":false}}]})\xd9\x00\x00\x00\x00\x00\x00\xc7_\x02\x00\x00\x00\x00\x00\x15+\x00\x00\x00\x00\x00\x00\xd3+\x02\x00\x00\x00\x00\x00\x87\xc4\x01\x00\x00\x00\x0

## 3. Server Metrics

Not only can we scale serving our data, but we can also gather metrics on our model as well. This is crucial, finding the right metric to optimize for with recommender systems is not a trivial task. Check out this [great paper](https://www.kdd.org/exploration_files/19-1-Article3.pdf) explaining common pitfalls.

The short version is this:
* Recommender systems create a feedback loop between users and recommendations. Popular items train our models that these are good recommendations, thus serving them to more users and perpetuating the loop.
* Try to avoid metrics that are biased by human behavior. For instance, click through rate is one commonly used in the advertisement space, but if not careful, using this will train the model which position on a web page is popular as opposed to the content.

At the end of the day, the goal is to increase user engagement. Triton automatically serves usage metrics using [Prometheus](https://prometheus.io/). Copy and paste the URL (web address) for this notebook and set it to `my_url` below. Run the cell to see the metrics for our model. [Here](https://github.com/triton-inference-server/server/blob/main/docs/metrics.md) is a list of available metrics, but a good one to start with is `nv_inference_count` which displays how many predictions have been made.

In [13]:
import IPython
my_url = "http://dli-337469e8f67f-e74638.aws.labs.courses.nvidia.com/lab/lab/tree/3-03_triton.ipynb"
prometheus_url = my_url.rsplit(".com", 1)[0] + ".com:9090/graph"
IPython.display.IFrame(prometheus_url, width=700, height=500)

## Wrap Up

We can take this a little further and hook these results into a service like [Grafana](https://grafana.com/) as explained in [this excellent blog post](https://blog.einstein.ai/benchmarking-tensorrt-inference-server/) by the SalesForce team, but for now, we have all the pieces to build an end-to-end recommender system.

Feeling ready? Head on over to [the next lab](3-04_assessment.ipynb) to put these new skills into action!

In [14]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./images/DLI_Header.png"></a></div>