<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./images/DLI_Header.png"></a></div>

# Triton for Recommender Systems

The [Triton Inference Server](https://github.com/triton-inference-server/server/blob/main/README.md#documentation) allows us to deploy our model to the web regardless of cloud provider, and it supports a number of different machine learning frameworks such as TensorFlow and PyTorch.

## Objectives
* Learn how to deploy a model to Triton
  * [1. Deploy TensorFlow Model to Triton Inference Server](#1.-Deploy-TensorFlow-Model-to-Triton-Inference-Server)
      * [1.1 Export a Model](#1.1-Export-a-Model)
      * [1.2 Review exported files](#1.2-Review-exported-files)
      * [1.3 Loading a Model](#1.3-Loading-a-Model)
  * [2. Sent requests for predictions](#2.-Sent-requests-for-predictions)
* Learn how to record deployment metrics
  * [3. Server Metrics](#3.-Server-Metrics)

## 1. Deploy TensorFlow Model to Triton Inference Server

Our Triton server has already been launched to the web and is ready to make requests. First, we need to export the saved TensorFlow model from Lab 2 and generate the config file for Triton Inference Server. NVTabular provides an easy-to-use function, which manages both tasks.

### 1.1 Export a Model

In [1]:
!pip install tritonclient

Collecting tritonclient
  Downloading tritonclient-2.20.0-py3-none-manylinux1_x86_64.whl (7.9 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7.9 MB 8.8 MB/s 
[?25hCollecting python-rapidjson>=0.9.1
  Downloading python_rapidjson-1.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.6 MB 8.2 MB/s 
Installing collected packages: python-rapidjson, tritonclient
Successfully installed python-rapidjson-1.6 tritonclient-2.20.0


In [2]:
!pip install tritonclient[http]

Collecting geventhttpclient>=1.4.4
  Downloading geventhttpclient-1.5.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (77 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 77 kB 6.3 MB/s 
[?25hCollecting gevent>=0.13
  Downloading gevent-21.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (5.8 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5.8 MB 26.7 MB/s 
Collecting brotli
  Downloading Brotli-1.0.9-cp37-cp37m-manylinux1_x86_64.whl (357 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 357 kB 69.3 MB/s 
Collecting zope.interface
  Downloading zope.interface-5.4.0-cp37-cp37m-manylinux2010_x86_64.whl (251 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 251 kB 56.0 MB/s 
[?25h

In [3]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 300, done.[K
remote: Counting objects: 100% (129/129), done.[K
remote: Compressing objects: 100% (108/108), done.[K
remote: Total 300 (delta 72), reused 43 (delta 21), pack-reused 171[K
Receiving objects: 100% (300/300), 86.52 KiB | 17.30 MiB/s, done.
Resolving deltas: 100% (134/134), done.
PLEASE READ FOR 21.06
********************************************************************************************************
Another release, another script change.  We had to revise the script, which now:
1. Does a more comprehensive install
2. Includes BlazingSQL
3. is far easier for everyone to understand and maintain

The script will require you to add these 5 cells to your notebook.  We have also created a new startup template: 
https://colab.research.google.com/drive/1TAAi_szMfWqRfHVfjGSqnGVLr_ztzUM9?usp=sharing

CHANGES T
CELL 1:
    # This get the RAPIDS-Colab install files and test check your GPU.  Run cells 1 and 2 onl

In [4]:
!nvidia-smi

Fri Apr 22 08:32:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
***********************************************************************



In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:7 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:13 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:14 h

In [1]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

‚è¨ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
üì¶ Installing...
üìå Adjusting configuration...
ü©π Patching environment...
‚è≤ Done in 0:00:44
üîÅ Restarting kernel...


In [1]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

‚ú®üç∞‚ú® Everything looks OK!


In [2]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

Found existing installation: cffi 1.14.5
Uninstalling cffi-1.14.5:
  Successfully uninstalled cffi-1.14.5
Found existing installation: cryptography 3.4.5
Uninstalling cryptography-3.4.5:
  Successfully uninstalled cryptography-3.4.5
Collecting cffi==1.15.0
  Downloading cffi-1.15.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (427 kB)
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 21.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
failed with initial frozen solve. Retrying with flexible solve.
failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - cudatoolkit=11.2
    - dask-sql
    - gcsfs
    - llvmlite
    - openssl
    - python=3.7
    - 

In [3]:
# External dependencies
import os
from time import time

import argparse
import numpy as np
import pandas as pd
import sys

import tritonhttpclient

import cudf
import tritonclient.grpc as grpcclient




ModuleNotFoundError: ignored

In [4]:
!pip install nvtabular

Collecting nvtabular
  Downloading nvtabular-1.0.0.tar.gz (3.2 MB)
[?25l[K     |                                | 10 kB 32.7 MB/s eta 0:00:01[K     |‚ñè                               | 20 kB 39.4 MB/s eta 0:00:01[K     |‚ñé                               | 30 kB 41.6 MB/s eta 0:00:01[K     |‚ñç                               | 40 kB 35.1 MB/s eta 0:00:01[K     |‚ñå                               | 51 kB 37.7 MB/s eta 0:00:01[K     |‚ñã                               | 61 kB 42.2 MB/s eta 0:00:01[K     |‚ñä                               | 71 kB 29.7 MB/s eta 0:00:01[K     |‚ñâ                               | 81 kB 31.2 MB/s eta 0:00:01[K     |‚ñà                               | 92 kB 33.3 MB/s eta 0:00:01[K     |‚ñà                               | 102 kB 34.5 MB/s eta 0:00:01[K     |‚ñà‚ñè                              | 112 kB 34.5 MB/s eta 0:00:01[K     |‚ñà‚ñé                              | 122 kB 34.5 MB/s eta 0:00:01[K     |‚ñà‚ñç                              

In [6]:
#!pip uninstall protobuf

Found existing installation: protobuf 3.16.0
Uninstalling protobuf-3.16.0:
  Would remove:
    /usr/local/lib/python3.7/site-packages/protobuf-3.16.0-py3.7.egg-info
Proceed (y/n)? y
  Successfully uninstalled protobuf-3.16.0


In [7]:
#!pip install --no-binary=protobuf protobuf

Collecting protobuf
  Downloading protobuf-3.20.0.tar.gz (216 kB)
[?25l[K     |‚ñà‚ñå                              | 10 kB 34.8 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà                             | 20 kB 42.1 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñå                           | 30 kB 41.7 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                          | 40 kB 30.3 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã                        | 51 kB 33.5 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                       | 61 kB 37.7 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã                     | 71 kB 27.9 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                    | 81 kB 29.5 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã                  | 92 kB 31.7 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè                | 102 kB 33.1 MB/s eta 0:00:01[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

In [10]:
#!export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'

In [23]:
import nvtabular

In [17]:
import nvtabular.inference as nvi

In [20]:
import nvtabular.inference.triton 

ModuleNotFoundError: ignored

Let's unzip the model that we saved as a zip file in the previous notebook, and then load it to be able to use it in the NVTabular `export_tensorflow_model()` function below. 

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
!unzip /content/drive/MyDrive/Recommender_Intelligent_Systems/RecommenderSystems/data/task_2_model.zip

Next, we will load the TensorFlow model.

In [None]:
import tensorflow as tf

model = tf.keras.models.load_model('task2_model')

Since we will need the output name of the last layer to make predictions later, let's print them out using `model.output_names`.

In [None]:
model.output_names

We can export the model to `model_repository`. This folder is shared between the docker container for the jupyter notebook and the docker container that runs Triton Inference Server. Therefore, Triton will have access to the model files.

In [None]:
# generate the TF saved model
from nvtabular.inference.triton.ensemble import export_tensorflow_model

tf_config = export_tensorflow_model(model, "wnd_tf", "model_repository/wnd_tf", version=1)

To free GPU memory, we will restart the notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### 1.2 Review exported files

Let's look at the files `export_tensorflow_model` created. Triton expects [a specific directory structure](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md) for our models. The folder `/model_repository` is shared with our server, and it expects the following format:

```<model_repository_path>/
  <model-name>/
    [config.pbtxt]
    <version-name>/
      [model.savedmodel]/
        <tensorflow_saved_model_files>/
          ...
```

In [None]:
!tree model_repository

Let's look at the generated config file. It defines the input columns with datatype and dimensions and the output layer. Manually creating this config file can be complicated and NVTabular provides an easy function with `export_tensorflow_model` to deploy TensorFlow model to Triton.

Triton needs a [config file](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) to understand how to interpret the model. Our `export_tensorflow_model` method is automaticall creating the config file and the required folder structure for us, so that we do not need to create it manually.

The config file needs the following information:
* name: The name of our model. Must be the same name as the parent folder.
* platform: The type of framework serving the model.
* input: The input our model expects.
  * `name`: Should correspond with the model input name.
  * `data_type`: Should correspond to the input's data type.
  * `dims`: The dimensions of the *request* for the input, as in the dimensions of the data the user passes to us.
  * `reshape`: How to reshape the data from the client to pass it to our model. In this case, the minimum dims from the client is `[1]`, but like Keras, Triton appends a dimension for batching. If our model expects `[batch_size]` as a dimension, we can reshape our data to `[]` (empty brackets) to account for that.
* output: The output parameters of our model.
  * `name`: Should correspond with the model output name. In this case, we're using the name automatically assigned by TensorFlow.
  * `data_type`: Should correspond to the output's data type.
  * `dims`: The dimensions of the output.

In [None]:
!cat model_repository/wnd_tf/config.pbtxt

### 1.3 Loading a Model

Now, we can communicate with the Triton Inference Server and sent the request to load the model. We can verify this by using [curl](https://curl.haxx.se/) to make a `GET` request.

In [None]:
!curl -i triton:8000/v2/health/ready

Next, let's build a client to connect to our server. This [InferenceServerClient](https://github.com/triton-inference-server/client) object is what we'll be using to talk to Triton.

In [None]:
import tritonhttpclient

import cudf
import tritonclient.grpc as grpcclient
import nvtabular.inference.triton as nvt_triton

import numpy as np
import pandas as pd

In [None]:
try:
    triton_client = tritonhttpclient.InferenceServerClient(url="triton:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))

We can verify that our server is ready to go by using [is_server_live](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L259). [get_model_repository_index](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L555) will also show what folders are in Triton's model repository.

In [None]:
triton_client.is_server_live()

In [None]:
triton_client.get_model_repository_index()

Now that everything is configured, let's get the model loaded! First, we'll create a version for our model. By default, Triton loads the version in the first listed folder, so we'll use `1` for our version number.

Finally, we'll copy our model into the server.

We've set Triton's [Model Control Mode](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-explicit) to `EXPLICIT`, meaning, it's not going to automatically pick up the model placed in it's directory. This is done on line 13 of our `docker-compose.yml` file in the [previous lab](3-02_docker.ipynb). We could have used [POLL](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md) in order to do this, but it's not immediate when checking for changes.

In order to load our model, we'll use [load_model](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L601). When needed, we can use [unload_model](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L634) when we want to remove it from the Triton server.

In [None]:
model_name = "wnd_tf"
triton_client.load_model(model_name=model_name)

Now that the model is loaded, we can use [get_model_metadata](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L429) to see our model's inputs and outputs.

In [None]:
triton_client.get_model_metadata(model_name=model_name)

Ok, time to shine! Let's make a request to our server!

### 2. Sent requests for predictions

We can use [InferInput](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L1449) to describe the tensors we'll be sending to the server. It needs the name of the input, the shape of the tensor we'll be passing to the server, and its datatype.

Then, we can use [set_data_from_numpy](https://github.com/triton-inference-server/client/blob/12d8a2a7318ccb4a367a09a42b80feba53f3944a/src/python/library/tritonclient/grpc/__init__.py#L1513) to pass it a NumPy array.

We'll use some fake data for now. The first row of our batch will have all `1`s and the second will have all `2`s.

In [None]:
inputs = []
outputs = []
batch_size = 2
inputs.append(tritonhttpclient.InferInput("user_index", [batch_size, 1], "INT64"))
inputs.append(tritonhttpclient.InferInput("item_index", [batch_size, 1], "INT64"))
inputs.append(tritonhttpclient.InferInput("brand_index", [batch_size, 1], "INT64"))
inputs.append(tritonhttpclient.InferInput("price_filled", [batch_size, 1], "FP32"))
inputs.append(tritonhttpclient.InferInput("salesRank_Electronics", [batch_size, 1], "FP32"))
inputs.append(tritonhttpclient.InferInput("category_0_2_index", [batch_size, 1], "INT32"))
inputs.append(tritonhttpclient.InferInput("category_1_2_index", [batch_size, 1], "INT32"))

inputs[0].set_data_from_numpy(np.array([[1], [2]], dtype=np.int64))
inputs[1].set_data_from_numpy(np.array([[1], [2]], dtype=np.int64))
inputs[2].set_data_from_numpy(np.array([[1], [2]], dtype=np.int64))
inputs[3].set_data_from_numpy(np.array([[1.0], [2.0]], dtype=np.float32))
inputs[4].set_data_from_numpy(np.array([[1.0], [2.0]], dtype=np.float32))
inputs[5].set_data_from_numpy(np.array([[1], [2]], dtype=np.int32))
inputs[6].set_data_from_numpy(np.array([[1], [2]], dtype=np.int32))

outputs.append(
    tritonhttpclient.InferRequestedOutput("tf.__operators__.add", binary_data=False)
)
results = triton_client.infer(model_name, inputs, outputs=outputs).get_response()

We'll get a bunch of data returned from our response, but the important one is the `"data"` at the very end. That's our prediction from our model!

In [None]:
results["outputs"][0]["data"]

This seems like a lot of work for only two predictions. Can we give it something meatier? We have loaded in the data from our previous labs. Let's try running our validation data from lab2 through the server.

In [None]:
ratings = pd.read_csv("./data/task_2_wide_and_deep.csv")
ratings = ratings[ratings["valid"]]
ratings.head()

Let's try to be a little more efficient with our code this time. We'll use a `for` loop to construct our inputs.

In [None]:
columns = [
    ('user_index', "INT64"),
    ('item_index', "INT64"),
    ('brand_index', "INT64"),
    ('price_filled', "FP32"),
    ('salesRank_Electronics', "FP32"),
    ('category_0_2_index', "INT32"),
    ('category_1_2_index', "INT32")
]

dtypes = {
    "INT32": np.int32,
    "INT64": np.int64,
    "FP32": np.float32
}

inputs = []
batch_size = 64
for column in columns:
    name = column[0]
    dtype = dtypes[column[1]]
    data = np.expand_dims(np.array(ratings.head(batch_size)[name], dtype=dtype), axis=-1)
    inputs.append(tritonhttpclient.InferInput(name, [batch_size, 1], column[1]))
    inputs[-1].set_data_from_numpy(data)

results = triton_client.infer(model_name, inputs, outputs=outputs).get_response()

print("\nprediction results:\n", results["outputs"][0]["data"])

## 3. Server Metrics

Not only can we scale serving our data, but we can also gather metrics on our model as well. This is crucial, finding the right metric to optimize for with recommender systems is not a trivial task. Check out this [great paper](https://www.kdd.org/exploration_files/19-1-Article3.pdf) explaining common pitfalls.

The short version is this:
* Recommender systems create a feedback loop between users and recommendations. Popular items train our models that these are good recommendations, thus serving them to more users and perpetuating the loop.
* Try to avoid metrics that are biased by human behavior. For instance, click through rate is one commonly used in the advertisement space, but if not careful, using this will train the model which position on a web page is popular as opposed to the content.

At the end of the day, the goal is to increase user engagement. Triton automatically serves usage metrics using [Prometheus](https://prometheus.io/). Copy and paste the URL (web address) for this notebook and set it to `my_url` below. Run the cell to see the metrics for our model. [Here](https://github.com/triton-inference-server/server/blob/main/docs/metrics.md) is a list of available metrics, but a good one to start with is `nv_inference_count` which displays how many predictions have been made.

In [None]:
import IPython
my_url = "COPY_NOTEBOOK_URL"
prometheus_url = my_url.rsplit(".com", 1)[0] + ".com:9090/graph"
IPython.display.IFrame(prometheus_url, width=700, height=500)

## Wrap Up

We can take this a little further and hook these results into a service like [Grafana](https://grafana.com/) as explained in [this excellent blog post](https://blog.einstein.ai/benchmarking-tensorrt-inference-server/) by the SalesForce team, but for now, we have all the pieces to build an end-to-end recommender system.

Feeling ready? Head on over to [the next lab](3-04_assessment.ipynb) to put these new skills into action!

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./images/DLI_Header.png"></a></div>