In [1]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# DLRM Triton Inference Demo

## Overview

Recomendation system (RecSys) inference involves determining an ordered list of items with which the query user will most likely interact with. For very large commercial databases with millions to hundreds of millions of items to choose from (like advertisements, apps), usually an item retrieval procedure is carried out to reduce the number of items to a more manageable quantity, e.g. a few hundreds to a few thousands. The methods include computationally-light algorithms such as approximate neighborhood search, random forest and filtering based on user preferences. From thereon, a deep learning based RecSys is invoked to re-rank the items and those with the highest scores are presented to the users. This process is well demonstrated in the Google AppStore recommendation system in Figure 1. 

![DLRM_model](recsys_inference.PNG)

Figure 1: Google’s app recommendation process. [Source](https://arxiv.org/pdf/1606.07792.pdf).

As we can see, for each query user, the number of user-item pairs to score can be as large as a few thousands. This places an extremely heavy duty on RecSys inference server, which must handle high throughput to serve many users concurrently yet at low latency to satisfy stringent latency thresholds of online commerce engines.

The NVIDIA Triton Inference Server [9] provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. Triton automatically manages and makes use of all the available GPUs.

We will next see how to prepare the DLRM model for inference with the Triton inference server and see how Triton is up to the task.    

### Learning objectives

This notebook demonstrates the steps for preparing a pre-trained DLRM model for deployment and inference with the NVIDIA [Triton inference server](https://github.com/NVIDIA/triton-inference-server). 

## Content
1. [Requirements](#1)
1. [Prepare model for inference](#2)
1. [Start the Triton inference server](#3)
1. [Testing server with the performance client](#4)


<a id="1"></a>
## 1. Requirements


### 1.1 Docker container
The most convenient way to make use of the NVIDIA DLRM model is via a docker container, which provides a self-contained, isolated and re-producible environment for all experiments.

First, clone the repository:

```
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Recommendation/DLRM
```

To execute this notebook, first build the following inference container:

```
docker build -t dlrm-inference . -f triton/Dockerfile
```

Start in interactive docker session with:

```
docker run -it --rm --gpus device=0 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v <PATH_TO_SAVED_MODEL>:/models -v <PATH_TO_EXPORT_MODEL>:/repository <PATH_TO_PREPROCESSED_DATA>:/data dlrm-inference bash
```
where:

- PATH_TO_SAVED_MODEL: directory containing the trained DLRM models with `.pt` extension.
 
- PATH_TO_EXPORT_MODEL: directory which will contain the converted model to be used with the NVIDIA Triton inference server.

- PATH_TO_PREPROCESSED_DATA: path to the preprocessed Criteo Terabyte dataset containing 3 binary data files: `test_data.bin`, `train_data.bin` and   `val_data.bin`  and a JSON `file model_size.json` totalling ~650GB.

Within the docker interactive bash session, start Jupyter with

```
export PYTHONPATH=/workspace/dlrm
jupyter notebook --ip 0.0.0.0 --port 8888
```

Then open the Jupyter GUI interface on your host machine at http://localhost:8888. Within the container, this demo notebook is located at `/workspace/dlrm/notebooks`.

### 1.2 Hardware
This notebook can be executed on any CUDA-enabled NVIDIA GPU with at least 24GB of GPU memory, although for efficient mixed precision inference, a [Tensor Core NVIDIA GPU](https://www.nvidia.com/en-us/data-center/tensorcore/) is desired (Volta, Turing or newer architectures). 

In [3]:
!nvidia-smi

Sat Apr  4 00:55:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-PCIE...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   30C    P0    37W / 250W |  19757MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

<a id="2"></a>
## 2. Prepare model for inference

We first convert model to a format accepted by the NVIDIA Triton inference server. Triton can accept TorchScript, ONNX amongst other formats. 

To deploy model into Triton compatible format, we provide the deployer.py [script](../triton/deployer.py).

### TorchScript
TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.

We provide two options to convert models to TorchScript:
-  --ts-script           convert to torchscript using torch.jit.script
-  --ts-trace            convert to torchscript using torch.jit.trace


In the conversion below, we assume:

- The trained model is stored at /models/dlrm_model_fp16.pt

- The maximum batchsize that Triton will handle is 65536.

- The processed dataset directory is /data which contain a `model_size.json` file.

In [12]:
%%bash
python ../triton/deployer.py \
--ts-script \
--triton-model-name dlrm-ts-script-16 \
--triton-max-batch-size 65536 \
--save-dir /repository \
-- --model_checkpoint /models/dlrm_model_fp16.pt  \
--fp16 \
--batch_size 4096 \
--num_numerical_features 13 \
--embedding_dim 128 \
--top_mlp_sizes 1024 1024 512 256 1 \
--bottom_mlp_sizes 512 256 128 \
--interaction_op dot \
--hash_indices \
--dataset /data \
--dump_perf_data ./perfdata

deploying model dlrm-ts-script-16 in format pytorch_libtorch
done


### ONNX

[ONNX](https://onnx.ai/) is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.

Conversion of DLRM pre-trained PyTorch model to ONNX model can be done with:

In [6]:
%%bash
python ../triton/deployer.py \
--onnx \
--triton-model-name dlrm-onnx-16 \
--triton-max-batch-size 4096 \
--save-dir /repository \
-- --model_checkpoint /models/dlrm_model_fp16.pt  \
--fp16 \
--batch_size 4096 \
--num_numerical_features 13 \
--embedding_dim 128 \
--top_mlp_sizes 1024 1024 512 256 1 \
--bottom_mlp_sizes 512 256 128 \
--interaction_op dot \
--hash_indices \
--dataset /data \
--dump_perf_data ./perfdata

deploying model dlrm-onnx-16 in format onnxruntime_onnx
done


  "If indices include negative values, the exported graph will produce incorrect results.")
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))


<a id="3"></a>
## 3. Start the Triton inference server

*Note: this step must be done outside the of the current docker container.*

Open a bash window on the **host machine** and execute the following commands:

```
docker pull nvcr.io/nvidia/tensorrtserver:20.03-py3
docker run -d --rm --gpus device=0 --ipc=host --network=host -p 8000:8000 -p 8001:8001 -p 8002:8002 -v <PATH_TO_MODEL_REPOSITORY>:/repository nvcr.io/nvidia/tensorrtserver:20.03-py3 trtserver --model-store=/repository --log-verbose=1 --model-control-mode=explicit
```

where:

- PATH_TO_MODEL_REPOSITORY: directory on the host machine containing the converted models in section 2 above. 

Note that each DLRM model will require ~19GB of GPU memory.

Within the `/models` directory on the inference server, the structure should look similar to the below:

```
/models
`-- dlrm-onnx-16
    |-- 1
    |   `-- model.onnx
    |       |-- bottom_mlp.0.weight
    |       |-- bottom_mlp.2.weight
    |       |-- bottom_mlp.4.weight
    |       |-- embeddings.0.weight
    |       |-- embeddings.1.weight
    |       |-- embeddings.10.weight
    |       |-- embeddings.11.weight
    |       |-- embeddings.12.weight
    |       |-- embeddings.13.weight
    |       |-- embeddings.14.weight
    |       |-- embeddings.15.weight
    |       |-- embeddings.17.weight
    |       |-- embeddings.18.weight
    |       |-- embeddings.19.weight
    |       |-- embeddings.2.weight
    |       |-- embeddings.20.weight
    |       |-- embeddings.21.weight
    |       |-- embeddings.22.weight
    |       |-- embeddings.23.weight
    |       |-- embeddings.24.weight
    |       |-- embeddings.25.weight
    |       |-- embeddings.3.weight
    |       |-- embeddings.4.weight
    |       |-- embeddings.6.weight
    |       |-- embeddings.7.weight
    |       |-- embeddings.8.weight
    |       |-- embeddings.9.weight
    |       |-- model.onnx
    |       |-- top_mlp.0.weight
    |       |-- top_mlp.2.weight
    |       |-- top_mlp.4.weight
    |       `-- top_mlp.6.weight
    `-- config.pbtxt
```

<a id="4"></a>
## 4. Testing server with the performance client

After model deployment has completed, we can test the deployed model against the Criteo test dataset. 

Note: This requires mounting the Criteo test data to, e.g. `/data/test_data.bin`. Within the dataset directory, there must also be a `model_size.json` file.

In [9]:
%%bash
python ../triton/client.py \
--triton-server-url localhost:8000 \
--protocol HTTP \
--triton-model-name dlrm-onnx-16 \
--num_numerical_features 13 \
--dataset_config /data/model_size.json \
--inference_data /data/test_data.bin \
--batch_size 4096 \
--fp16

Process is terminated.


The Triton inference server comes with a [performance client](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/optimization.html#perf-client) which is designed to stress test the server using multiple client threads.

The perf_client generates inference requests to your model and measures the throughput and latency of those requests. To get representative results, the perf_client measures the throughput and latency over a time window, and then repeats the measurements until it gets stable values. By default the perf_client uses average latency to determine stability but you can use the --percentile flag to stabilize results based on that confidence level. For example, if --percentile=95 is used the results will be stabilized using the 95-th percentile request latency. 

### Request Concurrency

By default perf_client measures your model’s latency and throughput using the lowest possible load on the model. To do this perf_client sends one inference request to the server and waits for the response. When that response is received, the perf_client immediately sends another request, and then repeats this process during the measurement windows. The number of outstanding inference requests is referred to as the request concurrency, and so by default perf_client uses a request concurrency of 1.

Using the --concurrency-range <start>:<end>:<step> option you can have perf_client collect data for a range of request concurrency levels. Use the --help option to see complete documentation for this and other options.
    


In [13]:
%%bash
/workspace/install/bin/perf_client \
--max-threads 10 \
-m dlrm-onnx-16 \
-x 1 \
-p 5000 \
-v -i gRPC \
-u localhost:8001 \
-b 4096 \
-l 5000 \
--concurrency-range 1:10 \
--input-data ./perfdata \
-f result.csv

*** Measurement Settings ***
  Batch size: 4096
  Measurement window: 5000 msec
  Latency limit: 5000 msec
  Concurrency limit: 10 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 67993.6 infer/sec. Avg latency: 60428 usec (std 22260 usec)
  Pass [2] throughput: 61440 infer/sec. Avg latency: 66310 usec (std 21723 usec)
  Pass [3] throughput: 68812.8 infer/sec. Avg latency: 59617 usec (std 22128 usec)
  Client: 
    Request count: 84
    Throughput: 68812.8 infer/sec
    Avg latency: 59617 usec (standard deviation 22128 usec)
    p50 latency: 71920 usec
    p90 latency: 80018 usec
    p95 latency: 83899 usec
    p99 latency: 88054 usec
    Avg gRPC time: 58773 usec (marshal 274 usec + response wait 58458 usec + unmarshal 41 usec)
  Server: 
    Request count: 102
    Avg request latency: 57208 usec (overhead 6 usec + queue 20184 usec + compute 37018 usec)

Request concurrency: 2
  Pass [1] thro



### Visualizing Latency vs. Throughput

The perf_client provides the -f option to generate a file containing CSV output of the results.
You can import the CSV file into a spreadsheet to help visualize the latency vs inferences/second tradeoff as well as see some components of the latency. Follow these steps:
- Open this [spreadsheet](https://docs.google.com/spreadsheets/d/1IsdW78x_F-jLLG4lTV0L-rruk0VEBRL7Mnb-80RGLL4)

- Make a copy from the File menu “Make a copy…”

- Open the copy

- Select the A1 cell on the “Raw Data” tab

- From the File menu select “Import…”

- Select “Upload” and upload the file

- Select “Replace data at selected cell” and then select the “Import data” button

![DLRM_model](latency_vs_throughput.PNG)


# Conclusion

In this notebook, we have walked through the complete process of preparing the pretrained DLRM for inference with the Triton inference server. Then, we stress test the server with the performance client to verify inference throughput.

## What's next
Now it's time to deploy your own DLRM model with Triton. 