In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

## Overview

In this notebook, we will show how we do inference with our trained deep learning recommender model using Triton Inference Server. In this example, we deploy the NVTabular workflow and HugeCTR model with Triton Inference Server. We deploy them as an ensemble. For each request, Triton Inference Server will feed the input data through the NVTabular workflow and its output through the HugeCR model.

As we went through in the previous notebook, [movielens-HugeCTR](https://github.com/NVIDIA/NVTabular/blob/main/examples/inference_triton/inference-HugeCTR/movielens-HugeCTR.ipynb), NVTabular provides a function to save the NVTabular workflow via `export_hugectr_ensemble`. This function does not only save NVTabular workflow, but also saves the trained HugeCTR model and ensemble model to be served to Triton IS.

## Getting Started

We need to write a configuration file with the stored model weights and model configuration.

In [2]:
%%writefile '/model/models/ps.json'

{
    "supportlonglong": true,
    "models": [
        {
            "model": "movielens",
            "sparse_files": ["/model/models/movielens/1/0_sparse_1900.model"],
            "dense_file": "/model/models/movielens/1/_dense_1900.model",
            "network_file": "/model/models/movielens/1/movielens.json",
        }
    ],
}

Overwriting /model/models/ps.json


Let's import required libraries.

In [3]:
import tritonclient.grpc as httpclient

import cudf
import numpy as np

### Load Models on Triton Server

At this stage, you should launch the Triton Inference Server docker container with the following script:

```
docker run -it --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}:/model nvcr.io/nvidia/merlin/merlin-inference:0.6
```

After you started the container you can start triton server with the command below:

```
tritonserver --model-repository=<path_to_models> --backend-config=hugectr,ps=<path_to_models>/ps.json --model-control-mode=explicit
```

In some cases, the Triton HugeCTR file has to be copied to the model folder.
```
cp /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so <path_to_models>/movielens/
```

Note: The model-repository path is `/model/models/`. The models haven't been loaded, yet. We can request triton server to load the saved ensemble.  We initialize a triton client. The path for the json file is `/model/models/movielens/1/movielens.json`.

In [4]:
# disable warnings
import warnings

warnings.filterwarnings("ignore")

In [5]:
import tritonhttpclient

try:
    triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))

client created.




In [6]:
triton_client.is_server_live()

GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>


True

In [7]:
triton_client.get_model_repository_index()

POST /v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '72'}>
bytearray(b'[{"name":"movielens"},{"name":"movielens_ens"},{"name":"movielens_nvt"}]')


[{'name': 'movielens'}, {'name': 'movielens_ens'}, {'name': 'movielens_nvt'}]

Let's load our models to Triton Server.

In [8]:
%%time

triton_client.load_model(model_name="movielens_nvt")

POST /v2/repository/models/movielens_nvt/load, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'movielens_nvt'
CPU times: user 0 ns, sys: 3.97 ms, total: 3.97 ms
Wall time: 3.04 s


In [9]:
%%time

triton_client.load_model(model_name="movielens")

POST /v2/repository/models/movielens/load, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'movielens'
CPU times: user 2.49 ms, sys: 968 µs, total: 3.46 ms
Wall time: 5.59 s


Finally, we load our ensemble model `movielens_ens`.

In [10]:
%%time

triton_client.load_model(model_name="movielens_ens")

POST /v2/repository/models/movielens_ens/load, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'movielens_ens'
CPU times: user 0 ns, sys: 2.45 ms, total: 2.45 ms
Wall time: 104 ms


Let's send a request to Inference Server and print out the response. Since in our example above we do not have continuous columns, below our only inputs are categorical columns.

In [11]:
from tritonclient.utils import np_to_triton_dtype

model_name = "movielens_ens"
col_names = ["userId", "movieId"]
# read in a batch of data to get transforms for
batch = cudf.read_parquet("/model/data/valid.parquet", num_rows=64)[col_names]
print(batch, "\n")

# convert the batch to a triton inputs
columns = [(col, batch[col]) for col in col_names]
inputs = []

col_dtypes = [np.int64, np.int64]
for i, (name, col) in enumerate(columns):
    d = col.values_host.astype(col_dtypes[i])
    d = d.reshape(len(d), 1)
    inputs.append(httpclient.InferInput(name, d.shape, np_to_triton_dtype(col_dtypes[i])))
    inputs[i].set_data_from_numpy(d)
# placeholder variables for the output
outputs = []
outputs.append(httpclient.InferRequestedOutput("OUTPUT0"))
# make the request
with httpclient.InferenceServerClient("localhost:8001") as client:
    response = client.infer(model_name, inputs, request_id=str(1), outputs=outputs)

print("predicted sigmoid result:\n", response.as_numpy("OUTPUT0"))

          userId  movieId
15347762   99476   104374
16647840  107979     2634
23915192  155372     1614
10052313   65225     7153
12214125   79161      500
...          ...      ...
17138306  111072     1625
21326655  138575    81591
5664631    36671     8861
217658      1535   111759
11842246   76766   109487

[64 rows x 2 columns] 



InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed