In [1]:
# Copyright 2020 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_merlin_getting-started-movielens-04-triton-inference-with-tf/nvidia_logo.png" style="width: 90px; float: right;">

## Serve Recommendations from the TensorFlow Model

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container.

The last step is to deploy the ETL workflow and saved model to production. In the production setting, we want to transform the input data as done during training ETL. We need to ensure that our data is processed in the same fashion during training as in production. Therefore, we deploy the Workflow with the Merlin Model as an ensemble model to Triton Inference Server. The ensemble model guarantees that the same transformations are applied to the raw inputs.

<img src="./imgs/triton-tf.png" width="25%">


### Launching and Starting the Triton Server

Before we get started, you should start the container for Triton Inference Server with the following command. This command includes the `-v` argument that mounts your local `model-repository` folder with your saved models from the previous notebook (`03-Training-with-TF.ipynb`).

```
docker run -it --gpus device=0 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}:/model/ nvcr.io/nvidia/merlin/merlin-tensorflow:latest
```


After you started the container, you can start Triton Inference Server with the following command.
You need to provide correct path for the `models` directory.

```
tritonserver --model-repository=path_to_models --backend-config=tensorflow,version=2 --model-control-mode=explicit 
```

Note: The model-repository path is `/root/nvt-examples/models/ensemble`. The models haven't been loaded, yet. Below, we will request the Triton server to load the saved ensemble model.

In [3]:
# External dependencies
import os
from time import time

# Get dataframe library - cudf or pandas
from merlin.core.dispatch import get_lib
df_lib = get_lib()

import tritonclient.grpc as grpcclient
import nvtabular.inference.triton as nvt_triton

We define our base directory, containing the data.

In [4]:
# path to preprocessed data
INPUT_DATA_DIR = os.environ.get(
    "INPUT_DATA_DIR", os.path.expanduser("~/nvt-examples/movielens/data/")
)

Let's deactivate the warnings before sending requests.

In [5]:
import warnings

warnings.filterwarnings("ignore")

### Loading Ensemble Model with Triton Inference Serve

At this stage, you should have launched the Triton Inference Server docker container with the instructions above.

Let's connect to the Triton Inference Server. Use Triton’s ready endpoint to verify that the server and the models are ready for inference. Replace localhost with your host ip address.

In [6]:
import tritonhttpclient

try:
    triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))

client created.




In [7]:
import warnings

warnings.filterwarnings("ignore")

We check if the server is alive.

In [8]:
triton_client.is_server_live()

GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>


True

The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.

We check the available models in the repositories:

In [9]:
triton_client.get_model_repository_index()

POST /v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '89'}>
bytearray(b'[{"name":"0_transformworkflow"},{"name":"1_predicttensorflow"},{"name":"ensemble_model"}]')


[{'name': '0_transformworkflow'},
 {'name': '1_predicttensorflow'},
 {'name': 'ensemble_model'}]

We load the ensemble model.

In [10]:
%%time

triton_client.load_model(model_name="ensemble_model")

POST /v2/repository/models/ensemble_model/load, headers None
{}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'ensemble_model'
CPU times: user 0 ns, sys: 2.68 ms, total: 2.68 ms
Wall time: 4.68 s


## Predicting

Let's now craft a request and obtain a response from the Triton Inference Server.

We will use the first 3 rows of `userId` and `movieId` as input.

In [11]:
batch = df_lib.read_parquet(
    os.path.join(INPUT_DATA_DIR, "valid.parquet"), num_rows=3, columns=["userId", "movieId"]
)
print(batch)

          userId  movieId
20502031  133324    47099
68125        544    60950
6908746    44825      459


We now send the request.

In [12]:
inputs = nvt_triton.convert_df_to_triton_input(["userId", "movieId"], batch, grpcclient.InferInput)

outputs = [
    grpcclient.InferRequestedOutput(col)
    for col in ["rating/binary_classification_task"]
]

with grpcclient.InferenceServerClient("localhost:8001") as client:
    response = client.infer("ensemble_model", inputs, request_id="1", outputs=outputs)

Let's decode the request and see what information we receive.

In [13]:
print(response.as_numpy("rating/binary_classification_task"), response.as_numpy("rating/binary_classification_task").shape)

[[0.47853076]
 [0.47652945]
 [0.4694085 ]] (3, 1)


The returned scores reflect the probability that a user of a given id will rate highly the movie referenced in the `movieId` column.