In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

# Deploying Ranking Models with Merlin Systems

NVIDIA Merlin is an open source framework that accelerates and scales end-to-end recommender system pipelines. The Merlin framework is broken up into several sub components, these include: Merlin-Core, Merlin-Models, NVTabular and Merlin-Systems. Merlin Systems will be the focus of this example.

The purpose of the Merlin Systems library is to make it easy for Merlin users to quickly deploy their recommender systems from development to Triton Inference Server. We extended the same user-friendly API users are accustomed to in NVTabular and leveraged it to accommodate deploying recommender system components to Triton.

There are some points we need ensure before we continue with this Notebook. Please ensure you have a working NVTabular workflow and model stored in an accessible location. Merlin Systems take the data preprocessing workflow defined in NVTabular and load that into Triton Inference Server as a model. Subsequently it does the same for the trained model. Lets take a closer look at how Merlin Systems makes deploying to Triton simple and effortless, in the rest of this notebook.

## Learning Objectives

This Jupyter notebook example demonstrates 
- how to deploy an NVTabular model and a ranking model to Triton Inference Server as an ensemble
- send a request to Triton 
- generate prediction results for a given query (a batch) 

## Starting Triton Inference Server

After we export the ensemble, we are ready to start the Triton Inference Server. The server is installed in all the Merlin inference containers. If you are not using one of our containers, then ensure it is installed in your environment. For more information, see the Triton Inference Server documentation.

You can start the server by running the following command:

`tritonserver --model-repository = <path to the saved ensemble folder>`

For the `--model-repository` argument, specify the same value as the `ensemble_export_path` that you specified previously when executing the `inference.py` script.

After you run the tritonserver command, wait until your terminal shows messages like the following example:

I0414 18:29:50.741833 4067 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001 <br>
I0414 18:29:50.742197 4067 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000 <br>
I0414 18:29:50.783470 4067 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002 ,br>

Import libraries.

In [2]:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
import numpy as np
import pandas as pd
from nvtabular.workflow import Workflow
import tritonclient.grpc as grpcclient

2023-05-11 22:43:16.509753: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")


Load the saved NVTabular workflow. We will use workflow's input schema as an input below when sending request to Triton.

In [3]:
input_path = os.environ.get("INPUT_FOLDER", "/workspace/Tenrec/dataset/workflow/")
workflow_stored_path = os.path.join(input_path)
workflow = Workflow.load(workflow_stored_path)

In [4]:
workflow.input_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged
0,user_id,(),"DType(name='int32', element_type=<ElementType....",False,False
1,item_id,(),"DType(name='int32', element_type=<ElementType....",False,False
2,video_category,(),"DType(name='int8', element_type=<ElementType.I...",False,False
3,gender,(),"DType(name='int8', element_type=<ElementType.I...",False,False
4,age,(),"DType(name='int8', element_type=<ElementType.I...",False,False


Load the saved output names as a list.

In [5]:
outputs = np.load(open('outputs.npy', 'rb'), allow_pickle=True).tolist()
print(outputs)

['click/binary_output', 'follow/binary_output']


We prepare a batch request to send to Triton and then Triton is going to send us a response, basically probability values for each target column. Since we are serving NVTabular model and our ranking model as a pipeline together, we can send a raw data as a request and served NVTabular model should be able to transform it in the same way it transformed the training set during preprocessing step.

One thing to note that in this example, we are not creating the raw data from raw `.csv` file since, we did some data preparations and removed some user and items from the dataset based on the min frequencies we set during preprocessing file. So we use the raw validation data that were generated after train and eval set split step.

In [6]:
original_data_path = os.environ.get("INPUT_FOLDER", "/workspace/Tenrec/dataset/_cache/02/")
batch = pd.read_parquet(os.path.join(original_data_path, "eval/", "part.0.parquet"), columns=workflow.input_schema.column_names).reset_index(drop=True)
batch = batch.iloc[:10, :]

In [7]:
batch

Unnamed: 0,user_id,item_id,video_category,gender,age
0,14402,412,0,0,0
1,276438,8176,1,0,0
2,215817,62881,0,0,0
3,117574,74582,1,0,0
4,195714,34982,0,0,0
5,117050,9769,0,0,0
6,219293,7750,0,0,0
7,252272,59629,0,0,0
8,232320,12737,0,0,0
9,168813,148443,0,0,0


## Deploy models on Triton Inference Server

First we need to ensure that we have a client connected to the server that we started. To do this, we use the Triton HTTP client library.

In [8]:
import tritonclient.http as client

# Create a triton client
try:
    triton_client = client.InferenceServerClient(url="localhost:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))

client created.


In [9]:
# ensure triton is in a good state
triton_client.is_server_live()
triton_client.get_model_repository_index()

GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>
POST /v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '191'}>
bytearray(b'[{"name":"0_transformworkflowtriton","version":"1","state":"READY"},{"name":"1_predicttensorflowtriton","version":"1","state":"READY"},{"name":"executor_model","version":"1","state":"READY"}]')


[{'name': '0_transformworkflowtriton', 'version': '1', 'state': 'READY'},
 {'name': '1_predicttensorflowtriton', 'version': '1', 'state': 'READY'},
 {'name': 'executor_model', 'version': '1', 'state': 'READY'}]

Now that our server is running, we can send requests to it. In the code below we create a request to send to triton and send it.

In [10]:
from merlin.systems.triton.utils import send_triton_request
response = send_triton_request(workflow.input_schema, batch, outputs)
print(response)

{'click/binary_output': array([[0.5073921 ],
       [0.50442815],
       [0.5070812 ],
       [0.5021189 ],
       [0.5028127 ],
       [0.5052893 ],
       [0.5060355 ],
       [0.51079565],
       [0.50457317],
       [0.50388616]], dtype=float32), 'follow/binary_output': array([[0.49480468],
       [0.49666217],
       [0.49884534],
       [0.4969691 ],
       [0.4962534 ],
       [0.49507955],
       [0.49622113],
       [0.49824178],
       [0.49707925],
       [0.49556646]], dtype=float32)}


## Summary

Congratulations on completing this quick start guide example series!

In this quick start example series, you have preprocessed and transformed the data with NVTabular, trained a single-task or multi-task model with Merlin Models, and then finally deployed these models on Triton Inference Server.