In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Getting Started with Merlin Systems 

## Overview

NVIDIA Merlin is an open source framework that accelerates and scales end-to-end recommender system pipelines on GPU. The Merlin framework is broken up into several sub components, these include: merlin-core, merlin-models, nvtabular and merlin-systems. Merlin Systems will be the focus of this example.

The purpose of the Merlin Systems library is to make it easy for merlin users to quickly deploy their recommender systems from development to triton. We extended the same user-friendly API users are accustomed to in NVTabular and leveraged it to accommodate deploying your recommender system components to triton. 

There are some things we need ensure before we continue with this Notebook. Please ensure you have a working workflow and model stored in an accessible location. As previously mentioned, Merlin Systems will take the data preprocessing workflow defined in nvtabular and load that into triton as a model. Subsequently it will do the same for the trained model. Lets take a closer look in the rest of this notebook at how merlin systems makes deploying to tritonserver simple and effortless.


**Be sure to check the other components of the Merlin framework, they can help you **

### Learning objectives

In this notebook, we learn how to deploy a NVTabular Workflow and a trained model from Merlin Models to tritonserver.
- Load NVtabular Workflow
- Load Pre-trained Merlin Models model
- Create Ensemble Graph
- Export Ensemble Graph
- Run Tritonserver
- Send Request to Tritonserver

### Dataset

In this notebook, we will be leveraging the Alibaba dataset[add link here]. It is important to note that the steps will take in this notebook are generalized and can be applied to any set of workflow and models. To see how the data is transformed please check the NVTabular [lin here nvtabular example] example for the Alibaba dataset. And to see how an Alibaba dataset trained model is created check the merlin-models project[link here merlin-models]

### Tools

- NVTabular
- Merlin Models
- Merlin Systems

## Load Workflow

First, we will load the workflow created in with this example [link to nvtabular workflow example for alibaba dataset]. 

In [2]:
import os
import nvtabular
input_path = "/merlin-models-data/movielens/"

In [3]:
from nvtabular.workflow import Workflow
workflow_stored_path = os.path.join(input_path, "workflow")

workflow = Workflow.load(workflow_stored_path)

In [4]:
workflow = workflow.remove_inputs(["rating"])

## Load Model

After loading the workflow, we will load the model. This model was trained with the output of the workflow [merlin models alibaba dataset].

In [5]:
import tensorflow as tf
tf_model_path = os.path.join(input_path, "model")

model = tf.keras.models.load_model(tf_model_path)

2022-03-27 03:26:17.198959: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-27 03:26:19.335383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21040 MB memory:  -> device: 0, name: Quadro RTX 8000, pci bus id: 0000:15:00.0, compute capability: 7.5
2022-03-27 03:26:19.336065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 46062 MB memory:  -> device: 1, name: Quadro RTX 8000, pci bus id: 0000:2d:00.0, compute capability: 7.5


## Create the Ensemble Graph

Once we have both the model and the workflow loaded we can start creating the Ensemble Graph. This graph is created by the user, the goal is to illustrate the path of data through your full system. In this example we will only be serving a workflow with a model, but you can add other components that might be necessary to comply with business logic requirements.

For this example, because we have two components a model and a workflow we will require two operators. These Ensemble operators, also known as Inference Operators, are meant to abstract away all the "hard parts" of loading a specific component (i.e. workflow or model) into tritonserver. 

In the following code block we will leverage two Inference operators, the TransformWorkflow operator and the PredictTensorflow operator. The TransformWorkflow operator will ensure the workflow is correctly saved and packaged with reqruired config, so that tritonserver will know how to load it. The PredictTensorflow operator will do something similar with the model we loaded before. 

Lets give it a try.

In [6]:
from merlin.systems.dag.ops.workflow import TransformWorkflow
from merlin.systems.dag.ops.tensorflow import PredictTensorflow

triton_chain = workflow.input_schema.column_names >> TransformWorkflow(workflow) >> PredictTensorflow(model)


## Export Graph as Ensemble

The last step is to create the ensemble artifacts that tritonserver can consume. To make these artifacts we need to import the Ensemble class. It is responsible with interpreting the graph and exporting the correct files for tritonserver.

To create an Ensemble object, the class consume a graph. Now the ensemble object is bonded to the supplied graph. The next step is to export the graph artifacts.

When you are exporting an ensemble graph you must supply two objects, the path to export the graph and a schema representing the starting input of the graph. In otherwords, the input to the graph is are the inputs to the first operator of your graph. Usually this is a single column, user_id. 
Below you will see that we create a ColumnSchema for the expected "user_id" input, which is turned into a Schema. 

Once we have the graph, the export path and the request schema, we can create the Ensemble and export it. 

Lets take a look below.

In [7]:
workflow.input_schema

[{'name': 'userId', 'tags': set(), 'properties': {}, 'dtype': dtype('int64'), 'is_list': False, 'is_ragged': False}, {'name': 'movieId', 'tags': set(), 'properties': {}, 'dtype': dtype('int64'), 'is_list': False, 'is_ragged': False}]

In [8]:
from merlin.systems.dag.ensemble import Ensemble
from merlin.schema import Schema, ColumnSchema
import numpy as np


ensemble = Ensemble(triton_chain, workflow.input_schema)


export_path = os.path.join(input_path, "ensemble")
# request_schema = Schema(
#     [
#         ColumnSchema("user_id", dtype=np.int64),
#     ]
# )

ensemble.export(export_path)

2022-03-27 03:26:21.025378: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: /merlin-models-data/movielens/ensemble/1_predicttensorflow/1/model.savedmodel/assets


INFO:tensorflow:Assets written to: /merlin-models-data/movielens/ensemble/1_predicttensorflow/1/model.savedmodel/assets


(name: "ensemble_model"
 platform: "ensemble"
 input {
   name: "userId"
   data_type: TYPE_INT64
   dims: -1
   dims: -1
 }
 input {
   name: "movieId"
   data_type: TYPE_INT64
   dims: -1
   dims: -1
 }
 output {
   name: "output_1"
   data_type: TYPE_FP32
   dims: -1
   dims: -1
 }
 ensemble_scheduling {
   step {
     model_name: "0_transformworkflow"
     model_version: -1
     input_map {
       key: "movieId"
       value: "movieId"
     }
     input_map {
       key: "userId"
       value: "userId"
     }
     output_map {
       key: "movie_id"
       value: "movie_id_0"
     }
     output_map {
       key: "user_id"
       value: "user_id_0"
     }
   }
   step {
     model_name: "1_predicttensorflow"
     model_version: -1
     input_map {
       key: "movie_id"
       value: "movie_id_0"
     }
     input_map {
       key: "user_id"
       value: "user_id_0"
     }
     output_map {
       key: "output_1"
       value: "output_1"
     }
   }
 },
 [name: "0_transformworkflow

## Verification of Ensemble Artifacts

Once the ensemble export has completed successfully, we can check the export path for the aforementioned graph artifacts. You should see a file structure that represents a ordering number followed by an operator identifier(i.e. 1_transformworkflow, 2_predicttensorflow). 

Inside each of those folders, there should be a config.pbtxt and a folder with a number, representing a version, usually 1. The artifacts for the given operator are found inside the version folder. These artifacts vary depending on the operator in question. 

Please see the snapshot below for verification.

## Starting Triton Server

After we have exported the ensemble, we are ready to start the triton server. First ensure it is installed in your environment otherwise find more install information here[link to triton build and install documentation]. Once installation is verified, you can start triton server by using the following command:

`tritonserver --model-repository=/ensemble_export_path/`

The export path should be the same as used above duirng ensemble export.

## Sending a request to Triton

Now that our tritonserver instance is running, we can send a request to it. This request will be composed of values that correspond to the request schema created when exporting the ensemble graph.

In this case, the request will only have one item, user id. In the code below we will create a request to send to triton and send it. We will then analyze the response, to show the full end to end experience.

In [9]:
import tritonhttpclient

# create triton client
try:
    triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))

client created.




In [10]:
# ensure triton is in a good state
triton_client.is_server_live()
triton_client.get_model_repository_index()
triton_client.load_model(model_name="ensemble_model")

GET /v2/health/live, headers None


ConnectionRefusedError: [Errno 111] Connection refused

In [None]:
# read in data for request
batch = df_lib.read_parquet(
    os.path.join(input_path, "valid.parquet"), num_rows=3, columns=workflow.input_schema.column_names
)
print(batch)


# create inputs and outputs
inputs = nvt_triton.convert_df_to_triton_input(["user_id"], batch, grpcclient.InferInput)

outputs = [
    grpcclient.InferRequestedOutput(col)
    for col in ["user_id", "movieId", "genres__nnzs", "genres__values"]
]

# send request to tritonserver
with grpcclient.InferenceServerClient("localhost:8001") as client:
    response = client.infer("ensemble", inputs, request_id="1", outputs=outputs)


# access individual response columns to get values back.
for col in ["userId", "movieId", "genres__nnzs", "genres__values"]:
    print(col, response.as_numpy(col), response.as_numpy(col).shape)