In [1]:
# Copyright 2020 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Model Deployment with Merlin Inference API

## Overview

In the previous notebook we explained and showed how we can preprocess data with NVTAbular, and train an TF MLP model using NVTabular TF dataloader. We learned how to save a workflow, and a trained TF model. In this notebook, we will show example request scripts sent to triton inference server 
- to transform new/streaming data with NVTabular library
- to generate prediction results for new data from trained model 
- to deploy the end-to-end pipeline.

## Getting Started

In [9]:
# External dependencies
import os

from tritonclient.utils import *
import tritonclient.http as httpclient
import nvtabular
import cudf
from timeit import default_timer as timer
from datetime import timedelta

In [2]:
BASE_DIR = '/working_dir/data/'

We define our base directory, containing the raw and processed data.

In [3]:
!ls $BASE_DIR

ml-25m	train  train.parquet  valid  valid.parquet


## Verify Triton Is Running Correctly

Use Triton’s ready endpoint to verify that the server and the models are ready for inference. From the host system use curl to access the HTTP endpoint that indicates server status.

In [13]:
!curl -i 10.110.20.127:8000/v2/health/ready

HTTP/1.1 200 OK
[1mContent-Length[0m: 0
[1mContent-Type[0m: text/plain



The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.

At this step we are going to generate config.pbtxt and we will save our workflow as a .pkl file to be able to load again to do transformation for the test (new coming) datasets at inference stage. This step actually does the serialization of the workflow that we created above using training set.

In [27]:
from nvtabular.inference.triton import generate_triton_model
generate_triton_model(workflow, "movielens_nvt", "/working_dir/models/models/movielens_nvt/")

Let's send a request to the running triton inference server using our raw validation set in parquet format. This request is going to load the saved NVTabular workflow and then transform the new dataset samples.

In [11]:
# read in the workflow (to get input/output schema to call triton with)

workflow = nvtabular.Workflow.load("/working_dir/models/movielens_nvt/1/workflow")

# read in a batch of data to get transforms for
batch = cudf.read_parquet("/working_dir/data/valid.parquet", num_rows=2)[workflow.column_group.input_column_names]

print(batch)

# convert the batch to a triton inputs
columns = [(col, batch[col][0:2]) for col in workflow.column_group.input_column_names]
inputs = []

col_dtypes = [np.int32, np.int32, np.float32]

for i, (name, col) in enumerate(columns):
    d = col.values_host.astype(col_dtypes[i])
    d = d.reshape(len(d),1)
    inputs.append(httpclient.InferInput(name, d.shape, np_to_triton_dtype(col_dtypes[i])))
    inputs[i].set_data_from_numpy(d)

# placeholder variables for the output
outputs = [httpclient.InferRequestedOutput(name) for name in workflow.column_group.columns]

# make the request
with httpclient.InferenceServerClient("<ip-host>:8000") as client:
    response = client.infer("movielens_nvt", inputs, request_id="1",outputs=outputs)
    
# convert output from triton back to a nvt dataframe  
output = cudf.DataFrame({col: response.as_numpy(col).T[0] for col in workflow.column_group.columns})
print(output)



          userId  movieId  rating
15347762   99476   104374     3.5
16647840  107979     2634     4.0
   userId  movieId  rating
0   99476    19997       1
1  107979     2543       1


## Running the MovieLens rating classification example

In the [movilens_TF]() notebook we saved our TF model with the following script:

```
model.save('/working_dir/models/movielens_tf/1/model.savedmodel')
```

A minimal model repository for a TensorFlow SavedModel model is:
```
  <model-repository-path>/
    <model-name>/
      config.pbtxt
      1/
        model.savedmodel/
           <saved-model files>
```
Let's check out our model repository layout.

In [None]:
# !apt-get install tree

In [7]:
!tree /working_dir/models/movielens_tf

[01;34m/working_dir/models/movielens_tf[00m
├── [01;34m1[00m
│   └── [01;34mmodel.savedmodel[00m
│       ├── [01;34massets[00m
│       ├── saved_model.pb
│       └── [01;34mvariables[00m
│           ├── variables.data-00000-of-00001
│           └── variables.index
└── config.pbtxt

4 directories, 4 files


You can see that we have a config.pbtxt file. Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a `config.pbtxt` file specified as [ModelConfig protobuf](https://github.com/triton-inference-server/server/blob/r20.12/src/core/model_config.proto).

In [25]:
from tritonclient.utils import *
import tritonclient.http as httpclient
import nvtabular
import cudf
from timeit import default_timer as timer
from datetime import timedelta


# read in a batch of data to get transforms for
batch = cudf.read_parquet("/working_dir/data/valid/*.parquet", num_rows=2)

batch = batch[batch.columns][0:2]
batch = batch.drop(columns=["rating"])

inputs = [] 

for i, col in enumerate(batch.columns):
    d = batch[col].values_host.astype(np.int32)
    d = d.reshape(len(d),1)
    inputs.append(httpclient.InferInput(col, d.shape, np_to_triton_dtype(np.int32)))
    inputs[i].set_data_from_numpy(d)

outputs = [httpclient.InferRequestedOutput("dense_3")]

with httpclient.InferenceServerClient("<ip-host>:8000") as client:
    response = client.infer("movielens_tf", inputs, request_id="1",outputs=outputs)

print(response.as_numpy("dense_3"))

[[0.6248126]
 [0.6249962]]


# END-2-END INFERENCE PIPELINE

In this request example below, we show that we can feed raw unprocessed parquet file, and obtain final prediction results coming from the last layer of the TF model that we built in `movilens_TF` notebook. The output we get is a softmax value.

In [26]:
from tritonclient.utils import *
import tritonclient.http as httpclient
import nvtabular
import cudf
from timeit import default_timer as timer
from datetime import timedelta

# read in the workflow (to get input/output schema to call triton with)
batch = cudf.read_parquet("/working_dir/data/valid.parquet", num_rows=2)
batch = batch[batch.columns][0:2]

print(batch, "\n")

# convert the batch to a triton inputs
inputs = []

col_names = ['userId_ens', 'movieId_ens', 'rating_ens'] 
col_dtypes = [np.int32, np.int32, np.float32]

for i, col in enumerate(batch.columns):
    d = batch[col].values_host.astype(col_dtypes[i])
    d = d.reshape(len(d),1)
    inputs.append(httpclient.InferInput(col_names[i], d.shape, np_to_triton_dtype(col_dtypes[i])))
    inputs[i].set_data_from_numpy(d)

# placeholder variables for the output
outputs = [httpclient.InferRequestedOutput("predicted_rating")]

# make the request
with httpclient.InferenceServerClient("<ip-host>:8000") as client:
    response = client.infer("movielens", inputs, request_id="1",outputs=outputs)

print("predicted softmax result:\n", response.as_numpy('predicted_rating'))

          userId  movieId  rating
15347762   99476   104374     3.5
16647840  107979     2634     4.0 

predicted softmax result:
 [[0.6248126]
 [0.6249962]]
