In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="https://developer.download.nvidia.com//notebooks/dlsw-notebooks/remtting-started-session-based-03-serving-session-based-model-torch-backend/nvidia_logo.png" style="width: 90px; float: right;">

# Serving a Session-based Recommendation model with Torch Backend

This notebook is created using the latest stable [merlin-pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-pytorch/tags) container.

At this point, when you reach out to this notebook, we expect that you have already executed the `01-ETL-with-NVTabular.ipynb` and `02-session-based-XLNet-with-PyT.ipynb` notebooks, and saved the NVT workflow and the trained session-based model.

In this notebook, you are going to learn how you can serve a trained Transformer-based PyTorch model on NVIDIA [Triton Inference Server](https://github.com/triton-inference-server/server)  (TIS) with Torch backend using [Merlin systems](https://github.com/NVIDIA-Merlin/systems) library. One common way to do inference with a trained model is to use TorchScript, an intermediate representation of a PyTorch model that can be run in Python as well as in a high performance environment like C++. [TorchScript](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html) is actually the recommended model format for scaled inference and deployment. TIS [PyTorch (LibTorch) backend](https://github.com/triton-inference-server/pytorch_backend) is designed to run TorchScript models using the PyTorch C++ API.

[Triton Inference Server](https://github.com/triton-inference-server/server) (TIS) simplifies the deployment of AI models at scale in production. TIS provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. It supports a number of different machine learning frameworks such as TensorFlow and PyTorch.

### Import required libraries

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import cudf
import glob
import torch 

from transformers4rec import torch as tr
from merlin.io import Dataset

from merlin.core.dispatch import make_df  # noqa
from merlin.systems.dag import Ensemble  # noqa
from merlin.systems.dag.ops.pytorch import PredictPyTorch  # noqa

  from .autonotebook import tqdm as notebook_tqdm
  warn(f"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}")


We define the paths

In [3]:
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", "/workspace/data")
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", f"{INPUT_DATA_DIR}/sessions_by_day")
model_path= os.environ.get("model_path", f"{INPUT_DATA_DIR}/saved_model")

### Set the schema object

We create the schema object by reading the `schema.pbtxt` file generated by NVTabular pipeline in the previous, `01-ETL-with-NVTabular`, notebook.

In [4]:
from merlin_standard_lib import Schema
SCHEMA_PATH = os.environ.get("INPUT_SCHEMA_PATH", "/workspace/data/processed_nvt/schema.pbtxt")
schema = Schema().from_proto_text(SCHEMA_PATH)

We need to load the saved model to be able to serve it on TIS.

In [5]:
import cloudpickle
loaded_model = cloudpickle.load(
                open(os.path.join(model_path, "t4rec_model_class.pkl"), "rb")
            )

Switch the model to eval mode. We call `model.eval()` before tracing to set dropout and batch normalization layers to evaluation mode before running inference. Failing to do this might yield inconsistent inference results.

In [6]:
model = loaded_model.cuda()
model.eval()

Model(
  (heads): ModuleList(
    (0): Head(
      (body): SequentialBlock(
        (0): TabularSequenceFeatures(
          (to_merge): ModuleDict(
            (continuous_module): SequentialBlock(
              (0): ContinuousFeatures(
                (filter_features): FilterFeatures()
                (_aggregation): ConcatFeatures()
              )
              (1): SequentialBlock(
                (0): DenseBlock(
                  (0): Linear(in_features=2, out_features=64, bias=True)
                  (1): ReLU(inplace=True)
                )
              )
              (2): AsTabular()
            )
            (categorical_module): SequenceEmbeddingFeatures(
              (filter_features): FilterFeatures()
              (embedding_tables): ModuleDict(
                (item_id-list): Embedding(503, 64, padding_idx=0)
                (category-list): Embedding(126, 64, padding_idx=0)
              )
            )
          )
          (_aggregation): ConcatFeatures()
        

### Trace the model

We serve the model with the PyTorch backend that is used to execute TorchScript models. All models created in PyTorch using the python API must be traced/scripted to produce a TorchScript model. For tracing the model, we use [torch.jit.trace](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) api that takes the model as a Python function or torch.nn.Module, and an example input  that will be passed to the function while tracing.

In [7]:
train_paths = os.path.join(OUTPUT_DIR, f"{1}/train.parquet")
dataset = Dataset(train_paths)

In [8]:
sparse_max = {'age_days-list': 20,
 'weekday_sin-list': 20,
 'item_id-list': 20,
 'category-list': 20}

from transformers4rec.torch.utils.data_utils import MerlinDataLoader

def generate_dataloader(schema, dataset, batch_size=128, seq_length=20):
    loader = MerlinDataLoader.from_schema(
            schema,
            dataset,
            batch_size=batch_size,
            max_sequence_length=seq_length,
            shuffle=False,
            sparse_as_dense=True,
            sparse_max=sparse_max
        )
    return loader

Create a dict of tensors to feed it as example inputs in the `torch.jit.trace()`.

In [9]:
loader = generate_dataloader(schema, dataset)
train_dict = next(iter(loader))

Let's check out the `item_id-list` column in the `train_dict` dictionary.

In [10]:
train_dict[0]['item_id-list']

tensor([[27, 26,  7,  ..., 32, 14,  0],
        [15,  5,  5,  ...,  0,  0,  0],
        [17, 12,  9,  ...,  0,  0,  0],
        ...,
        [30, 13, 21,  ...,  0,  0,  0],
        [19, 14,  8,  ...,  0,  0,  0],
        [11, 27, 16,  ...,  0,  0,  0]], device='cuda:0')

In [11]:
traced_model = torch.jit.trace(model, train_dict[0], strict=True)



Generate model input and output schemas to feed in the `PredictPyTorch` operator below.

In [12]:
input_schema = model.input_schema
output_schema = model.output_schema

In [13]:
input_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.int_domain.min,properties.int_domain.max
0,age_days-list,"(Tags.LIST, Tags.CONTINUOUS)","DType(name='float32', element_type=<ElementTyp...",True,False,0,0
1,weekday_sin-list,"(Tags.LIST, Tags.CONTINUOUS)","DType(name='float32', element_type=<ElementTyp...",True,False,0,0
2,item_id-list,"(Tags.CATEGORICAL, Tags.ITEM_ID, Tags.ITEM, Ta...","DType(name='int64', element_type=<ElementType....",True,False,0,502
3,category-list,"(Tags.LIST, Tags.CATEGORICAL)","DType(name='int64', element_type=<ElementType....",True,False,0,125


Let's create a folder that we can store the exported models and the config files.

In [14]:
import shutil
ens_model_path = os.environ.get("ens_model_path", f"{INPUT_DATA_DIR}/models")
# Make sure we have a clean stats space for Dask
if os.path.isdir(ens_model_path):
    shutil.rmtree(ens_model_path)
os.mkdir(ens_model_path)

We use `PredictPyTorch` operator that takes a pytorch model and packages it correctly for tritonserver to run on the PyTorch backend.

In [15]:
torch_op = input_schema.column_names >> PredictPyTorch(
    traced_model, input_schema, output_schema
)

The last step is to create the ensemble artifacts that Triton Inference Server can consume. To make these artifacts, we import the Ensemble class. The class is responsible for interpreting the graph and exporting the correct files for the server.

When we create an `Ensemble` object we supply the graph and a schema representing the starting input of the graph. The inputs to the ensemble graph are the inputs to the first operator of out graph. After we created the Ensemble we export the graph, supplying an export path for the `ensemble.export` function. This returns an ensemble config which represents the entire inference pipeline and a list of node-specific configs.

In [16]:
ensemble = Ensemble(torch_op, input_schema)
ens_config, node_configs = ensemble.export(ens_model_path)

## Starting Triton Server

It is time to deploy all the models as an ensemble model to Triton Inference Serve TIS. After we export the ensemble, we are ready to start the TIS. You can start triton server by using the following command on your terminal:

`tritonserver --model-repository=<ensemble_export_path>`

For the `--model-repository` argument, specify the same path as the export_path that you specified previously in the `ensemble.export` method. This command will launch the server and load all the models to the server. Once all the models are loaded successfully, you should see READY status printed out in the terminal for each loaded model.

In [17]:
import tritonclient.http as client

# Create a triton client
try:
    triton_client = client.InferenceServerClient(url="localhost:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))

client created.


After we create the client and verified it is connected to the server instance, we can communicate with the server and ensure all the models are loaded correctly.

In [18]:
# ensure triton is in a good state
triton_client.is_server_live()
triton_client.get_model_repository_index()

GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>
POST /v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '121'}>
bytearray(b'[{"name":"0_predictpytorchtriton","version":"1","state":"READY"},{"name":"executor_model","version":"1","state":"READY"}]')


[{'name': '0_predictpytorchtriton', 'version': '1', 'state': 'READY'},
 {'name': 'executor_model', 'version': '1', 'state': 'READY'}]

### Send request to Triton and get the response

The last step of a machine learning (ML)/deep learning (DL) pipeline is to deploy the model to production, and get responses for a given query or a set of queries.
In this section, we generate a dataframe that we can serve as a request to TIS. Note that this is a transformed dataframe. We also need out dataset list columns to be padded to the max sequence length that was set in the ETL pipeline.

We do not serve the raw dataframe because in the production setting, we want to transform the input data as done during training (ETL). We need to apply the same mean/std for continuous features and use the same categorical mapping to convert the categories to continuous integer before we use the deployed DL model for a prediction. Therefore, we use a transformed dataset that is processed similarly as train set. 

In [19]:
eval_batch_size = 32
eval_paths = os.path.join(OUTPUT_DIR, f"{1}/valid.parquet")
eval_dataset = Dataset(eval_paths, shuffle=False)
eval_loader = generate_dataloader(schema, eval_dataset, batch_size=eval_batch_size)
test_dict = next(iter(eval_loader))

In [21]:
test_dict[0]['item_id-list'][0]

tensor([ 14,  71,  45,  35, 140,  89,   7, 115, 196,  19,   2,  10,   0,   0,
          0,   0,   0,   0,   0,   0], device='cuda:0')

In [22]:
df_cols = {}
for name, tensor in test_dict[0].items():
    if name in input_schema.column_names:
        df_cols[name] = tensor.cpu().numpy()
        if len(tensor.shape) > 1:
            df_cols[name] = list(df_cols[name])
            
df = make_df(df_cols)
print(df.shape)
df.head()

(32, 4)


Unnamed: 0,age_days-list,weekday_sin-list,item_id-list,category-list
0,"[0.9509504, 0.3658292, 0.10605793, 0.8901615, ...","[0.9222485, 0.1284022, 0.92028487, 0.3788347, ...","[14, 71, 45, 35, 140, 89, 7, 115, 196, 19, 2, ...","[3, 14, 8, 6, 27, 16, 2, 20, 31, 4, 1, 3, 0, 0..."
1,"[0.23776619, 0.062151734, 0.059320305, 0.37635...","[0.75332737, 0.18823138, 0.5440263, 0.27081072...","[6, 4, 42, 97, 208, 5, 50, 45, 7, 2, 0, 0, 0, ...","[1, 2, 7, 18, 34, 1, 9, 8, 2, 1, 0, 0, 0, 0, 0..."
2,"[0.6510976, 0.002470178, 0.19554594, 0.6035013...","[0.0155129675, 0.067784436, 0.6556247, 0.90605...","[25, 38, 126, 2, 14, 10, 8, 14, 16, 28, 0, 0, ...","[5, 7, 21, 1, 3, 3, 1, 3, 3, 5, 0, 0, 0, 0, 0,..."
3,"[0.62920743, 0.7574743, 0.1393074, 0.14867006,...","[0.44066542, 0.6632927, 0.51982445, 0.8328001,...","[4, 12, 26, 19, 23, 124, 22, 2, 50, 38, 0, 0, ...","[2, 3, 5, 4, 4, 22, 4, 1, 9, 7, 0, 0, 0, 0, 0,..."
4,"[0.4540216, 0.66014326, 0.4065639, 0.90007794,...","[0.5709135, 0.41235211, 0.21241243, 0.01835139...","[33, 29, 46, 15, 14, 27, 38, 115, 60, 122, 0, ...","[6, 6, 8, 3, 3, 5, 7, 20, 11, 21, 0, 0, 0, 0, ..."


Once our models are successfully loaded to the TIS, we can now easily send a request to TIS and get a response for our query with send_triton_request utility function.

In [23]:
from merlin.systems.triton.utils import send_triton_request
response = send_triton_request(input_schema, df[input_schema.column_names], output_schema.column_names)
print(response)

{'next-item': array([[ -9.769284 ,  -3.3535378,  -3.5593104, ..., -10.696345 ,
         -9.082857 ,  -9.554779 ],
       [ -9.769166 ,  -3.3535283,  -3.5592926, ..., -10.696279 ,
         -9.082819 ,  -9.55474  ],
       [ -9.768643 ,  -3.3534937,  -3.559177 , ..., -10.696127 ,
         -9.0826   ,  -9.554597 ],
       ...,
       [ -9.769294 ,  -3.3535573,  -3.559361 , ..., -10.696278 ,
         -9.082909 ,  -9.554747 ],
       [ -9.769636 ,  -3.3535905,  -3.5594552, ..., -10.696384 ,
         -9.083048 ,  -9.554836 ],
       [ -9.769545 ,  -3.353582 ,  -3.5594208, ..., -10.696352 ,
         -9.083025 ,  -9.554812 ]], dtype=float32)}


In [24]:
response['next-item'].shape

(32, 503)

We return a response for each request in the df. Each row in the `response['next-item']` array corresponds to the logit values per item in the catalog, and one logit score corresponding to the null, OOV and padded items. The first score of each array in each row corresponds to the score for the padded item, OOV or null item. Note that we dont have OOV or null items in our syntheticall generated datasets.

This is the end of this suit of examples. You successfully performed feature engineering with NVTabular trained transformer architecture based session-based recommendation models with Transformers4Rec deployed a trained model to Triton Inference Server with Torch backend, sent request and got responses from the server. If you would like to learn how to serve a TF4Rec model with Python backend please visit this [example](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/end-to-end-session-based/02-End-to-end-session-based-with-Yoochoose-PyT.ipynb).