In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

## Deploying a Multi-Stage RecSys into Production with Merlin Systems and Triton Inference Server

At this point, when you reach out to this notebook, we expect that you have already executed the first notebook `01-Building-Recommender-Systems-with-Merlin.ipynb` and exported all the required files and models. 

We are going to generate recommended items for a given user query (user_id) by following the steps described in the figure below.

![tritonensemble](../images/triton_ensemble.png)

Merlin Systems library have the set of operators to be able to serve multi-stage recommender systems built with Tensorflow on [Triton Inference Server](https://github.com/triton-inference-server/server)(TIS) easily and efficiently. Below, we will go through these operators and demonstrate their usage in serving a multi-stage system on Triton.

### Import required libraries and functions

In [None]:
%pip install tensorflow "feast<0.20" faiss-gpu

In [None]:
import os
import numpy as np
import pandas as pd
import feast
import faiss
from nvtabular import ColumnSchema, Schema

from merlin.systems.dag.ensemble import Ensemble
from merlin.systems.dag.ops.session_filter import FilterCandidates
from merlin.systems.dag.ops.softmax_sampling import SoftmaxSampling
from merlin.systems.dag.ops.tensorflow import PredictTensorflow
from merlin.systems.dag.ops.unroll_features import UnrollFeatures
from merlin.systems.triton.utils import run_triton_server, run_ensemble_on_tritonserver

We use `configure_tensorflow` function to prevent the Tensorflow to claim entire GPU memory. With this func, we let TF to allocate 50% of the available GPU memory. 

In [3]:
from nvtabular.loader.tf_utils import configure_tensorflow
configure_tensorflow()

04/05/2022 06:52:55 PM INFO:init


<function tensorflow.python.dlpack.dlpack.from_dlpack(dlcapsule)>

### Register our features on feature store

The Feast feature registry is a central catalog of all the feature definitions and their related metadata(read more [here](https://docs.feast.dev/getting-started/architecture-and-components/registry)). We have defined our user and item features definitions in the `user_features.py` and  `item_features.py` files. With FeatureView() users can register data sources in their organizations into Feast, and then use those data sources for both training and online inference. In the `user_features.py` and `item_features.py` files, we are telling Feast where to find user and item features.

Before we move on to the next steps, we need to perform `feast apply`command as directed below.  With that, we register our features, we can apply the changes to create our feature registry and store all entity and feature view definitions in a local SQLite online store called `online_store.db`.

In [4]:
BASE_DIR = os.environ.get("BASE_DIR", "/Merlin/examples/Deploying-multi-stage-RecSys/")

# define feature repo path
feast_repo_path = BASE_DIR + "feature_repo/"

In [5]:
%cd $feast_repo_path
!feast apply

/Merlin/examples/Deploying-multi-stage-RecSys/feature_repo
Created data source [1m[32m/Merlin/examples/Deploying-multi-stage-RecSys/feature_repo/data/user_features.parquet[0m
Created data source [1m[32m/Merlin/examples/Deploying-multi-stage-RecSys/feature_repo/data/item_features.parquet[0m
Created entity [1m[32mitem_id[0m
Created entity [1m[32muser_id[0m
Created feature view [1m[32mitem_features[0m
Created feature view [1m[32muser_features[0m

Created sqlite table [1m[32mfeature_repo_item_features[0m
Created sqlite table [1m[32mfeature_repo_user_features[0m



### Loading features from offline store into an online store 

After we execute `apply` and registered our features and created our online local store, now we need to perform [materialization](https://docs.feast.dev/how-to-guides/running-feast-in-production) operation. This is done to keep our online store up to date and get it ready for prediction. For that we need to run a job that loads feature data from our feature view sources into our online store. As we add new features to our offline stores, we can continuously materialize them to keep our online store up to date by finding the latest feature values for each user. 

When you run the `feast materialize ..` command below, you will see a message <i>Materializing 2 feature views from 1995-01-01 01:01:01+00:00 to 2025-01-01 01:01:01+00:00 into the sqlite online store </i>  will be printed out.

Note that materialization step takes some time.. 

In [6]:
!feast materialize 1995-01-01T01:01:01 2025-01-01T01:01:01

Materializing [1m[32m2[0m feature views from [1m[32m1995-01-01 01:01:01+00:00[0m to [1m[32m2025-01-01 01:01:01+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mitem_features[0m:
100%|█████████████████████████████████████████████████████████| 1298/1298 [00:00<00:00, 5214.52it/s]
[1m[32muser_features[0m:
100%|█████████████████████████████████████████████████████████| 1322/1322 [00:00<00:00, 1621.24it/s]


Now, let's check our feature_repo structure again after we ran `apply` and `materialize` commands.

In [7]:
# set up the base dir to for feature store
feature_repo_path = os.path.join(BASE_DIR, 'feature_repo')
!tree $feature_repo_path

[01;34m/Merlin/examples/Deploying-multi-stage-RecSys/feature_repo[00m
├── __init__.py
├── [01;34mdata[00m
│   ├── item_features.parquet
│   ├── online_store.db
│   ├── registry.db
│   └── user_features.parquet
├── feature_store.yaml
├── item_features.py
└── user_features.py

1 directory, 8 files


### Set up Faiss index, create feature store client and objects for the Triton ensemble

Create a folder for faiss index path

In [8]:
if not os.path.isdir(os.path.join(BASE_DIR + 'faiss_index')):
    os.makedirs(os.path.join(BASE_DIR + 'faiss_index'))

Define paths for ranking model, retrieval model, and faiss index path

In [9]:
faiss_index_path = BASE_DIR + 'faiss_index' + "/index.faiss"
retrieval_model_path = BASE_DIR + "query_tower/"
ranking_model_path = BASE_DIR + "dlrm/"

Create a request schema that we are going to use when sending a request to Triton Infrence Server (TIS).

In [10]:
request_schema = Schema(
    [
        ColumnSchema("user_id", dtype=np.int32),
    ]
)

`QueryFaiss` operator creates an interface between a FAISS Approximate Nearest Neighbors (ANN) Index and Triton Infrence Server. For a given input query vector, we do an ANN search query to find the ids of top-k nearby nodes in the index.

`setup_faiss` is  a utility function that will create a Faiss index from an embedding vector with using L2 distance.

In [11]:
from merlin.systems.dag.ops.faiss import QueryFaiss, setup_faiss 

item_embeddings = np.ascontiguousarray(
    pd.read_parquet(BASE_DIR + "item_embeddings.parquet").to_numpy()
)
setup_faiss(item_embeddings, faiss_index_path)

Create feature store client.

In [12]:
feature_store = feast.FeatureStore(feast_repo_path)

Fetch user features with `QueryFeast` operator from the feature store. `QueryFeast` operator is responsible for ensuring that our feast feature store can communicate correctly with tritonserver for the ensemble feast feature look ups.

In [13]:
from merlin.systems.dag.ops.feast import QueryFeast 

user_features = ["user_id"] >> QueryFeast.from_feature_view(
    store=feature_store,
    view="user_features",
    column="user_id",
    include_id=True,
)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  ValueType.FLOAT: (np.float, False, False),


Retrieve top-K candidate items using `retrieval model` that are relevant for a given user. We use `PredictTensorflow()` operator that takes a tensorflow model and packages it correctly for TIS to run with the tensorflow backend.

In [14]:
retrieval = (
    user_features
    >> PredictTensorflow(retrieval_model_path)
    >> QueryFaiss(faiss_index_path, topk=100)
)

2022-04-05 18:53:03.680149: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-05 18:53:04.788233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0


Fetch item features for the candidate items that are retrieved from the retrieval step above from the feature store.

In [15]:
item_features = retrieval["candidate_ids"] >> QueryFeast.from_feature_view(
    store=feature_store,
    view="item_features",
    column="candidate_ids",
    output_prefix="item",
    include_id=True,
)

Merge the user features and items features to create the all set of combined features that were used in model training using `UnrollFeatures` operator which takes a target column and joins the "unroll" columns to the target. This helps when broadcasting a series of user features to a set of items.

In [16]:
user_features_to_unroll = [
    "user_id",
    "user_shops",
    "user_profile",
    "user_group",
    "user_gender",
    "user_age",
    "user_consumption_2",
    "user_is_occupied",
    "user_geography",
    "user_intentions",
    "user_brands",
    "user_categories",
]

combined_features = item_features >> UnrollFeatures(
    "item_id", user_features[user_features_to_unroll]
)

Rank the combined features using the trained ranking model, which is a DLRM model for this example. We feed the path of the ranking model to `PredictTensorflow()` operator.

In [17]:
ranking = combined_features >> PredictTensorflow(ranking_model_path)

For the ordering we use `SoftmaxSampling()` operator. This operator sorts all inputs in descending order given the input ids and prediction introducing some randomization into the ordering by sampling items from the softmax of the predicted relevance scores, and finally returns top-k ordered items.

In [18]:
ordering = combined_features["item_id"] >> SoftmaxSampling(
    relevance_col=ranking["output_1"], topk=10, temperature=20.0
)

### Export Graph as Ensemble
The last step is to create the ensemble artifacts that TIS can consume. To make these artifacts import the Ensemble class. This class  represents an entire ensemble consisting of multiple models that run sequentially in TIS initiated by an inference request. It is responsible with interpreting the graph and exporting the correct files for TIS.

When we create an Ensemble object we feed the graph and a schema representing the starting input of the graph.  After we create the ensemble object, we export the graph, supplying an export path for the `ensemble.export()` function. This returns an ensemble config which represents the entire inference pipeline and a list of node-specific configs.

Create the folder to export the models and config files.

In [19]:
if not os.path.isdir(os.path.join(BASE_DIR + 'poc_ensemble')):
    os.makedirs(os.path.join(BASE_DIR + 'poc_ensemble'))

In [20]:
# define the path where all the models and config files exported to
export_path = os.path.join(BASE_DIR + 'poc_ensemble')

ensemble = Ensemble(ordering, request_schema)
ens_config, node_configs = ensemble.export(export_path)

Let's check our export_path structure

In [21]:
!tree $export_path

[01;34m/Merlin/examples/Deploying-multi-stage-RecSys/poc_ensemble[00m
├── [01;34m0_queryfeast[00m
│   ├── [01;34m1[00m
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m1_predicttensorflow[00m
│   ├── [01;34m1[00m
│   │   └── [01;34mmodel.savedmodel[00m
│   │       ├── [01;34massets[00m
│   │       ├── keras_metadata.pb
│   │       ├── saved_model.pb
│   │       └── [01;34mvariables[00m
│   │           ├── variables.data-00000-of-00001
│   │           └── variables.index
│   └── config.pbtxt
├── [01;34m2_queryfaiss[00m
│   ├── [01;34m1[00m
│   │   ├── [01;34mindex.faiss[00m
│   │   │   └── index.faiss
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m3_queryfeast[00m
│   ├── [01;34m1[00m
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m4_unrollfeatures[00m
│   ├── [01;34m1[00m
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m5_predicttensorflow[00m
│   ├── [01;34m1[00m
│   │   └── [01;34mmodel.savedmodel[00m
│   │       ├── [01;34masse

### Retrieving Recommendations from Triton

It is time to deploy the all the models as an ensemble model to Triton Inference very easily using Merlin Systems library. Now we can launch our triton server and load our models, and get a response for our query with a utility function `run_ensemble_on_tritonserver()`.

In [22]:
# create a request to be sent to TIS
from merlin.core.dispatch import make_df

request = make_df({"user_id": [1]})
request["user_id"] = request["user_id"].astype(np.int32)

response = run_ensemble_on_tritonserver(
    export_path, ensemble.graph.output_schema.column_names, request, "ensemble_model"
)

I0405 18:53:12.285197 12809 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0405 18:53:12.285284 12809 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0405 18:53:12.285289 12809 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0405 18:53:12.285293 12809 tensorflow.cc:2216] backend configuration:
{"cmdline":{"version":"2"}}
I0405 18:53:12.434720 12809 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fd4f8000000' with size 268435456
I0405 18:53:12.435114 12809 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0405 18:53:12.440615 12809 model_repository_manager.cc:994] loading: 0_queryfeast:1
I0405 18:53:12.541023 12809 model_repository_manager.cc:994] loading: 1_predicttensorflow:1
I0405 18:53:12.544542 12809 backend.cc:46] TRITONBACKEND_Initialize: nvtabular
I0405 18:53:12.544572 12809 backend.cc:53] Triton TRITONBACKEND API version: 1.8
I0405 18:53:12.544585 12809 backend.cc:56] 'nvtabular' TRI

Signal (2) received.


I0405 18:53:23.800413 12809 server.cc:267] Timeout 29: Found 7 live models and 0 in-flight non-inference requests
I0405 18:53:24.811206 12809 server.cc:267] Timeout 28: Found 7 live models and 0 in-flight non-inference requests
 0# 0x0000555EA4C27299 in /opt/tritonserver/bin/tritonserver
 1# 0x00007FD58C687210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007FD531CF2F2E in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
 3# TRITONBACKEND_ModelInstanceFinalize in /opt/tritonserver/backends/nvtabular/libtriton_nvtabular.so
 4# 0x00007FD58D224FC4 in /opt/tritonserver/bin/../lib/libtritonserver.so
 5# 0x00007FD58D21E3B9 in /opt/tritonserver/bin/../lib/libtritonserver.so
 6# 0x00007FD58D21EB1D in /opt/tritonserver/bin/../lib/libtritonserver.so
 7# 0x00007FD58D0A20D7 in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x00007FD58CA75DE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 9# 0x00007FD58CEF3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
10# clone in /usr/lib/x86_64-linux-gnu/libc

Convert our response to a numpy array and print it out.

In [23]:
output= response.as_numpy('ordered_ids')
output

array([[ 392],
       [ 267],
       [1107],
       [ 968],
       [ 457],
       [ 750],
       [ 669],
       [ 789],
       [1237],
       [1164]], dtype=int32)

Note that these item ids are encoded values, not the raw original values. We will eventually create the reverse dictionary lookup functionality to be able to map these encoded item ids to their original raw ids with one-line of code. But if you really want to do it now, you can easily map these ids to their original values using the `unique.item_id.parquet` file stored in the `categories` folder.

That's it! You finished deploying a multi-stage Recommender Systems on Triton Inference Server using Merlin framework.