In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

## Deploying the Model into Production with Merlin Systems and Triton IS

At this point, when you reach out to this notebook, we expect that you have already executed the first notebook `01-Building-Recommender-Systems-PoC.ipynb` and exported all the required files and models. 

We are going to generate recommended items for a given user query (user_id) by following the steps described in the figure below.

![tritonensemble](../images/triton_ensemble.png)

Merlin Systems library have the set of operators to be able to serve multi-stage recommender systems built with Tensorflow on [Triton Inference Server](https://github.com/triton-inference-server/server)(TIS) easily and efficiently. Below, we will go through these operators and demonstrate their usage in serving a multi-stage system on Triton.

### Import required libraries and functions

In [2]:
import numpy as np
import cudf
import feast
import faiss
import pandas as pd

from nvtabular import ColumnSchema, Schema

from merlin.systems.dag.ensemble import Ensemble
from merlin.systems.dag.ops.session_filter import FilterCandidates
from merlin.systems.dag.ops.softmax_sampling import SoftmaxSampling
from merlin.systems.dag.ops.tensorflow import PredictTensorflow
from merlin.systems.dag.ops.unroll_features import UnrollFeatures

from merlin.systems.triton.utils import run_triton_server, run_ensemble_on_tritonserver



### Feast Apply

We have defined our user and item features definitions in the `user_features.py` and  `item_features.py` files. With FeatureView() users can register data sources in their organizations into Feast, and then use those data sources for both training and online inference. In the `user_features.py` and `item_features.py` files, we are telling Feast where to find user and item features.


Before we move on to the next steps, we need to perform `apply`command as directed below:

```
# open a terminal and navigate to the `feature_repo` folder

cd /Merlin/examples/PoC/feature_repo

# run the following command

feast apply
```

With `feast apply` we register our features, we can apply the changes to create our feature registry and store all entity and feature view definitions in a local SQLite online store called `online_store.db`.

### Feast Materialize

After we execute `apply` and registered our features and created our online local store, now we need to perform [materialization](https://docs.feast.dev/how-to-guides/running-feast-in-production) operation. This is done to keep our online store up to date and get it ready for prediction. For that we need to run a job that loads feature data from our feature view sources into our online store. As we add new features to our offline stores, we can continuously materialize them to keep our online store up to date by finding the latest feature values for each user. 

When you run the `feast materialize ..` command below, you will see a print out message <i>Materializing 2 feature views from 1995-01-01 01:01:01+00:00 to 2025-01-01 01:01:01+00:00 into the sqlite online store </i> on your terminal.

```
# open a terminal and navigate to the `feature_repo` folder

cd /Merlin/examples/PoC/feature_repo

# run the following commands

feast materialize 1995-01-01T01:01:01 2025-01-01T01:01:01
```

Note that materialization step takes some time.. 

Now, let's check our feature_repo structure again after we ran `apply` and `materialize` commands.

In [3]:
!tree ./feature_repo

[01;34m./feature_repo[00m
├── __init__.py
├── [01;34mdata[00m
│   ├── item_features.parquet
│   ├── online_store.db
│   ├── registry.db
│   └── user_features.parquet
├── feature_store.yaml
├── item_features.py
└── user_features.py

1 directory, 8 files


We use `configure_tensorflow` function to prevent the Tensorflow to claim entire GPU memory. With this func, we let TF to allocate 50% of the available GPU memory. 

In [5]:
from nvtabular.loader.tf_utils import configure_tensorflow
configure_tensorflow()

<function tensorflow.python.dlpack.dlpack.from_dlpack(dlcapsule)>

Create a folder for faiss index path

In [7]:
import os
os.makedirs("tmp")

Define paths for ranking model, retrieval model, feast feature repo and fais index path

In [8]:
base_path = "/Merlin/examples/PoC/"
faiss_index_path = base_path + 'tmp' + "/index.faiss"
feast_repo_path = base_path + "feature_repo/"
retrieval_model_path = base_path + "query_tower/"
ranking_model_path = base_path + "dlrm/"

Create a request schema that we are going to use when sending a request to Triton Infrence Server (TIS).

In [None]:
request_schema = Schema(
    [
        ColumnSchema("user_id", dtype=np.int32),
    ]
)

`QueryFaiss` operator creates an interface between a FAISS Approximate Nearest Neighbors (ANN) Index and Triton Infrence Server. For a given input query vector, we do an ANN search query to find the ids of top-k nearby nodes in the index. 

`QueryFeast` operator is responsible for ensuring that our feast feature store can communicate correctly with tritonserver for the ensemble feast feature look ups.

`setup_faiss` is  a utiltiy function that will create a Faiss index from an embedding vector with using L2 distance.

In [None]:
from merlin.systems.dag.ops.faiss import QueryFaiss, setup_faiss 
from merlin.systems.dag.ops.feast import QueryFeast 

item_embeddings = np.ascontiguousarray(
    pd.read_parquet(base_path + "item_embeddings.parquet").to_numpy()
)

feature_store = feast.FeatureStore(feast_repo_path)
setup_faiss(item_embeddings, faiss_index_path)

Fetch user features with `QueryFeast` operator from the feature store.

In [None]:
user_features = ["user_id"] >> QueryFeast.from_feature_view(
    store=feature_store,
    path=feast_repo_path,
    view="user_features",
    column="user_id",
    include_id=True,
)

Retrieve top-K candidate items using `retrieval model` that are relevant for a given user. We use `PredictTensorflow()` operator that takes a tensorflow model and packages it correctly for TIS to run with the tensorflow backend.

In [11]:
retrieval = (
    user_features
    >> PredictTensorflow(retrieval_model_path)
    >> QueryFaiss(faiss_index_path, topk=100)
)

2022-03-30 00:40:16.934151: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-30 00:40:18.123955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0
2022-03-30 00:40:20.257596: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1034311152 exceeds 10% of free system memory.






Fetch item features for the candidate items that are retrieved from the retrieval step above from the feature store.

In [12]:
item_features = retrieval["candidate_ids"] >> QueryFeast.from_feature_view(
    store=feature_store,
    path=feast_repo_path,
    view="item_features",
    column="candidate_ids",
    output_prefix="item",
    include_id=True,
)

Merge the user features and items features to create the all set of combined features that were used in model training using `UnrollFeatures` operator which takes a target column and joins the "unroll" columns to the target. This helps when broadcasting a series of user features to a set of items.

In [None]:
user_features_to_unroll = [
    "user_id",
    "user_shops",
    "user_profile",
    "user_group",
    "user_gender",
    "user_age",
    "user_consumption_2",
    "user_is_occupied",
    "user_geography",
    "user_intentions",
    "user_brands",
    "user_categories",
]

combined_features = item_features >> UnrollFeatures(
    "item_id", user_features[user_features_to_unroll]
)

Rank the combined features using the trained ranking model, which is a DLRM model for this example. We feed the path of the ranking model to `PredictTensorflow()` operator.

In [13]:
ranking = combined_features >> PredictTensorflow(ranking_model_path)

2022-03-30 00:40:25.126858: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 788046592 exceeds 10% of free system memory.
2022-03-30 00:40:25.127006: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 788046592 exceeds 10% of free system memory.


For the ordering we use `SoftmaxSampling()` operator. This operator sorts all inputs in descending order given the input ids and prediction introducing some randomization into the ordering by sampling items from the softmax of the predicted relevance scores, and finally returns top-k ordered items.

In [None]:
ordering = combined_features["item_id"] >> SoftmaxSampling(
    relevance_col=ranking["output_1"], topk=10, temperature=20.0
)

### Export Graph as Ensemble
The last step is to create the ensemble artifacts that TIS can consume. To make these artifacts import the Ensemble class. This class  represents an entire ensemble consisting of multiple models that run sequentially in TIS initiated by an inference request. It is responsible with interpreting the graph and exporting the correct files for TIS.

When we create an Ensemble object we feed the graph and a schema representing the starting input of the graph.  After we create the ensemble object, we export the graph, supplying an export path for the `ensemble.export()` function. This returns an ensemble config which represents the entire inference pipeline and a list of node-specific configs.

Create the folder to export the models and config files.

In [None]:
os.makedirs("poc_ensemble")

In [14]:
# define the path where all the models and config files exported to
export_path = '/Merlin/examples/PoC/poc_ensemble/'

ensemble = Ensemble(ordering, request_schema)
ens_config, node_configs = ensemble.export(export_path)

Let's check our export_path structure

In [1]:
!tree ./poc_ensemble

[01;34m./poc_ensemble[00m
├── [01;34m0_queryfeast[00m
│   ├── [01;34m1[00m
│   │   ├── [01;34m__pycache__[00m
│   │   │   └── model.cpython-38.pyc
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m1_predicttensorflow[00m
│   ├── [01;34m1[00m
│   │   └── [01;34mmodel.savedmodel[00m
│   │       ├── [01;34massets[00m
│   │       ├── keras_metadata.pb
│   │       ├── saved_model.pb
│   │       └── [01;34mvariables[00m
│   │           ├── variables.data-00000-of-00001
│   │           └── variables.index
│   └── config.pbtxt
├── [01;34m2_queryfaiss[00m
│   ├── [01;34m1[00m
│   │   ├── [01;34mindex.faiss[00m
│   │   │   └── index.faiss
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m3_queryfeast[00m
│   ├── [01;34m1[00m
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m4_unrollfeatures[00m
│   ├── [01;34m1[00m
│   │   └── model.py
│   └── config.pbtxt
├── [01;34m5_predicttensorflow[00m
│   ├── [01;34m1[00m
│   │   └── [01;34mmodel.savedmodel[00

### Retrieving Recommendations from Triton

It is time to deploy the all the models as an ensemble model to Triton Inference very easily using Merlin Systems library. Now we can launch our triton server and load our models, and get a response for our query with a utility function `run_ensemble_on_tritonserver()`.

In [15]:
# create a request to be sent to TIS
from merlin.core.dispatch import make_df

request = make_df({"user_id": [1]})
request["user_id"] = request["user_id"].astype(np.int32)

response = run_ensemble_on_tritonserver(
    export_path, ensemble.graph.output_schema.column_names, request, "ensemble_model"
)

I0330 00:40:28.996325 1892 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0330 00:40:28.996416 1892 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0330 00:40:28.996421 1892 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0330 00:40:28.996425 1892 tensorflow.cc:2216] backend configuration:
{"cmdline":{"version":"2"}}
I0330 00:40:29.150263 1892 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f96b4000000' with size 268435456
I0330 00:40:29.150648 1892 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0330 00:40:29.156027 1892 model_repository_manager.cc:994] loading: 0_queryfeast:1
I0330 00:40:29.256479 1892 model_repository_manager.cc:994] loading: 1_predicttensorflow:1
I0330 00:40:29.263985 1892 backend.cc:46] TRITONBACKEND_Initialize: nvtabular
I0330 00:40:29.264019 1892 backend.cc:53] Triton TRITONBACKEND API version: 1.8
I0330 00:40:29.264031 1892 backend.cc:56] 'nvtabular' TRITONBACKEND 

Signal (2) received.


I0330 00:40:44.842471 1892 server.cc:252] Waiting for in-flight requests to complete.
I0330 00:40:44.842502 1892 model_repository_manager.cc:1026] unloading: ensemble_model:1
I0330 00:40:44.842628 1892 model_repository_manager.cc:1026] unloading: 6_softmaxsampling:1
I0330 00:40:44.842775 1892 model_repository_manager.cc:1026] unloading: 5_predicttensorflow:1
I0330 00:40:44.842846 1892 model_repository_manager.cc:1132] successfully unloaded 'ensemble_model' version 1
I0330 00:40:44.842869 1892 model_repository_manager.cc:1026] unloading: 4_unrollfeatures:1
I0330 00:40:44.843011 1892 model_repository_manager.cc:1026] unloading: 3_queryfeast:1
I0330 00:40:44.843074 1892 backend.cc:160] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0330 00:40:44.843092 1892 model_repository_manager.cc:1026] unloading: 2_queryfaiss:1
Signal (I0330 00:40:44.843153 1892 backend.cc:160] TRITONBACKEND_ModelInstanceFinalize: delete instance stateI0330 00:40:44.843172 1892 tensorflow.cc:2363] TRITON

Convert our response to a numpy array and print it out.

In [18]:
output= response.as_numpy('ordered_ids')
output

array([[2628383],
       [1780891],
       [ 397955],
       [2573239],
       [1255680],
       [ 505277],
       [2084603],
       [ 365618],
       [ 229051],
       [1323574]], dtype=int32)

Note that these item ids are encoded values, not the raw original values. We will eventually create the reverse dictionary lookup functionality to be able to map these encoded item ids to their original raw ids with one-line of code. But if you really want to do it now, you can easily map these ids to their original values using the `unique.item_id.parquet` file stored in the `categories` folder.

That's it! You finished deploying a multi-stage Recommender Systems on Triton Inference Server using Merlin framework.