In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

## Training a DLRM model with TensorFlow

In the previous notebooks, we have downloaded the movielens data, converted it to parquet files and then used NVTabular library to process the data, join data frames, and create input features. In this notebook we will use NVIDIA Merlin Models library to build and train a Deep Learning Recommendation Model [(DLRM)](https://arxiv.org/abs/1906.00091) architecture originally proposed by Facebook in 2019.

Figure 1 illustrates DLRM architecture. The model was introduced as a personalization deep learning model that uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in [here](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5694074).

![DLRM](images/DLRM.png)

<p>Figure 2.DLRM architecture. Image source: <a href="https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/DLRM">Nvidia DL Examples</a></p>

DLRM accepts two types of features: categorical and numerical. 
- For each categorical feature, an embedding table is used to provide dense representation to each unique value. 
- For numerical features, they are fed to model as dense features, and then transformed by a simple neural network referred to as "bottom MLP". This part of the network consists of a series of linear layers with ReLU activations. 
- The output of the bottom MLP and the embedding vectors are then fed into the `dot product interaction` operation (see Pairwise interaction step). The output of "dot interaction" is then concatenated with the features resulting from the bottom MLP (we apply a skip-connection there) and fed into the "top MLP" which is also a series of dense layers with activations ((a fully connected NN). 
- The model outputs a single number (here we use sigmoid function to generate probabilities) which can be interpreted as a likelihood of a certain user clicking on an ad, watching a movie, or viewing a news page. 

## Import Libraries

In [2]:
import os
import glob
import nvtabular
import numpy as np
import pandas as pd
import nvtabular as nvt

import merlin_models.tf as ml
from merlin_standard_lib import Schema, Tag

from nvtabular.loader.tensorflow import KerasSequenceLoader
import tensorflow as tf

In [3]:
import logging
# disable INFO and DEBUG logging everywhere
logging.disable(logging.WARNING) 

In [4]:
# Avoid Numba low occupancy warnings
from numba import config
config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

Merlin Models library relies on a `schema` object to automatically build all necessary layers to represent, normalize and aggregate input features. As you can see below, schema.pb is a protobuf file that contains metadata including statistics about features such as cardinality, min and max values and also tags features based on their characteristics and dtypes (e.g., categorical, continuous, list, integer).

We have already generated our `schema.pbtxt` file in the previous notebook using NVTabular. Not we read this schema file to create a `schema` object.

In [9]:
from merlin_standard_lib import Schema
SCHEMA_PATH = "/workspace/data/movielens/train/schema.pbtxt"
schema = Schema().from_proto_text(SCHEMA_PATH)
!head -30 $SCHEMA_PATH

feature {
  name: "movieId"
  type: INT
  int_domain {
    name: "movieId"
    min: 0
    max: 56677
    is_categorical: true
  }
  annotation {
    tag: "item_id"
    tag: "item"
    tag: "categorical"
    extra_metadata {
      type_url: "type.googleapis.com/google.protobuf.Struct"
      value: "\n\021\n\013num_buckets\022\002\010\000\n\033\n\016freq_threshold\022\t\021\000\000\000\000\000\000\000\000\n\025\n\010max_size\022\t\021\000\000\000\000\000\000\000\000\n\030\n\013start_index\022\t\021\000\000\000\000\000\000\000\000\n2\n\010cat_path\022&\032$.//categories/unique.movieId.parquet\nG\n\017embedding_sizes\0224*2\n\030\n\013cardinality\022\t\021\000\000\000\000\240\254\353@\n\026\n\tdimension\022\t\021\000\000\000\000\000\000\200@"
    }
  }
}
feature {
  name: "userId"
  type: INT
  int_domain {
    name: "userId"
    min: 0
    max: 162542
    is_categorical: true
  }
  annotation {
    tag: "categorical"


## Define the Input module

Below we define our input block using the `ml.ContinuousEmbedding` function. The from_schema() method processes the schema and creates the necessary layers to represent features and aggregate them.

In the next cell, the whole model is build with a few lines of code. Here is a brief explanation of the main classes and functions:

- [DotProductInteraction](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/layers/interaction.py#L22) class implements the factorization machine style feature interaction layer suggested by the DLRM and DeepFM architectures. Here we do not feed an interaction type, and the `None` interaction type defaults to the standard factorization machine style interaction.
- [TabularBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/core.py#L661) is a sub-class of `Block` class that accepts dictionary of tensors as inputs and supports the integration of many commonly used operations. This class has additional methods to apply transformations and aggregations to inputs for pre and post processing.
- [ParallelBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/core.py#L1152) class merges multiple layers or TabularBlock's into a single output of TabularData which is a dictionary of tensors. In this example, this class outputs two parallel layers of continuous and categorical blocks.
- [BinaryClassificationTask](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/prediction/classification.py#L30) supports the binary prediction task. We also support other predictions tasks, like next-item prediction and regression.

In [6]:
con_schema = schema.select_by_tag(Tag.CONTINUOUS)
cat_schema = schema.select_by_tag(Tag.CATEGORICAL)

In [7]:
top_block_inputs = {}

top_block_inputs["continuous"] = ml.ContinuousFeatures.from_schema(con_schema).connect(ml.MLPBlock([128, 64]))

We use ContinousFeatures layer build the dense layer for the continuos features and then we fed this to the MLP layer with `connect` method.

In [8]:
embedding_dim = 64
top_block_inputs["categorical"] = ml.EmbeddingFeatures.from_schema(
    cat_schema, embedding_dim_default=embedding_dim
)

In [9]:
dot_product = ml.TabularBlock(aggregation="stack").connect(ml.DotProductInteraction())
top_block_outputs = (ml.ParallelBlock(top_block_inputs).connect_with_shortcut
                     (
                         dot_product, shortcut_filter=ml.Filter("continuous"), aggregation="concat"
                     ).connect(ml.MLPBlock([128, 64])
                              )
                    )

In [10]:
model = top_block_outputs.connect(ml.BinaryClassificationTask("rating"))

2021-12-08 23:45:09.832763: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-08 23:45:11.306734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0


In addition to this low-level api code, we also have high-level api where you can define a DLRM model with only one line of code as follow:
    
```
ml.DLRMBlock(schema, bottom_block=ml.MLPBlock([128, 64]), top_block=ml.MLPBlock([128, 64])
            ).connect(ml.BinaryClassificationTask("rating"))
    
```

### Define Data Loader

We're ready to get trainin.. We'll use the NVTabular `KerasSequenceLoader` for reading chunks of parquet files. `KerasSequenceLoader` manages shuffling by loading in chunks of data from different parts of the full dataset, concatenating them and then shuffling, then iterating through this super-chunk sequentially in batches. The number of "parts" of the dataset that get sample, or "partitions", is controlled by the `parts_per_chunk` kwarg, while the size of each one of these parts is controlled by the `buffer_size` kwarg, which refers to a fraction of available GPU memory (you can read more about it [here](https://nvidia-merlin.github.io/NVTabular/main/training/tensorflow.html) and [here](https://nvidia-merlin.github.io/NVTabular/main/api/tensorflow_dataloader.html?highlight=kerassequence#nvtabular.loader.tensorflow.KerasSequenceLoader)). Using more chunks leads to better randomness, especially at the epoch level where physically disparate samples can be brought into the same batch, but can impact throughput if you use too many. In any case, the speed of the parquet reader makes feasible buffer sizes much larger.

Note that `genres` column is a multi-hot column and it is fed to dataloader as a sparse tensor and then it is converted to dense represantation. Based on our analysis, genres column has max 10 sequence of entries. So we will set the sequence length for the multi-hot columns as 10 in the `sparse_feature_max` dictionary below.

In [11]:
# Define categorical and continuous columns
x_cat_names, x_cont_names = ['userId', 'movieId', 'genres'], ['TE_movieId_rating','userId_count']

# dictionary representing max sequence length for each column
sparse_features_max = {'genres': 10}

def get_dataloader(paths_or_dataset, batch_size=4096):
    dataloader = KerasSequenceLoader(
        paths_or_dataset,
        batch_size=batch_size,
        label_names=['rating'],
        cat_names=x_cat_names,
        cont_names=x_cont_names,
        sparse_names=list(sparse_features_max.keys()),
        sparse_max=sparse_features_max,
        sparse_as_dense=True,
    )
    return dataloader.map(lambda X, y: (X, tf.reshape(y, (-1,))))

### Start Training and Evaluation

In [12]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/movielens/")
train_paths = glob.glob(os.path.join(OUTPUT_DIR, "train/*.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, "valid/*.parquet"))

In [13]:
model.compile(optimizer="adam", run_eagerly=False)

In [14]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(train_paths) 
losses = model.fit(train_loader, epochs=3)
model.reset_metrics()

# Evaluate
print('*'*20)
print("Start evaluation")
eval_loader = get_dataloader(eval_paths) 
eval_metrics = model.evaluate(eval_loader, return_dict=True)

print('*'*20)
print("Eval results")
print('\n' + '*'*20 + '\n')
for key in sorted(eval_metrics.keys()):
    print(" %s = %s" % (key, str(eval_metrics[key]))) 

********************
Launch training
********************

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: annotated name 'output' can't be nonlocal (tmpn84hxazp.py, line 36)


2021-12-08 23:45:26.273241: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/3
Epoch 2/3
Epoch 3/3
********************
Start evaluation
********************
Eval results

********************

 loss = 0.472271203994751
 rating/binary_classification_task/auc = 0.8262367248535156
 rating/binary_classification_task/binary_accuracy = 0.762577474117279
 rating/binary_classification_task/precision = 0.7839897871017456
 rating/binary_classification_task/recall = 0.856126070022583
 regularization_loss = 0
 total_loss = 0.472271203994751


### Perform Prediction

Let's use validation set and perform prediction for a given user.

In [5]:
valid = pd.read_parquet("/workspace/data/movielens/valid/part_0.parquet")

In [6]:
batch = valid[valid['userId']==69789].reset_index(drop=True)
batch.head()

Unnamed: 0,movieId,userId,genres,rating,TE_movieId_rating,userId_count
0,1207,69789,"[1, 4]",0,0.869776,1.66064
1,484,69789,"[2, 12]",1,0.47538,1.66064
2,574,69789,"[2, 8, 1, 16]",1,0.717452,1.66064
3,91,69789,"[3, 5, 7, 4]",0,0.702393,1.66064
4,517,69789,"[11, 6, 7, 4]",0,0.680634,1.66064


We first need to pad the genres column to be able to create a dictionary of tensors to serve as input to `model.predict()`. We could also use NVTabular `ListSlice` op for that but since we will only process couple of lines we can do that with a function defined below.

In [17]:
def padding(s, seq_length=10):
    padding_len = seq_length - len(s)
    padded = np.pad(s, (0, padding_len), 'constant', constant_values=(0))
    return padded

padded_genres = batch['genres'].apply(padding)
batch.loc[:,'genres'] = padded_genres.values

In [18]:
# Convert our dataframe to a dictionary of tensors
def tf_tensor_dict(df):
    import tensorflow as tf

    data = df.to_dict("list")

    return {key: tf.convert_to_tensor(value) for key, value in data.items()}

In [19]:
batch_tensor = tf_tensor_dict(batch)
batch_tensor

{'movieId': <tf.Tensor: shape=(22,), dtype=int32, numpy=
 array([1207,  484,  574,   91,  517,  221,   80,  205,  125,  808,  104,
         630,  331,    1,  486,  192,  307,  292,  129,  413, 1294,  526],
       dtype=int32)>,
 'userId': <tf.Tensor: shape=(22,), dtype=int32, numpy=
 array([69789, 69789, 69789, 69789, 69789, 69789, 69789, 69789, 69789,
        69789, 69789, 69789, 69789, 69789, 69789, 69789, 69789, 69789,
        69789, 69789, 69789, 69789], dtype=int32)>,
 'genres': <tf.Tensor: shape=(22, 10), dtype=int32, numpy=
 array([[ 1,  4,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 2, 12,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 2,  8,  1, 16,  0,  0,  0,  0,  0,  0],
        [ 3,  5,  7,  4,  0,  0,  0,  0,  0,  0],
        [11,  6,  7,  4,  0,  0,  0,  0,  0,  0],
        [ 5,  2,  1,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  6,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 3,  5,  7,  0,  0,  0,  0,  0,  0,  0],
        [ 3, 

In [20]:
# Perform prediction for userId 69789
predictions = model.predict(batch_tensor)
predictions

array([[0.52605325],
       [0.17234999],
       [0.45934787],
       [0.3444392 ],
       [0.29376656],
       [0.33181232],
       [0.67826164],
       [0.49523795],
       [0.5502221 ],
       [0.64141977],
       [0.70937765],
       [0.8737097 ],
       [0.37274352],
       [0.8956236 ],
       [0.5559566 ],
       [0.49068797],
       [0.6118884 ],
       [0.1779553 ],
       [0.61409414],
       [0.6127957 ],
       [0.7583447 ],
       [0.47995976]], dtype=float32)

The predictions are probabilities that shows the likelihood of a user liking a movie or not. What's the movie that the user `69789` will like most?

In [22]:
batch['predict_proba'] = predictions

# select the row where the estimated probability is highest
batch[batch['predict_proba']==batch['predict_proba'].max()][['movieId', 'userId']]

Unnamed: 0,movieId,userId
13,1,69789


Based on the estimated probabilities, we can say that the  `69789` would like the movie `1` most with probability of `0.8956236`.