In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

## Training a DLRM model with TensorFlow

In the previous notebooks, we have downloaded the movielens data, converted it to parquet files and then used NVTabular library to process the data, join data frames, and create input features. In this notebook we will use NVIDIA Merlin Models library to build and train a Deep Learning Recommendation Model [(DLRM)](https://arxiv.org/abs/1906.00091) architecture originally proposed by Facebook in 2019.

Figure 1 illustrates DLRM architecture. The model was introduced as a personalization deep learning model that uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in [here](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5694074).

![DLRM](../images/DLRM.png)

<p>Figure 2.DLRM architecture. Image source: <a href="https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/DLRM">Nvidia DL Examples</a></p>

DLRM accepts two types of features: categorical and numerical. 
- For each categorical feature, an embedding table is used to provide dense representation to each unique value. 
- For numerical features, they are fed to model as dense features, and then transformed by a simple neural network referred to as "bottom MLP". This part of the network consists of a series of linear layers with ReLU activations. 
- The output of the bottom MLP and the embedding vectors are then fed into the `dot product interaction` operation (see Pairwise interaction step). The output of "dot interaction" is then concatenated with the features resulting from the bottom MLP (we apply a skip-connection there) and fed into the "top MLP" which is also a series of dense layers with activations ((a fully connected NN). 
- The model outputs a single number (here we use sigmoid function to generate probabilities) which can be interpreted as a likelihood of a certain user clicking on an ad, watching a movie, or viewing a news page. 

## Import Libraries

In [2]:
import os
import glob
import nvtabular
import numpy as np
import pandas as pd
import nvtabular as nvt
from nvtabular.loader.tensorflow import KerasSequenceLoader

import merlin_models.tf as ml
from merlin_standard_lib import Schema, Tag

2021-12-13 21:42:04.377035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0


In [3]:
import logging
# disable INFO and DEBUG logging everywhere
logging.disable(logging.WARNING) 

In [4]:
# Avoid Numba low occupancy warnings
from numba import config
config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

Merlin Models library relies on a `schema` object to automatically build all necessary layers to represent, normalize and aggregate input features. As you can see below, schema.pb is a protobuf file that contains metadata including statistics about features such as cardinality, min and max values and also tags features based on their characteristics and dtypes (e.g., categorical, continuous, list, integer).

We have already generated our `schema.pbtxt` file in the previous notebook using NVTabular. Not we read this schema file to create a `schema` object.

In [5]:
from merlin_standard_lib import Schema
SCHEMA_PATH = "/workspace/data/movielens/train/schema.pbtxt"
schema = Schema().from_proto_text(SCHEMA_PATH)
!head -30 $SCHEMA_PATH

feature {
  name: "movieId"
  type: INT
  int_domain {
    name: "movieId"
    min: 0
    max: 56690
    is_categorical: true
  }
  annotation {
    tag: "item"
    tag: "categorical"
    tag: "item_id"
    extra_metadata {
      type_url: "type.googleapis.com/google.protobuf.Struct"
      value: "\n\021\n\013num_buckets\022\002\010\000\n\033\n\016freq_threshold\022\t\021\000\000\000\000\000\000\000\000\n\025\n\010max_size\022\t\021\000\000\000\000\000\000\000\000\n\030\n\013start_index\022\t\021\000\000\000\000\000\000\000\000\n2\n\010cat_path\022&\032$.//categories/unique.movieId.parquet\nG\n\017embedding_sizes\0224*2\n\030\n\013cardinality\022\t\021\000\000\000\000@\256\353@\n\026\n\tdimension\022\t\021\000\000\000\000\000\000\200@"
    }
  }
}
feature {
  name: "userId"
  type: INT
  int_domain {
    name: "userId"
    min: 0
    max: 162542
    is_categorical: true
  }
  annotation {
    tag: "user"


Let's remove the original `rating` and 'title' columns from the schema because we do not want to feed them to the model.

In [6]:
schema = schema.remove_by_name(['rating', 'title'])

We can print out the feature names including the binary target column, `rating_b`, in the schema easily.

In [7]:
schema.column_names

['movieId',
 'userId',
 'genres',
 'rating_b',
 'TE_movieId_rating',
 'userId_count']

## Define the Input module

Below we define our input block using the `ml.ContinuousEmbedding` function. The from_schema() method processes the schema and creates the necessary layers to represent features and aggregate them.

In the next cell, the whole model is build with a few lines of code. Here is a brief explanation of the main classes and functions:

- [DotProductInteraction](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/layers/interaction.py#L22) class implements the factorization machine style feature interaction layer suggested by the DLRM and DeepFM architectures. Here we do not feed an interaction type, and the `None` interaction type defaults to the standard factorization machine style interaction.
- [TabularBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/core.py) is a sub-class of `Block` class that accepts dictionary of tensors as inputs and supports the integration of many commonly used operations. This class has additional methods to apply transformations and aggregations to inputs for pre and post processing.
- [ParallelBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/core.py) class merges multiple layers or TabularBlock's into a single output of TabularData which is a dictionary of tensors. In this example, this class outputs two parallel layers of continuous and categorical blocks.
- [BinaryClassificationTask](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/prediction/classification.py#L30) supports the binary prediction task. We also support other predictions tasks, like next-item prediction and regression.

Select continuous and categorical columns from schema using feature tags.

In [8]:
con_schema = schema.select_by_tag(Tag.CONTINUOUS)
cat_schema = schema.select_by_tag(Tag.CATEGORICAL)

In the DLRM architecture, categorical features are processed using embeddings. Below, for each categorical feature, we create an embedding table used to provide dense representation to each unique value of this feature. The dense vector values in the embedding tables are learned during model training.

In [9]:
embedding_dim = 64

embeddings = ml.EmbeddingFeatures.from_schema(
    cat_schema, options=ml.EmbeddingOptions(embedding_dim_default=embedding_dim)
)

We use `ContinuousFeatures` layer to build the dense layer for the continuous features and then we fed it to the MLP layer (bottom block) with `connect` method.

In [10]:
bottom_block = ml.MLPBlock([128, 64])
bottom_block = ml.ContinuousFeatures.from_schema(con_schema).connect(bottom_block)

`ParallelBlock` class outputs two parallel layers of continuous and categorical blocks, so that we can perform the dot production easily.

In [11]:
interaction_inputs = ml.ParallelBlock({"embeddings": embeddings, "bottom_block": bottom_block})

Below, we create the `dot product interaction` by taking the dot product of the bottom mlp layer output and embedding layer created from categorical features. Then we do `skip-connection` process by concatenating the bottom MLP results with the interaction layer results.

In [12]:
def DotProductInteractionBlock():
    return ml.SequentialBlock(ml.DotProductInteraction(), pre_aggregation="stack")

In [13]:
top_block_inputs = interaction_inputs.connect_with_shortcut(
    DotProductInteractionBlock(), shortcut_filter=ml.Filter("continuous"), aggregation="concat"
)

We then create the top MLP block and feed our concatenated features to the top block.

In [14]:
top_block = ml.MLPBlock([128, 32])
top_block_outputs = top_block_inputs.connect(top_block)

Finally, we connect our top block to the BinaryClassificationTask head to be able to do binary classification, and create our `model` class.

In [15]:
model = top_block_outputs.connect(ml.BinaryClassificationTask("rating_b"))

In addition to this low-level api code, we also have high-level api where you can define a DLRM model with only one line of code as follow:
    
```
ml.DLRMBlock(schema, bottom_block=ml.MLPBlock([128, 64]), top_block=ml.MLPBlock([128, 64])
            ).connect(ml.BinaryClassificationTask("rating_b"))
    
```

### Define Data Loader

We're ready to start training. We'll use the NVTabular `KerasSequenceLoader` for reading chunks of parquet files. `KerasSequenceLoader` manages shuffling by loading in chunks of data from different parts of the full dataset, concatenating them and then shuffling, then iterating through this super-chunk sequentially in batches. The number of "parts" of the dataset that get sample, or "partitions", is controlled by the `parts_per_chunk` kwarg, while the size of each one of these parts is controlled by the `buffer_size` kwarg, which refers to a fraction of available GPU memory (you can read more about it [here](https://nvidia-merlin.github.io/NVTabular/main/training/tensorflow.html) and [here](https://nvidia-merlin.github.io/NVTabular/main/api/tensorflow_dataloader.html?highlight=kerassequence#nvtabular.loader.tensorflow.KerasSequenceLoader)). Using more chunks leads to better randomness, especially at the epoch level where physically disparate samples can be brought into the same batch, but can impact throughput if you use too many. In any case, the speed of the parquet reader makes feasible buffer sizes much larger.

Note that `genres` column is a multi-hot column and it is fed to dataloader as a sparse tensor and then it is converted to dense represantation. Based on our analysis, genres column has max 10 sequence of entries. So we will set the sequence length for the multi-hot columns as 10 in the `sparse_feature_max` dictionary below.

In [16]:
# Define categorical and continuous columns
x_cat_names, x_cont_names = ['userId', 'movieId', 'genres'], ['TE_movieId_rating','userId_count']

# dictionary representing max sequence length for each column
sparse_features_max = {'genres': 10}

def get_dataloader(paths_or_dataset, batch_size=4096):
    dataloader = KerasSequenceLoader(
        paths_or_dataset,
        batch_size=batch_size,
        label_names=['rating_b'],
        cat_names=x_cat_names,
        cont_names=x_cont_names,
        sparse_names=list(sparse_features_max.keys()),
        sparse_max=sparse_features_max,
        sparse_as_dense=True,
    )
    return dataloader.map(lambda X, y: (X, tf.reshape(y, (-1,))))

### Start Training and Evaluation

In [17]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/movielens/")
train_paths = glob.glob(os.path.join(OUTPUT_DIR, "train/*.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, "valid/*.parquet"))

In [18]:
import tensorflow as tf
model.compile(optimizer="adam", run_eagerly=False)

In [20]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(train_paths) 
losses = model.fit(train_loader, epochs=3)
model.reset_metrics()

# Evaluate
print('*'*20)
print("Start evaluation")
eval_loader = get_dataloader(eval_paths) 
eval_metrics = model.evaluate(eval_loader, return_dict=True)

print('*'*20 + '\n')
print("Eval results")
print('\n' + '*'*20 + '\n')
for key in sorted(eval_metrics.keys()):
    print(" %s = %s" % (key, str(eval_metrics[key]))) 

********************
Launch training
********************

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: annotated name 'output' can't be nonlocal (tmpv3h2awnj.py, line 36)


2021-12-13 20:03:55.020480: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/3
Epoch 2/3
Epoch 3/3
********************
Start evaluation
********************

Eval results

********************

 loss = 0.49038803577423096
 rating_b/binary_classification_task/auc = 0.8250362873077393
 rating_b/binary_classification_task/binary_accuracy = 0.7617262601852417
 rating_b/binary_classification_task/precision = 0.7810465693473816
 rating_b/binary_classification_task/recall = 0.860145628452301
 regularization_loss = 0
 total_loss = 0.49038803577423096


### Perform Prediction

Let's use validation set and perform prediction for a given user.

In [134]:
valid = pd.read_parquet("/workspace/data/movielens/valid/part_0.parquet")

In [135]:
batch = valid[valid['userId']==15488].reset_index(drop=True)
batch.head()

Unnamed: 0,movieId,userId,genres,rating_b,TE_movieId_rating,userId_count,rating,title
0,79,15488,"[1, 6]",1,0.845424,1.899144,5.0,"Beautiful Mind, A (2001)"
1,175,15488,"[2, 6]",1,0.708734,1.899144,4.0,While You Were Sleeping (1995)
2,2371,15488,"[2, 8]",0,0.213532,1.899144,1.0,Police Academy 6: City Under Siege (1989)
3,1580,15488,[2],1,0.706598,1.899144,5.0,Major League (1989)
4,471,15488,"[3, 11]",1,0.710984,1.899144,4.0,"Thomas Crown Affair, The (1999)"


Filter out the columns that are not used in the model training.

In [136]:
batch_input = batch[['movieId', 'userId', 'genres', 'TE_movieId_rating','userId_count']]

We first need to pad the genres column to be able to create a dictionary of tensors to serve as input to `model.predict()`. We could also use NVTabular `ListSlice` op for that but since we will only process couple of lines we can do that with a function defined below.

In [137]:
def padding(s, seq_length=10):
    padding_len = seq_length - len(s)
    padded = np.pad(s, (0, padding_len), 'constant', constant_values=(0))
    return padded

padded_genres = batch_input['genres'].apply(padding)
batch_input.loc[:,'genres'] = padded_genres.values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


Convert our dataframe to a dictionary of tensors to feed to model.

In [138]:
def tf_tensor_dict(df):
    data = df.to_dict("list")
    return {key: tf.convert_to_tensor(value) for key, value in data.items()}

In [139]:
batch_tensor = tf_tensor_dict(batch_input)

In [140]:
# Perform prediction for userId 15488
predictions = model.predict(batch_tensor)

The predictions are probabilities that shows the likelihood of a user liking a movie or not.

In [141]:
batch['predict_proba'] = predictions
batch[['userId', 'movieId', 'title','rating', 'rating_b', 'predict_proba']]

Unnamed: 0,userId,movieId,title,rating,rating_b,predict_proba
0,15488,79,"Beautiful Mind, A (2001)",5.0,1,0.999110
1,15488,175,While You Were Sleeping (1995),4.0,1,0.987507
2,15488,2371,Police Academy 6: City Under Siege (1989),1.0,0,0.110548
3,15488,1580,Major League (1989),5.0,1,0.984806
4,15488,471,"Thomas Crown Affair, The (1999)",4.0,1,0.992783
...,...,...,...,...,...,...
69,15488,70,Léon: The Professional (a.k.a. The Professiona...,5.0,1,0.997819
70,15488,183,Amadeus (1984),4.5,1,0.998248
71,15488,3297,Never Say Never Again (1983),3.5,1,0.982946
72,15488,1062,"Transporter, The (2002)",4.0,1,0.985573


Let's find the top-5 movies that can be recommended to the user `15488` based on prediction probabilities.

In [142]:
ranked = np.argsort(predictions, axis=0)
indices = ranked[::-1][:5].reshape(-1)

In [132]:
batch.iloc[indices, :][['movieId', 'userId', 'rating', 'predict_proba', 'title']]

Unnamed: 0,movieId,userId,rating,predict_proba,title
0,79,15488,5.0,0.99911,"Beautiful Mind, A (2001)"
67,213,15488,5.0,0.999062,"Lock, Stock & Two Smoking Barrels (1998)"
16,2,15488,5.0,0.998943,Forrest Gump (1994)
33,73,15488,4.5,0.998461,Goodfellas (1990)
68,26,15488,5.0,0.998405,Apollo 13 (1995)


We can see that the model predicted top-5 movies with high confidence and user's actual ratings for these movies correlates with the model prediction.