In [None]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

## MERLIN MODELS

Recommender Systems (RecSys) attempt to accurately predict users' preferences by supporting users in the process of finding and selecting products (items) from a given catalog, thus enhance their engagement and overall satisfaction with online services. Several traditional (e.g. collaborative filtering) and Deep Learning-based RecSys models have been built and used in industry within the last decade. Recently, deep learning based recommender systems have become the norm for large scale industry applications. In contrast to the other domains, DL recommender models aren’t able to effectively leverage the parallel compute and high memory bandwidth that GPUs have to offer. To address the challenges in preprocessing, training and deploying recommender systems at scale, NVIDIA developed an open source framework, called [Merlin](https://github.com/NVIDIA-Merlin). One of the libraries of the Merlin framework is Merlin Models that is being developed to provide all the necessary tools to design an end-to-end RecSys pipeline, offering flexibility at each stage: multiple inputs processing/representation modules, different layers for designing the model’s architecture, different prediction heads, loss functions, different negative sampling techniques etc. 

The goal of the `Merlin Models` library is make it easy for users in industry or academia to train and deploy recommender models with best practices baked into the library. This will let users in industry easily train standard models against their own dataset, getting high performance GPU accelerated models into production. This will also let researchers to build custom models by incorporating standard components of deep learning recommender models, and then benchmark their new models on example offline datasets.

### Core features

- Support for diverse input types (e.g. categorical and continuous), architectures (e.g. two-tower or sequential) or tasks (e.g. binary, multi-class classification, multi-task)
- Flexible building blocks to define a broad range of models with various prediction tasks, loss functions and negative sampling techniques 
- Flexible APIs targeted to both production and research
- Unified API enables users to create models in TensorFlow or PyTorch
- GPU-optimized data-loading, model training and serving
- Deep integration with NVTabular for ETL and model serving
- Multi-task learning is a first-class citizen
- State-of-the-art negative sampling techniques

### Learning objectives

In this introductory notebook we aim at

- introducing the building blocks of Merlin Models library
- showing how seamlessly integrate NVTabular and Merlin Models libraries
- training DL-based recommender models with optimized Tensorflow data loaders with only a few lines of code for `BinaryClassification` task. 

## Import Libraries

In [2]:
import os
import glob
import nvtabular as nvt
import merlin_models.tf as mm

from merlin.schema import Schema
from merlin.schema.io.tensorflow_metadata import TensorflowMetadata
from merlin.schema.tags import Tags

2022-02-21 16:12:18.793008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0


## ETL with NVTabular

We use the [MovieLens25M](https://grouplens.org/datasets/movielens/25m/) dataset which is a popular dataset for recommender systems and is used in academic publications. We developed a utility functions for data downloading, converting and preprocessing. For data preprocessing and feature engineering, we are using [NVTabular]((https://github.com/NVIDIA-Merlin/NVTabular) library.

In [3]:
# disable INFO and DEBUG logging everywhere
import logging
logging.disable(logging.WARNING)

Avoid Numba low occupancy warnings

We define our base input directory, containing the data.

In [4]:
INPUT_DATA_DIR = os.environ.get(
    "INPUT_DATA_DIR", os.path.expanduser("/workspace/data/movielens/")
)

With help of a utility function first we download and unzip the data. Second, we convert data via basic preprocessing, and split data into train and validation files and save them as parquet files. Afterwards, we preprocess the train and validation parquet files and generate features for model training using NVTabular.

Let's download Movielens 25M dataset and then process it, and save files to disk in parquet format.

In [5]:
from merlin_standard_lib.utils.data_etl_utils import movielens_download_etl
movielens_download_etl(INPUT_DATA_DIR, 'ml-25m')

You can visit `01-Download-Convert.ipynb` and `02-ETL-with-NVTabular.ipynb` in `ETL` folder to see pipeline coded in jupyter notebooks.

### Schema Object

Merlin Models library relies on a schema object that takes the input features as input and automatically builds all necessary layers to represent, normalize and aggregate input features. As you can see below, `schema.pbtxt` is a protobuf text file contains features metadata, including statistics about features such as cardinality, min and max values and also tags based on their characteristics and dtypes (e.g., categorical, continuous, list, item_id). We can tag our target column and even add the prediction task such as binary, regression or multiclass as tags for the target column in the `schema.pbtxt` file. The Schema provides a standard representation for metadata that is useful when training machine learning or deep learning models.

The metadata information loaded from Schema and their tags are used to automatically set the parameters of Merlin models. Certain modules have a `from_schema()` method to instantiate their parameters and layers from protobuf text file respectively.

We have already generated our `schema.pbtxt` file in the previous notebook using `NVTabular`. Now we start with reading this schema file to create a schema object.

In [6]:
SCHEMA_PATH = os.path.join(INPUT_DATA_DIR, 'ml-25m/train/schema.pbtxt')

schema = TensorflowMetadata.from_proto_text_file('/workspace/data/movielens/ml-25m/train/').to_merlin_schema()

!head -30 $SCHEMA_PATH

feature {
  name: "movieId"
  type: INT
  int_domain {
    name: "movieId"
    min: 0
    max: 56586
    is_categorical: true
  }
  annotation {
    tag: "categorical"
    tag: "item_id"
    tag: "item"
    extra_metadata {
      type_url: "type.googleapis.com/google.protobuf.Struct"
      value: "\n\021\n\013num_buckets\022\002\010\000\n\033\n\016freq_threshold\022\t\021\000\000\000\000\000\000\000\000\n\025\n\010max_size\022\t\021\000\000\000\000\000\000\000\000\n\030\n\013start_index\022\t\021\000\000\000\000\000\000\000\000\n2\n\010cat_path\022&\032$.//categories/unique.movieId.parquet\nG\n\017embedding_sizes\0224*2\n\030\n\013cardinality\022\t\021\000\000\000\000@\241\353@\n\026\n\tdimension\022\t\021\000\000\000\000\000\000\200@"
    }
  }
}
feature {
  name: "userId"
  type: INT
  int_domain {
    name: "userId"
    min: 0
    max: 162542
    is_categorical: true
  }
  annotation {
    tag: "categorical"


## Building an MLP with Merlin Models

Let's start with building an MLP model to train a binary classification task using Merlin Models library. As a starting point, we are going to use only `['movieId', 'userId', 'genres']` input features. Note that our target column is `rating_binary` created in the `02-ETL-with-NVTabular` notebook as a binary target column.

We can easily remove the features we do not want to use as input to the model with `without` method of the schema object, and create a new schema.

In [7]:
schema = schema.without(['rating', 'title', 'TE_movieId_rating', 'userId_count'])

In [8]:
schema.column_names

['movieId', 'userId', 'genres', 'rating_binary']

Below we build a TF MLP model using three main blocks:
- [InputBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/block/inputs.py) is the entry block of the model that accepts `schema` object and other arguments like `aggregation`, `continuous projection`, `embedding options`. This function creates continuous and embedding layers, and connects them via [ParallelBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/core.py#L1184). If `aggregation` argument is not set, it returns a dictionary of multiple tensor each corresponds to an input feature, otherwise it merges the tensors into one using the aggregation method.
- [ParallelBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/core.py#L1184) take a list of layers as input, stacks them in parallel, and then  outputs a dictionary of tensors.
- [MLPBlock](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/block/mlp.py) is to define a Multi-layer perceptron block, where each dimension is used to create a fully connected Dense layer. In addition to `dimension` argument we can also feed `activation`, `use_bais`, `dropout`, etc. arguments to the `MLPBlock` function.
- [BinaryClassificationTask](https://github.com/NVIDIA-Merlin/models/blob/main/merlin_models/tf/prediction/classification.py#L30) supports the binary prediction task. Merlin Models library also supports other predictions tasks, like next-item prediction and regression.

In [9]:
# default emb_dim is 64
model = mm.Model(
    mm.InputBlock(schema), 
    mm.MLPBlock([64, 32]),
    mm.BinaryClassificationTask("rating_binary")
)

As you can notice we start with `InputBlock`, then create an `MLP block`, and then finally, we connect our `MLP block` to the BinaryClassificationTask head to be able to do binary classification, and create our model class. Basically, this is similar to a linear stacking of layers into a tf.keras.Model as done with tf.keras.Sequential.

### Define Data Loader

We're ready to start training. We'll use the NVTabular `KerasSequenceLoader` for reading chunks of parquet files. `KerasSequenceLoader` manages shuffling by loading in chunks of data from different parts of the full dataset, concatenating them and then shuffling, then iterating through this super-chunk sequentially in batches. The number of "parts" of the dataset that get sample, or "partitions", is controlled by the `parts_per_chunk` kwarg, while the size of each one of these parts is controlled by the `buffer_size` kwarg, which refers to a fraction of available GPU memory (you can read more about it [here](https://nvidia-merlin.github.io/NVTabular/main/training/tensorflow.html) and [here](https://nvidia-merlin.github.io/NVTabular/main/api/tensorflow_dataloader.html?highlight=kerassequence#nvtabular.loader.tensorflow.KerasSequenceLoader)). Using more chunks leads to better randomness, especially at the epoch level where physically disparate samples can be brought into the same batch, but can impact throughput if you use too many. In any case, the speed of the parquet reader makes feasible buffer sizes much larger.

Note that `genres` column is a multi-hot column and it is fed to dataloader as a sparse tensor and then it is converted to dense represantation. `Dataset` class schema aware, therefore each feature will automatically be converted to sparse tensor when it's a list-column (multi-hot).

Note that we do not have continuous columns so we only feed categorical columns to the model.

In [10]:
import merlin_models.tf.dataset as tf_dataloader
def get_dataloader(paths_or_dataset, batch_size=4096, shuffle=True):
    dataloader = tf_dataloader.Dataset(
        paths_or_dataset,
        batch_size=batch_size,
        label_names=['rating_binary'],
        shuffle=shuffle,
        schema = schema,
    )
    return dataloader.map(lambda X, y: (X, tf.reshape(y, (-1,))))

Define the parquet file paths for model training.

In [11]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/movielens/ml-25m")
train_paths = glob.glob(os.path.join(OUTPUT_DIR, "train/*.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, "valid/*.parquet"))

In [12]:
import tensorflow as tf
model.compile(optimizer="adam", run_eagerly=False)

In order to train and evaluate our model we are using `.fit()` and `.evaluate()` methods as done in `tf.keras`.

In [13]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(nvt.Dataset(train_paths), shuffle=True) 
losses = model.fit(train_loader, epochs=1)
model.reset_metrics()

# Evaluate
print('*'*20 + '\n')
print("Start evaluation")
print('*'*20 + '\n')
eval_loader = get_dataloader(nvt.Dataset(eval_paths), shuffle=False) 
eval_metrics = model.evaluate(eval_loader, return_dict=True)

********************
Launch training
********************



2022-02-21 16:12:25.771594: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: annotated name 'output' can't be nonlocal (__autograph_generated_file4o0az23x.py, line 36)
********************

Start evaluation
********************







#### Connect Method

Now we will build the same MLP model using `connect` method. Connect method cares about the sequential order, first argument is run first. The `connect` method is works as [SequentialBlock]() class which enables users to represent a sequence of Keras layers. It is a Keras Layer that can be used instead of `tf.keras.layers.Sequential` which is actually a Keras Model. 

In [14]:
model = mm.InputBlock(schema).connect(mm.MLPBlock([64, 32]), mm.BinaryClassificationTask("rating_binary"))

In [15]:
model.compile(optimizer="adam", run_eagerly=False)

In [16]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(nvt.Dataset(train_paths), shuffle=True) 
eval_loader = get_dataloader(nvt.Dataset(eval_paths), shuffle=False) 
losses = model.fit(train_loader, validation_data=eval_loader, epochs=1)

********************
Launch training
********************



Before moving to the next section we need to restart our kernel so that we can free up the GPU memory.

In [17]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

You can restart manually in case the restart is not sucessful as expected.

### ParallelBlock

Now, we will add continuous features and create a new model using `ParallelBlock` class. We will create one block (like a branch) for continuous features and another one for categorical features and we will connect them. Note that this architecture is different than the ones above.

In [17]:
import os
import glob
import nvtabular as nvt
import merlin_models.tf as mm
from merlin.schema import ColumnSchema, Schema
from merlin.schema.io.tensorflow_metadata import TensorflowMetadata
from merlin.schema.tags import Tags, TagSet

In [18]:
INPUT_DATA_DIR = os.environ.get(
    "INPUT_DATA_DIR", os.path.expanduser("/workspace/data/movielens/")
)

In [19]:
SCHEMA_PATH = os.path.join(INPUT_DATA_DIR, 'ml-25m/train/schema.pbtxt')
schema = TensorflowMetadata.from_proto_text_file('/workspace/data/movielens/ml-25m/train/').to_merlin_schema()

In [20]:
schema = schema.without(['rating', 'title'])

In [21]:
schema.column_names

['movieId',
 'userId',
 'genres',
 'TE_movieId_rating',
 'userId_count',
 'rating_binary']

Note that  `TE_movieId_rating`,  and  `userId_count` are the continuous features.

In [22]:
con_schema = schema.select_by_tag(Tags.CONTINUOUS)
cat_schema = schema.select_by_tag(Tags.CATEGORICAL)

In [23]:
con_schema.column_names

['TE_movieId_rating', 'userId_count']

In [24]:
cat_schema.column_names

['movieId', 'userId', 'genres']

In [25]:
cont_block = mm.InputBlock(con_schema).connect(mm.MLPBlock([256, 64]))
cat_block = mm.InputBlock(cat_schema).connect(mm.MLPBlock([128, 64]))

body = mm.ParallelBlock({'continuous_block': cont_block, 'categorical_block': cat_block}, aggregation='concat')

model = body.connect(mm.BinaryClassificationTask("rating_binary"))

In [26]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/movielens/ml-25m")
train_paths = glob.glob(os.path.join(OUTPUT_DIR, "train/*.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, "valid/*.parquet"))

Notice below that this time our `x_cont_names` list is not empty.

In [27]:
import merlin_models.tf.dataset as tf_dataloader
def get_dataloader(paths_or_dataset, batch_size=4096, shuffle=True):
    dataloader = tf_dataloader.Dataset(
        paths_or_dataset,
        batch_size=batch_size,
        label_names=['rating_binary'],
        shuffle=shuffle,
        schema = schema,
    )
    return dataloader.map(lambda X, y: (X, tf.reshape(y, (-1,))))

In [28]:
import tensorflow as tf
model.compile(optimizer="adam", run_eagerly=False)

In [30]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(nvt.Dataset(train_paths), shuffle=True) 
eval_loader = get_dataloader(nvt.Dataset(eval_paths), shuffle=False) 
losses = model.fit(train_loader, validation_data=eval_loader, epochs=1)

********************
Launch training
********************



2022-02-21 16:29:33.329018: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: annotated name 'output' can't be nonlocal (__autograph_generated_filei28804tf.py, line 36)
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: annotated name 'output' can't be nonlocal (__autograph_generated_filei28804tf.py, line 36)


Like that we can easily build `BinaryClasificationTask` models with Merlin Models library in couple of lines. For more advanced ranking examples, please visit `ranking` folder and explore the other example notebooks.