## MERLIN MODELS

Recommender Systems (RecSys) attempt to accurately predict users' preferences by supporting users in the process of finding and selecting products (items) from a given catalog, thus enhance their engagement and overall satisfaction with online services. Several traditional (e.g. collaborative filtering) and Deep Learning-based RecSys models have been built and used in industry within the last decade. Recently, deep learning based recommender systems have become the norm for large scale industry applications. In contrast to the other domains, DL recommender models aren’t able to effectively leverage the parallel compute and high memory bandwidth that GPUs have to offer. To address the challenges in preprocessing, training and deploying recommender systems at scale, NVIDIA developed an open source framework, called [Merlin](https://github.com/NVIDIA-Merlin). One of the libraries of the Merlin framework is Merlin Models that is being developed to provide all the necessary tools to design an end-to-end RecSys pipeline, offering flexibility at each stage: multiple inputs processing/representation modules, different layers for designing the model’s architecture, different prediction heads, loss functions, different negative sampling techniques etc. 

The goal of the `Merlin Models` library is make it easy for users in industry or academia to train and deploy recommender models with best practices baked into the library. This will let users in industry easily train standard models against their own dataset, getting high performance GPU accelerated models into production. This will also let researchers to build custom models by incorporating standard components of deep learning recommender models, and then benchmark their new models on example offline datasets.

### Core features

- Support for diverse input types (e.g. categorical and continuous), architectures (e.g. two-tower or sequential) or tasks (e.g. binary, multi-class classification, multi-task)
- Flexible building blocks to define a broad range of models with various prediction tasks, loss functions and negative sampling techniques 
- Flexible APIs targeted to both production and research
- Unified API enables users to create models in TensorFlow or PyTorch.
- GPU-optimized data-loading, model training and serving
- Deep integration with NVTabular for ETL and model serving
- Multi-task learning is a first-class citizen
- State-of-the-art negative sampling techniques

### Learning objectives

In this introductory notebook we aim at

- introducing the building blocks of Merlin Models library
- showing how seamlessly integrate NVTabular and Merlin Models libraries
- training DL-based recommender models with optimized Tensorflow data loaders with only a few lines of code. 

### Import Libraries

In [1]:
import os
import glob
import shutil
import nvtabular as nvt
import numpy as np

from nvtabular.loader.tensorflow import KerasSequenceLoader

import merlin_models.tf as ml
from merlin_standard_lib import Schema, Tag

2022-01-06 15:08:35.352097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0


## ETL with NVTabular

We use the [MovieLens25M](https://grouplens.org/datasets/movielens/25m/) dataset which is a popular dataset for recommender systems and is used in academic publications. In order to download and convert the raw dataset, run `01-Download-Convert.ipynb` notebook first. Here, we will use parquet files processed using [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular) library. Run the `02-ETL-with-NVTabular.ipynb` notebook to generate the input features and processed parquet files.

Avoid Numba low occupancy warnings

In [2]:
from numba import config
config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

We define our base input directory, containing the data.

In [3]:
INPUT_DATA_DIR = os.environ.get(
    "INPUT_DATA_DIR", os.path.expanduser("/workspace/data/movielens/")
)

### Schema Object

Merlin Models library relies on a schema object to automatically build all necessary layers to represent, normalize and aggregate input features. As you can see below, `schema.pbtxt` is a protobuf file that contains metadata including statistics about features such as cardinality, min and max values and also tags features based on their characteristics and dtypes (e.g., categorical, continuous, list, integer).

We have already generated our `schema.pbtxt` file in the previous notebook using NVTabular. Now we read this schema file to create a schema object.

In [4]:
from merlin_standard_lib import Schema
SCHEMA_PATH = "/workspace/data/movielens/train/schema.pbtxt"
schema = Schema().from_proto_text(SCHEMA_PATH)
!head -30 $SCHEMA_PATH

feature {
  name: "movieId"
  type: INT
  int_domain {
    name: "movieId"
    min: 0
    max: 56690
    is_categorical: true
  }
  annotation {
    tag: "item_id"
    tag: "item"
    tag: "categorical"
    extra_metadata {
      type_url: "type.googleapis.com/google.protobuf.Struct"
      value: "\n\021\n\013num_buckets\022\002\010\000\n\033\n\016freq_threshold\022\t\021\000\000\000\000\000\000\000\000\n\025\n\010max_size\022\t\021\000\000\000\000\000\000\000\000\n\030\n\013start_index\022\t\021\000\000\000\000\000\000\000\000\n2\n\010cat_path\022&\032$.//categories/unique.movieId.parquet\nG\n\017embedding_sizes\0224*2\n\030\n\013cardinality\022\t\021\000\000\000\000@\256\353@\n\026\n\tdimension\022\t\021\000\000\000\000\000\000\200@"
    }
  }
}
feature {
  name: "userId"
  type: INT
  int_domain {
    name: "userId"
    min: 0
    max: 162542
    is_categorical: true
  }
  annotation {
    tag: "user_id"


## Building an MLP with Merlin Models

Let's start with building an MLP model to train a binary classification task using Merlin Models library. As a starting point, we are goind to use only `['movieId', 'userId', 'genres']` input features. Note that our target column is `rating_b` which is created in the `02-ETL-with-NVTabular` notebook.

We can easily remove the features we do not want to use as input to the model with `remove_by_name` property of the schema object, and create a new schema.

In [7]:
schema = schema.remove_by_name(['rating', 'title', 'TE_movieId_rating', 'userId_count'])

In [8]:
schema.column_names

['movieId', 'userId', 'genres']

Below we build a TF MLP model using three main blocks:
- InputBlock
- MLPBlock
- BinaryClassificationTask

In [None]:
# default emb_dim is 64
model = ml.Model(
    ml.InputBlock(schema), 
    ml.MLPBlock([64, 32]),
    ml.BinaryClassificationTask("rating_b")
)

### Define Data Loader

We're ready to start training. We'll use the NVTabular `KerasSequenceLoader` for reading chunks of parquet files. `KerasSequenceLoader` manages shuffling by loading in chunks of data from different parts of the full dataset, concatenating them and then shuffling, then iterating through this super-chunk sequentially in batches. The number of "parts" of the dataset that get sample, or "partitions", is controlled by the `parts_per_chunk` kwarg, while the size of each one of these parts is controlled by the `buffer_size` kwarg, which refers to a fraction of available GPU memory (you can read more about it [here](https://nvidia-merlin.github.io/NVTabular/main/training/tensorflow.html) and [here](https://nvidia-merlin.github.io/NVTabular/main/api/tensorflow_dataloader.html?highlight=kerassequence#nvtabular.loader.tensorflow.KerasSequenceLoader)). Using more chunks leads to better randomness, especially at the epoch level where physically disparate samples can be brought into the same batch, but can impact throughput if you use too many. In any case, the speed of the parquet reader makes feasible buffer sizes much larger.

Note that `genres` column is a multi-hot column and it is fed to dataloader as a sparse tensor and then it is converted to dense represantation. Based on our analysis, genres column has max 10 sequence of entries. So we will set the sequence length for the multi-hot columns as 10 in the `sparse_feature_max` dictionary below.

In [9]:
# Define categorical and continuous columns
x_cat_names, x_cont_names = ['userId', 'movieId', 'genres'], []

# dictionary representing max sequence length for each column
sparse_features_max = {'genres': 10}

def get_dataloader(paths_or_dataset, batch_size=4096):
    dataloader = KerasSequenceLoader(
        paths_or_dataset,
        batch_size=batch_size,
        label_names=['rating_b'],
        cat_names=x_cat_names,
        cont_names=x_cont_names,
        sparse_names=list(sparse_features_max.keys()),
        sparse_max=sparse_features_max,
        sparse_as_dense=True,
    )
    return dataloader.map(lambda X, y: (X, tf.reshape(y, (-1,))))

In [10]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/movielens/")
train_paths = glob.glob(os.path.join(OUTPUT_DIR, "train/*.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, "valid/*.parquet"))

In [8]:
import tensorflow as tf
model.compile(optimizer="adam", run_eagerly=False)

In [None]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(train_paths) 
losses = model.fit(train_loader, epochs=1)
model.reset_metrics()

# Evaluate
print('*'*20 + '\n')
print("Start evaluation")
eval_loader = get_dataloader(eval_paths) 
eval_metrics = model.evaluate(eval_loader, return_dict=True)

Now we will build the same MLP model with `connect` method.

In [11]:
model = ml.InputBlock(schema).connect(ml.MLPBlock([64, 32]), ml.BinaryClassificationTask("rating_b"))

In [13]:
model.compile(optimizer="adam", run_eagerly=False)

In [15]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(train_paths) 
losses = model.fit(train_loader, epochs=1)
model.reset_metrics()

# Evaluate
print('*'*20 + '\n')
print("Start evaluation")
eval_loader = get_dataloader(eval_paths) 
eval_metrics = model.evaluate(eval_loader, return_dict=True)

********************
Launch training
********************

********************

Start evaluation


## 3rd Example

In [11]:
# default emb_dim is 64

block1 = ml.MLPBlock([256, 64])
block2 = ml.MLPBlock([128, 64])

body = ml.InputBlock(schema).connect_branch(block1, block2, aggregation='concat')

model = body.connect(ml.BinaryClassificationTask("rating_b"))

In [12]:
import tensorflow as tf
model.compile(optimizer="adam", run_eagerly=False)

In [13]:
print('*'*20)
print("Launch training")
print('*'*20 + '\n')
train_loader = get_dataloader(train_paths) 
losses = model.fit(train_loader, epochs=1)
model.reset_metrics()

# Evaluate
print('*'*20 + '\n')
print("Start evaluation")
eval_loader = get_dataloader(eval_paths) 
eval_metrics = model.evaluate(eval_loader, return_dict=True)

********************
Launch training
********************



2021-12-17 16:37:06.049981: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2021-12-17 16:37:07.282600: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: annotated name 'output' can't be nonlocal (tmp4cg7dpwn.py, line 36)
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: annotated name 'output' can't be nonlocal (tmp4cg7dpwn.py, line 36)



********************

Start evaluation





Recommender systems need to be highly scalable while matching millions of items with billions of users with ~100 milliseconds latency. The scalability requirement has led two-stage recommender systems to be widely used by industry. A two-stage recommender system consists of efficient candidate generation model(s) in the first stage and a more powerful ranking model in the second stage.

### DLRM Model

In [5]:
schema = schema.remove_by_name(['rating', 'title'])

In [6]:
schema.column_names

['movieId', 'userId', 'genres', 'rating_b']

In [7]:
# Define categorical and continuous columns
x_cat_names, x_cont_names = ['userId', 'movieId', 'genres'], []

# dictionary representing max sequence length for each column
sparse_features_max = {'genres': 10}

def get_dataloader(paths_or_dataset, batch_size=20):
    dataloader = KerasSequenceLoader(
        paths_or_dataset,
        batch_size=batch_size,
        label_names=['rating_b'],
        cat_names=x_cat_names,
        cont_names=x_cont_names,
        sparse_names=list(sparse_features_max.keys()),
        sparse_max=sparse_features_max,
        sparse_as_dense=True,
    )
    return dataloader.map(lambda X, y: (X, tf.reshape(y, (-1,))))

in batch negative sampling it will create the number of negatives wrt batch_size in each batch.

In [8]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/movielens/")
train_paths = glob.glob(os.path.join(OUTPUT_DIR, "train/*.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, "valid/*.parquet"))

In [9]:
train_loader = get_dataloader(train_paths) 
# losses = model.fit(train_loader, epochs=1)
# model.reset_metrics()

In [None]:
# music_streaming_data._schema = music_streaming_data.schema.remove_by_tag(Tag.TARGETS)
import tensorflow as tf

two_tower = ml.TwoTowerBlock(schema, query_tower=ml.MLPBlock([512, 256]))
model = two_tower.connect(ml.ItemRetrievalTask(softmax_temperature=2))

# output = model(music_streaming_data.tf_tensor_dict)
# assert output is not None

model.compile(optimizer="adam", run_eagerly=False)
losses = model.fit(train_loader, epochs=1)
# assert len(losses.epoch) == num_epochs
# assert all(measure >= 0 for metric in losses.history for measure in losses.history[metric])

it is multiclass because it is one positive and multi-negatives

### NCF example with Binary classification

In [None]:
model[0] --> TwoTower block

# This is for the inference
user_model = model[0]['query'] 

user_model(user_id)