In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Applying to your own dataset with Merlin Models and NVTabular

## Overview

In [01-getting-started.ipynb](01-getting-started.ipynb), we provide a getting started example to train a DLRM model on the MovieLens 1M dataset. In this notebook, we will explore how Merlin Models uses the ETL output from [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular/).<br><br>

### Learning objectives

This notebook provides details how NVTabular and Merlin Models are linked together. We will discuss the concept of the `schema` file.

## Merlin

[Merlin](https://developer.nvidia.com/nvidia-merlin) is an open-source framework for building large-scale (deep learning) recommender systems. It is designed to support recommender systems end-to-end from ETL to training to deployment on CPU or GPU. Common deep learning frameworks are integrated such as TensorFlow or PyTorch. Its key benefits are the easy-to-use APIs, accelerations with GPU and scaling to multi-GPU or multi-node systems.

Merlin Models and NVTabular are components of Merlin. They are designed to work closely together. 

[Merlin Models](https://github.com/NVIDIA-Merlin/models/) is a library to make it easy for users in industry or academia to train and deploy recommender models with best practices baked into the library. This will let users in industry easily train standard models against their own dataset, getting high performance GPU accelerated models into production. This will also let researchers to build custom models by incorporating standard components of deep learning recommender models, and then benchmark their new models on example offline datasets.

[NVTabular](https://github.com/NVIDIA-Merlin/NVTabular/)) is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS Dask-cuDF library.

## Integration of NVTabular and Merlin Models

<img src="images/schema.png">

If you use NVTabular for feature engineering, in addition to the data, NVTabular will provide a `schema file` describing the dataset structures. NVTabular will automatically detect some types of `Tags`. Some `Tags` have to be provided manually. 

Let's take a look on the MovieLens 1M example.

In [3]:
import os
import shutil
import pandas as pd
import nvtabular as nvt
import merlin.io

import merlin.models.tf as mm

from os import path
from nvtabular import ops
from merlin.core.utils import download_file
from merlin.models.data.movielens import get_movielens
from merlin.schema.tags import Tags

2022-03-15 14:51:49.328297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16255 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:0b:00.0, compute capability: 7.0
2022-03-15 14:51:49.329539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29922 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2022-03-15 14:51:49.330666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29924 MB memory:  -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
2022-03-15 14:51:49.331690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29924 MB memory:  -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id

We will use the utils function to download, extract and preprocess the dataset.

In [4]:
train, valid = get_movielens(variant="ml-1m")



When NVTabular process the data, it will persist the schema as a file to disk in . The dataset contains the schema as a property, as well.

The `schema` can be interpreted as a list of features in the dataset, where each element describe the feature. It contains the name, some properties (e.g. statistics) and multiple tags. 

In [5]:
train.schema

[{'name': 'userId', 'tags': {<Tags.USER_ID: 'user_id'>, <Tags.USER: 'user'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0.0, 'max_size': 0.0, 'start_index': 0.0, 'cat_path': './/categories/unique.userId.parquet', 'embedding_sizes': {'cardinality': 6041.0, 'dimension': 210.0}, 'domain': {'min': 0, 'max': 6041}}, 'dtype': dtype('int32'), 'is_list': False, 'is_ragged': False}, {'name': 'movieId', 'tags': {<Tags.ITEM: 'item'>, <Tags.ITEM_ID: 'item_id'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0.0, 'max_size': 0.0, 'start_index': 0.0, 'cat_path': './/categories/unique.movieId.parquet', 'embedding_sizes': {'cardinality': 3680.0, 'dimension': 159.0}, 'domain': {'min': 0, 'max': 3680}}, 'dtype': dtype('int32'), 'is_list': False, 'is_ragged': False}, {'name': 'title', 'tags': {<Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0.0, 'max_size': 0.0, 'start_index

We can select the features by `Name`.

In [6]:
train.schema.select_by_name("userId")

[{'name': 'userId', 'tags': {<Tags.USER_ID: 'user_id'>, <Tags.USER: 'user'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0.0, 'max_size': 0.0, 'start_index': 0.0, 'cat_path': './/categories/unique.userId.parquet', 'embedding_sizes': {'cardinality': 6041.0, 'dimension': 210.0}, 'domain': {'min': 0, 'max': 6041}}, 'dtype': dtype('int32'), 'is_list': False, 'is_ragged': False}]

Alternativley, we can select them by `Tag`. We add `column_names` to the object to receive only names without all the additional metadata.

In [7]:
# All categorical features
train.schema.select_by_tag(Tags.CATEGORICAL).column_names

['userId',
 'movieId',
 'title',
 'genres',
 'gender',
 'age',
 'occupation',
 'zipcode']

In [8]:
# All continuous features
train.schema.select_by_tag(Tags.CONTINUOUS).column_names

['TE_age_rating',
 'TE_gender_rating',
 'TE_occupation_rating',
 'TE_zipcode_rating',
 'TE_movieId_rating',
 'TE_userId_rating']

In [9]:
# All targets
train.schema.select_by_tag(Tags.TARGET).column_names

['rating_binary', 'rating']

In [10]:
# All features related to the item
train.schema.select_by_tag(Tags.ITEM).column_names

['movieId', 'genres', 'TE_movieId_rating']

In [11]:
# All features related to the user
train.schema.select_by_tag(Tags.USER).column_names

['userId',
 'TE_age_rating',
 'TE_gender_rating',
 'TE_occupation_rating',
 'TE_zipcode_rating',
 'TE_userId_rating']

The `schema` is a great way to combine Feature Engineering and Model Training as one end-to-end pipeline. Many popular (deep learning) recommender models define the architecture based on different feature types. 

DLRM applies embedding layers to each categorical input feature and applies a MLP (called bottom MLP) to the continuous input features.

Two Tower model applies a MLP (with embedding layers for categorical features) to all item features (called item tower) and another MLP to all user features (called user tower).

The `schema` file contains all required information to build the architecture. If the dataset changes (e.g. more features are added), then the same code can be used to define the same architecture.

## Applying NVTabular and Merlin Models to an own dataset.

We have a solid understanding of the importance of the schema and how the schema works. Let's take a look on how to apply it to your own dataset.

The best way is to use [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular/) for the feature engineering step. We will look on a minimal example for the MovieLens dataset.

We will download the dataset, if it is not available on the disk.

In [13]:
input_path = os.environ.get(
    "INPUT_DATA_DIR",
    os.path.expanduser("~/merlin-models-data/movielens/")
)
name = "ml-1m"
download_file(
    "http://files.grouplens.org/datasets/movielens/ml-1m.zip",
    os.path.join(input_path, "ml-1m.zip"),
)

downloading ml-1m.zip: 5.93MB [00:02, 2.63MB/s]                                 
unzipping files: 100%|█████████████████████████| 5/5 [00:00<00:00, 36.20files/s]


We preprocess the dataset and split it into training and validation.

In [14]:
ratings = pd.read_csv(
    os.path.join(input_path, "ml-1m/ratings.dat"),
    sep="::",
    names=["userId", "movieId", "rating", "timestamp"],
)
ratings = ratings.sample(len(ratings), replace=False)

num_valid = int(len(ratings) * 0.2)
train = ratings[:-num_valid]
valid = ratings[-num_valid:]
train.to_parquet(os.path.join(input_path, name, "train.parquet"))
valid.to_parquet(os.path.join(input_path, name, "valid.parquet"))

  ratings = pd.read_csv(os.path.join(input_path, "ml-1m/ratings.dat"), sep="::", names=['userId', 'movieId', 'rating', 'timestamp'])


We use NVTabular to define a feature engineering pipeline. 

NVTabular has already implemented multiple calculations, called `ops`. An `op` can be applied to a `ColumnGroup` from an overloaded `>>` operator.<br><br>
**Example:**<br>
```python
features = [ column_name, ...] >> op1 >> op2 >> ...
```

We need to perform following steps:
- Categorify userId and movieId, that the values are continuous integers from 0 ... |C|
- Transform the rating column to a binary target by using `>3` as `1` and otherwise `0`
- Add Tags with `ops.AddMetadata` for `item_id`, `user_id`, `item`, user and `target`.

In [15]:
cat_features = ["userId", "movieId"] >> ops.Categorify(dtype="int32")

In [16]:
feats_itemId = (
    cat_features["movieId"] >> ops.AddMetadata(tags=["item_id", "item"])
)
feats_userId = (
    cat_features["userId"] >> ops.AddMetadata(tags=["user_id", "user"])
)
feats_target = (
    nvt.ColumnSelector(["rating"])
    >> ops.LambdaOp(lambda col: (col > 3).astype("int32"))
    >> ops.AddMetadata(tags=["binary_classification", "target"])
    >> nvt.ops.Rename(name="rating_binary")
)
output = feats_itemId + feats_userId + feats_target

We apply the workflow to our dataset.

In [17]:
# ToDo replace with fit_transform

workflow = nvt.Workflow(output)

train_dataset = nvt.Dataset([os.path.join(input_path, name, "train.parquet")])
valid_dataset = nvt.Dataset([os.path.join(input_path, name, "valid.parquet")])

if path.exists(os.path.join(input_path, name, "train")):
    shutil.rmtree(os.path.join(input_path, name, "train"))
if path.exists(os.path.join(input_path, name, "valid")):
    shutil.rmtree(os.path.join(input_path, name, "valid"))

workflow.fit(train_dataset)
workflow.transform(train_dataset).to_parquet(
    output_path=os.path.join(input_path, name, "train"),
    out_files_per_proc=1,
    shuffle=False,
)
workflow.transform(valid_dataset).to_parquet(
    output_path=os.path.join(input_path, name, "valid"),
    out_files_per_proc=1,
    shuffle=False,
)
# Save the workflow
workflow.save(os.path.join(input_path, name, "workflow"))

We can load the data as a Merlin Dataset object.

In [18]:
train = merlin.io.Dataset(
    os.path.join(input_path, name, "train"), engine="parquet"
)
valid = merlin.io.Dataset(
    os.path.join(input_path, name, "valid"), engine="parquet"
)

We can train and evaluate our model.

In [19]:
model = model = mm.DLRMModel(
    train.schema,
    embedding_dim=64,
    bottom_block=mm.MLPBlock([128, 64]),
    top_block=mm.MLPBlock([128, 64, 32]),
    prediction_tasks=mm.BinaryClassificationTask(
        train.schema.select_by_tag(Tags.TARGET).column_names[0]
    ),
)

model.compile(optimizer="adam")
model.fit(train, batch_size=1024)
model.evaluate(valid, batch_size=1024)

2022-03-15 14:52:01.170856: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.




2022-03-15 14:52:12.331654: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: cond/then/_0/cond/cond/branch_executed/_101




[0.7367859482765198,
 0.8161250352859497,
 0.7265610694885254,
 0.7926554679870605,
 0.5303010940551758,
 0.0,
 0.5303010940551758]

We can take a look on the schema.

In [20]:
train.schema

[{'name': 'movieId', 'tags': {<Tags.ITEM: 'item'>, <Tags.ITEM_ID: 'item_id'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0.0, 'max_size': 0.0, 'start_index': 0.0, 'cat_path': './/categories/unique.movieId.parquet', 'embedding_sizes': {'cardinality': 3682.0, 'dimension': 159.0}, 'domain': {'min': 0, 'max': 3682}}, 'dtype': dtype('int32'), 'is_list': False, 'is_ragged': False}, {'name': 'userId', 'tags': {<Tags.USER_ID: 'user_id'>, <Tags.USER: 'user'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0.0, 'max_size': 0.0, 'start_index': 0.0, 'cat_path': './/categories/unique.userId.parquet', 'embedding_sizes': {'cardinality': 6041.0, 'dimension': 210.0}, 'domain': {'min': 0, 'max': 6041}}, 'dtype': dtype('int32'), 'is_list': False, 'is_ragged': False}, {'name': 'rating_binary', 'tags': {<Tags.BINARY_CLASSIFICATION: 'binary_classification'>, <Tags.TARGET: 'target'>}, 'properties': {}, 'dtype': dtype('i

As we prepared only a minimal example, our schema has only tree features `movieId`, `userId` and `rating_binary`.

NVTabular can automatically add `Tags` for certrain operations. For example, the output of `Categorify` is always a categorical feature and will be tagged. Similar, the output of `Normalize` is always continuous.

You can take a look on the full example of our util function for MovieLens in our repository.

You can learn more about NVTabular, its functionality and suppored ops by visiting our [github repository]() or exploring the [examples](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples), such as [`Getting Started MovieLens`](https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb) or [`Scaling Criteo`](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples/scaling-criteo).

The easist way is to use NVTabular to generate a schema file. Alternatively, you can manually create a schema file.