In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Getting Started with `Merlin dataloader`

This notebook is created using the latest stable [merlin-pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-pytorch) container.

## Overview

[Merlin dataloader](https://github.com/NVIDIA-Merlin/dataloader) is a library for constructing highly optimized dataloaders to feed `Tensorflow/Keras` and `PyTorch` models during training. You preprocess your data using a [Merlin NVTabular](https://github.com/NVIDIA-Merlin/NVTabular) workflow and hand the dataset over to `Merlin dataloader`.

In this notebook we will download the Movielens dataset consisting of movie ratings. We will briefly process the data using NVTabular and output it to a `Merlin Dataset`. Subsequently, we will construct a dataloader and build a simple `MatrixFactorization` model in vanilla PyTorch.

### Learning objectives

- Learn how `Merlin dataloader` integrates with `NVTabular` (a library for preprocessing tabular data on the GPU)
- Understand `Merlin dataloader` high-level concepts
- Use `Merlin dataloader` to train a `PyTorch` model

# Downloading the dataset

### MovieLens25M

The [MovieLens25M](https://grouplens.org/datasets/movielens/25m/) is a popular dataset for recommender systems and is widely used in academic publications. The dataset contains 25M movie ratings for 62,000 movies given by 162,000 users. Many projects use only the user/item/rating information of MovieLens, but the original dataset provides metadata for the movies, as well. For example, which genres a movie has.

In this notebook, we will only use the user-movie pairs along with the ratings a user assigned to the movie.

Let's begin by downloading the [`MovieLens 25M Dataset`](https://grouplens.org/datasets/movielens/).

In [1]:
rm -rf ml-25m*

In [2]:
ls

01a-Getting-started-Pytorch.ipynb     README.md    index.html
01b-Getting-started-Tensorflow.ipynb  [0m[01;34mcategories[0m/


In [2]:
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip

--2022-11-16 00:31:17--  https://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip’


2022-11-16 00:32:07 (5.25 MB/s) - ‘ml-25m.zip’ saved [261978986/261978986]



In [3]:
%%capture

!apt update
!apt install unzip
!unzip -q ml-25m.zip

We have now downloaded and extracted the data and can read in the ratings.

In [4]:
from merlin.core.dispatch import get_lib

In [5]:
ratings = get_lib().read_csv('ml-25m/ratings.csv')

The `ratings.csv` file stores ratings a user has given a movie. Let's process the data, pass the resultant NVTabular dataset to `Merlin dataloader` and train a simple MatrixFactorization model that we will construct in `PyTorch`.

In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [7]:
from nvtabular import *
from nvtabular import ops
from merlin.schema.tags import Tags

dataset = Dataset(ratings)

user_and_movie_ids = ['userId', 'movieId'] >> ops.Categorify(freq_threshold=100)
rating = ['rating'] >> ops.AddTags([Tags.TARGET])
workflow = Workflow(user_and_movie_ids + rating)

processed = workflow.fit_transform(dataset)

We processed our `Merlin Dataset` using NVTabulars and performed the operations on the GPU. If you would like to learn more about the NVTabular library, please take a look [here](https://github.com/NVIDIA-Merlin/NVTabular).

Now that we have preprocessed our data, let's instantiate the `dataloader`.

In [8]:
from merlin.loader.tensorflow import Loader
loader = Loader(processed, batch_size=65536)

2022-11-16 00:32:56.129491: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 00:32:56.129853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 00:32:56.130008: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


In [9]:
batch = next(iter(loader))

2022-11-16 00:32:56.323968: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-16 00:32:56.325172: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 00:32:56.325378: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 00:32:56.325533: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning

In [7]:
!pip install merlin-dataloader

Collecting merlin-dataloader
  Downloading merlin-dataloader-0.0.2.tar.gz (44 kB)
[K     |████████████████████████████████| 44 kB 2.1 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Building wheels for collected packages: merlin-dataloader
  Building wheel for merlin-dataloader (PEP 517) ... [?25ldone
[?25h  Created wheel for merlin-dataloader: filename=merlin_dataloader-0.0.2-py3-none-any.whl size=29203 sha256=91d964a33714577018cd70fed533f3eb37104e4976e057d63488ea38e6726b83
  Stored in directory: /root/.cache/pip/wheels/d5/ce/8c/31476c01e0b5a2278110fe2092bdd911efb0e5b83d0d3550ca
Successfully built merlin-dataloader
Installing collected packages: merlin-dataloader
Successfully installed merlin-dataloader-0.0.2


From the `loader` we obtain a batch of data, a dictionary of tensors that have already been moved to the GPU.

In [10]:
batch

({'userId': <tf.Tensor: shape=(65536, 1), dtype=int64, numpy=
  array([[10893],
         [   24],
         [26219],
         ...,
         [23129],
         [ 3882],
         [41289]])>,
  'movieId': <tf.Tensor: shape=(65536, 1), dtype=int64, numpy=
  array([[1512],
         [   0],
         [ 561],
         ...,
         [2174],
         [1467],
         [ 619]])>},
 <tf.Tensor: shape=(65536, 1), dtype=float64, numpy=
 array([[3. ],
        [4. ],
        [4.5],
        ...,
        [3. ],
        [2.5],
        [5. ]])>)

Let us now construct a simple MatrixFactorization model and train for a single epoch.

In [13]:
import tensorflow as tf

class MatrixFactorization(tf.keras.Model):
    def __init__(self, n_factors):
        super().__init__()
        self.user_embeddings = tf.keras.layers.Embedding(processed.schema['userId'].properties['domain']['max'], n_factors)
        self.movie_embeddings = tf.keras.layers.Embedding(processed.schema['movieId'].properties['domain']['max'], n_factors)
        
    def call(self, batch, training=False):
        user_embs = self.user_embeddings(batch['userId'])
        movie_embs = self.movie_embeddings(batch['movieId'])
        
        tensor = (tf.squeeze(user_embs) * tf.squeeze(movie_embs))
        return tf.reduce_sum(tensor, 1)

In [14]:
model = MatrixFactorization(64)
model.compile(optimizer=tf.keras.optimizers.Adam(1e-2), loss=tf.keras.losses.MeanSquaredError())

Let's first calculate the Mean Squared Error loss before training.

In [15]:
model.evaluate(loader)



13.613266944885254

Let us now train for a single epoch.

In [16]:
model.fit(loader, epochs=1)

2022-11-16 00:39:56.476031: W tensorflow/core/common_runtime/forward_type_inference.cc:231] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_BOOL
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_LEGACY_VARIANT
    }
  }
}

	while inferring type of node 'mean_squared_error/cond/output/_11'




<keras.callbacks.History at 0x7fd8d0d1d8b0>

In [17]:
model.evaluate(loader)



0.772408127784729

The model has improved and has run for a single epoch, training on all 25_000_000 datapoints, in record time!

In a more realistic scenario we would have singled out a validation dataset, trained for more epochs, optimized hyperparameters and so on.

Nonetheless, the objective here was to quickly show you the ropes on integrating `Merlin dataloader` with `PyTorch`.

In the subsequent example, let us take a look at training a model with `Merlin dataloader` and `Keras`. 