In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Getting Started with `Merlin dataloader`

This notebook is created using the latest stable [merlin-pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-pytorch) container.

## Overview

[Merlin dataloader](https://github.com/NVIDIA-Merlin/dataloader) is a library for constructing highly optimized dataloaders to feed `Tensorflow/Keras` and `PyTorch` models during training. You preprocess your data using a [Merlin NVTabular](https://github.com/NVIDIA-Merlin/NVTabular) workflow and hand the dataset over to `Merlin dataloader`.

In this notebook we will download the Movielens dataset consisting of movie ratings. We will briefly process the data using NVTabular and output it to a `Merlin Dataset`. Subsequently, we will construct a dataloader and build a simple `MatrixFactorization` model in vanilla PyTorch.

### Learning objectives

- Learn how `Merlin dataloader` integrates with `NVTabular` (a library for preprocessing tabular data on the GPU)
- Understand `Merlin dataloader` high-level concepts
- Use `Merlin dataloader` to train a `PyTorch` model

# Downloading the dataset

### MovieLens25M

The [MovieLens25M](https://grouplens.org/datasets/movielens/25m/) is a popular dataset for recommender systems and is widely used in academic publications. The dataset contains 25M movie ratings for 62,000 movies given by 162,000 users. Many projects use only the user/item/rating information of MovieLens, but the original dataset provides metadata for the movies, as well. For example, which genres a movie has.

In this notebook, we will only use the user-movie pairs along with the ratings a user assigned to the movie.

Let's begin by downloading the [`MovieLens 25M Dataset`](https://grouplens.org/datasets/movielens/).

In [2]:
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip

--2022-11-11 05:41:26--  https://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip.1’

ml-25m.zip.1          5%[>                   ]  12.91M  3.06MB/s    eta 1m 53s ^C


In [3]:
%%capture

!apt update
!apt install unzip
!unzip -q ml-25m.zip

We have now downloaded and extracted the data and can read in the ratings.

In [1]:
from merlin.core.dispatch import get_lib

In [2]:
ratings = get_lib().read_csv('ml-25m/ratings.csv')

The `ratings.csv` file stores ratings a user has given a movie. Let's process the data, pass the resultant NVTabular dataset to `Merlin dataloader` and train a simple MatrixFactorization model that we will construct in `PyTorch`.

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [4]:
from nvtabular import *
from nvtabular import ops

dataset = Dataset(ratings)

user_and_movie_ids = ['userId', 'movieId'] >> ops.Categorify(freq_threshold=100)
workflow = Workflow(user_and_movie_ids + 'rating')

processed = workflow.fit_transform(dataset)



We processed our `Merlin Dataset` using NVTabulars and performed the operations on the GPU. If you would like to learn more about the NVTabular library, please take a look [here](https://github.com/NVIDIA-Merlin/NVTabular).

Now that we have preprocessed our data, let's instantiate the `dataloader`.

In [8]:
from merlin.loader.torch import Loader
loader = Loader(processed, batch_size=65536)

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
batch = next(iter(loader))

From the `loader` we obtain a batch of data, a dictionary of tensors that have already been moved to the GPU.

In [10]:
batch

({'userId': tensor([[    0],
          [    0],
          [    0],
          ...,
          [28656],
          [28656],
          [28656]], device='cuda:0'),
  'movieId': tensor([[   3],
          [ 880],
          [ 943],
          ...,
          [3504],
          [   7],
          [1226]], device='cuda:0'),
  'rating': tensor([5.0000, 3.5000, 5.0000,  ..., 3.0000, 2.0000, 2.0000], device='cuda:0',
         dtype=torch.float64)},
 None)

Let us now construct a simple MatrixFactorization model and train for a single epoch.

In [11]:
from torch import nn

In [12]:
class DotProduct(nn.Module):
    def __init__(self, n_factors):
        super().__init__()
        self.user_embeddings = nn.Embedding(processed.schema['userId'].properties['domain']['max'], n_factors)
        self.movie_embeddings = nn.Embedding(processed.schema['movieId'].properties['domain']['max'], n_factors)
        
    def forward(self, batch):
        user_embs = self.user_embeddings(batch[0]['userId'])
        movie_embs = self.movie_embeddings(batch[0]['movieId'])
        
        return (user_embs.squeeze(1) * movie_embs.squeeze(1)).sum(dim=1)

In [13]:
from torch.optim import Adam
import torch

In [14]:
lr=1e-2
optim=Adam
weight_decay=0

model = DotProduct(64).cuda()
optimizer = optim(model.parameters(), lr=lr, weight_decay=weight_decay)
criterion = nn.MSELoss()

Let's first calculate the Mean Squared Error loss before training.

In [15]:
def calculate_loss(model):
    model.eval()
    loss = 0
    n = 0
    with torch.no_grad():
        for batch in loader:
            batch_size = batch[0]['rating'].shape[0]
            loss += criterion(model(batch), batch[0]['rating']) * batch_size
            n += batch_size
    return loss.item() / n

In [16]:
calculate_loss(model)

76.42800231738009

Let us now train for a single epoch.

In [17]:
%%time

model.train()
for batch in loader:
    loss = criterion(model(batch), batch[0]['rating'].float())

    # compute gradient and do an update step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

CPU times: user 740 ms, sys: 83.3 ms, total: 823 ms
Wall time: 829 ms


In [18]:
calculate_loss(model)

13.262424255887305

The model has improved and has run for a single epoch, training on all 25_000_000 datapoints, in record time!

In a more realistic scenario we would have singled out a validation dataset, trained for more epochs, optimized hyperparameters and so on.

Nonetheless, the objective here was to quickly show you the ropes on integrating `Merlin dataloader` with `PyTorch`.