In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Getting Started with Merlin dataloader and PyTorch

This notebook is created using the latest stable [merlin-pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-pytorch) container.

## Overview

[Merlin dataloader](https://github.com/NVIDIA-Merlin/dataloader) is a library for constructing highly optimized dataloaders to accelerate training pipelines in TensorFlow (Keras) and PyTorch. In this example, we will provide a simple pipeline to train a MatrixFactorization Model in PyTorch with Merlin dataloader based on the MovieLens dataset.

The core features of Merlin dataloader:

- Accelerate pipelines by up to 10x compared to other dataloaders
- Handles larger than memory dataset by streaming data from disk
- Support for common data formats: CSV, Parquet, Avro
- Distributed training support

### Learning objectives

- Using Merlin dataloader to train a PyTorch Model

# Downloading and preparing the dataset

We will base our example on the  [MovieLens25M](https://grouplens.org/datasets/movielens/25m/) dataset.

In [2]:
from merlin.core.utils import download_file
from merlin.core.dispatch import get_lib

from merlin.io import Dataset
from merlin.loader.torch import Loader

import torch
from torch import nn
from torch.optim import Adam

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
DATA_PATH = '/workspace'

In [3]:
download_file("http://files.grouplens.org/datasets/movielens/ml-25m.zip", DATA_PATH + "/ml-25m.zip")

# Training a PyTorch Model with Merlin dataloader

In [4]:
ratings = get_lib().read_csv(DATA_PATH + '/ml-25m/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


The `ratings.csv` file stores ratings a user has given a movie. Let's load the data directly from disk into a `Merlin Dataset` and train a simple `MatrixFactorization` model that we will construct in `PyTorch`.

In [5]:
dataset = Dataset(DATA_PATH + '/ml-25m/ratings.csv')

Let us now instantiate the `dataloader`.

In [6]:
loader = Loader(dataset, batch_size=65536)

As is, the `loader` will output a batch that will consist of a tuple with dictionary with tensors and `None`.

In [7]:
batch = next(loader)
loader.stop()
batch

({'userId': tensor([[  1],
          [  1],
          [  1],
          ...,
          [526],
          [526],
          [526]], device='cuda:0'),
  'movieId': tensor([[296],
          [306],
          [307],
          ...,
          [479],
          [480],
          [481]], device='cuda:0'),
  'timestamp': tensor([[1147880044],
          [1147868817],
          [1147868828],
          ...,
          [ 874932743],
          [ 874933291],
          [ 874931351]], device='cuda:0'),
  'rating': tensor([5.0000, 3.5000, 5.0000,  ..., 3.0000, 2.0000, 2.0000], device='cuda:0',
         dtype=torch.float64)},
 None)

Let us now construct a simple `MatrixFactorization` model and train for a single epoch.

In [9]:
class MatrixFactorization(nn.Module):
    def __init__(self, n_factors):
        super().__init__()
        self.user_embeddings = nn.Embedding(ratings['userId'].max() + 1, n_factors)
        self.movie_embeddings = nn.Embedding(ratings['movieId'].max() + 1, n_factors)

    def forward(self, batch):
        user_embs = self.user_embeddings(batch[0]['userId'])
        movie_embs = self.movie_embeddings(batch[0]['movieId'])

        return (user_embs.squeeze(1) * movie_embs.squeeze(1)).sum(dim=1)

In [10]:
lr = 1e-2
optim = Adam
weight_decay = 0

device = "cuda" if torch.cuda.is_available() else "cpu"

model = MatrixFactorization(64).to(device)
optimizer = optim(model.parameters(), lr=lr, weight_decay=weight_decay)
criterion = nn.MSELoss()

In [11]:
def evaluate(model):
    model.eval()
    loss = 0
    n = 0
    with torch.no_grad():
        for batch in loader:
            batch_size = batch[0]['rating'].shape[0]
            loss += criterion(model(batch), batch[0]['rating']) * batch_size
            n += batch_size
    return loss.item() / n

Let us now train for a single epoch.

In [12]:
%%time

model.train()
for batch in loader:

    loss = criterion(model(batch), batch[0]['rating'].float())

    # compute gradient and do an update step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

CPU times: user 2.16 s, sys: 66.6 ms, total: 2.23 s
Wall time: 2.23 s


In [13]:
evaluate(model)

16.63651518486087

## Conclusion

We demonstrated how to train a PyTorch model with Merlin dataloader. Merlin dataloader can accelerate existing PyTorch pipelines with minimal code changes. 

# Next Steps

Merlin dataloader is part of NVIDIA Merlin, a open source framework for recommender systems. In this example, we looked only on a specific use-case to accelerate existing training pipelines.

We also offer [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular), a library to accelerate and scale feature engineering

Our libraries are designed to work closely together. We recommend to check out our examples:

* [Getting Started with NVTabular: Process Tabular Data On GPU](https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/01-Getting-started.ipynb)
* [Getting Started MovieLens: Training with PyTorch](https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/getting-started-movielens/03-Training-with-PyTorch.ipynb)