In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Getting Started with `Merlin dataloader`

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow) container.

## Overview

[Merlin dataloader](https://github.com/NVIDIA-Merlin/dataloader) is a library for constructing highly optimized dataloaders to feed `Tensorflow/Keras` and `PyTorch` models during training. You preprocess your data using a [Merlin NVTabular](https://github.com/NVIDIA-Merlin/NVTabular) workflow and hand the dataset over to `Merlin dataloader`.

In this notebook we will download the Movielens dataset consisting of movie ratings. We will load the data directly from a csv file residing on disk into a `Merlin Dataset`. Subsequently, we will construct a dataloader and build a simple `MatrixFactorization` model in vanilla PyTorch.

### Learning objectives

- Understand `Merlin dataloader` high-level concepts
- Become familar  with `Merlin dataloader` and understand the shape of the outputs

# Downloading the dataset

### MovieLens25M

The [MovieLens25M](https://grouplens.org/datasets/movielens/25m/) is a popular dataset for recommender systems and is widely used in academic publications. The dataset contains 25M movie ratings for 62,000 movies given by 162,000 users. Many projects use only the user/item/rating information of MovieLens, but the original dataset provides metadata for the movies, as well. For example, which genres a movie has.

In this notebook, we will only use the user-movie pairs along with the ratings a user assigned to the movie.

Let's begin by downloading the [`MovieLens 25M Dataset`](https://grouplens.org/datasets/movielens/).

In [2]:
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip

--2022-11-16 00:31:17--  https://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip’


2022-11-16 00:32:07 (5.25 MB/s) - ‘ml-25m.zip’ saved [261978986/261978986]



In [3]:
%%capture

!apt update
!apt install unzip
!unzip -q ml-25m.zip

We have now downloaded and extracted the data and can read in the ratings.

In [1]:
from merlin.core.dispatch import get_lib

In [2]:
ratings = get_lib().read_csv('ml-25m/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


The `ratings.csv` file stores ratings a user has given a movie. Let's load the data directly from disk into a `Merlin Dataset` and train a simple `MatrixFactorization` model that we will construct in `Tensorflow`.

In [3]:
from merlin.io import Dataset

dataset = Dataset('ml-25m/ratings.csv')

Let us now instantiate the `dataloader`.

In [4]:
from merlin.loader.tensorflow import Loader
loader = Loader(dataset, batch_size=65536)

2022-11-16 01:52:40.081385: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 01:52:40.081755: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 01:52:40.081949: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


As is, the `loader` will output a batch that will consist of a tuple with dictionary with tensors and `None`.

In [5]:
batch = next(iter(loader))
batch

2022-11-16 01:52:41.673770: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-16 01:52:41.674969: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 01:52:41.675174: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-16 01:52:41.675330: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning

({'userId': <tf.Tensor: shape=(65536, 1), dtype=int64, numpy=
  array([[ 76153],
         [158537],
         [ 73683],
         ...,
         [128780],
         [106222],
         [126524]])>,
  'movieId': <tf.Tensor: shape=(65536, 1), dtype=int64, numpy=
  array([[96610],
         [ 1242],
         [ 1047],
         ...,
         [ 5610],
         [ 2028],
         [78499]])>,
  'timestamp': <tf.Tensor: shape=(65536, 1), dtype=int64, numpy=
  array([[1544993745],
         [1090906537],
         [ 862850547],
         ...,
         [1078376169],
         [1434313376],
         [1527204337]])>,
  'rating': <tf.Tensor: shape=(65536, 1), dtype=float64, numpy=
  array([[5. ],
         [3.5],
         [5. ],
         ...,
         [3. ],
         [5. ],
         [3. ]])>},
 None)

What `Tensorflow` expects to see are targets as the 2nd position in the tuple.

Let's write a function that will transform the batch to an appropriate shape on the fly for us.

In [6]:
def process_batch(data, _):
    return ({'userId': data['userId'], 'movieId': data['movieId']}, data['rating'])

loader._map_fns = [process_batch]

We now have the data in the shape that `Tensorflow` expects.

In [7]:
batch = next(iter(loader))
batch

({'userId': <tf.Tensor: shape=(65536, 1), dtype=int64, numpy=
  array([[  8168],
         [ 63125],
         [127567],
         ...,
         [ 55148],
         [ 88981],
         [ 25953]])>,
  'movieId': <tf.Tensor: shape=(65536, 1), dtype=int64, numpy=
  array([[ 2952],
         [ 1250],
         [  293],
         ...,
         [68358],
         [99114],
         [55241]])>},
 <tf.Tensor: shape=(65536, 1), dtype=float64, numpy=
 array([[3.5],
        [3. ],
        [4. ],
        ...,
        [4. ],
        [4. ],
        [0.5]])>)

Let us now construct a simple MatrixFactorization model and train for a single epoch.

In [8]:
import tensorflow as tf

class MatrixFactorization(tf.keras.Model):
    def __init__(self, n_factors):
        super().__init__()
        self.user_embeddings = tf.keras.layers.Embedding(ratings['userId'].max(), n_factors)
        self.movie_embeddings = tf.keras.layers.Embedding(ratings['movieId'].max(), n_factors)
        
    def call(self, batch, training=False):
        user_embs = self.user_embeddings(batch['userId'])
        movie_embs = self.movie_embeddings(batch['movieId'])
        
        tensor = (tf.squeeze(user_embs) * tf.squeeze(movie_embs))
        return tf.reduce_sum(tensor, 1)

In [9]:
model = MatrixFactorization(64)
model.compile(optimizer=tf.keras.optimizers.Adam(1e-2), loss=tf.keras.losses.MeanSquaredError())

Let's first calculate the Mean Squared Error loss before training.

In [10]:
model.evaluate(loader)



13.613324165344238

Let us now train for a single epoch.

In [11]:
model.fit(loader, epochs=1)

2022-11-16 01:53:06.514200: W tensorflow/core/common_runtime/forward_type_inference.cc:231] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_BOOL
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_LEGACY_VARIANT
    }
  }
}

	while inferring type of node 'mean_squared_error/cond/output/_11'




<keras.callbacks.History at 0x7f118af504c0>

In [12]:
model.evaluate(loader)



6.169384479522705

The model has improved and has run for a single epoch, training on all 25_000_000 datapoints, in record time!

In a more realistic scenario we would have singled out a validation dataset, trained for more epochs, optimized hyperparameters and so on.

Nonetheless, the objective here was to quickly show you the ropes on integrating `Merlin dataloader` with `Tensorflow`.