In [1]:
%%bash
cd /core
git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
pip install . --no-deps

cd /dataloader
git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && gheckout main
pip install . --no-deps

cd /nvtabular
git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
pip install . --no-deps

cd /workspace
pip install . --no-deps

cd /systems
git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
pip install . --no-deps

cd /transformers4rec
git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
pip install . --no-deps

In [2]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions anda
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_models_entertainment-with-pretrained-embeddings/nvidia_logo.png" style="width: 90px; float: right;">

# Training with pretrained embeddings

## Overview

In this use case we will consider how we might train with pretrained embeddings.

Pretrained embeddings can allow our model to include information from additional modalities (for instance, we might want to grab CNN descriptors of product images). They can also come from other models that we train on our data. For example, we might train a word2vec model on the sequence of purchased items by a customer and want to include this information in our retrieval or ranking model.

The use cases are many, but this particular example will focus on the technical aspects of working with pretrained embeddings.

We will use a synthetic version of the MovieLens 100k dataset and emulate a scenario where we would have a pretrained embedding for each of the movies in the dataset.

### Learning objectives

- Training with pretrained embeddings
- Understanding [the Schema file](https://github.com/NVIDIA-Merlin/core/blob/main/merlin/schema/schema.py)

## Downloading and preparing the dataset

In [3]:
import merlin.models.tf as mm
from merlin.schema.tags import Tags
import tensorflow as tf
from merlin.models.tf.blocks import *
from merlin.datasets.synthetic import generate_data

import numpy as np

2023-05-16 00:50:24.241261: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")




2023-05-16 00:50:26.763022: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-16 00:50:26.763418: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-16 00:50:26.763542: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


[INFO]: sparse_operation_kit is imported
[SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[SOK INFO] Initialize finished, communication tool: horovod


2023-05-16 00:50:28.341719: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-16 00:50:28.342675: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-16 00:50:28.342843: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-16 00:50:28.342965: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must 

In [4]:
train = generate_data('movielens-100k', num_rows=100_000)
train.schema = train.schema.excluding_by_name(["title"])

In [5]:
target_column = train.schema.select_by_tag(Tags.TARGET).column_names[1]
target_column

'rating_binary'

# Passing the embeddings directly to our model using `TensorInitializer`

One way of passing the embeddings directly to our model is using the `TensorInitializer` as part of the `mm.Embeddings`.

This is a straightforward method that works well with small embedding tables.

Let's begin by looking at the schema which holds vital information about our dataset. We can extract the embedding table size for the `moveId` column from it.

In [6]:
train.schema['movieId'].properties['embedding_sizes']['cardinality']

1680.0

From the schema, we can tell that the cardinality of `movieId` is 1680. Index 0 will be used in case an unknown `movieId` is encountered

In order to accommodate this, we initialize our embedding table of dimensionality of (1681, 128).

In [7]:
pretrained_movie_embs = np.random.random((1681, 128))

In [8]:
schema = train.schema

This is only a mock up embedding table. In reality, this is where we would pass our embeddings from another model.

The dimensionality of each embedding, that of 128, is arbitrary. We could have specified some other value here, though generally multiples of 8 tend to work well.

We need to update the schema properties of our `movieId` column since we will not be using the default embedding dimension.

In [9]:
schema['movieId'].properties['embedding_sizes'] = {
    'cardinality': float(pretrained_movie_embs.shape[0]), 
    'dimension': float(pretrained_movie_embs.shape[1])
}

In [10]:
train.schema = schema

In [11]:
train.schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.freq_threshold,properties.max_size,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.cat_path,properties.num_buckets,properties.start_index,properties.domain.min,properties.domain.max,properties.domain.name
0,movieId,"(Tags.CATEGORICAL, Tags.ITEM, Tags.ID)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,1681.0,128.0,.//categories/unique.movieId.parquet,,0.0,1.0,1680.0,movieId
1,userId,"(Tags.USER, Tags.CATEGORICAL, Tags.ID)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,943.0,74.0,.//categories/unique.userId.parquet,,0.0,0.0,943.0,userId
2,genres,"(Tags.CATEGORICAL, Tags.ITEM)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,216.0,32.0,.//categories/unique.genres.parquet,,0.0,0.0,216.0,genres
3,TE_movieId_rating,(Tags.CONTINUOUS),"DType(name='float64', element_type=<ElementTyp...",False,False,,,,,,,,,,
4,userId_count,(Tags.CONTINUOUS),"DType(name='float32', element_type=<ElementTyp...",False,False,,,,,,,,,,
5,gender,"(Tags.USER, Tags.CATEGORICAL)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,2.0,16.0,.//categories/unique.gender.parquet,,0.0,0.0,2.0,gender
6,zip_code,"(Tags.USER, Tags.CATEGORICAL)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,795.0,67.0,.//categories/unique.zip_code.parquet,,0.0,0.0,795.0,zip_code
7,rating,"(Tags.REGRESSION, Tags.TARGET)","DType(name='int64', element_type=<ElementType....",False,False,,,,,,,,,,
8,rating_binary,"(Tags.BINARY_CLASSIFICATION, Tags.TARGET)","DType(name='int32', element_type=<ElementType....",False,False,,,,,,,,,,
9,age,"(Tags.USER, Tags.CATEGORICAL)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,8.0,16.0,.//categories/unique.age.parquet,,0.0,0.0,8.0,age


## Building the model

We now have everything we need to construct a model and train on our custom embeddings. In order to do so, we will leverage the `TensorInitializer` class and `Embeddings` function to set the `trainable` arg to `False`, so that our pre-trained embedddings will be frozen and not be updated during model training.

In [12]:
embed_dims = {}
embed_dims["movieId"] = pretrained_movie_embs.shape[1]

embeddings_init={
    "movieId": mm.TensorInitializer(pretrained_movie_embs),
}

embeddings_block = mm.Embeddings(
    train.schema.select_by_tag(Tags.CATEGORICAL),
    infer_embedding_sizes=True,
    embeddings_initializer=embeddings_init,
    trainable={'movieId': False},
    dim=embed_dims,
)
input_block = mm.InputBlockV2(train.schema, categorical=embeddings_block)

Let us now feed our input_block into our model.

In [13]:
model = mm.DCNModel(
    train.schema,
    depth=2,
    input_block=input_block,
    deep_block=mm.MLPBlock([64, 32]),
    prediction_tasks=mm.BinaryOutput(target_column)
)

We could have created the model without passing the `embeddings_block` to the `input_block`. The model would still be able to infer how to construct itself (what should be the dimensionality of the input layer and so on) from the information contained in the schema.

However, passing a `TensorInitializer` into the constructor of the `input_block` tells our model to use our embedding table (`pretrained_movie_embs`) for that particular column of our dataset (`movieId`) as opposed to the model randomly initializing a brand new embedding matrix. For categorical columns we do not provide this information, the model will go with the standard initialization logic, which is to create an embedding table of appropriate size and perform random preinitialization.

Additionally, we set the `trainable` parameter for our pre-trained embeddings to `False` to ensure the embeddings will not be modified during training.

## Training

We train our model with `AUC` as our metric.

As we use synthetic data, the AUC score will not improve significantly.

In [14]:
%%time
opt = tf.keras.optimizers.Adagrad(learning_rate=1e-1)
model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()])

CPU times: user 14.1 ms, sys: 777 µs, total: 14.9 ms
Wall time: 13.7 ms


In [15]:
model.fit(train, batch_size=1024, epochs=5)

Epoch 1/5


2023-05-16 00:50:40.363082: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0xe34bde0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-05-16 00:50:40.363124: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): Quadro RTX 8000, Compute Capability 7.5
2023-05-16 00:50:40.368402: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-05-16 00:50:40.476666: I tensorflow/compiler/jit/xla_compilation_cache.cc:480] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f05590a70a0>

In [16]:
history = model.history.history

# Passing the `EmbeddingOperator` to the `Loader`

Another way of training with pretrained embeddings is to create a custom `Loader` and equip it with the ability to feed embeddings to our model.

When we use `mm.Embeddings` and `TensorInitializer` as above, the embeddings are moved to the GPU and can be considered part of our model. That might become problematic if the embedding table is large.

Taking the approach below the pretrained embeddings are passed to the model as part of each batch. We do not hold the embedding table, which depending on the scenario might consists of millions of rows, in GPU memory.

We can reuse the train data and the embedding information we generated above.

In [17]:
pretrained_movie_embs.shape

(1681, 128)

In [18]:
from merlin.dataloader.ops.embeddings import EmbeddingOperator

loader = mm.Loader(
    train,
    batch_size=1024,
    transforms=[
        EmbeddingOperator(
            pretrained_movie_embs,
            lookup_key="movieId",
            embedding_name="pretrained_movie_embeddings",
        ),
    ],
)

But we need to recreate the model, as now the  emebeddings will be passed to it straight from the dataloader.

In [19]:
loader.output_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.freq_threshold,properties.max_size,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.cat_path,properties.num_buckets,properties.start_index,properties.domain.min,properties.domain.max,properties.domain.name,properties.value_count.min,properties.value_count.max
0,movieId,"(Tags.CATEGORICAL, Tags.ITEM, Tags.ID)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,1681.0,128.0,.//categories/unique.movieId.parquet,,0.0,1.0,1680.0,movieId,,
1,userId,"(Tags.USER, Tags.CATEGORICAL, Tags.ID)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,943.0,74.0,.//categories/unique.userId.parquet,,0.0,0.0,943.0,userId,,
2,genres,"(Tags.CATEGORICAL, Tags.ITEM)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,216.0,32.0,.//categories/unique.genres.parquet,,0.0,0.0,216.0,genres,,
3,TE_movieId_rating,(Tags.CONTINUOUS),"DType(name='float64', element_type=<ElementTyp...",False,False,,,,,,,,,,,,
4,userId_count,(Tags.CONTINUOUS),"DType(name='float32', element_type=<ElementTyp...",False,False,,,,,,,,,,,,
5,gender,"(Tags.USER, Tags.CATEGORICAL)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,2.0,16.0,.//categories/unique.gender.parquet,,0.0,0.0,2.0,gender,,
6,zip_code,"(Tags.USER, Tags.CATEGORICAL)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,795.0,67.0,.//categories/unique.zip_code.parquet,,0.0,0.0,795.0,zip_code,,
7,rating,"(Tags.REGRESSION, Tags.TARGET)","DType(name='int64', element_type=<ElementType....",False,False,,,,,,,,,,,,
8,rating_binary,"(Tags.BINARY_CLASSIFICATION, Tags.TARGET)","DType(name='int32', element_type=<ElementType....",False,False,,,,,,,,,,,,
9,age,"(Tags.USER, Tags.CATEGORICAL)","DType(name='int32', element_type=<ElementType....",False,False,0.0,0.0,8.0,16.0,.//categories/unique.age.parquet,,0.0,0.0,8.0,age,,


In [27]:
embeddings_block = mm.Embeddings(
    loader.output_schema.select_by_tag(Tags.CATEGORICAL).remove_col('movieId'),
)

pretrained_embeddings = mm.PretrainedEmbeddings(
    loader.output_schema.select_by_tag(Tags.EMBEDDING)
)

input_block = mm.InputBlockV2(loader.output_schema, categorical=embeddings_block, pretrained_embeddings=pretrained_embeddings)

model = mm.DCNModel(
    loader.output_schema,
    depth=2,
    input_block=input_block,
    deep_block=mm.MLPBlock([64, 32]),
    prediction_tasks=mm.BinaryOutput(target_column)
)

## Training

In [26]:
%%time
opt = tf.keras.optimizers.Adagrad(learning_rate=1e-1)
model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()])

model.fit(loader, epochs=5)

KeyError: 'rating'

In [23]:
# history_with_embeddings = model.history.history

AttributeError: 'NoneType' object has no attribute 'history'

The model trains using pretrained embeddings.