In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions anda
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Training with pretrained embeddings

## Overview

In this use case we will consider how we might train with pretrained embeddings.

Pretrained embeddings can allow our model to include information from additional modalities (for instance, we might want to grab CNN descriptors of product images). They can also come from other models that we train on our data. For example, we might train a word2vec model on the sequence of purchased items by a customer and want to include this information in our retrieval or ranking model.

The use cases are many, but this particular example will focus on the technical aspects of working with pretrained embeddings.

We will use a synthetic version of the MovieLens 100k dataset and emulate a scenario where we would have a pretrained embedding for each of the movies in the dataset.

### Learning objectives

- Training with pretrained embeddings
- Understanding [the Schema file](https://github.com/NVIDIA-Merlin/core/blob/main/merlin/schema/schema.py)

## Downloading and preparing the dataset

In [2]:
import merlin.models.tf as mm
from merlin.schema.tags import Tags
import tensorflow as tf
from merlin.models.tf.prediction_tasks.classification import BinaryClassificationTask
from merlin.models.tf.blocks import *
from merlin.datasets.synthetic import generate_data

import numpy as np

2022-06-20 23:18:09.855239: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-20 23:18:09.855574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-20 23:18:09.855716: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-20 23:18:09.877888: I tensorflow/core/platform/cpu_feature_guard.cc:152] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate 

In [3]:
train = generate_data('movielens-100k', num_rows=100_000)

In [4]:
target_column = train.schema.select_by_tag(Tags.TARGET).column_names[1]
target_column

'rating_binary'

The schema holds vital information about our dataset. We can extract the embedding table size for the `moveId` column from it.

In [5]:
train.schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.freq_threshold,properties.max_size,properties.embedding_sizes.dimension,properties.embedding_sizes.cardinality,properties.cat_path,properties.num_buckets,properties.start_index,properties.domain.min,properties.domain.max
0,movieId,"(Tags.CATEGORICAL, Tags.ITEM, Tags.ITEM_ID)",int32,False,False,0.0,0.0,102.0,1680.0,.//categories/unique.movieId.parquet,,0.0,0.0,1680.0
1,userId,"(Tags.USER, Tags.CATEGORICAL, Tags.USER_ID)",int32,False,False,0.0,0.0,74.0,943.0,.//categories/unique.userId.parquet,,0.0,0.0,943.0
2,genres,"(Tags.CATEGORICAL, Tags.ITEM)",int32,False,False,0.0,0.0,32.0,216.0,.//categories/unique.genres.parquet,,0.0,0.0,216.0
3,TE_movieId_rating,(Tags.CONTINUOUS),float64,False,False,,,,,,,,,
4,userId_count,(Tags.CONTINUOUS),float32,False,False,,,,,,,,,
5,gender,"(Tags.USER, Tags.CATEGORICAL)",int32,False,False,0.0,0.0,16.0,2.0,.//categories/unique.gender.parquet,,0.0,0.0,2.0
6,zip_code,"(Tags.USER, Tags.CATEGORICAL)",int32,False,False,0.0,0.0,67.0,795.0,.//categories/unique.zip_code.parquet,,0.0,0.0,795.0
7,rating,"(Tags.TARGET, Tags.REGRESSION)",int64,False,False,,,,,,,,,
8,rating_binary,"(Tags.BINARY_CLASSIFICATION, Tags.TARGET)",int32,False,False,,,,,,,,,
9,age,"(Tags.USER, Tags.CATEGORICAL)",int32,False,False,0.0,0.0,16.0,8.0,.//categories/unique.age.parquet,,0.0,0.0,8.0


From the schema, we can tell that the cardinality of `movieId` is 1680. Embedding with `id` of 0 will be used in case an unknown `movieId` is encountered, thus we need to create an embedding table of size 1681.

In [6]:
train.schema.column_schemas['movieId'].properties['embedding_sizes']['cardinality']

1680.0

In [7]:
pretrained_movie_embs = np.random.random((1681, 64))

Let us now feed this into our model.

## Building the model

In [8]:
model = mm.DCNModel(
    train.schema,
    depth=2,
    deep_block=mm.MLPBlock([64, 32]),
    prediction_tasks=mm.BinaryClassificationTask(target_column),
    embedding_options=mm.EmbeddingOptions(
        embeddings_initializers={
            "movieId": mm.TensorInitializer(pretrained_movie_embs),
        }
    )
)

2022-06-20 23:18:10.978589: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


## Training

In [9]:
%%time
opt = tf.keras.optimizers.Adagrad(learning_rate=1e-1)
model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()])
model.fit(train, batch_size=1024, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 13.6 s, sys: 1.5 s, total: 15.1 s
Wall time: 8.23 s


<keras.callbacks.History at 0x7f9abfc2a310>

The model trains and we have utilized pretrained embeddings 