In [1]:
!pwd

/workspace/examples/usecases


In [2]:
import os

In [3]:
os.chdir('/workspace')

In this use case we will consider how we might train with pretrained embeddings.

Pretrained embeddings can allow our model to include information from other modalities (for instance, we might want to grab CNN descriptors of product images). They can also come from other models that we train on our data. For example, we might train a word2vec model on the sequence of purchased items by a customer and want to include this information in our retrieval or ranking model.

The use cases are many, but this particular example will focus just on the technical aspects of working with pretrained embeddings.

We will use the MovieLens 100k dataset and emulate a scenario where we would have a pretrained embedding for each of the movies in the train dataset.

In [4]:
import merlin.models.tf as mm
from merlin.datasets.entertainment import get_movielens
from merlin.schema.tags import Tags
import tensorflow as tf
from merlin.models.tf.prediction_tasks.classification import BinaryClassificationTask
from merlin.models.tf.blocks import *

import numpy as np

2022-06-13 04:02:20.693184: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-13 04:02:20.693584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-13 04:02:20.693744: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-13 04:02:20.715031: I tensorflow/core/platform/cpu_feature_guard.cc:152] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate 

In [5]:
train, valid = get_movielens(variant="ml-100k")



In [6]:
target_column = train.schema.select_by_tag(Tags.TARGET).column_names[1]
target_column

'rating_binary'

From the schema, we can tell that there are 1681 known movie ids in the dataset. The movideId to movie mapping is stored in `.//categories/unique.movieId.parquet`. Let's read the file and take a closer look at the situation.

In [7]:
import cudf

movieIds = cudf.read_parquet('.//categories/unique.movieId.parquet')

In [8]:
movieIds

Unnamed: 0,movieId,movieId_size
0,,0
1,50,495
2,100,443
3,181,439
4,258,412
...,...,...
1676,1678,1
1677,1679,1
1678,1680,1
1679,1681,1


The highest movie id in the train dataset is 1682 with movie ID 0 being left for movies not seen in the train set.

Let's create a mock embedding matrix. In a regular scenario, that is where our pretrained embeddings would go.

In [9]:
pretrained_movie_embs = np.random.random((1682, 64))

Let us now feed this into our model.

In [10]:
model = mm.DCNModel(
train.schema,
    depth=2,
    deep_block=mm.MLPBlock([64, 32]),
    prediction_tasks=mm.BinaryClassificationTask(target_column),
    embedding_options=mm.EmbeddingOptions(
        embeddings_initializers={
            "movieId": mm.TensorInitializer(pretrained_movie_embs),
        }
    )
)

2022-06-13 04:02:21.868740: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


In [11]:
%%time
opt = tf.keras.optimizers.Adagrad(learning_rate=1e-1)
model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()])
model.fit(train, validation_data=valid, batch_size=1024, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 13 s, sys: 1.15 s, total: 14.2 s
Wall time: 8.7 s


<keras.callbacks.History at 0x7fe3a516d910>

The model trains and we have utilized pretrained embeddings 