# 01-04 : Retrieval using a sequential model

Recommender systems are often composed of two stages:

1. The retrieval stage is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.

2. The ranking stage takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.

This notebook is going to focus on the first stage, retrieval.

Retrieval models are often composed of two sub-models:

1. A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features.

2. A candidate model computing the candidate representation (an equally-sized vector) using the candidate features.

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

## References

- [Recommending movies: retrieval using a sequential model](https://www.tensorflow.org/recommenders/examples/sequential_retrieval)
- [Item-to-item recommendation and sequential recommendation](https://www.youtube.com/watch?v=ZBaKzw938oM)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint

import tensorflow as tf
import tensorflow_recommenders as tfrs

from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

2024-03-03 14:36:34.264735: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-03 14:36:34.264762: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-03 14:36:34.265661: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-03 14:36:34.270051: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 1. The dataset

We will use the RetailRocket source dataset as prepared for the GRU4Rec paper:
https://github.com/JohnnyFoulds/GRU4Rec/blob/master/notebooks/01_%20preprocess/01-02_retailrocket.ipynb

The dataset is already split into training, validation, and test sets as tab separated files. The columns are:

- `SessionId` - the id of the session. In one session there are one or many items.
- `ItemId` - the id of the item.
- `Time` - the event time.

In [2]:
data_path = '../../data/RetailRocket'
model_path = '../../models/RetailRocket'

# file paths for the data files
train_path = f'{data_path}/retailrocket_processed_view_train_tr.tsv'
validation_path = f'{data_path}/retailrocket_processed_view_train_valid.tsv'
test_path = f'{data_path}/retailrocket_processed_view_test.tsv'

In [3]:
# load the datasets
df_train = pd.read_csv(train_path, sep='\t').sample(frac=0.3, random_state=42)
df_validation = pd.read_csv(validation_path, sep='\t')
df_test = pd.read_csv(test_path, sep='\t')

In [4]:
# show the shape of the datasets
print('Train      :', df_train.shape)
print('Validation :', df_validation.shape)
print('Test       :', df_test.shape)

Train      : (215097, 3)
Validation : (33812, 3)
Test       : (29148, 3)


In [5]:
# head of the training set
display(df_train.head())

Unnamed: 0,SessionId,ItemId,Time
150593,363804,451942,1433134557529
700326,1684469,441756,1440134153810
673447,1615632,357925,1434672250853
48855,117260,2129,1435772219980
515183,1231963,7804,1432004462904


In [6]:
# unique sessions in each dataset
print('Train      :', df_train.SessionId.nunique())
print('Validation :', df_validation.SessionId.nunique())
print('Test       :', df_test.SessionId.nunique())

Train      : 119950
Validation : 9408
Test       : 8036


In [7]:
# average session length in each dataset
print('--- Train ---')
print(df_train.groupby('SessionId').size().describe())
print('--- Validation ---')
print(df_validation.groupby('SessionId').size().describe())
print('--- Test ---')
print(df_test.groupby('SessionId').size().describe())

--- Train ---
count    119950.000000
mean          1.793222
std           2.249649
min           1.000000
25%           1.000000
50%           1.000000
75%           2.000000
max          97.000000
dtype: float64
--- Validation ---
count    9408.000000
mean        3.593963
std         5.100989
min         2.000000
25%         2.000000
50%         2.000000
75%         4.000000
max       147.000000
dtype: float64
--- Test ---
count    8036.000000
mean        3.627178
std         5.460967
min         2.000000
25%         2.000000
50%         2.000000
75%         4.000000
max       200.000000
dtype: float64


## 2. Preparing the dataset

In [8]:
# convert the items ids to strings for tokenization
df_train['ItemId'] = df_train['ItemId'].astype(str)
df_validation['ItemId'] = df_validation['ItemId'].astype(str)
df_test['ItemId'] = df_test['ItemId'].astype(str)

### 2.1 Sequence Creation

The first step involves creating sequences of item interactions for each session. This requires grouping the data by SessionId and ordering it within each group based on the Time column. Each sequence represents a series of item interactions within a session.

In [9]:
# Sort by SessionId and Time to ensure the order is correct
df_train_sorted = df_train.sort_values(by=['SessionId', 'Time'])
df_validation_sorted = df_validation.sort_values(by=['SessionId', 'Time'])
df_test_sorted = df_test.sort_values(by=['SessionId', 'Time'])

# Create sequences of ItemIds grouped by SessionId
train_sequences = df_train_sorted.groupby('SessionId')['ItemId'].apply(list)
validation_sequences = df_validation_sorted.groupby('SessionId')['ItemId'].apply(list)
test_sequences = df_test_sorted.groupby('SessionId')['ItemId'].apply(list)

In [10]:
train_sequences.head(5)

SessionId
2             [216305]
6     [253615, 344723]
8             [164941]
74            [321706]
79            [233200]
Name: ItemId, dtype: object

In [11]:
# drop the sessions with only one item
train_sequences = train_sequences[train_sequences.map(len) > 1]
train_sequences.head(5)

SessionId
6                           [253615, 344723]
133                          [169956, 45520]
135                         [400946, 400946]
211                           [248862, 1152]
226    [397068, 18519, 27248, 10034, 254301]
Name: ItemId, dtype: object

### 2.2 Tokenization (Categorical Features Encoding)

We need to ensure that ItemIds are treated as categorical inputs.

Create a tokenizer to encode ItemIds as integers, 0 and 1 are special values, where 0 should be for padding and 1 for out of vocabulary items.

In [12]:
# get a list of the unique item ids across all datasets
unique_items = pd.concat([df_train, df_validation, df_test]).ItemId.unique()

# use keras to map the item ids to a sequential list of integer values,
# 0 should be for padding and 1 for out of vocabulary items
tokenizer = Tokenizer(num_words=len(unique_items) + 2, oov_token=1)
tokenizer.fit_on_texts(unique_items)

# save the tokenizer
tokenizer_path = f'{model_path}/item_id_tokenizer.json'
with open(tokenizer_path, 'w') as file:
    file.write(tokenizer.to_json())

In [13]:
# tokenize the sequences
train_sequences_tokenized = tokenizer.texts_to_sequences(train_sequences)
validation_sequences_tokenized = tokenizer.texts_to_sequences(validation_sequences)
test_sequences_tokenized = tokenizer.texts_to_sequences(test_sequences)

In [14]:
train_sequences_tokenized[:5]

[[965, 2568],
 [29597, 14106],
 [1455, 1455],
 [1260, 6973],
 [178, 18563, 10273, 14326, 2998]]

### 2.3 Padding

To handle sessions of varying lengths, we'll need to pad the sequences so that they all have the same length, making them suitable for batch processing.

In [15]:
# Determine the maximum sequence length for padding
#max_sequence_length = max(map(len, train_input))
max_sequence_length = 10

In [16]:
# use the last item as the target and the rest as the input
def split_input_target(sequence):
    return sequence[:-1], sequence[-1]

train_sequences_input = list(map(split_input_target, train_sequences_tokenized))
validation_sequences_input = list(map(split_input_target, validation_sequences_tokenized))
test_sequences_input = list(map(split_input_target, test_sequences_tokenized))

In [17]:
train_sequences_input[:5]

[([965], 2568),
 ([29597], 14106),
 ([1455], 1455),
 ([1260], 6973),
 ([178, 18563, 10273, 14326], 2998)]

In [18]:
# separate into input and target arrays
train_input, y_train = map(list, zip(*train_sequences_input))
validation_input, y_validation = map(list, zip(*validation_sequences_input))
test_input, y_test = map(list, zip(*test_sequences_input))

In [19]:
pprint(train_input[:5])
print('-'*10)
pprint(y_train[:5])

[[965], [29597], [1455], [1260], [178, 18563, 10273, 14326]]
----------
[2568, 14106, 1455, 6973, 2998]


In [20]:
# Determine the maximum sequence length for padding
#max_sequence_length = max(map(len, train_input))
max_sequence_length = 10

# pad the sequences
X_train = pad_sequences(train_input, maxlen=max_sequence_length, padding='post')
X_validation = pad_sequences(validation_input, maxlen=max_sequence_length, padding='post')
X_test = pad_sequences(test_input, maxlen=max_sequence_length, padding='post')

In [21]:
X_train[:5]

array([[  965,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [29597,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [ 1455,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [ 1260,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [  178, 18563, 10273, 14326,     0,     0,     0,     0,     0,
            0]], dtype=int32)

## 2.4 Create TensorFlow Datasets

In [22]:
# Map each element of the dataset to the corresponding feature
def map_feature(sequence, target):
    return {'context_item_id': sequence, 'label_item_id': [target]}

# Create a TensorFlow dataset
train_ds = tf.data.Dataset \
    .from_tensor_slices((X_train, y_train)).map(map_feature)
validation_ds = tf.data.Dataset \
    .from_tensor_slices((X_validation, y_validation)).map(map_feature)
test_ds = tf.data.Dataset \
    .from_tensor_slices((X_test, y_test)).map(map_feature)

for x in train_ds.take(1).as_numpy_iterator():
  pprint(x)

{'context_item_id': array([965,   0,   0,   0,   0,   0,   0,   0,   0,   0], dtype=int32),
 'label_item_id': array([2568], dtype=int32)}


2024-03-03 14:36:37.361960: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-03 14:36:37.389847: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-03 14:36:37.390027: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

Now our datasets include only a sequence of historical item IDs and a label of next iten ID. Note that we use [10] as the shape of the features during tf.Example parsing because we specify 10 as the length of context features in the example generateion step.

We need one more thing before we can start building the model - the vocabulary for our item IDs.

In [23]:
# get a list of all the item ids from the tokenizer
item_ids = tf.data.Dataset.from_tensor_slices((tokenizer.index_word.keys()))

# get the unique item ids
max_item_id = max(tokenizer.index_word.keys())
#unique_item_ids = max(tokenizer.index_word.keys()) + 1

## 3. Implementing a sequential model

Here we are still going to use the two-tower architecture. Specificially, we use the query tower with a Gated Recurrent Unit (GRU) layer to encode the sequence of historical items, and keep the same candidate tower for the candidate item.

In [24]:
embedding_dimension = 32

query_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim=max_item_id + 1,
        output_dim=embedding_dimension,
        input_length=max_sequence_length), 
    tf.keras.layers.GRU(embedding_dimension),
])

candidate_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim=max_item_id + 1,
        output_dim=embedding_dimension,
        input_length=max_sequence_length), 
])

The metrics, task and full model are defined similar to the basic retrieval model.

In [25]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=item_ids.batch(128).map(candidate_model)
)

task = tfrs.tasks.Retrieval(
  metrics=metrics
)

In [26]:
class Model(tfrs.Model):

    def __init__(self, query_model, candidate_model):
        super().__init__()
        self._query_model = query_model
        self._candidate_model = candidate_model

        self._task = task

    def compute_loss(self, features, training=False):
        item_history = features["context_item_id"]
        item_next_label = features["label_item_id"]

        query_embedding = self._query_model(item_history)       
        candidate_embedding = self._candidate_model(item_next_label)

        return self._task(query_embedding, candidate_embedding, compute_metrics=not training)

## 4. Fitting and evaluating

In [27]:
model = Model(query_model, candidate_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

In [28]:
cached_train = train_ds.shuffle(10_000).batch(5_000).cache()
cached_test = test_ds.batch(2560).cache()

In [29]:
history = model.fit(cached_train, epochs=100)

Epoch 1/100


 1/10 [==>...........................] - ETA: 9s - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 42585.9453 - regularization_loss: 0.0000e+00 - total_loss: 42585.9453

2024-03-03 14:36:44.114382: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
2024-03-03 14:36:44.181132: I external/local_xla/xla/service/service.cc:168] XLA service 0x7ef8911dc190 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-03-03 14:36:44.181151: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6
2024-03-03 14:36:44.184641: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1709469404.241180  456981 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [30]:
query_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 10, 32)            1140320   
                                                                 
 gru (GRU)                   (None, 32)                6336      
                                                                 
Total params: 1146656 (4.37 MB)
Trainable params: 1146656 (4.37 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [31]:
candidate_model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 10, 32)            1140320   
                                                                 
Total params: 1140320 (4.35 MB)
Trainable params: 1140320 (4.35 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [32]:
model.evaluate(cached_test, return_dict=True)

ValueError: in user code:

    File "/home/johnny/swan/miniconda3/envs/dsm150-2024/lib/python3.10/site-packages/keras/src/engine/training.py", line 2066, in test_function  *
        return step_function(self, iterator)
    File "/home/johnny/swan/miniconda3/envs/dsm150-2024/lib/python3.10/site-packages/keras/src/engine/training.py", line 2049, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/johnny/swan/miniconda3/envs/dsm150-2024/lib/python3.10/site-packages/keras/src/engine/training.py", line 2037, in run_step  **
        outputs = model.test_step(data)
    File "/home/johnny/swan/miniconda3/envs/dsm150-2024/lib/python3.10/site-packages/tensorflow_recommenders/models/base.py", line 88, in test_step
        loss = self.compute_loss(inputs, training=False)
    File "/tmp/ipykernel_456904/4164478603.py", line 17, in compute_loss
        return self._task(query_embedding, candidate_embedding, compute_metrics=not training)
    File "/home/johnny/swan/miniconda3/envs/dsm150-2024/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/tmp/__autograph_generated_filek0lisqwi.py", line 159, in tf__call
        ag__.if_stmt(ag__.ld(compute_metrics), if_body_5, else_body_5, get_state_7, set_state_7, (), 0)
    File "/tmp/__autograph_generated_filek0lisqwi.py", line 155, in if_body_5
        ag__.for_stmt(ag__.ld(self)._factorized_metrics, None, loop_body_1, get_state_6, set_state_6, (), {'iterate_names': 'metric'})
    File "/tmp/__autograph_generated_filek0lisqwi.py", line 154, in loop_body_1
        ag__.converted_call(ag__.ld(update_ops).append, (ag__.converted_call(ag__.ld(metric).update_state, (ag__.ld(query_embeddings), ag__.ld(candidate_embeddings)[:ag__.converted_call(ag__.ld(tf).shape, (ag__.ld(query_embeddings),), None, fscope)[0]]), dict(true_candidate_ids=ag__.ld(candidate_ids)), fscope),), None, fscope)
    File "/tmp/__autograph_generated_filessrsh__e.py", line 128, in tf__update_state
        ag__.if_stmt(ag__.ld(true_candidate_ids) is not None, if_body_2, else_body_2, get_state_4, set_state_4, ('top_k_predictions', 'true_candidate_ids'), 0)
    File "/tmp/__autograph_generated_filessrsh__e.py", line 101, in else_body_2
        y_pred = ag__.converted_call(ag__.ld(tf).concat, ([ag__.ld(positive_scores), ag__.ld(top_k_predictions)],), dict(axis=1), fscope)

    ValueError: Exception encountered when calling layer 'retrieval' (type Retrieval).
    
    in user code:
    
        File "/home/johnny/swan/miniconda3/envs/dsm150-2024/lib/python3.10/site-packages/tensorflow_recommenders/tasks/retrieval.py", line 197, in call  *
            update_ops.append(
        File "/home/johnny/swan/miniconda3/envs/dsm150-2024/lib/python3.10/site-packages/tensorflow_recommenders/metrics/factorized_top_k.py", line 183, in update_state  *
            y_pred = tf.concat([positive_scores, top_k_predictions], axis=1)
    
        ValueError: Shape must be rank 3 but is rank 2 for '{{node retrieval/concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](retrieval/Sum, retrieval/streaming/ReduceDataset, retrieval/concat/axis)' with input shapes: [?,1,32], [?,?], [].
    
    
    Call arguments received by layer 'retrieval' (type Retrieval):
      • query_embeddings=tf.Tensor(shape=(None, 32), dtype=float32)
      • candidate_embeddings=tf.Tensor(shape=(None, 1, 32), dtype=float32)
      • sample_weight=None
      • candidate_sampling_probability=None
      • candidate_ids=None
      • compute_metrics=True
      • compute_batch_metrics=True


In [None]:
test_ds.take(1).as_numpy_iterator().next()