# 01-04 : Retrieval using a sequential model

Recommender systems are often composed of two stages:

1. The retrieval stage is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.

2. The ranking stage takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.

This notebook is going to focus on the first stage, retrieval.

Retrieval models are often composed of two sub-models:

1. A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features.

2. A candidate model computing the candidate representation (an equally-sized vector) using the candidate features.

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

## References

- [Recommending movies: retrieval using a sequential model](https://www.tensorflow.org/recommenders/examples/sequential_retrieval)
- [Item-to-item recommendation and sequential recommendation](https://www.youtube.com/watch?v=ZBaKzw938oM)

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint

import tensorflow as tf

from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

2024-03-03 13:47:41.394022: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-03 13:47:41.394053: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-03 13:47:41.394970: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-03 13:47:41.399509: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 1. The dataset

We will use the RetailRocket source dataset as prepared for the GRU4Rec paper:
https://github.com/JohnnyFoulds/GRU4Rec/blob/master/notebooks/01_%20preprocess/01-02_retailrocket.ipynb

The dataset is already split into training, validation, and test sets as tab separated files. The columns are:

- `SessionId` - the id of the session. In one session there are one or many items.
- `ItemId` - the id of the item.
- `Time` - the event time.

In [2]:
data_path = '../../data/RetailRocket'
model_path = '../../models/RetailRocket'

# file paths for the data files
train_path = f'{data_path}/retailrocket_processed_view_train_tr.tsv'
validation_path = f'{data_path}/retailrocket_processed_view_train_valid.tsv'
test_path = f'{data_path}/retailrocket_processed_view_test.tsv'

In [3]:
# load the datasets
df_train = pd.read_csv(train_path, sep='\t').sample(frac=0.3, random_state=42)
df_validation = pd.read_csv(validation_path, sep='\t')
df_test = pd.read_csv(test_path, sep='\t')

In [4]:
# show the shape of the datasets
print('Train      :', df_train.shape)
print('Validation :', df_validation.shape)
print('Test       :', df_test.shape)

Train      : (215097, 3)
Validation : (33812, 3)
Test       : (29148, 3)


In [5]:
# head of the training set
display(df_train.head())

Unnamed: 0,SessionId,ItemId,Time
150593,363804,451942,1433134557529
700326,1684469,441756,1440134153810
673447,1615632,357925,1434672250853
48855,117260,2129,1435772219980
515183,1231963,7804,1432004462904


In [6]:
# unique sessions in each dataset
print('Train      :', df_train.SessionId.nunique())
print('Validation :', df_validation.SessionId.nunique())
print('Test       :', df_test.SessionId.nunique())

Train      : 119950
Validation : 9408
Test       : 8036


In [7]:
# average session length in each dataset
print('--- Train ---')
print(df_train.groupby('SessionId').size().describe())
print('--- Validation ---')
print(df_validation.groupby('SessionId').size().describe())
print('--- Test ---')
print(df_test.groupby('SessionId').size().describe())

--- Train ---
count    119950.000000
mean          1.793222
std           2.249649
min           1.000000
25%           1.000000
50%           1.000000
75%           2.000000
max          97.000000
dtype: float64
--- Validation ---
count    9408.000000
mean        3.593963
std         5.100989
min         2.000000
25%         2.000000
50%         2.000000
75%         4.000000
max       147.000000
dtype: float64
--- Test ---
count    8036.000000
mean        3.627178
std         5.460967
min         2.000000
25%         2.000000
50%         2.000000
75%         4.000000
max       200.000000
dtype: float64


## 2. Preparing the dataset

In [8]:
# convert the items ids to strings for tokenization
df_train['ItemId'] = df_train['ItemId'].astype(str)
df_validation['ItemId'] = df_validation['ItemId'].astype(str)
df_test['ItemId'] = df_test['ItemId'].astype(str)

### 2.1 Sequence Creation

The first step involves creating sequences of item interactions for each session. This requires grouping the data by SessionId and ordering it within each group based on the Time column. Each sequence represents a series of item interactions within a session.

In [9]:
# Sort by SessionId and Time to ensure the order is correct
df_train_sorted = df_train.sort_values(by=['SessionId', 'Time'])
df_validation_sorted = df_validation.sort_values(by=['SessionId', 'Time'])
df_test_sorted = df_test.sort_values(by=['SessionId', 'Time'])

# Create sequences of ItemIds grouped by SessionId
train_sequences = df_train_sorted.groupby('SessionId')['ItemId'].apply(list)
validation_sequences = df_validation_sorted.groupby('SessionId')['ItemId'].apply(list)
test_sequences = df_test_sorted.groupby('SessionId')['ItemId'].apply(list)

In [10]:
train_sequences.head(5)

SessionId
2             [216305]
6     [253615, 344723]
8             [164941]
74            [321706]
79            [233200]
Name: ItemId, dtype: object

In [11]:
# drop the sessions with only one item
train_sequences = train_sequences[train_sequences.map(len) > 1]
train_sequences.head(5)

SessionId
6                           [253615, 344723]
133                          [169956, 45520]
135                         [400946, 400946]
211                           [248862, 1152]
226    [397068, 18519, 27248, 10034, 254301]
Name: ItemId, dtype: object

### 2.2 Tokenization (Categorical Features Encoding)

We need to ensure that ItemIds are treated as categorical inputs.

Create a tokenizer to encode ItemIds as integers, 0 and 1 are special values, where 0 should be for padding and 1 for out of vocabulary items.

In [12]:
# get a list of the unique item ids across all datasets
unique_items = pd.concat([df_train, df_validation, df_test]).ItemId.unique()

# use keras to map the item ids to a sequential list of integer values,
# 0 should be for padding and 1 for out of vocabulary items
tokenizer = Tokenizer(num_words=len(unique_items) + 2, oov_token=1)
tokenizer.fit_on_texts(unique_items)

# save the tokenizer
tokenizer_path = f'{model_path}/item_id_tokenizer.json'
with open(tokenizer_path, 'w') as file:
    file.write(tokenizer.to_json())

In [13]:
# tokenize the sequences
train_sequences_tokenized = tokenizer.texts_to_sequences(train_sequences)
validation_sequences_tokenized = tokenizer.texts_to_sequences(validation_sequences)
test_sequences_tokenized = tokenizer.texts_to_sequences(test_sequences)

In [14]:
train_sequences_tokenized[:5]

[[965, 2568],
 [29597, 14106],
 [1455, 1455],
 [1260, 6973],
 [178, 18563, 10273, 14326, 2998]]

### 2.3 Padding

To handle sessions of varying lengths, we'll need to pad the sequences so that they all have the same length, making them suitable for batch processing.

In [15]:
# Determine the maximum sequence length for padding
#max_sequence_length = max(map(len, train_input))
max_sequence_length = 10

In [16]:
# use the last item as the target and the rest as the input
def split_input_target(sequence):
    return sequence[:-1], sequence[-1]

train_sequences_input = list(map(split_input_target, train_sequences_tokenized))
validation_sequences_input = list(map(split_input_target, validation_sequences_tokenized))
test_sequences_input = list(map(split_input_target, test_sequences_tokenized))

In [17]:
train_sequences_input[:5]

[([965], 2568),
 ([29597], 14106),
 ([1455], 1455),
 ([1260], 6973),
 ([178, 18563, 10273, 14326], 2998)]

In [18]:
def map_features(sequence):
    print(sequence[0])
    #X_train = pad_sequences(list(sequence[0]), maxlen=max_sequence_length, padding='post')
    return {
        'context_item_id': sequence[0], 
        'label_item_id': [sequence[1]]
    }

list(map(map_features, train_sequences_input[:5]))

[965]
[29597]
[1455]
[1260]
[178, 18563, 10273, 14326]


[{'context_item_id': [965], 'label_item_id': [2568]},
 {'context_item_id': [29597], 'label_item_id': [14106]},
 {'context_item_id': [1455], 'label_item_id': [1455]},
 {'context_item_id': [1260], 'label_item_id': [6973]},
 {'context_item_id': [178, 18563, 10273, 14326], 'label_item_id': [2998]}]

In [19]:
# separate into input and target arrays
train_input, y_train = map(list, zip(*train_sequences_input))
validation_input, y_validation = map(list, zip(*validation_sequences_input))
test_input, y_test = map(list, zip(*test_sequences_input))

In [20]:
pprint(train_input[:5])
print('-'*10)
pprint(y_train[:5])

[[965], [29597], [1455], [1260], [178, 18563, 10273, 14326]]
----------
[2568, 14106, 1455, 6973, 2998]


In [21]:
# Determine the maximum sequence length for padding
#max_sequence_length = max(map(len, train_input))
max_sequence_length = 10

# pad the sequences
X_train = pad_sequences(train_input, maxlen=max_sequence_length, padding='post')
X_validation = pad_sequences(validation_input, maxlen=max_sequence_length, padding='post')
X_test = pad_sequences(test_input, maxlen=max_sequence_length, padding='post')

In [22]:
X_train[:5]

array([[  965,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [29597,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [ 1455,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [ 1260,     0,     0,     0,     0,     0,     0,     0,     0,
            0],
       [  178, 18563, 10273, 14326,     0,     0,     0,     0,     0,
            0]], dtype=int32)

## 2.4 Create TensorFlow Datasets

In [23]:
# Map each element of the dataset to the corresponding feature
def map_feature(sequence, target):
    return {'context_item_id': sequence, 'label_item_id': target}

# Create a TensorFlow dataset
train_ds = tf.data.Dataset \
    .from_tensor_slices((X_train, y_train)).map(map_feature)
validation_ds = tf.data.Dataset \
    .from_tensor_slices((X_validation, y_validation)).map(map_feature)
test_ds = tf.data.Dataset \
    .from_tensor_slices((X_test, y_test)).map(map_feature)

for x in train_ds.take(1).as_numpy_iterator():
  pprint(x)

{'context_item_id': array([965,   0,   0,   0,   0,   0,   0,   0,   0,   0], dtype=int32),
 'label_item_id': 2568}


2024-03-03 13:47:44.529879: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-03 13:47:44.557045: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-03 13:47:44.557241: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-