In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_models_ecommerce-session-based-next-item-prediction-for-fashion/nvidia_logo.png" style="width: 90px; float: right;">

# Session-Based Next Item Prediction for Fashion E-Commerce

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container. 

## Overview

NVIDIA-Merlin team participated in [Recsys2022 challenge](http://www.recsyschallenge.com/2022/index.html) and secured 3rd position. This notebook contains the various techniques used in the solution.

### Learning Objective

In this notebook, we will apply important concepts that improve recommender systems. We leveraged them for our RecSys solution:
- MultiClass next item prediction head with Merlin Models
- Sequential input features representing user sessions
- Label Smoothing 
- Temperature Scaling
- Weight Tying
- Learning Rate Scheduler

### Brief Description of the Concepts

##### Label smoothing
In recommender systems, we often have noisy datasets. A user cannot view all items to make the best decision. Noisy examples can result in high gradients and confuse the model. Label smoothing addresses the problem of noisy examples by smoothing the porbabilities to avoid high confident predictions.

$$  \begin{array}{l}
y_{l} \ =\ ( 1\ -\ \alpha \ ) \ *\ y_{o} \ +\ ( \alpha \ /\ L)\\
\alpha :\ Label\ smoothing\ hyper-parameter\ ( 0 \leq \alpha \leq 1 ) \\
L:\ Total\ number\ of\ label\ classes\\
y_{o} :\ One-hot\ encoded\ label\ vector
\end{array}
$$

When α is 0, we have the original one-hot encoded labels, and as α increases, we move towards smoothed labels. Read [this](https://arxiv.org/abs/1906.02629) paper to learn more about it.


##### Temperature Scaling
Similar to Label Smoothing, Temperature Scaling is done to reduce the overconfidence of a model. In this, we divide the logits (inputs to the softmax function) by a scalar parameter (T) . For more information on Temperature Scaling read [this](https://arxiv.org/pdf/1706.04599.pdf) paper.
$$ softmax\ =\ \frac{e\ ^{( z_{i} \ /\ \ T)}}{\sum _{j} \ e^{( z_{j} \ /\ T)} \ } $$


##### Weight Tying
Weight Tying can be applied for Multi-Class Classification problems, when we try to predict items and have previous viewed items as an input. The final output layer (without activation function) is multiplied with the same item embeddings table to represent the input items, resulting in a vector with a logit for each item id. The advantage is that the gradients flow to the item embeddings are short. For more information read [this](https://arxiv.org/pdf/1608.05859v3.pdf) paper.

## Downloading and preparing the dataset

We will import the required libraries.

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import tensorflow as tf
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
import glob

import nvtabular as nvt
from merlin.io import Dataset
from merlin.schema import Schema, Tags
from nvtabular.ops import (
    AddMetadata,
)

from tensorflow.keras import regularizers

import merlin.models.tf as mm
from merlin.models.tf import InputBlock
from merlin.models.tf.models.base import Model
from merlin.models.tf.transforms.bias import LogitsTemperatureScaler
from merlin.models.tf.prediction_tasks.next_item import ItemsPredictionWeightTying

from merlin.core.dispatch import get_lib

2022-11-15 23:36:41.283384: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-15 23:36:42.472778: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2022-11-15 23:36:42.472873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16254 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:15:00.0, compute capability: 7.0
  from .autonotebook import tqdm as notebook_tqdm


###  Dressipi
[Dressipi](http://www.recsyschallenge.com/2022/dataset.html) hosted the [Recsys2022 challenge](http://www.recsyschallenge.com/2022/index.html) and provided an anonymized dataset. It contains 1.1 M online retail sessions that resulted in a purchase. It provides details about items that were viewed in a session, the item purchased at the end of the session and numerous features of those items. The item features are categorical IDs and are not interpretable.

The task of this competition was, given a sequence of items predict which item will be purchased at the end of a session.

<img src="http://www.recsyschallenge.com/2022/images/session_purchase_data.jpeg" alt="dressipi_dataset" style="width: 400px; float: center;">  


### Dataset

We provide a function `get_dressipi2022` which preprocess the dataset. Currently, we can't download this dataset automatically so this needs to be downloaded manually. To use this function, prepare the data by following these 3 steps:
1. Sign up and download the data from [dressipi-recsys2022.com](https://www.dressipi-recsys2022.com/).
2. Unzip the raw data to a directory.
3. Define `DATA_FOLDER` to the directory

In case you do not want to use this dataset to run our examples, you can also opt for synthetic data. Synthetic data can be generated by running::

```python
    from merlin.datasets.synthetic import generate_data
    train, valid = generate_data("dressipi2022-preprocessed", num_rows=10000, set_sizes=(0.8, 0.2))
```

In [2]:
from merlin.datasets.ecommerce import get_dressipi2022

DATA_FOLDER = os.environ.get(
    "DATA_FOLDER", 
    '/workspace/data/dressipi_recsys2022'
)

train, valid = get_dressipi2022(DATA_FOLDER)

The dataset contains:
- `session_id`, id of a session, in which a user viewed and purchased an item. 
- `item_id` which was viewed at a given `timestamp` in a session
- `purchase_id` which is the id of item bought at the end of the session 

In addition to `timestamp`, we have `day` and `date` features for representing the chronological order in which items were viewed.

The items in the Dresspi dataset had a many features out of which we took 22 most important features, namely 
`f_3 ,f_4 ,f_5 ,f_7 ,f_17 ,f_24 ,f_30 ,f_45 ,f_46 ,f_47 ,f_50 ,f_53 ,f_55 ,f_56 ,f_58 ,f_61 ,f_63 ,f_65 ,f_68 ,f_69 ,f_72 ,f_73`.

In [3]:
train.to_ddf().head()

Unnamed: 0,session_id,item_id,date,f_3,f_5,f_7,f_17,f_24,f_45,f_47,...,f_61,f_63,f_65,f_68,f_69,f_72,f_73,timestamp,day,purchase_id
0,21970,26077,2020-04-27 06:39:19.329,-1,-1,619,378,-1,-1,96,...,706,861,521,662,592,263,544,1587969559329,117,24868
1,21970,6140,2020-04-27 06:39:46.460,-1,-1,619,378,-1,-1,96,...,706,861,521,383,592,655,544,1587969586460,117,24868
2,21970,19770,2020-04-27 06:40:13.598,-1,-1,536,378,-1,-1,96,...,706,861,521,97,592,7,544,1587969613598,117,24868
3,21970,6140,2020-04-27 06:41:03.256,-1,-1,619,378,-1,-1,96,...,706,861,521,383,592,655,544,1587969663256,117,24868
4,21970,22885,2020-04-27 06:41:40.288,793,605,798,378,-1,559,549,...,706,861,521,745,592,75,544,1587969700288,117,24868


## Feature Engineering with NVTabular

We use NVTabular for Feature Engineering. If you want to learn more about NVTabular, we recommend the [examples in the NVTabular GitHub Repository](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples).

### Categorify

We want to use embedding layers for our categorical features. First, we need to Categorify them, that they are contiguous integers. 

The features `item_id` and `purchase_id` belongs to the same category. If `item_id` is 8432 and `purchase_id` is 8432, they are the same item. When we want to apply Categorify, we want to keep the connection. We can achieve this by encoding them jointly by providing them as a list in the list `[['item_id', 'purchase_id']]`.

We will use only 2 of the categorical item features in this example.

In [4]:
%%time
item_features_names = ['f_' + str(col) for col in [47, 68]]
cat_features = [['item_id', 'purchase_id']] + item_features_names >> nvt.ops.Categorify(start_index=1, dtype='int32')

features = ['session_id', 'timestamp', 'date'] + cat_features

CPU times: user 86 µs, sys: 31 µs, total: 117 µs
Wall time: 121 µs


### GroupBy the data by sessions.

Currently, every row is a viewed item in the dataset. Our goal is to predict the item purchased after the last view in a session. Therefore, we groupby the dataset by `session_id` to have one row for each prediction.

Each row will have a sequence of encoded items ids with which a user interacted. The last item of a session has special importance as it is closer to the user's intention. We will keep the viewed item as a separate feature.

The NVTabular `GroupBy` op enables the transformation. 

First, we define how the different columns should be aggregates:
- Keep the first occurrence of `date`
- Keep the last item and concatenate all items to a list (results are 2 features)
- Keep the first occurrence of `purchase_id` (purchase_id should be the same for all rows of one session)

In [6]:
to_aggregate = {}
to_aggregate['date'] = ["first"]
to_aggregate['item_id'] = ["last", "list"]
to_aggregate['purchase_id'] = ["first"]   

In addition, we concatenate each item features to a list.

In [7]:
for name in item_features_names: 
    to_aggregate[name] = ['list']

In [8]:
to_aggregate

{'date': ['first'],
 'item_id': ['last', 'list'],
 'purchase_id': ['first'],
 'f_47': ['list'],
 'f_68': ['list']}

We want to sort the dataframe by `date` and groupby the columns by `session_id`.

In [9]:
groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session_id"], 
    sort_cols=["date"],
    aggs= to_aggregate,
    name_sep="_")

Merlin Models can infer the neural network architecture from the dataset schema. We will Tag the columns accordingly based on the type of each column. If you want to learn more, we recommend our [Dataset Schema Example](https://github.com/NVIDIA-Merlin/models/blob/main/examples/02-Merlin-Models-and-NVTabular-integration.ipynb).

In [10]:
item_last = (
    groupby_features['item_id_last'] >> 
    AddMetadata(tags=[Tags.ITEM, Tags.ITEM_ID])
)
item_list = (
    groupby_features['item_id_list'] >> 
    AddMetadata(
        tags=[Tags.ITEM, Tags.ITEM_ID, Tags.LIST, Tags.SEQUENCE]
    )
)
feature_list = (
    groupby_features[[name+'_list' for name in item_features_names]] >> 
    AddMetadata(
        tags=[Tags.SEQUENCE, Tags.ITEM, Tags.LIST]
    )
)
target_feature = (
    groupby_features['purchase_id_first'] >> 
    AddMetadata(tags=[Tags.TARGET])
)
other_features = groupby_features['session_id', 'date_first']

groupby_features = item_last + item_list + feature_list + other_features +  groupby_features['purchase_id_first']


### Truncate for a Maximum Sequence Length

We want to truncate and pad the sequential features. We define the columns, which are sequential features and the non-sequential ones. We truncate the sequence by keeping the last 3 elements.

In [11]:
list_features = [name+'_list' for name in item_features_names] + ['item_id_list']
nonlist_features = ['session_id', 'date_first', 'item_id_last', 'purchase_id_first']

In [12]:
SESSIONS_MAX_LENGTH = 3
truncated_features = groupby_features[list_features] >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH) >> nvt.ops.Rename(postfix = '_seq')

final_features = groupby_features[nonlist_features] + truncated_features

We initialize our NVTabular workflow.

In [13]:
workflow = nvt.Workflow(final_features)

We call fit and transform similar to the scikit learn API.

Categorify will map item_ids (and purchase_ids), which does not occur in the train dataset, to a special category `0` in the validation dataset. This can bias the validation metrics. In our example, almost all item_ids in validation are available in train and we neglect it.

In [14]:
# fit data
workflow.fit(train)

# transform and save data
workflow.transform(train).to_parquet(os.path.join(DATA_FOLDER, "train/"), output_files=2)
workflow.transform(valid).to_parquet(os.path.join(DATA_FOLDER, "valid/"))



### Sort the Training Dataset by Time

The train dataset contains the data from Jan 2020 to April 2021 and the validation dataset is May 2021. As the data is split by time, we noticed that we achieve higher validation scores, when we sort the training data by time and do not apply shuffling.

In [15]:
df = get_lib().read_parquet(
    glob.glob(
        os.path.join(DATA_FOLDER, "train/*.parquet")
    )
)
df = df.sort_values('date_first').reset_index(drop=True)
df.to_parquet(os.path.join(DATA_FOLDER, "train_sorted.parquet"))

Let's review the transformed dataset.

In [16]:
df.head()

Unnamed: 0,session_id,date_first,item_id_last,purchase_id_first,f_47_list_seq,f_68_list_seq,item_id_list_seq
0,3747794,2020-01-01 00:00:01.359,7920,13757,"[3, 6, 14]","[2, 10, 8]","[538, 4177, 7920]"
1,3458777,2020-01-01 00:00:21.440,11594,17900,"[2, 2, 2]","[10, 45, 7]","[14809, 7840, 11594]"
2,4350716,2020-01-01 00:00:48.505,5192,14217,[14],[9],[5192]
3,2579761,2020-01-01 00:06:37.801,9675,12251,"[9, 8, 9]","[15, 4, 4]","[19431, 16369, 9675]"
4,2048031,2020-01-01 00:08:19.297,14877,12751,"[2, 6, 2]","[6, 2, 10]","[779, 15326, 14877]"


## Training an MLP with sequential input with Merlin Models

We train a Sequential-Multi-Layer Perceptron model, which averages the sequential input features (e.g. `item_id_list_seq`) and concatenate the resulting embeddings with the categorical embeddings (e.g. `item_id_last`). We visualize the architecture in the figure below.

<img src="../images/mlp_ecommerce.png"  width="30%">

### Dataloader

We initialize the dataloaders to train the neural network models. First, we define NVTabular dataset.

In [5]:
train = Dataset(os.path.join(DATA_FOLDER, 'train_sorted.parquet'))
valid = Dataset(os.path.join(DATA_FOLDER, 'valid/*.parquet'))



As we loaded, sorted and saved the train dataset without using NVTabular, the parquet file doesn't contain a schema, anymore. We can copy the schema from valid to train.

In [6]:
train.schema = valid.schema
schema_model = train.schema.select_by_name(
        ['item_id_list_seq', 'item_id_last','f_47_list_seq', 'f_68_list_seq']
)

#### Hyperparameters

We use the following hyperparameters, we found during experimentations.

In [7]:
EPOCHS = int(os.environ.get(
    "EPOCHS", 
    '3'
))
BATCH_SIZE = 1024
LEARNING_RATE = 0.01
DROPOUT = 0.2 
LABEL_SMOOTHING = 0.2
TEMPERATURE_SCALING = 2

#### Data loader

The default dataloader does shuffle by default. We will initialize the Loader for the training dataset, and set the shuffle to `False`.

In [8]:
loader = mm.Loader(train, batch_size=BATCH_SIZE, transform=mm.ToTarget(train.schema, "purchase_id_first", one_hot=True),  shuffle = False)
val_loader = mm.Loader(valid, batch_size=BATCH_SIZE, transform=mm.ToTarget(train.schema, "purchase_id_first", one_hot=True),  shuffle=False)

### Build the Sequential MLP with Merlin Models

Now we will create an InputBlock which takes sequential features, concatenate them and return the sequence of interaction embeddings. Note that we define the embedding dimensions, manually.

In [9]:
manual_dims = {
    'item_id_list_seq': 256, 
    'item_id_last': 256,
    'f_47_list_seq': 16,
    'f_68_list_seq': 16
}

In [22]:
input_block = mm.InputBlockV2(
        schema_model,
        categorical=mm.Embeddings(schema_model,
                                 dim=manual_dims
                                )
)

Before the loss is calculated, we want to transform the model output:

1. We apply `weight-tying` and multiply the model output with the embedding weights from ITEM_ID. The embedding dimensions and the model output dimensions have to be the same (256 in our example).
2. We transform the ground truth into OneHot representation
3. We apply Temperature Scaling

Now, we will build a model with a 2-layer MLPBlock, `input_block` as the input and `prediction_task` as the task. The output dimension of MLPBlock should match with the embedding dimension of the `item_id_list_seq` since we are using weight tying technique.

In [23]:
mlp_block = mm.MLPBlock(
        [128, 256], 
        no_activation_last_layer=True, 
        dropout=0.2
    )

In [24]:
item_id_name = train.schema.select_by_tag(Tags.ITEM_ID).first.properties['domain']['name']
item_id_name

'item_id_purchase_id'

Next, we define the prediction task. Our objective is multi-class classification - which is the item purchased at the end of the session. Therefore, this is a multi-class classification task, and the default_loss in the `CategoricalOutput` class is  "categorical_crossentropy". [CategoricalOutput](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/outputs/classification.py#L112) class has the functionality to do `weight-tying`, when we provide the `EmbeddingTable` related to the target feature in the `to_call` method. Note that in our example we feed the embedding table for the `item_id_purchase_id` domain name, since it reflects the fact that the `item_id_list_seq` and `item_id_last` input columns were jointly encoded and they share the same embedding table.

In [25]:
prediction_task= mm.CategoricalOutput(
        to_call=input_block["categorical"][item_id_name],
        logits_temperature=TEMPERATURE_SCALING,
        target='purchase_id_first',
    )

In [26]:
model_mlp = mm.Model(input_block, mlp_block, prediction_task)

### Fit the Model


We initialize the optimizer with `ExponentialDecay` learning rate scheduler and compile the model - similar to other TensorFlow Keras API. The competition was evaluated based on MRR@100.

In [27]:
initial_learning_rate = LEARNING_RATE

exp_decay_lr_scheduler = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=100000,
    decay_rate=0.96,
    staircase=True)

In [28]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=exp_decay_lr_scheduler,
)

model_mlp.compile(
    optimizer=optimizer,
    run_eagerly=False,
    loss=tf.keras.losses.CategoricalCrossentropy(
        from_logits=True, 
        label_smoothing=LABEL_SMOOTHING
    ),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[100])
)

We call `.fit` to train the model.

In [29]:
%%time
history = model_mlp.fit(
    loader,
    validation_data=val_loader,
    epochs=EPOCHS,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 4min 46s, sys: 2min 18s, total: 7min 4s
Wall time: 5min 54s


We can evaluate the model.

In [30]:
metrics_mlp = model_mlp.evaluate(val_loader, batch_size=BATCH_SIZE, return_dict=True)
metrics_mlp['mrr_at_100']



0.09053944051265717

In this section, we train a Bi-LSTM model, which concatenates the embedding vectors for all sequential features (`item_id_list_seq`, `f_47_list_seq`, `f_68_list_seq`) per step (e.g. here 3). The concatenated vectors are processed by a BiLSTM. The hidden state of the BiLSTM is concatenated with the embedding vectors of the categorical features (`item_id_last`). Then we connect it with a Multi-Layer Perceptron Block. We visualize the architecture in the figure below.

<img src="../images/bi-lstm_ecommerce.png"  width="30%">

### Build Bi-LSTM model

Now we will create a Bi-LSTM model by using `tf.keras.layers.Bidirectional` api. We connect the sequence input block for sequential features with `BiLSTM` block. First, we  create two input blocks which takes sequential and categorical features, respectively, concatenate them and return the interaction embeddings.

#### Hyperparameters
We only update the `LEARNING_RATE`, as in our experimentations we found Bi-LSTM gave better results with lower learning rate.


In [31]:
LEARNING_RATE = 0.005

We define the embedding dimensions, manually.

In [32]:
seq_inputs = mm.InputBlockV2(
        schema_model.select_by_name(
            ['item_id_list_seq', 'f_47_list_seq', 'f_68_list_seq']
        ),
        categorical=mm.Embeddings(
            schema_model.select_by_name(
            ['item_id_list_seq', 'f_47_list_seq', 'f_68_list_seq']
        ),
            sequence_combiner=None,
            dim=manual_dims
                                
        )
)

cat_inputs = mm.InputBlockV2(
        schema_model.select_by_name(
            ['item_id_last']
        ),
        categorical=mm.Embeddings(
            schema_model.select_by_name(
            ['item_id_last']
        ),
            dim=manual_dims
                                
        )
)

Connect the sequential input block to the BiLSTM model. We leverage `tf.keras.layers.Bidirectional` api.

In [33]:
dense_block =seq_inputs.connect(tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64,
        return_sequences=False, 
        dropout=0.05,
        kernel_regularizer=regularizers.l2(1e-4),
    )
))

Now, we combine dense block with input block of categorical features by concatenating them.

In [34]:
concats = mm.ParallelBlock(
    {'dense_block': dense_block, 
     'cat_inputs': cat_inputs},
    aggregation='concat'
)

Next, we build a 2-layer MLPBlock which is used as a projection layer.

In [35]:
mlp_block = mm.MLPBlock(
                [128,256],
                activation='relu',
                no_activation_last_layer=True,
                dropout=DROPOUT,
            )

Now, we define the prediction task. As we saw above in our MLP implementation, our objective is multi-class classification.

Before the loss is calculated, we want to transform the model output:

- We apply  `weight-tying` (see in the beginning) and multiply the model output with the embedding weights from ITEM_ID. The embedding dimensions and the model output dimensions have to be the same (256 in our example).
- We transform the ground truth into OneHot representation
- We apply Temperature Scaling

Again same as above, the pipeline is called prediction_call and we add it as the pre parameter to MultiClassClassificationTask. The pipeline is executed before the loss is calculated.

In [36]:
item_id_name = train.schema.select_by_tag(Tags.ITEM_ID).first.properties['domain']['name']

In [37]:
prediction_task= mm.CategoricalOutput(
    to_call=seq_inputs["categorical"][item_id_name],
    logits_temperature=TEMPERATURE_SCALING,
    target='purchase_id_first',
)

Now, we will build a model by chaining the `concats`, the `mlp_block` and the `prediction_task` layers.

In [38]:
model_bi_lstm = Model(concats, mlp_block, prediction_task)

### Fit the Model
We initialize the optimizer and compile the model - similar to other TensorFlow Keras API. The competition was evaluated based on MRR@100.

In [39]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE
)

model_bi_lstm.compile(
    optimizer=optimizer,
    run_eagerly=False,
    loss=tf.keras.losses.CategoricalCrossentropy(
        from_logits=True, 
        label_smoothing=LABEL_SMOOTHING
    ),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[100])
)

Now, we train the model

In [40]:
history = model_bi_lstm.fit(
    loader,
    validation_data=val_loader,
    epochs=EPOCHS,
)

2022-11-15 22:43:59.797594: I tensorflow/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8500


Epoch 1/3




Epoch 2/3
Epoch 3/3


We can evaluate the model

In [41]:
metrics_bi_lstm = model_bi_lstm.evaluate(val_loader, batch_size=BATCH_SIZE, return_dict=True)
metrics_bi_lstm['mrr_at_100']



0.12781542539596558

Note that final score `mrr_at_100` we printed out above is the average over all steps, whereas the value we get from progress bar shows the score of the last n_steps (by default =1).

## Training a Transformer-based Model

In recent years, several deep learning-based algorithms have been proposed for recommendation systems while its adoption in industry deployments have been steeply growing. In particular, NLP inspired approaches have been successfully adapted for sequential and session-based recommendation problems, which are important for many domains like e-commerce, news and streaming media. Session-Based Recommender Systems (SBRS) have been proposed to model the sequence of interactions within the current user session, where a session is a short sequence of user interactions typically bounded by user inactivity. They have recently gained popularity due to their ability to capture short-term or contextual user preferences towards items.

The field of NLP has evolved significantly within the last decade, particularly due to the increased usage of deep learning. As a result, state of the art NLP approaches have inspired RecSys practitioners and researchers to adapt those architectures, especially for sequential and session-based recommendation problems. Here, we use one of the state-of-the-art Transformer-based architecture, [XLNet](https://arxiv.org/abs/1906.08237) with Causal Language Modeling (CLM) training technique for multi-class classification task. For this, we leverage the popular HuggingFace’s Transformers NLP library and make it possible to experiment with cutting-edge implementation of such architectures for sequential and session-based recommendation problems.

Now, we replace the BiLSTM model with a transformer-based architecture. We train an `XLNet` model which concatenates the embedding vectors for all sequential features (`item_id_list_seq`, `f_47_list_seq`, `f_68_list_seq`) per step in the sequential input block, and uses self-attention mechanism. To learn more about the self-attention mechanism you can take a look at this [paper](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) and this [post](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html).

In [10]:
item_id_name = train.schema.select_by_tag(Tags.ITEM_ID).first.properties['domain']['name']

In [11]:
seq_inputs = mm.InputBlockV2(
        schema_model.select_by_name(
            ['item_id_list_seq', 'f_47_list_seq', 'f_68_list_seq']
        ),
        categorical=mm.Embeddings(
            schema_model.select_by_name(
            ['item_id_list_seq', 'f_47_list_seq', 'f_68_list_seq']
        ),
            sequence_combiner=None,
            dim=manual_dims
                                
        )
)

cat_inputs = mm.InputBlockV2(
        schema_model.select_by_name(
            ['item_id_last']
        ),
        categorical=mm.Embeddings(
            schema_model.select_by_name(
            ['item_id_last']
        ),
            dim=manual_dims
                                
        )
)

We can check the output from the sequential input block and its dimension. We obtain a 3-D sequence representation (batch_size, sequence_length, sum_of_emb_dim_of_features).

In [12]:
batch = mm.sample_batch(train, batch_size=128, include_targets=False, to_ragged=True)

In [13]:
seq_inputs(batch).shape

TensorShape([128, None, 288])

In [14]:
cat_inputs(batch).shape

TensorShape([128, 256])

The sequence_length dimension is printed out as None, because it is a variable length given a batch. That's why we get the sequence_length dim printed as `None`.

Let's create a sequential block where we connect sequential inputs block (i.e., a SequentialLayer represents a sequence of Keras layers) with MLPBlock and then XLNetBlock. MLPBlock is used as a projection block to match the output dimensions of the seq_inputs block with the transformer block. In otherwords, due to residual connection in the Transformer model, we add an MLPBlock in the model pipeline. The output dim of the input block should match with the hidden dimension of the XLNetBlock.

In [15]:
mlp_block = mm.MLPBlock(
                [32,256],
                activation='relu',
                no_activation_last_layer=True,
                dropout=DROPOUT,
            )

In [16]:
dense_block =mm.SequentialBlock(
    seq_inputs,
    mlp_block,
    mm.XLNetBlock(
        d_model=256,
        n_head=4,
        n_layer=2,
        post='sequence_mean',
    )
)

The output of XLNetBlock is a 2D tensor `(batch_size, d_model)`, and it is then fed to final output layer.

In [17]:
dense_block(batch).shape

2022-11-15 23:37:20.451144: I tensorflow/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8500


TensorShape([128, 256])

Let's concatenate the output of transformer block with the output of the categorical block.

In [18]:
concats = mm.ParallelBlock(
    {'dense_block': dense_block, 
     'cat_inputs': cat_inputs},
    aggregation='concat'
)

The concat layer shape would be the total output dimension (256 + 256) of two layers that are concatenated.

In [19]:
concats(batch).shape

TensorShape([128, 512])

In [20]:
prediction_task= mm.CategoricalOutput(
    to_call=seq_inputs["categorical"][item_id_name],
    logits_temperature=TEMPERATURE_SCALING,
    target='purchase_id_first',
)

In [21]:
model_transformer = mm.Model(concats, mlp_block, prediction_task)

We train the model

In [22]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.01
)

model_transformer.compile(
    optimizer=optimizer,
    run_eagerly=False,
    loss=tf.keras.losses.CategoricalCrossentropy(
        from_logits=True, 
        label_smoothing=LABEL_SMOOTHING
    ),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[100])
)

In [23]:
model_transformer.fit(loader, 
                      validation_data=val_loader,
                      epochs=EPOCHS
                     )

ValueError: Exception encountered when calling layer "private__dense_4" (type _Dense).

Input 0 of layer "dense_4" is incompatible with the layer: expected axis -1 of input shape to have value 288, but received input with shape (1024, 512)

Call arguments received by layer "private__dense_4" (type _Dense):
  • inputs=tf.Tensor(shape=(1024, 512), dtype=float32)
  • kwargs={'training': 'False', 'features': {'f_47_list_seq': '<tf.RaggedTensor [[[3],\n  [6],\n  [14]], [[2],\n          [2],\n          [2]], [[14]], ..., [[4],\n                              [4],\n                              [4]], [[2],\n                                     [2],\n                                     [4]], [[14]]]>', 'f_68_list_seq': '<tf.RaggedTensor [[[2],\n  [10],\n  [8]] , [[10],\n          [45],\n          [7]] , [[9]], ..., [[28],\n                              [23],\n                              [2]] , [[28],\n                                      [11],\n                                      [2]] , [[10]]]>', 'item_id_list_seq': '<tf.RaggedTensor [[[538],\n  [4177],\n  [7920]], [[14809],\n            [7840],\n            [11594]], [[5192]], ..., [[11823],\n                                      [19212],\n                                      [6142]] , [[4396],\n                                                 [11014],\n                                                 [10622]], [[3855]]]>', 'item_id_last': 'tf.Tensor(shape=(1024, 1), dtype=int32)'}, 'testing': 'False'}

In [25]:
model_transformer.summary

<bound method Model.summary of Model(
  (blocks): _TupleWrapper((ParallelBlock(
    (_aggregation): ConcatFeatures(
      (_feature_shapes): Dict(
        (f_47_list_seq): TensorShape([1024, None])
        (f_68_list_seq): TensorShape([1024, None])
        (item_id_list_seq): TensorShape([1024, None])
        (item_id_last): TensorShape([1024, 1])
      )
      (_feature_dtypes): Dict(
        (f_47_list_seq): tf.int32
        (f_68_list_seq): tf.int32
        (item_id_list_seq): tf.int32
        (item_id_last): tf.int32
      )
    )
    (parallel_layers): Dict(
      (dense_block): SequentialBlock(
        (layers): List(
          (0): ParallelBlock(
            (_aggregation): ConcatFeatures(
              (_feature_shapes): Dict(
                (f_47_list_seq): TensorShape([1024, None])
                (f_68_list_seq): TensorShape([1024, None])
                (item_id_list_seq): TensorShape([1024, None])
                (item_id_last): TensorShape([1024, 1])
              )
    

We evaluate the model

In [None]:
metrics_transformer = model_transformer.evaluate(val_loader, return_dict=True)

## Summary

In this example, we focused on concepts which are relevant for a broad range of recommender system use cases- session-based recommendation task. If you compare the MRR to the ACM RecSys'22 competition, you will notice, that the MRR can be much higher. Following are additional techniques that can be applied to improve the MRR:
- Data Augmentations - in the RecSys'22 challenge, we used a lot of different techniques to increase the training dataset. The techniques are specific to the dataset and we did not include it in the example:
- Additional item features - we focused on only a few item features
- Stacking - we stacked 17 models with a two-step approach
- Ensemble - we ensembled 3 different stacked models
- Hyperparameter Search - we ran multiple HPO jobs to find the best hyperparameters

In addition, the MRR on the June month (test data) was in general higher than in May (validation)