In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_models_ecommerce-session-based-next-item-prediction-for-fashion/nvidia_logo.png" style="width: 90px; float: right;">

# Session-Based Next Item Prediction for Fashion E-Commerce

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container. 

## Overview

NVIDIA-Merlin team participated in [Recsys2022 challenge](http://www.recsyschallenge.com/2022/index.html) and secured 3rd position. This notebook contains the various techniques used in the solution.

### Learning Objective

In this notebook, we will apply important concepts that improve recommender systems. We leveraged them for our RecSys solution:
- MultiClass next item prediction head with Merlin Models
- Sequential input features representing user sessions
- Label Smoothing 
- Temperature Scaling
- Weight Tying
- Learning Rate Scheduler

### Brief Description of the Concepts

##### Label smoothing
In recommender systems, we often have noisy datasets. A user cannot view all items to make the best decision. Noisy examples can result in high gradients and confuse the model. Label smoothing addresses the problem of noisy examples by smoothing the porbabilities to avoid high confident predictions.

$$  \begin{array}{l}
y_{l} \ =\ ( 1\ -\ \alpha \ ) \ *\ y_{o} \ +\ ( \alpha \ /\ L)\\
\alpha :\ Label\ smoothing\ hyper-parameter\ ( 0 \leq \alpha \leq 1 ) \\
L:\ Total\ number\ of\ label\ classes\\
y_{o} :\ One-hot\ encoded\ label\ vector
\end{array}
$$

When α is 0, we have the original one-hot encoded labels, and as α increases, we move towards smoothed labels. Read [this](https://arxiv.org/abs/1906.02629) paper to learn more about it.


##### Temperature Scaling
Similar to Label Smoothing, Temperature Scaling is done to reduce the overconfidence of a model. In this, we divide the logits (inputs to the softmax function) by a scalar parameter (T) . For more information on Temperature Scaling read [this](https://arxiv.org/pdf/1706.04599.pdf) paper.
$$ softmax\ =\ \frac{e\ ^{( z_{i} \ /\ \ T)}}{\sum _{j} \ e^{( z_{j} \ /\ T)} \ } $$


##### Weight Tying
Weight Tying can be applied for Multi-Class Classification problems, when we try to predict items and have previous viewed items as an input. The final output layer (without activation function) is multiplied with the traversed item embeddings, resulting in a vector with a logit for each item id. The advantage is that the gradients flow to the item embeddings are short. For more information read [this](https://arxiv.org/pdf/1608.05859v3.pdf) paper.

## Downloading and preparing the dataset

We will import the required libraries.

In [1]:
import os
import glob

import nvtabular as nvt
from merlin.io import Dataset
from merlin.schema import Schema, Tags
from nvtabular.ops import (
    AddMetadata,
)


import tensorflow as tf

from tensorflow.keras import regularizers
from merlin.models.tf.dataset import BatchedDataset

import merlin.models.tf as mm
from merlin.models.tf import InputBlock
from merlin.models.tf.models.base import Model
from merlin.models.tf.transforms.bias import LogitsTemperatureScaler
from merlin.models.tf.prediction_tasks.next_item import ItemsPredictionWeightTying

from merlin.core.dispatch import get_lib
import merlin.models.tf.dataset as tf_dataloader

2022-08-26 03:53:32.985517: I tensorflow/core/platform/cpu_feature_guard.cc:152] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-26 03:53:35.564738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16255 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB-LS, pci bus id: 0000:86:00.0, compute capability: 7.0


###  Dressipi
[Dressipi](http://www.recsyschallenge.com/2022/dataset.html) hosted the [Recsys2022 challenge](http://www.recsyschallenge.com/2022/index.html) and provided an anonymized dataset. It contains 1.1 M online retail sessions that resulted in a purchase. It provides details about items that were viewed in a session, the item purchased at the end of the session and numerous features of those items. The item features are categorical IDs and are not interpretable.

The task of this competition was, given a sequence of items predict which item will be purchased at the end of a session.

<img src="http://www.recsyschallenge.com/2022/images/session_purchase_data.jpeg" alt="dressipi_dataset" style="width: 400px; float: center;">  


### Dataset

We provide a function `get_dressipi2022` which preprocess the dataset. Currently, we can't download this dataset automatically so this needs to be downloaded manually. To use this function, prepare the data by following these 3 steps:
1. Sign up and download the data from [dressipi-recsys2022.com](https://www.dressipi-recsys2022.com/).
2. Unzip the raw data to a directory.
3. Define `DATA_FOLDER` to the directory

In case you do not want to use this dataset to run our examples, you can also opt for synthetic data. Synthetic data can be generated by running::

```python
    from merlin.datasets.synthetic import generate_data
    train, valid = generate_data("dressipi2022-preprocessed")
```

In [2]:
from merlin.datasets.ecommerce import get_dressipi2022

DATA_FOLDER = os.environ.get(
    "DATA_FOLDER", 
    '~/data/dressipi_recsys2022'
)

train, valid = get_dressipi2022(DATA_FOLDER)

The dataset contains:
- `session_id`, id of a session, in which a user viewed and purchased an item. 
- `item_id` which was viewed at a given `timestamp` in a session
- `purchase_id` which is the id of item bought at the end of the session 

In addition to `timestamp`, we have `day` and `date` features for representing the chronological order in which items were viewed.

The items in the Dresspi dataset had a many features out of which we took 22 most important features, namely 
`f_3 ,f_4 ,f_5 ,f_7 ,f_17 ,f_24 ,f_30 ,f_45 ,f_46 ,f_47 ,f_50 ,f_53 ,f_55 ,f_56 ,f_58 ,f_61 ,f_63 ,f_65 ,f_68 ,f_69 ,f_72 ,f_73`.

In [4]:
train.to_ddf().head()

Unnamed: 0,session_id,item_id,date,f_3,f_5,f_7,f_17,f_24,f_45,f_47,...,f_61,f_63,f_65,f_68,f_69,f_72,f_73,timestamp,day,purchase_id
0,13441,19420,2020-08-14 18:42:59.970,793,605,536,378,588,177,242,...,808,816,579,739,805,75,-1,1597430579970,226,23039
1,13441,22734,2020-08-14 18:43:17.426,793,605,536,378,-1,559,36,...,706,861,521,373,780,75,544,1597430597426,226,23039
2,13441,13369,2020-08-14 18:44:53.299,-1,-1,394,-1,588,-1,123,...,706,861,-1,373,805,75,544,1597430693299,226,23039
3,13441,23304,2020-08-14 18:45:33.083,-1,-1,452,378,-1,-1,165,...,706,861,521,393,592,75,-1,1597430733083,226,23039
4,13441,19653,2020-08-14 18:45:47.108,-1,-1,536,378,588,-1,516,...,462,861,521,379,805,75,544,1597430747108,226,23039


## Feature Engineering with NVTabular

We use NVTabular for Feature Engineering. If you want to learn more about NVTabular, we recommend the [examples in the NVTabular GitHub Repository](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples).

### Categorify

We want to use embedding layers for our categorical features. First, we need to Categorify them, that they are contiguous integers. 

The features `item_id` and `purchase_id` belongs to the same category. If `item_id` is 8432 and `purchase_id` is 8432, they are the same item. When we want to apply Categorify, we want to keep the connection. We can achieve this by encoding them jointly by providing them as a list in the list `[['item_id', 'purchase_id']]`.

We will use only 2 of the categorical item features in this example.

In [5]:
%%time
item_features_names = ['f_' + str(col) for col in [47, 68]]
cat_features = ['session_id', ['item_id', 'purchase_id']] + item_features_names >> nvt.ops.Categorify()

features = ['timestamp','date'] + cat_features

CPU times: user 188 µs, sys: 0 ns, total: 188 µs
Wall time: 208 µs


### GroupBy the data by sessions.

Currently, every row is a viewed item in the dataset. Our goal is to predict the item purchased after the last view in a session. Therefore, we groupby the dataset by `session_id` to have one row for each prediction.

Each row will have a sequence of encoded items ids with which a user interacted. The last item of a session has special importance as it is closer to the user's intention. We will keep the viewed item as a separate feature.

The NVTabular `GroupBy` op enables the transformation. 

First, we define how the different columns should be aggregates:
- Keep the first occurrence of `date`
- Keep the last item and concatenate all items to a list (results are 2 features)
- Keep the first occurrence of `purchase_id` (purchase_id should be the same for all rows of one session)

In [6]:
to_aggregate = {}
to_aggregate['date'] = ["first"]
to_aggregate['item_id'] = ["last", "list"]
to_aggregate['purchase_id'] = ["first"]   

In addition, we concatenate each item features to a list.

In [7]:
for name in item_features_names: 
    to_aggregate[name] = ['list']

In [8]:
to_aggregate

{'date': ['first'],
 'item_id': ['last', 'list'],
 'purchase_id': ['first'],
 'f_47': ['list'],
 'f_68': ['list']}

We want to sort the dataframe by `date` and groupby the columns by `session_id`.

In [9]:
groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session_id"], 
    sort_cols=["date"],
    aggs= to_aggregate,
    name_sep="_")

Merlin Models can infer the neural network architecture from the dataset schema. We will Tag the columns accordingly based on the type of each column. If you want to learn more, we recommend our [Dataset Schema Example](https://github.com/NVIDIA-Merlin/models/blob/main/examples/02-Merlin-Models-and-NVTabular-integration.ipynb).

In [10]:
item_last = (
    groupby_features['item_id_last'] >> 
    AddMetadata(tags=[Tags.ITEM, Tags.ITEM_ID])
)
item_list = (
    groupby_features['item_id_list'] >> 
    AddMetadata(
        tags=[Tags.ITEM, Tags.ITEM_ID, Tags.LIST, Tags.SEQUENCE]
    )
)
feature_list = (
    groupby_features[[name+'_list' for name in item_features_names]] >> 
    AddMetadata(
        tags=[Tags.SEQUENCE, Tags.ITEM, Tags.LIST]
    )
)
target_feature = (
    groupby_features['purchase_id_first'] >> 
    AddMetadata(tags=[Tags.TARGET])
)
other_features = groupby_features['session_id', 'date_first']

groupby_features = item_last + item_list + feature_list + other_features + target_feature


### Truncate and Padding for a Maximum Sequence Length

We want to truncate and pad the sequential features. We define the columns, which are sequential features and the non-sequential ones. We truncate the sequence by keeping the last 3 elements.

In [11]:
list_features = [name+'_list' for name in item_features_names] + ['item_id_list']
nonlist_features = ['session_id', 'date_first', 'item_id_last', 'purchase_id_first']

In [12]:
SESSIONS_MAX_LENGTH = 3
truncated_features = groupby_features[list_features] >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH, pad=True) >> nvt.ops.Rename(postfix = '_seq')

final_features = groupby_features[nonlist_features] + truncated_features

We initialize our NVTabular workflow.

In [13]:
workflow = nvt.Workflow(final_features)

We call fit and transform similar to the scikit learn API.

Categorify will map item_ids (and purchase_ids), which does not occur in the train dataset, to a special category `0` in the validation dataset. This can bias the validation metrics. In our example, almost all item_ids in validation are available in train and we neglect it.

In [14]:
# fit data
workflow.fit(train)

# transform and save data
workflow.transform(train).to_parquet(os.path.join(DATA_FOLDER, "train/"), output_files=10)
workflow.transform(valid).to_parquet(os.path.join(DATA_FOLDER, "valid/"), output_files=1)



### Sort the Training Dataset by Time

The train dataset contains the data from Jan 2020 to April 2021 and the validation dataset is May 2021. As the data is split by time, we noticed that we achieve higher validation scores, when we sort the training data by time and do not apply shuffling.

In [15]:
df = get_lib().read_parquet(
    glob.glob(
        os.path.join(DATA_FOLDER, "train/*.parquet")
    )
)
df = df.sort_values('date_first').reset_index(drop=True)
df.to_parquet(os.path.join(DATA_FOLDER, "train_sorted.parquet"))

Let's review the transformed dataset.

In [16]:
df.head()

Unnamed: 0,session_id,date_first,item_id_last,purchase_id_first,f_47_list_seq,f_68_list_seq,item_id_list_seq
773236,144306,2020-01-01 00:00:01.359,8558,14343,"[2, 5, 13]","[1, 8, 7]","[670, 4633, 8558]"
919765,102504,2020-01-01 00:00:21.440,12231,18295,"[1, 1, 1]","[8, 45, 6]","[15367, 8469, 12231]"
546675,993759,2020-01-01 00:00:48.505,5726,14805,"[13, 0, 0]","[9, 0, 0]","[5726, 0, 0]"
837402,9972,2020-01-01 00:06:37.801,10329,12877,"[7, 8, 7]","[14, 3, 3]","[19797, 16836, 10329]"
415676,357643,2020-01-01 00:08:19.297,15432,13374,"[1, 5, 1]","[5, 1, 8]","[938, 15840, 15432]"


In [17]:
del df

## Training a MLP with sequential input with Merlin Models

We train a Sequential-Multi-Layer Perceptron model, which averages the sequential input features (e.g. `item_id_list_seq`) and concatenate the resulting embeddings with the categorical embeddings (e.g. `item_id_last`). We visualize the architecture in the figure below.

<img src="../images/mlp_ecommerce.png"  width="30%">

### Hyperparameters

We use the following hyperparameters, we found during experimentations.

In [3]:
EPOCHS = int(os.environ.get(
    "EPOCHS", 
    '10'
))
BATCH_SIZE = 512
LEARNING_RATE = 0.2
DROPOUT = 0.2 
LABEL_SMOOTHING = 0.2
TEMPERATURE_SCALING = 2

### Dataloader

We initialize the dataloaders to train the neural network models. First, we define NVTabular dataset.

In [4]:
train = Dataset(os.path.join(DATA_FOLDER, 'train_sorted.parquet'))
valid = Dataset(os.path.join(DATA_FOLDER, 'valid/*.parquet'))



As we loaded, sorted and saved the train dataset without using NVTabular, the parquet file doesn't contain a schema, anymore. We can copy the schema from valid to train.

In [5]:
train.schema = valid.schema
schema_model = train.schema

The default dataloader does shuffle by default. We will initialize the BatchedDataset `tf_dataloader.BatchedDataset` for the training dataset.

In [6]:
train_dl = tf_dataloader.BatchedDataset(
    train,
    batch_size = BATCH_SIZE,
    shuffle = False 
)

### Build the Sequential MLP with Merlin Models

Now we will create an InputBlock which takes sequential features, concatenate them and return the sequence of interaction embeddings.

We define the embedding dimensions, manually.



In [7]:
manual_dims = {
    'item_id_list_seq': 256, 
    'item_id_last': 256,
    'f_47_list_seq': 16,
    'f_68_list_seq': 16
}

emb_options = mm.EmbeddingOptions(
        embedding_dims=manual_dims,
        infer_embedding_sizes=False,
)

In [8]:
input_block = InputBlock(
    schema_model.select_by_name(
        ['item_id_list_seq', 'item_id_last', 'f_47_list_seq', 'f_68_list_seq']
    ), 
    aggregation='concat',
    embedding_options = emb_options
)

Next, we define the prediction task. Our objective is multi-class classification - which is the item purchased at the end of the session. Therefore, we use the `MultiClassClassificationTask` task. 

Before the loss is calculated, we want to transform the model output:
1. We apply L2 norm to the model logits.
2. We apply Weight Tying (see in the beginning) and multiply the model output with the embedding weights from ITEM_ID.  The embedding dimensions and the model output dimensions have to be the same (256 in our example).
3. We transform the ground truth into OneHot representation
4. We apply Temperature Scaling

The pipeline is called `prediction_call` and we add it as the `pre` parameter to `MultiClassClassificationTask`. The pipeline is executed before the loss is calculated.

In [9]:
prediction_call = mm.L2Norm().connect(
    ItemsPredictionWeightTying(schema_model), 
    mm.ToOneHot(),
    LogitsTemperatureScaler(temperature=TEMPERATURE_SCALING)
)

prediction_task = mm.MultiClassClassificationTask(
    target_name="purchase_id_first",
    pre=prediction_call,
)

Now, we will build a model with a 2-layer MLPBlock, using `input_block` as the input and `prediction_task` as the task.

In [10]:
model_mlp = mm.Model.from_block(
    mm.MLPBlock(
        [128,256], 
        no_activation_last_layer=True, 
        dropout=0.2
    ),
    schema_model, 
    input_block=input_block,
    prediction_tasks=prediction_task
)

2022-08-26 03:58:01.963545: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


### Fit the Model

We initialize the optimizer with `ExponentialDecay` learning rate scheduler and compile the model - similar to other TensorFlow Keras API. The competition was evaluated based on MRR@100.

In [26]:
initial_learning_rate = LEARNING_RATE

exp_decay_lr_scheduler = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=100000,
    decay_rate=0.96,
    staircase=True)

In [27]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=exp_decay_lr_scheduler,
)

model_mlp.compile(
    optimizer=optimizer,
    run_eagerly=True,
    loss=tf.keras.losses.CategoricalCrossentropy(
        from_logits=True, 
        label_smoothing=LABEL_SMOOTHING
    ),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[100])
)

We call `.fit` to train the model.

In [28]:
%%time
history = model_mlp.fit(
    train_dl,
    validation_data=valid,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    schema=schema_model,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 40min 17s, sys: 1min 18s, total: 41min 36s
Wall time: 38min 6s


We can evaluate the model.

In [39]:
metrics_mlp = model_mlp.evaluate(valid, batch_size=BATCH_SIZE, return_dict=True)
metrics_mlp['mrr_at_100']



0.14556337893009186

## Training a Bi-LSTM with Merlin Models


We train a Bi-LSTM model, which concatenates the embedding vectors for all sequential features (`item_id_list_seq`, `f_47_list_seq`, `f_68_list_seq`) per step (e.g. here 3). The concatenated vectors are processed by a BiLSTM. The hidden state of the BiLSTM is concatenated with the embedding vectors of the categorical features (`item_id_last`). Then we connect it with a Multi-Layer Perceptron Block. We visualize the architecture in the figure below.

<img src="../images/bi-lstm_ecommerce.png"  width="30%">

## Hyperparameters
We only updated the `LEARNING_RATE`, as in our experimentations we found Bi-LSTM gave better results with lower learning rate.

In [11]:
LEARNING_RATE = 0.005

### Build the Sequential MLP with Merlin Models
Now we will create two InputBlock which takes sequential and categorical features, concatenate them and return the interaction embeddings.

We define the embedding dimensions, manually.

In [12]:
manual_dims = {
    'item_id_list_seq': 256, 
    'item_id_last': 256,
    'f_47_list_seq': 16,
    'f_68_list_seq': 16
}

seq_inputs = InputBlock(
        schema_model.select_by_name(
            ['item_id_list_seq', 'f_47_list_seq', 'f_68_list_seq']
        ),
        aggregation='concat',
        seq=True,
        max_seq_length=3,
        embedding_options=mm.EmbeddingOptions(
            embedding_dims=manual_dims,
            infer_embedding_sizes=False
        ),
        split_sparse=True,
)

cat_inputs = InputBlock(
        schema_model.select_by_name(
            ['item_id_last']
        ),
        aggregation='concat',
        embedding_options=mm.EmbeddingOptions(
            embedding_dims=manual_dims,
            infer_embedding_sizes=False
        )
)

### Build Bi-LSTM model

Now we will create a Bi-LSTM model by defining a custom layer of type mm.Block. It will take sequential interaction embeddings as an input.

In [13]:
class BiLSTM(mm.Block):
    def __init__(self, hidden_dim= 64, **kwargs):
        self.hidden_dim = hidden_dim
        lstm = tf.keras.layers.LSTM(
            hidden_dim, 
            return_sequences=False, 
            dropout=0.05,
            kernel_regularizer=regularizers.l2(1e-4)
        )
        self.lstm = tf.keras.layers.Bidirectional(lstm)
        
        super().__init__(**kwargs)
        
    def call(self, inputs, training=False, **kwargs) -> tf.Tensor:  
        interactions = inputs['input_sequence']
        sequence_representation = self.lstm(interactions)
        return sequence_representation
    
    def compute_output_shape(self, input_shape):
        input_shape = input_shape['input_sequence']
        return (input_shape[0], input_shape[1], self.hidden_dim*2)
    
    
bilstm = BiLSTM(hidden_dim=64)

We connect the `InputBlock` for sequential features with `Bi-LSTM` Block

In [14]:
dense_block = mm.ParallelBlock({'input_sequence': seq_inputs}).connect(bilstm)

Now, we combine `Bi-LSTM` block with `InputBlock` of categorical features by concatenating them

In [15]:
concats = mm.ParallelBlock(
    {'dense_block': dense_block, 
     'cat_inputs': cat_inputs},
    aggregation='concat'
)

Next, we will build a 2-layer MLPBlock

In [16]:
mlp_block = mm.MLPBlock(
                [128,256],
                activation='relu',
                no_activation_last_layer=True,
                dropout=DROPOUT,
            )

Now, we define the prediction task. As we saw above in our MLP implementation, our objective is multi-class classification.

Before the loss is calculated, we want to transform the model output:

- We apply Weight Tying (see in the beginning) and multiply the model output with the embedding weights from ITEM_ID. The embedding dimensions and the model output dimensions have to be the same (256 in our example).
- We transform the ground truth into OneHot representation
- We apply Temperature Scaling

Here we don't apply L2 norm to the model logits because it was detrimental to the Bi-LSTM model.

Again same as above, the pipeline is called prediction_call and we add it as the pre parameter to MultiClassClassificationTask. The pipeline is executed before the loss is calculated.

In [17]:
prediction_call = ItemsPredictionWeightTying(schema_model).connect(
    mm.ToOneHot(),
    LogitsTemperatureScaler(temperature=TEMPERATURE_SCALING)
)

prediction_task = mm.MultiClassClassificationTask(
    target_name="purchase_id_first",
    pre=prediction_call,
)

Now, we will build a model by chaining the `concats`, the `mlp_block` and the `prediction_task`

In [18]:
model_bi_lstm = Model(concats, mlp_block, prediction_task)

### Fit the Model
We initialize the optimizer and compile the model - similar to other TensorFlow Keras API. The competition was evaluated based on MRR@100.

In [19]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE
)

model_bi_lstm.compile(
    optimizer=optimizer,
    run_eagerly=True,
    loss=tf.keras.losses.CategoricalCrossentropy(
        from_logits=True, 
        label_smoothing=LABEL_SMOOTHING
    ),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[100])
)

Now, we train the model

In [20]:
history = model_bi_lstm.fit(
    train_dl,
    validation_data=valid,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    schema=schema_model,
)

2022-08-26 03:58:25.555132: I tensorflow/stream_executor/cuda/cuda_dnn.cc:379] Loaded cuDNN version 8400


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We can evaluate the model

In [21]:
metrics_bi_lstm = model_bi_lstm.evaluate(valid, batch_size=BATCH_SIZE, return_dict=True)
metrics_bi_lstm['mrr_at_100']



0.11864075064659119

In this example, we focused on concepts which are relevant for a broad range of recommender system use cases. If you compare the MRR to the competition, you will notice, that the MRR can be much higher. Following additional techniques can be applied to improve the MRR:
- Data Augmentations - we used a lot of different techniques to increase the training dataset. The techniques are specific to the dataset and we did not include it in the example/
- Additional item features - we focused on only a few item features
- Stacking - we stacked 17 models with a two-step approach
- Ensemble - we ensembled 3 different stacked models
- Hyperparameter Search - we ran multiple HPO jobs to find the best hyperparameters
- etc.

In addition, the MRR on the June month (test data) was in general higher than in May (validation)