In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Session-Based Next Item Prediction for Fashion E-Commerce

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container. 

## Overview

NVIDIA-Merlin team participated in [Recsys2022 challenge](http://www.recsyschallenge.com/2022/index.html) and secured 3rd position. This notebook contains the various techniques used in the solution.

### Learning Objective

In this notebook, we will apply important concepts that improve recommender systems. We leveraged them for our RecSys2022 participations:
- MultiClass next item prediction head with Merlin Models
- Sequential input features representing user sessions
- Label Smoothing 
- Temperature Scaling
- Weight Tying

### Brief Description of the Concepts

##### Label smoothing
When the probabilities predicted by a Classification model are higher than its accuracy we say the model is overconfident. It can be prevented by using Label smoothing. This technique basically, transforms One-hot encoded labels into smoothed labels. 
$$  \begin{array}{l}
y_{l} \ =\ ( 1\ -\ \alpha \ ) \ *\ y_{o} \ +\ ( \alpha \ /\ L)\\
\alpha :\ Label\ smoothing\\
L:\ Total\ number\ of\ label\ classes\\
y_{o} :\ One-hot\ encoded\ label\ vector
\end{array}
$$
When α is 0, we have the original one-hot encoded labels, and as α increases, we move towards smoothed labels. Read [this](https://arxiv.org/abs/1906.02629) paper to learn more about it.


##### Temperature Scaling
Similar to Label Smoothing, Temperature Scaling is done to reduce the overconfidence of a model. In this, we divide the logits (inputs to the softmax function) by a scalar parameter (T) . For more information on Temperature Scaling read [this](https://arxiv.org/pdf/1706.04599.pdf) paper.
$$ softmax\ =\ \frac{e\ ^{( z_{i} \ /\ \ T)}}{\sum _{j} \ e^{( z_{j} \ /\ T)} \ } $$


##### Weight Tying
In this technique, we share the Embedding layer's weights which is used to convert the input to embeddings, as the softmax weights,  to convert hidden layer output to softmax layer output. This drastically reduces the number of parameters and allows the model to train better. For more information read [this](https://arxiv.org/pdf/1608.05859v3.pdf) paper.

## Downloading and preparing the dataset

We will import the required libraries.

In [2]:
import os
import cupy
import cudf
import dask_cudf
import numpy as np
import pandas as pd 

import nvtabular as nvt
from merlin.dag import ColumnSelector
from merlin.io import Dataset
from merlin.schema import Schema, Tags
from nvtabular.ops import (
    AddMetadata,
)
from merlin.schema.tags import Tags


import tensorflow as tf

from merlin.io import Dataset
from tensorflow.keras import regularizers
from merlin.models.tf.dataset import BatchedDataset
from merlin.models.tf.utils.tf_utils import extract_topk

import merlin.models.tf as mm
from merlin.models.tf import InputBlock
from merlin.models.tf.models.base import Model
from merlin.models.tf.core.aggregation import SequenceAggregation, SequenceAggregator
from merlin.models.tf.core.transformations import (
    ItemsPredictionWeightTying,
    L2Norm,
    LogitsTemperatureScaler,
)

2022-08-19 14:07:38.392579: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 14:07:40.734255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16255 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB-LS, pci bus id: 0000:0a:00.0, compute capability: 7.0


###  Dressipi
[Dressipi](http://www.recsyschallenge.com/2022/dataset.html) hosted the [Recsys2022 challenge](http://www.recsyschallenge.com/2022/index.html) and provided an anonymized dataset. It contains 1.1 M online retail sessions that resulted in a purchase. It provides details about items that were viewed in a session, the item purchased at the end of the session and numerous features of those items. The item features are categorical IDs and are not interpretable.

The task of this competition was, given a sequence of items predict which item will be purchased at the end of a session.

<img src="http://www.recsyschallenge.com/2022/images/session_purchase_data.jpeg" alt="dressipi_dataset" style="width: 400px; float: center;">  


### Dataset

We provide a function `get_dressipi2022` which preprocess the dataset. Currently, we can't download this dataset automatically so this needs to be downloaded manually. To use this function, prepare the data by following these 3 steps:
1. Sign up and download the data from [dressipi-recsys2022.com](https://www.dressipi-recsys2022.com/).
2. Unzip the raw data to a directory.
3. Define `DATA_FOLDER` to the directory

In case you want to use this dataset to run our examples, you can also opt for synthetic data. Synthetic data can be generated by running::

```python
    from merlin.datasets.synthetic import generate_data
    train, valid = generate_data("dressipi2022-preprocessed")
```

In [3]:
from merlin.datasets.ecommerce import get_dressipi2022

DATA_FOLDER = '../../../../../dressipi_recsys2022'
train, valid = get_dressipi2022(DATA_FOLDER)

The dataset contains:
- `session_id`, id of a session, in which a user viewed and purchased an item. 
- `item_id` which was viewed at a given `timestamp` in a session
- `purchase_id` which is the id of item bought at the end of the session. 
In addition to `timestamp`, we have `day` and `date` features for representing the chronological order in which items were viewed.

The items in the Dresspi dataset had a many features out of which we took 22 most important features, namely 
`f_3 ,f_4 ,f_5 ,f_7 ,f_17 ,f_24 ,f_30 ,f_45 ,f_46 ,f_47 ,f_50 ,f_53 ,f_55 ,f_56 ,f_58 ,f_61 ,f_63 ,f_65 ,f_68 ,f_69 ,f_72 ,f_73`

In [4]:
train.to_ddf().head().columns

Index(['session_id', 'item_id', 'date', 'f_3', 'f_5', 'f_7', 'f_17', 'f_24',
       'f_45', 'f_47', 'f_50', 'f_55', 'f_56', 'f_58', 'f_61', 'f_63', 'f_65',
       'f_68', 'f_69', 'f_72', 'f_73', 'timestamp', 'day', 'purchase_id'],
      dtype='object')

## Feature Engineering with NVTabular

We use NVTabular for Feature Engineering. If you want to learn more about NVTabular, we recommend the [examples in the NVTabular GitHub Repository](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples).

### Categorify

We want to use embedding layers for our categorical features. First, we need to Categorify them, that they are continuous integers. 

The features `item_id` and `purchase_id` belongs to the same category. If `item_id` is 8432 and `purchase_id` is 8432, they are the same item. When we want to apply Categorify, we want to keep the connection. We can achieve this by encoding them jointly providing them as a list in the list `[['item_id', 'purchase_id']]`.

We will use only 5 of the categorical item features in this example.

In [5]:
%%time
item_features_names = ['f_' + str(col) for col in [47, 68]]
cat_features = ['session_id', ['item_id', 'purchase_id']] + item_features_names >> nvt.ops.Categorify()

features = ['timestamp','date'] + cat_features

CPU times: user 0 ns, sys: 151 µs, total: 151 µs
Wall time: 172 µs


### GroupBy the data by sessions.

Currently, every row is a viewed item in the dataset. Our goal is to predict the item purchased after the last view in a session. Therefore, we groupby the dataset by `session_id` to have one row for each prediction.

Each row will have a sequence of encoded items ids with which a user interacted. The last item of a session has special importance as it is closer to the user's intention. We will keep the viewed item as a separate feature.

The NVTabular Op `GroupBy` enables the transformation. 

First, we define how the different colums should be aggregates:
- Keep the first occurance of `date`
- Keep the last item and concatenate all items to a list (results are 2 features)
- Keep the first occurance of `purchase_id` (purchase_id should be the same for all rows of one session)

In [6]:
to_aggregate = {}
to_aggregate['date'] = ["first"]
to_aggregate['item_id'] = ["last", "list"]
to_aggregate['purchase_id'] = ["first"]   

In addition, we concatenate each item features to a list.

In [7]:
for name in item_features_names: 
    to_aggregate[name] = ['list']

In [8]:
to_aggregate

{'date': ['first'],
 'item_id': ['last', 'list'],
 'purchase_id': ['first'],
 'f_47': ['list'],
 'f_68': ['list']}

We want to sort the dataframe by `date` and groupby the columns by `session_id`.

In [9]:
groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session_id"], 
    sort_cols=["date"],
    aggs= to_aggregate,
    name_sep="_")

Merlin Models can infer the neural network architecture from the dataset schema. We will Tag the columns accordingly to the dataset type. If you want to learn more, we recommend our [Dataset Schema Example](https://github.com/NVIDIA-Merlin/models/blob/main/examples/02-Merlin-Models-and-NVTabular-integration.ipynb).

In [10]:
item_last = (
    groupby_features['item_id_last'] >> 
    AddMetadata(tags=[Tags.ITEM, Tags.ITEM_ID])
)
item_list = (
    groupby_features['item_id_list'] >> 
    AddMetadata(
        tags=[Tags.ITEM, Tags.ITEM_ID, Tags.LIST, Tags.SEQUENCE]
    )
)
feature_list = (
    groupby_features[[name+'_list' for name in item_features_names]] >> 
    AddMetadata(
        tags=[Tags.SEQUENCE, Tags.ITEM, Tags.LIST]
    )
)
target_feature = (
    groupby_features['purchase_id_first'] >> 
    AddMetadata(tags=[Tags.TARGET])
)
other_features = groupby_features['session_id', 'date_first']

groupby_features = item_last + item_list + feature_list + other_features + target_feature


### Truncate and Padding for a Maximum Sequence Length

We want to truncate and pad the sequential features. We define the columns, which are sequential features and the non-sequential ones. We truncate the sequence by keeping the last 3 elements.

In [11]:
list_features = [name+'_list' for name in item_features_names] + ['item_id_list']
nonlist_features = ['session_id', 'date_first', 'item_id_last', 'purchase_id_first']

In [12]:
SESSIONS_MAX_LENGTH = 3
truncated_features = groupby_features[list_features] >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH, pad=True) >> nvt.ops.Rename(postfix = '_seq')

final_features = groupby_features[nonlist_features] + truncated_features

We initalize our NVTabular workflow.

In [13]:
workflow = nvt.Workflow(final_features)

We call fit and transform similar to the scikit learn API.

In [14]:
dataset = nvt.Dataset(cudf.concat([train.to_ddf().compute(), valid.to_ddf().compute()]))
# fit data
workflow.fit(dataset)

# transform data
workflow.transform(train).to_parquet(os.path.join(DATA_FOLDER, "train/"), output_files=10)
workflow.transform(valid).to_parquet(os.path.join(DATA_FOLDER, "train/"), output_files=10)



### Sort the Training Dataset by Time

Now let's save the data for training a model

In [20]:
%%time
train_ds = Dataset(train_2.to_ddf().sort_values('date_first'), schema=train_2.schema)
valid_ds = Dataset(valid_2.to_ddf().sort_values('date_first'), schema=valid_2.schema)

CPU times: user 10.5 ms, sys: 3.48 ms, total: 14 ms
Wall time: 12.6 ms


In [21]:
train_ds.to_parquet(os.path.join(DATA_FOLDER, "train/"), output_files=10)
valid_ds.to_parquet(os.path.join(DATA_FOLDER, "valid/"), output_files=10)

## Training - MLP

A Sequential- Multi-Layer Perceptron model with average of the sequence as final representation

#### Hyperparameters

In [22]:
SEED = 42
EPOCHS = 10
BATCH_SIZE = 512
LEARNING_RATE = 0.2 #3e-1
CLIPNORM = True
DROPOUT= 0.2 
LABEL_SMOOTHING = 0.2
TEMPERATURE_SCALING = 2
OPTIMIZER_NAME = 'adam'
LOSS='CategoricalCrossentropy'

tf.keras.utils.set_random_seed(SEED)

#### Load the processed dataset

In [24]:
train = Dataset(os.path.join(DATA_FOLDER, 'train/*.parquet'), shuffle=False,)
valid = Dataset(os.path.join(DATA_FOLDER, 'valid/*.parquet'), shuffle=False,)



In [25]:
import merlin.models.tf.dataset as tf_dataloader

train_dl = tf_dataloader.BatchedDataset(
    train,
    batch_size = 512,
    shuffle=False, 
)

#### Schema 
Let’s visualize the schema. From this we will select features for training the MLP model

In [26]:
train.schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.domain.min,properties.domain.max,properties.domain.name
0,session_id,(Tags.CATEGORICAL),int64,False,False,,0.0,0.0,0.0,.//categories/unique.session_id.parquet,1000001.0,512.0,0.0,1000001.0,
1,date_first,(),float64,False,False,,,,,,,,,,
2,item_id_last,"(Tags.ITEM, Tags.ITEM_ID, Tags.CATEGORICAL)",int64,False,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0.0,23619.0,item_id_purchase_id
3,purchase_id_first,"(Tags.TARGET, Tags.CATEGORICAL)",int64,False,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0.0,23619.0,item_id_purchase_id
4,f_47_list_seq,"(Tags.ITEM, Tags.SEQUENCE, Tags.CATEGORICAL, T...",int64,True,False,,0.0,0.0,0.0,.//categories/unique.f_47.parquet,18.0,16.0,0.0,18.0,f_47
5,f_68_list_seq,"(Tags.ITEM, Tags.SEQUENCE, Tags.CATEGORICAL, T...",int64,True,False,,0.0,0.0,0.0,.//categories/unique.f_68.parquet,50.0,16.0,0.0,50.0,f_68
6,item_id_list_seq,"(Tags.ITEM, Tags.SEQUENCE, Tags.ITEM_ID, Tags....",int64,True,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0.0,23619.0,item_id_purchase_id


In this model, we will only use two features `item_id_last` and `item_id_list_seq` for training

In [27]:
schema_model = train.schema.select_by_name(['item_id_list_seq', 'item_id_last', 'purchase_id_first'])
schema_model

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.domain.min,properties.domain.max,properties.domain.name
0,item_id_list_seq,"(Tags.ITEM, Tags.SEQUENCE, Tags.ITEM_ID, Tags....",int64,True,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0,23619,item_id_purchase_id
1,item_id_last,"(Tags.ITEM, Tags.ITEM_ID, Tags.CATEGORICAL)",int64,False,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0,23619,item_id_purchase_id
2,purchase_id_first,"(Tags.TARGET, Tags.CATEGORICAL)",int64,False,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0,23619,item_id_purchase_id


Let's visualise the dataset

In [28]:
train.head()

Unnamed: 0,session_id,date_first,item_id_last,purchase_id_first,f_47_list_seq,f_68_list_seq,item_id_list_seq
0,144306,2020-01-01 00:00:01.359,8558,14343,"[2, 5, 13]","[1, 8, 7]","[670, 4633, 8558]"
1,102504,2020-01-01 00:00:21.440,12231,18295,"[1, 1, 1]","[8, 45, 6]","[15367, 8469, 12231]"
2,993759,2020-01-01 00:00:48.505,5726,14805,"[13, 0, 0]","[9, 0, 0]","[5726, 0, 0]"
3,9972,2020-01-01 00:06:37.801,10329,12877,"[7, 8, 7]","[14, 3, 3]","[19797, 16836, 10329]"
4,357643,2020-01-01 00:08:19.297,15432,13374,"[1, 5, 1]","[5, 1, 8]","[938, 15840, 15432]"


#### Build the Model
Now we will create an InputBlock which takes sequential features, concatenate them and return the sequence of interaction embeddings

In [29]:
emb_options = mm.EmbeddingOptions(
        embedding_dims = None,
        embedding_dim_default=256,
        infer_embedding_sizes=False,
)

input_block = InputBlock(
    schema_model.select_by_name(['item_id_list_seq', 'item_id_last']), 
    aggregation='concat',
    embedding_options = emb_options
)

The Multi-Classiffication Prediction head will have
- Layer Normalization
- Weight Tying
- Labels as One-hot encoded vectors, used for label smoothing 
- Temperature Scaling to reduce the overconfidence of the model

In [30]:
prediction_call = L2Norm().connect(
    ItemsPredictionWeightTying(schema_model), 
    mm.LabelToOneHot(),
    LogitsTemperatureScaler(temperature=TEMPERATURE_SCALING)
)

prediction_task = mm.MultiClassClassificationTask(
    target_name="purchase_id_first",
    pre=prediction_call,
)

Now, we will build a model with a MLPBlock, to get the sequence of hidden representation

In [31]:
model_mlp = mm.Model.from_block(
    mm.MLPBlock([128,256], no_activation_last_layer=True, dropout=0.2),
    schema_model, 
    input_block=input_block,
    prediction_tasks=prediction_task
)

#### Define the Optimizer

In [32]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE,
)

model_mlp.compile(
    optimizer=optimizer,
    run_eagerly=True,
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True, label_smoothing=LABEL_SMOOTHING),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[100])
)

#### Model Training

In [None]:
%%time
history = model_mlp.fit(
    train_dl,
    validation_data=valid,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    schema=schema_model,
)

Epoch 1/10
Epoch 2/10
 167/1799 [=>............................] - ETA: 2:39 - loss: 8.2953 - recall_at_100: 0.2852 - mrr_at_100: 0.0681 - ndcg_at_100: 0.1095 - map_at_100: 0.0681 - precision_at_100: 0.0029 - regularization_loss: 0.0000e+00

### Model Evaluation

In [30]:
model_mlp.evaluate(valid_ds, batch_size=1024, return_dict=True)



{'loss': 8.017659187316895,
 'recall_at_100': 0.4807376563549042,
 'mrr_at_100': 0.1448851227760315,
 'ndcg_at_100': 0.21048013865947723,
 'map_at_100': 0.1448851227760315,
 'precision_at_100': 0.004807377699762583,
 'regularization_loss': 0.0}

In [31]:
def generate_recommendations(pred, df_agg, batch_size=1024, n_topk=100):
    print('Mask Predictions')
    print('Generate Top100 Recommendations')
    out_pred = []
    out_score = []
    for i in range(0, pred.shape[0]//batch_size+1):
        batch_start = (i)*batch_size
        batch_end = min((i+1)*batch_size, pred.shape[0])
        pred_tmp = pred[batch_start:batch_end]
        cp_pred = cupy.asarray(pred_tmp)
        pred_idx = cupy.argsort(-cp_pred)
        pred_idx = cupy.asnumpy(pred_idx)
        for j in range(pred_idx.shape[0]):
            topk = []
            score = []
            for k in range(n_topk):
                idx = pred_idx[j][k]
                topk.append(idx)
                score.append(pred_tmp[j][idx])
            out_pred.append(topk)
            out_score.append(score)
    
    print('Transform Top100 Recommendations')
    metadata = df_agg[['session_id', 'purchase_id_first']].to_pandas().values.tolist()
    out = []
    for i, ex in enumerate(metadata):
        session_id = ex[0]
        purchase = ex[1]
        for k in range(n_topk):
            out.append([session_id, purchase, out_pred[i][k], out_score[i][k]])

    df_rec = cudf.DataFrame(out)
    df_rec.columns = ['session_id', 'purchased', 'rec', 'score']
    return(df_rec)

def evaluate(df, add_folds=False):
    print('Model evaluation')
    df = df.drop_duplicates(['session_id', 'rec'])
    df = df.sort_values(['session_id', 'score'], ascending=False)
    df['dummy'] = 1
    df['rank'] = df[['session_id', 'dummy']].groupby('session_id').cumsum()
    df = df[df['rank']<=100]
    df.drop('dummy', inplace=True, axis=1)
    df['mrr'] = 1/df['rank']
    df.loc[df['purchased']!=df['rec'], 'mrr'] = 0
    out = {}
    mrr = df[df['purchased']==df['rec']]['mrr'].sum()/df['session_id'].drop_duplicates().shape[0]
    out['total'] = mrr
    return(out)

In [32]:
%%time
predictions = model_mlp.predict(valid, batch_size=1024, verbose=1)
ddf = valid.to_ddf()
ddf = ddf[['session_id', 'purchase_id_first']].compute()
df_rec = generate_recommendations(predictions, ddf)
val_mrr = evaluate(df_rec)['total']
print('MRR: ',val_mrr)

Mask Predictions
Generate Top100 Recommendations
Transform Top100 Recommendations
Model evaluation
MRR:  0.14488511239789104
CPU times: user 51.9 s, sys: 12.1 s, total: 1min 3s
Wall time: 1min 2s


## Training Bi-LSTM

#### Hyperparameters

In [33]:
SEED = 42
EPOCHS = 10
BATCH_SIZE = 1024 #512
LEARNING_RATE = 3e-1
CLIPNORM = True
DROPOUT= 0.2 #0.01
LABEL_SMOOTHING = 0.2
TEMPERATURE_SCALING = 2
OPTIMIZER_NAME = 'adam'
LOSS='CategoricalCrossentropy'

BI_LSTM_HIDDEN_DIM = 64
tf.keras.utils.set_random_seed(SEED)

### Model

Let's create a `BiLSTM Block`, that will have a `inputs` dictionary to send the sequence of interaction embeddings `input_sequence`

In [34]:
class BiLSTM(mm.Block):
    def __init__(self, hidden_dim= 64, **kwargs):
        self.hidden_dim = hidden_dim
        lstm = tf.keras.layers.LSTM(hidden_dim, return_sequences=False, dropout=0.05,
                                   kernel_regularizer=regularizers.l2(1e-4))
        self.lstm = tf.keras.layers.Bidirectional(lstm)
        
        super().__init__(**kwargs)
        
    def call(self, inputs, training=False, **kwargs) -> tf.Tensor:  
        interactions = inputs['input_sequence']
        sequence_representation = self.lstm(interactions)
        return sequence_representation
    
    def compute_output_shape(self, input_shape):
        input_shape = input_shape['input_sequence']
        return (input_shape[0], input_shape[1], self.hidden_dim*2)
    
    
bilstm = BiLSTM(hidden_dim=BI_LSTM_HIDDEN_DIM)

For the Bi-LSTM model, let's use only `item_id_list_seq` for training

In [35]:
schema_model = train.schema.select_by_name(['item_id_list_seq', 'purchase_id_first'])
schema_model

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.domain.min,properties.domain.max,properties.domain.name
0,item_id_list_seq,"(Tags.SEQUENCE, Tags.ITEM, Tags.ITEM_ID, Tags....",int64,True,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0,23619,item_id_purchase_id
1,purchase_id_first,"(Tags.CATEGORICAL, Tags.TARGET)",int64,False,False,,0.0,0.0,0.0,.//categories/unique.item_id_purchase_id.parquet,23619.0,450.0,0,23619,item_id_purchase_id


#### Build the Model 

Let's create a InputBlock which takes sequential features, concatenate them and return the sequence of interaction embeddings

In [36]:
inputs = InputBlock(
        schema_model,
        aggregation='concat',
        seq=True,
        max_seq_length=20,
        embedding_options=mm.EmbeddingOptions(
            embedding_dim_default=256,
            infer_embedding_sizes=True,
            infer_embedding_sizes_multiplier=2,
            infer_embeddings_ensure_dim_multiple_of_8=True
        ),
        split_sparse=True,
)

In [37]:
dense_block = mm.ParallelBlock({'input_sequence': inputs}).connect(bilstm)

A MLPBlock to get the sequence of hidden representation

In [38]:
mlp_block = mm.MLPBlock(
                [64, 32],
                activation='relu',
                no_activation_last_layer=True,
                dropout=DROPOUT,
            )

A Multi-Classiffication Prediction head which has
- Layer Normalization
- Weight Tying
- Labels as One-hot encoded vectors, used for label smoothing 
- Temperature Scaling to reduce the overconfidence of the model

In [39]:
prediction_call = L2Norm().connect(
    ItemsPredictionWeightTying(schema_model), 
    mm.LabelToOneHot(), 
    LogitsTemperatureScaler(temperature=TEMPERATURE_SCALING)
)

prediction_task = mm.MultiClassClassificationTask(
    target_name="purchase_id_first",
    pre=prediction_call,
)

Now, we connect all the blocks togther to build a model

In [40]:
model_bi_lstm = Model(dense_block, mlp_block, prediction_task)

#### Define the Optimizer

In [42]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE,
    clipnorm=CLIPNORM
)

model_bi_lstm.compile(
    optimizer=optimizer,
    run_eagerly=True,
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True, label_smoothing=LABEL_SMOOTHING),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[100])
)

### Model Training

In [44]:
%%time
history = model_bi_lstm.fit(
    train,
    validation_data=valid,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    schema=schema_model,
)

2022-08-16 06:23:00.631926: I tensorflow/stream_executor/cuda/cuda_dnn.cc:379] Loaded cuDNN version 8400


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 6/10
Epoch 7/10
Epoch 8/10

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Model Evaluation

In [45]:
model_bi_lstm.evaluate(valid_ds, batch_size=1024, return_dict=True)



{'loss': 13.174932479858398,
 'recall_at_100': 0.22673992812633514,
 'mrr_at_100': 0.029583772644400597,
 'ndcg_at_100': 0.06532561033964157,
 'map_at_100': 0.029583772644400597,
 'precision_at_100': 0.002267398638650775,
 'regularization_loss': 4.286446571350098}

In [46]:
%%time
predictions = model_bi_lstm.predict(valid, batch_size=1024, verbose=1)
ddf = valid.to_ddf()
ddf = ddf[['session_id', 'purchase_id_first']].compute()
df_rec = generate_recommendations(predictions, ddf)
val_mrr = evaluate(df_rec)['total']
print('MRR: ',val_mrr)

Mask Predictions
Generate Top100 Recommendations
Transform Top100 Recommendations
Model evaluation
MRR:  0.029583778549516854
CPU times: user 50.5 s, sys: 12.6 s, total: 1min 3s
Wall time: 1min 8s
