In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

## 3. Session-based Recommendations with MERLIN

Session-based recommendation, a sub-area of sequential recommendation, has been an important task in online services like e-commerce and news portals, where most users either browse anonymously or may have very distinct interests for different sessions. Session-Based Recommender Systems (SBRS) have been proposed to model the sequence of interactions within the current user session, where a session is a short sequence of user interactions typically bounded by user inactivity. They have recently gained popularity due to their ability to capture short-term and contextual user preferences towards items.

Many methods have been proposed to leverage the sequence of interactions that occur during a session, including session-based k-NN algorithms like V-SkNN [1] and neural approaches like GRU4Rec [2]. In addition, state of the art NLP approaches have inspired RecSys practitioners and researchers to leverage the self-attention mechanism and the Transformer-based architectures for sequential [3] and session-based recommendation [4, 5].

<img src="./images/sessionbased.png" width=800 height=400/>

In this tutorial, we introduce the Merlin Models an open-source library that is designed to provide standard models for recommender systems with an aim for high-quality implementations that range from classic machine learning models to highly-advanced deep learning models. With Merlin Models we import from the HF Transformers NLP library the transformer architectures and their configuration classes.

In addition, Merlin Models provides additional blocks necessary for recommendation, e.g., input features normalization and aggregation, and heads for recommendation and sequence classification/prediction. We also extend their Trainer class to allow for the evaluation with RecSys metrics

### Merlin Models
Merlin Models is a library to make it easy for users in industry or academia to train and deploy recommender models with best practices baked into the library. This will let users in industry easily train standard models against their own dataset, getting high performance GPU accelerated models into production. This will also let researchers to build custom models by incorporating standard components of deep learning recommender models, and then benchmark their new models on example offline datasets. 

Core features are:
- Unified API enables users to create models in TensorFlow 
- Seamless integration with NVTabular for feature engineering and model serving
- Flexible APIs targeted to both production and research
- Many different recommender system architectures (tabular, two-tower, sequential) or tasks (binary, multi-class classification, multi-task)

**Learning Objectives:**</br>

In this lab, participants will learn:
- building an MLP model architecture for next-item prediction task with multi-class classification objective
- training and evaluating an MLP model using sequential categorical and continuous input features.

### 3.1. Import Libraries

In [2]:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
import glob
import gc
import numpy as np

from merlin.schema.tags import Tags
from merlin.io.dataset import Dataset

import merlin.models.tf as mm
from merlin.models.tf.core.aggregation import SequenceAggregator

import tensorflow as tf

2023-02-23 20:08:45.367666: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.




  warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")
2023-02-23 20:08:47.974018: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-23 20:08:49.565700: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2023-02-23 20:08:49.565799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16249 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:2d:00.0, compute capability: 7.0
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
seed=42
tf.random.set_seed(seed)
np.random.seed(seed)

Define data path.

In [4]:
DATA_FOLDER = os.environ.get(
    "DATA_FOLDER", 
    '/workspace/data/'
)

Read in train and validation sets as Merlin Dataset objects. Note that these datasets have schema associated to them.

In [5]:
train = Dataset(os.path.join(DATA_FOLDER, "train/*.parquet"))
valid = Dataset(os.path.join(DATA_FOLDER, "valid/*.parquet"))



Define target column.

In [6]:
target = train.schema.select_by_tag(Tags.SEQUENCE).column_names[0]
target

'city_id_list'

### 3.2. Next-item prediction with an MLP Model

We train a Multi-Layer Perceptron model using sequential categorical and continuous features that we created in the previous notebook. We use the last item in the `city_id_list` column as target. We visualize the architecture in the figure below.

<img src="./images/mlp.png" width=600 height=200/>

Let's start with defining our hyper-parameters.

In [7]:
EPOCHS = int(os.environ.get(
    "EPOCHS", 
    '3'
))

dmodel = int(os.environ.get(
    "dmodel", 
    '64'
))

BATCH_SIZE = 1024
LEARNING_RATE = 0.005

#### 3.2.2. Define the Schema object

Merlin Models can infer a neural network architecture from the dataset schema. Remember, we added tags to each feature in NVTabular workflow. Now you can see these tags in the `train.schema`. If you want to learn more, we recommend our [Dataset Schema Example](https://github.com/NVIDIA-Merlin/models/blob/main/examples/02-Merlin-Models-and-NVTabular-integration.ipynb).

Below, we define the schema object by using `select_by_name` method. With this method we can select the input columns that we want to train our model with. In addition, one can use `select_by_tag` method and create subschemas.

In [8]:
train.schema = train.schema.select_by_name(['city_id_list','booker_country_list', 'hotel_country_list',
                                            'weekday_checkin_list','weekday_checkout_list',
                                            'month_checkin_list', 'length_of_stay_list', 'num_city_visited']
                                          )

valid.schema = train.schema
schema_model = train.schema

In [9]:
train.schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.domain.min,properties.domain.max,properties.domain.name,properties.value_count.min,properties.value_count.max
0,city_id_list,"(Tags.LIST, Tags.ID, Tags.CATEGORICAL, Tags.SE...","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.city_id.parquet,39665.0,512.0,0.0,39664.0,city_id,0.0,10.0
1,booker_country_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.booker_country_list.parquet,6.0,16.0,0.0,5.0,booker_country_list,0.0,10.0
2,hotel_country_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.hotel_country_list.parquet,196.0,31.0,0.0,195.0,hotel_country_list,0.0,10.0
3,weekday_checkin_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkin.parquet,13.0,16.0,0.0,12.0,checkin,0.0,10.0
4,weekday_checkout_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkout.parquet,8.0,16.0,0.0,7.0,checkout,0.0,10.0
5,month_checkin_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkin.parquet,13.0,16.0,0.0,12.0,checkin,0.0,10.0
6,length_of_stay_list,"(Tags.LIST, Tags.CONTINUOUS, Tags.SEQUENCE)","DType(name='float64', element_type=<ElementTyp...",True,True,,,,,,,,,,,0.0,10.0
7,num_city_visited,"(Tags.CONTINUOUS, Tags.CONTEXT)","DType(name='float64', element_type=<ElementTyp...",False,False,,0.0,0.0,0.0,.//categories/unique.city_id.parquet,39665.0,512.0,0.0,39664.0,city_id,,


In [10]:
train.schema.column_names

['city_id_list',
 'booker_country_list',
 'hotel_country_list',
 'weekday_checkin_list',
 'weekday_checkout_list',
 'month_checkin_list',
 'length_of_stay_list',
 'num_city_visited']

#### 3.2.3. Build the MLP architecture

We are starting with a simple model, an MLP model to predict the next city to visit. Our goal is to use a given sequence and predict where a user (traveler) will travel next. Since we are training an MLP model to predict the last item in the sequence as target, we are going to use [SequencePredictLast](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transforms/sequence.py#L255) training approach below.

We build a model with a 2-layer MLPBlock, input_block as the input layer  and prediction_task as the prediction head.  Let's create the `Input Block` which takes sequential features, concatenate them and return the sequence of interaction embeddings. 

- **InputBlockV2** is the input block for sequential features. Based on a `schema` object and options set by the user, it dynamically creates all the necessary layers (e.g. embedding layers) to encode, normalize, and aggregate categorical and continuous features.

Note that with `dim` arg, we can feed a fix dimension for all  categorical features, or a dictionary of the embedding dimensions if we want to set the embedding dimension different for each categorical feature. In this example, we set the embedding dimension as `16` for each categorical feature.

We set `continuous` arg that lets us to apply aggregation on continuous list columns. Note that `length_of_stay_list` input feature is a list continuous column, and below we apply mean aggreation on the values of this column per input sequence.

In [11]:
input_block = mm.InputBlockV2(
    schema_model,
    categorical=mm.Embeddings(
        schema_model.select_by_tag(Tags.CATEGORICAL), 
        sequence_combiner="mean",
        dim=16
    ),
    continuous=mm.Continuous(post=SequenceAggregator("mean")),
)

Let's check the output tensor dimension from input block. Let's define a batch and feed that to `input_block`.

In [12]:
batch = mm.sample_batch(train, batch_size=BATCH_SIZE, include_targets=False, to_ragged=True)

In [13]:
input_block(batch).shape

TensorShape([1024, 98])

We obtain a 2-D sequence representation (batch_size, sum_of_emb_dim_of_features + 1-d of the continuous features). Note that total embedding dimension for categorical features is 96, and extra dimensions come from the continuous input features (`length_of_stay_list` and `num_city_visited`).

Define an MLP block with two hidden layers.

In [14]:
mlp_block = mm.MLPBlock(
        [128, dmodel], 
        no_activation_last_layer=True, 
    )

Similarly we can check the output of the mlp_block. MLP block projects the output dimension as `dmodel` since we set the number of neurons in the second hidden layer as `64`.

In [15]:
mlp_block(input_block(batch))

<tf.Tensor: shape=(1024, 64), dtype=float32, numpy=
array([[ 0.06843954,  0.03452714,  0.00337486, ...,  0.10803191,
        -0.0130931 ,  0.02420143],
       [-0.01964401,  0.01131665, -0.01562893, ...,  0.0337671 ,
        -0.01954725, -0.00809142],
       [ 0.01400696, -0.00550668, -0.07824401, ...,  0.06993526,
        -0.06954388,  0.00429055],
       ...,
       [-0.1097099 , -0.01124749, -0.03097691, ...,  0.23646343,
        -0.05934976, -0.0206476 ],
       [ 0.02554754,  0.00850445,  0.02615668, ...,  0.1427507 ,
        -0.02937096,  0.03578164],
       [ 0.03435782,  0.02314084,  0.02957557, ...,  0.09582867,
        -0.02388796,  0.01308565]], dtype=float32)>

Next, we define the prediction task. Our objective is `multi-class classification` - which is the city visited at the end of the trip session. Therefore, this is a multi-class classification task, and the default_loss in the CategoricalOutput class is `categorical_crossentropy`.

In [16]:
prediction_task = mm.CategoricalOutput(
    schema_model.select_by_name(target), 
    default_loss="categorical_crossentropy"
)

In [17]:
model_mlp = mm.Model(input_block, mlp_block, prediction_task)

**SequencePredictLast** class prepares sequential inputs and targets for last-item prediction. Note that the last city in each `city_id_list` column is reserved as target, and it is not used as input sequence during training.

In [18]:
predict_last = mm.SequencePredictLast(schema=schema_model.select_by_tag(Tags.SEQUENCE), target=target)

#### 3.2.4. Metrics

The following information retrieval metrics are used to compute the `Top-k` accuracy of recommendation lists containing all items (see keras [documentation](https://www.tensorflow.org/ranking/api_docs/python/tfr/keras/metrics)):

**Normalized Discounted Cumulative Gain (NDCG@k)**: NDCG accounts for rank of the relevant item in the recommendation list and is a more fine-grained metric than HR, which only verifies whether the relevant item is among the top-k items.

**Recall@k**: Also known as HitRate@n when there is only one relevant item in the recommendation list. Recall just verifies whether the relevant item is among the top-n items.

**Mean Reciprocal Rank (MRR@k)**: This is the user-averaged value of the Reciprocal Rank@k, which is the inverse rank (one divided by the rank) of the first item
among the top-K recommended that is in the test data (see [reference](https://cran.r-project.org/web/packages/recometrics/recometrics.pdf)).

The goal of the [Booking.com WSDM WebTour 2021 Challenge](https://ceur-ws.org/Vol-2855/challenge_short_1.pdf) challenge was to predict (and recommend) the final city (city_id) of each trip (utrip_id). The quality of the predictions is evaluated based on the top four recommended cities for each trip (session) by using Top-4 Accuracy metric. At the [Booking.com WSDM WebTour 2021 Challenge](https://ceur-ws.org/Vol-2855/challenge_short_1.pdf)  best performing team (NVIDIA team) achieved Accuracy@4 of 0.5939, using a blend of Transformers, GRUs, and feed-forward multi-layer perceptron. Considering this, we are generating `Top@4` metric values from model training and evaluation steps.

In [19]:
from merlin.models.tf.metrics.topk import (
    MRRAt,
    NDCGAt,
    RecallAt,
    TopKMetricsAggregator,
)
metrics_agg = mm.TopKMetricsAggregator(RecallAt(4), MRRAt(4), NDCGAt(4))

Compile and train the model.

In [20]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE,
)

model_mlp.compile(
    optimizer=optimizer,
    run_eagerly=False,
    loss=tf.keras.losses.CategoricalCrossentropy(
        from_logits=True, 
    ),
    metrics=[metrics_agg]
)

In [21]:
%%time
history = model_mlp.fit(
    train,
    epochs=3,
    batch_size=512,
    pre=predict_last
)



Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 2min 26s, sys: 20.8 s, total: 2min 47s
Wall time: 1min 20s


Evaluate the model using the validation set.

In [22]:
model_mlp.evaluate(
    valid,
    batch_size=1024,
    pre=predict_last,
    return_dict=True
)



{'loss': 3.5418004989624023,
 'recall_at_4': 0.5628074407577515,
 'mrr_at_4': 0.40140968561172485,
 'ndcg_at_4': 0.44224879145622253,
 'regularization_loss': 0.0,
 'loss_batch': 3.611255645751953}

In [23]:
del model_mlp
gc.collect()

135

### Summary

In this example, we focused on concepts which are relevant for a broad range of recommender system use cases- session-based recommendation task. We started with a simple architecture, an MLP model, and used high level apis provided by Merlin Models library. Note that we did not perform hyper-parameter tuning or set an optimizer scheduler or different techniques to improve the model accuracy. All of these can be applied at your end. For this tutorial we keep it simple, and demonstrate the building blocks of Merlin Models library in creating custom architectures for session-based recommendation models. 

In the next notebook, we will build an RNN-based model, and see if it can help us improve our accuracy metrics.

### References

[1] Malte Ludewig and Dietmar Jannach. 2018. Evaluation of session-based recommendation algorithms. User Modeling and User-Adapted Interaction 28, 4-5 (2018), 331–390.<br>
[2] Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of the 27th ACMinternational conference on information and knowledge management. 843–852. <br>
[3] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450. <br>
[4] Shiming Sun, Yuanhe Tang, Zemei Dai, and Fu Zhou. 2019. Self-attention network for session-based recommendation with streaming data input. IEEE Access 7 (2019), 110499–110509.<br>
[5] Gabriel De Souza P. Moreira, et al. (2021). Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based Recommendation. RecSys'21.

Please execute the cell below to shut down the kernel before moving on to the next notebook `04-Next-item-prediction-with-LSTM`.

In [24]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}