In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

## 5. Next item prediction with a Transformer-based model

In recent years, several deep learning-based algorithms have been proposed for recommendation systems while its adoption in industry deployments have been steeply growing. In particular, NLP inspired approaches have been successfully adapted for sequential and session-based recommendation problems, which are important for many domains like e-commerce, news and streaming media. Session-Based Recommender Systems (SBRS) have been proposed to model the sequence of interactions within the current user session, where a session is a short sequence of user interactions typically bounded by user inactivity. They have recently gained popularity due to their ability to capture short-term or contextual user preferences towards items.

The field of NLP has evolved significantly within the last decade, particularly due to the increased usage of deep learning. As a result, state of the art NLP approaches have inspired RecSys practitioners and researchers to adapt those architectures, especially for sequential and session-based recommendation problems. Here, we use one of the state-of-the-art Transformer-based architecture, XLNet with Causal Language Modeling (CLM) training technique for multi-class classification task. For this, we leverage the popular HuggingFace’s Transformers NLP library and make it possible to experiment with cutting-edge implementation of such architectures for sequential and session-based recommendation problems.

### 5.1.1. What's Transformers?
The Transformer is a competitive alternative to the models using Recurrent Neural Networks (RNNs) for a range of sequence modeling tasks. The Transformer architecture [1] was introduced as a novel architecture in NLP domain that aims to solve sequence-to-sequence tasks relying entirely on self-attention mechanism to compute representations of its input and output. Hence, the Transformer overperforms RNNs with their three mechanisms:

- Non-sequential: Transformers network is parallelized where as RNN computations are inherently sequential. That resulted in significant speed-up in the training time.<br>
- Self-attention mechanisms: Transformers rely entirely on self-attention mechanisms that directly model relationships between all item-ids in a sequence.
- Positional encodings: A representation of the location or “position” of items in a sequence which is used to give the order context to the model architecture.

<center><img src="./images/rnn_transformers.png" width=600 height=200/></center>

**Learning Objectives:**
- Train and evaluate a transformer-based model (XLNet) for next-item prediction task
- Apply weight-tying technique

In [2]:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
import gc
import numpy as np

import tensorflow as tf

2023-03-04 00:09:07.825921: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Sets all random seeds for the program (Python, NumPy, and TensorFlow), to make Keras program deterministic.

In [3]:
seed=42
tf.keras.utils.set_random_seed(
    seed
)

In [4]:
from merlin.schema.tags import Tags
from merlin.io.dataset import Dataset
import merlin.models.tf as mm



  warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")
2023-03-04 00:09:10.901210: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-04 00:09:10.901696: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-04 00:09:10.901834: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  from .autonotebook import tqdm as notebook_tqdm
2023-03-04 00:09:11.585567: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions

In [5]:
DATA_FOLDER = os.environ.get(
    "DATA_FOLDER", 
    '/workspace/data/'
)

In [6]:
train = Dataset(os.path.join(DATA_FOLDER, "train/*.parquet"))
valid = Dataset(os.path.join(DATA_FOLDER, "valid/*.parquet"))



In [7]:
target = train.schema.select_by_tag(Tags.SEQUENCE).column_names[0]
target

'city_id_list'

In [8]:
EPOCHS = int(os.environ.get(
    "EPOCHS", 
    '3'
))

dmodel = int(os.environ.get(
    "dmodel", 
    '64'
))

BATCH_SIZE = 1024
LEARNING_RATE = 0.003

### 5.1.2. Building an XLNET Model with Merlin Models TensorFlow API

The Merlin Models Transformer API consists of wrapping the HuggingFace transformer layers inside a Merlin Models Block class, called `TransformerBlock`, and offering different pre-training approaches to train and evaluate the model on recsys data. 

Using the Merlin Transformer API, you can define your transformer-based model with all common recsys techniques such as negative sampling,top-k candidates generation,  and weight-tying.

The API consists of three main steps: 

- **Inputs preparation:** Implement specialized pre-processing blocks that involve set embeddings expected by the HuggingFace transformer layer, including generating mask information at inference (if needed), conversion of ragged inputs to dense tensors, and preparation of the dictionary inputs required by the HF layer.

- **Target generation**: Geneate targets from the input sequence of candidate IDs using [SequenceTransform](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transforms/sequence.py#L77) instances based on the training and evaluation strategy.

- **Output post-processing:** Implement specialized post-processing blocks that involve selecting relevant information from the HF layer's output, converting the output hidden representation to a RaggedTensor, and summarizing the sequence of hidden vectors into one representing the entire input sequence. 


We can do high-level visualization of the building blocks of the Merlin Transformers API as below.

<center><img src="./images/TransformerAPI.png" width=600 height=200/></center>

Let's visualize the workflow inside of the `TransformerBlock`.



<center><img src="./images/transformerblock.png" width=800 height=200/></center>

Now, let's get started with reading in train and validation sets as Merlin Dataset objects. Note that these datasets have schema associated to them.

In [9]:
train = Dataset(os.path.join(DATA_FOLDER, "train/*.parquet"))
valid = Dataset(os.path.join(DATA_FOLDER, "valid/*.parquet"))

In [10]:
train.schema = train.schema.select_by_name(['city_id_list','booker_country_list', 'hotel_country_list',
                                            'weekday_checkin_list','weekday_checkout_list',
                                            'month_checkin_list','num_city_visited', 'length_of_stay_list']
                                          )

In [11]:
seq_schema = train.schema.select_by_tag(Tags.SEQUENCE)

In [12]:
context_schema = train.schema.select_by_tag(Tags.CONTEXT)

In [13]:
target_schema = train.schema.select_by_tag(Tags.ITEM_ID)
target = target_schema.column_names[0]
target

'city_id_list'

In [14]:
mlp_block = mm.MLPBlock(
                [128,dmodel],
                activation='relu',
                no_activation_last_layer=True,
            )

Define the `input_block`.

In [15]:
input_block = mm.InputBlockV2(
    train.schema,    
    embeddings=mm.Embeddings(
        seq_schema.select_by_tag(Tags.CATEGORICAL), 
        sequence_combiner=None,
        dim=dmodel
        ),
    post=mm.BroadcastToSequence(context_schema, seq_schema),
)

We can check the output shape of the input block using a batch.

In [16]:
batch = mm.sample_batch(train, batch_size=128, include_targets=False, to_ragged=True)
print(input_block(batch).shape)

(128, None, 386)


Let's create a sequential block where we connect sequential inputs block (i.e., a SequentialLayer represents a sequence of Keras layers) with MLPBlock and then XLNetBlock. XLNet architecture [2] was originally proposed to be trained with the Permutation Language Modeling (PLM) technique, that combines the advantages of autoregressive (Causal LM) and autoencoding (Masked LM). However, with Merlin Models TF API, we are able to decouple model architecture and masking approach. With that, in this example, we perform next-item prediction with Causal Language Modeling (CLM) approach, which involves an auto-regressive model with sliding window predictions, where only the left context of position `n` is used to predict target `n+1`.<br>

Below we use MLPBlock as a projection block to match the output dimensions of the seq_inputs block with the transformer block. In other words, due to residual connection in the Transformer model, we add an MLPBlock in the model pipeline. The output dim of the input block should match with the hidden dimension (d_model) of the XLNetBlock.

Here we instantiate an [XLNet block](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transformers/block.py#L399) by setting the parameters (e.g., d_model, n_head, n_layer, etc.). You can learn more about these parameters [here](https://huggingface.co/docs/transformers/model_doc/xlnet).

- d_model:  Dimensionality of the encoder layers and the pooler layer.
- n_head:  Number of attention heads for each attention layer in the Transformer encoder.
- n_layer: Number of hidden layers in the Transformer encoder.

In [17]:
xlnet_block = mm.XLNetBlock(d_model=dmodel, n_head=4, n_layer=2)

In [18]:
dense_block = mm.SequentialBlock(
    input_block,
    mlp_block,
    xlnet_block
)

In [19]:
mlp_block2 = mm.MLPBlock(
                [128,dmodel],
                activation='relu',
                no_activation_last_layer=True,
            )

[CategoricalOutput](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/outputs/classification.py#L114https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/outputs/classification.py#L114) class has the functionality to do `weight-tying`, when we provide the EmbeddingTable related to the target feature in the `to_call` method. 

**Weight Tying:** Sharing the weight matrix between input-to-embedding layer and output-to-softmax layer. That is, instead of using two weight matrices, we just use only one weight matrix. The intuition behind doing so is to combat the problem of overfitting. Thus, weight tying can be considered as a form of regularization.

In [20]:
item_id_name = train.schema.select_by_tag(Tags.ITEM_ID).first.properties['domain']['name']
print(item_id_name)

city_id


In [21]:
prediction_task = mm.CategoricalOutput(
    to_call=input_block["categorical"][item_id_name],
)

In [22]:
model_transformer = mm.Model(dense_block, mlp_block2, prediction_task)

In [23]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE,
)

#### 5.1.2.1. Next-item prediction with Causal Language Modeling (CLM)

To be able to train our XLNet architecture with CLM masking technique, we need two sequence transform classes: `SequencePredictNext` and `SequencePredictLast`.

[SequencePredictNext](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transforms/sequence.py): Prepares sequential inputs and targets for next-item prediction. The target is extracted from the shifted sequence of item ids and the sequential input features are truncated in the last position. With this traning technique, we are able to train XLNet model with Casual Language Modeling (CLM) approach.

[SequencePredictLast](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transforms/sequence.py): Prepares sequential inputs and targets for last-item prediction. The target is extracted from the last element of sequence of item ids and the sequential input features are truncated before the last position.

In [24]:
%%time
model_transformer.compile(run_eagerly=False, optimizer=optimizer, loss="categorical_crossentropy",
              metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[4])
             )
model_transformer.fit(train, batch_size=512, epochs=3, pre=mm.SequencePredictNext(schema=seq_schema, target=target, transformer=xlnet_block))

2023-03-04 00:09:25.149201: I tensorflow/stream_executor/cuda/cuda_blas.cc:1633] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-03-04 00:09:25.361888: I tensorflow/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8700


Epoch 1/3




Epoch 2/3
Epoch 3/3
CPU times: user 1min 27s, sys: 13.5 s, total: 1min 41s
Wall time: 1min 26s


<keras.callbacks.History at 0x7f7ee9e0d3a0>

We will mask the last item using `SequencePredictLast` and perform evaluation.

In [25]:
predict_last = mm.SequencePredictLast(schema=seq_schema, target=target, transformer=xlnet_block)

In [26]:
valid.schema = train.schema

In [27]:
model_transformer.evaluate(
    valid,
    batch_size=1024,
    pre=predict_last,
    return_dict=True
)



{'loss': 3.576197624206543,
 'recall_at_4': 0.542291522026062,
 'mrr_at_4': 0.37504497170448303,
 'ndcg_at_4': 0.417227566242218,
 'map_at_4': 0.37504497170448303,
 'precision_at_4': 0.1355728805065155,
 'regularization_loss': 0.0,
 'loss_batch': 3.534959077835083}

### Summary

Congratulations on finishing this tutorial. We have walked you through how to tackle with a next-item prediction task using a publicly available dataset. We expect you gained hands-on experience in this tutorial, and you can take this knowledge back to your organizations to build custom accelerated sesion-based recommender models.

We demonstrated how one can start with data analysis step first, and prepare the data, transform it and create new features, afterwards, and finally start building models and train/evaluate them using the prepared sequential input dataset.

We introduced the NVIDIA Merlin Framework, particularly [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular) and [Models](https://github.com/NVIDIA-Merlin/models) library. Merlin Models session-based TF API is an open source library designed to enable RecSys community quickly and easily explore the ML models or latest developments of the NLP for sequential and session-based recommendation tasks. We experienced how easy it is to build models for session-based tasks using Models high-level APIs.

Note that we did not explore hyper-parameter tuning or extensive feature engineering. Following are some additional techniques that can be applied to improve the accuracy metrics:

- Data Augmentations - in the [WSDM'21 Booking challenge](https://web.ec.tuwien.ac.at/webtour21/), we used different techniques to augment the training dataset. The techniques are specific to the dataset and we did not include it in this tutorial
- Creating additional features
- Hyperparameter Search- we can ran multiple HPO jobs to find the best hyperparameters. The simplest approach could be tuning learning_rate, number of epochs, and batch_size
- Adding regularization techniques, or model calibration techniques such as label smoothing, or temperature scaling
- Ensembling multiple models.

Please execute the cell below to shut down the kernel.

In [28]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### References

[1] Vaswani, A., et al. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).<br>
[2] Understanding XLNet, BorealisAI. Online available: https://www.borealisai.com/en/blog/understanding-xlnet/<br>
[3] Gabriel De Souza P. Moreira, et al. (2021). Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based Recommendation. RecSys'21. <br>
[4] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).