In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# 1. Introduction

In the previous notebook we went through our ETL pipeline with NVTabular library, and created session-based features to be used in training a session-based recommendation model. In this notebook we will learn:

- Accelerating data loading of multiple features on PyTorch using NVTabular library
- Training and evaluating an RNN-based (GRU) session-based recommendation model 
- Training and evaluating a Transformer-based (XLNET-MLM) session-based recommendation model
- Integrate side information (additional features) into transformer architectures in order to improve recommendation accuracy

# 2. Session-based Recommendation

Session-based recommendation, a sub-area of sequential recommendation, has been an important task in online services like e-commerce and news portals, where most users either browse anonymously or may have very distinct interests for different sessions. Session-Based Recommender Systems (SBRS) have
been proposed to model the sequence of interactions within the current user session, where a session is a short sequence of user interactions typically bounded by user inactivity. They have recently gained popularity due to their ability to capture short-term and contextual user preferences towards items.


Many methods have been proposed to leverage the sequence of interactions that occur during a session, including session-based k-NN algorithms like V-SkNN [1] and neural approaches like GRU4Rec [2]. In addition,  state of the art NLP approaches have inspired RecSys practitioners and researchers to leverage the self-attention mechanism and the Transformer-based architectures for sequential and session-based recommendation[3].

# 3. Transformers4Rec Library

In this tutorial, we introduce an open source library, [Transformers4Rec](https://github.com/NVIDIA-Merlin/Transformers4Rec), which leverages the popular [HuggingFace’s Transformers](https://github.com/huggingface/transformers) NLP library and make it possible to experiment with cutting-edge implementation of such architectures for sequential and session-based recommendation problems.

Transformers4Rec supports multiple input features and provides configurable building blocks that can be easily combined for custom architectures:

- [TabularSequenceFeatures](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/features/sequence.py#L74) class that reads from schema and creates an input block. This input module combines different types of features (continuous, categorical & text) to a sequence.
-  [MaskSequence](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/masking.py#L28) to define masking schema and prepare the masked inputs and labels for the selected LM task.
- [TransformerBlock](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/block/transformer.py#L38) class that supports HuggingFace Transformers for session-based and sequential-based recommendation models.
- [SequentialBlock](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/block/base.py#L61) that creates the body by mimicking [torch.nn.sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) class. It is designed to define our model as a sequence of layers.
- [Head](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/model/head.py) where we define the prediction task of the model.
- [NextItemPredictionTask](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/model/head.py#L236) that is the class to support next item prediction task.
- [Trainer](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/trainer.py#L34) manages the model training and evaluation.

In Figure 1, we define our Transformers4Rec meta-architecure that we train for next-item prediction in this tutorial. Although, we can only use `product-id` as input feature, you can notice that in the figure, we have both categorical and numerical features. That means we can also use side information in traning a Transformers4Rec model, which we will do in Section 3.2.4.

<div align="center"><img src="images/tf4rec_meta.png", width=500, height=500></div>
<p><center>Figure 1. Transformers4Rec meta-architecure.</center></p>

In Transformers4Rec we leverage from HF Transformers only the transformer architectures building block and their configuration classes. Transformers4Rec provides additional blocks necessary for recommendation, e.g., input features normalization and aggregation, and heads for recommendation and sequence classification/prediction. We also extend their Trainer class to allow for the evaluation with RecSys metrics.

In the `Features Processing Module` of the meta-architecture, the input features are processed. Categorical features are represented by embeddings. Numerical features can be represented as a scalar, projected by a fully-connected (FC) layer to multiple dimensions, or represented as a weighted average of embeddings by the technique Soft One-Hot embeddings (more info in our [paper's online appendix](https://github.com/NVIDIA-Merlin/publications/blob/main/2021_acm_recsys_transformers4rec/Appendices/Appendix_A-Techniques_used_in_Transformers4Rec_Meta-Architecture.md)).

The features are optionally normalized (with layer normalization) and then aggregated. The current feature aggregation options are: `concat`, `stack`,
`Element-wise sum`  and  `Element-wise sum & item multiplication`.  You can learn more about these aggregation methods [here](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/docs/source/core_features.md).

The other blocks of the meta-architecture is explained in the following sections.

## 3.1 Training an RNN-based Session-based Recommendation Model

In this section, we use a Recurrent Neural Networks (RNN), A Gated Recurrent Unit (GRU)[4] architecture, to do next item prediction using a sequence of events per user in a given session. There is obviously some sequence information that want to capture to do accurate/relevant recommendations. The input of the GRU network is the actual state of the session while the output is the next item to be interacted (e.g., click, view, or purchase) in the session. Basically, for each item in a given session, we generate the output as the predicted preference of the items, i.e. the likelihood of being the next.


Figure 2 illustrates the logic of predicting next item in a given session. We treat the recommendation as a multi-class classification problem and use cross-entropy loss. In our first example, we use GRU architecture instead of `Transformer block` as shown in the Figure 1.

<div align="center"><img src="images/gru_based.png", width=600, height=600></div>
<p><center>Figure 2. Next item prediction with RNN.</center></p>

### 3.1.1 Import Libraries and Modules

In [1]:
import os
import glob

import torch 
import transformers4rec.torch as tr

from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt

- Create the Schema object from `schema` file.

In [2]:
from merlin_standard_lib import Schema
# Define schema object to pass it to the TabularSequenceFeatures class
SCHEMA_PATH = 'schema_tutorial.pb'
schema = Schema().from_proto_text(SCHEMA_PATH)
schema = schema.select_by_name(['product_id-list_seq'])

Transformers4Rec library relies on schema object in creation of TabularSequence features. As you can see below `schema.pb` is a protobuf file contains metadata including statistics about features such as cardinality, min and max values and also tags each feature based on their characteristics and dtypes (e.g., categorical, continuous, list, integer). We can tag our target column and even the add the prediction task such as `binary`, `regression` or `multiclass` as a tag for the target column in the `schema.pb` file. `schema.pb` provides a standard representations for metadata that are useful when training machine learning or deep learning models.

The meta-data information loaded from `Schema` and their tags is used to automatically set the parameters of Transformers4rec models. Certain Transformers4rec modules have `from_schema` method to instantiate their parameters and layers from protobuf text file respectively. 

Although in this tutorial we did not automatically generate schema file from the NVTabular workflow, with the new NVTabular release we are able to define a standard and generic protobuf schema.

Let's view the content of `schema.pb`

In [3]:
!head -30 $SCHEMA_PATH

feature {
  name: "user_session"
  type: INT
  int_domain {
    name: "user_session"
    min: 1
    max: 1877365
    is_categorical: false
  }
  annotation {
    tag: "groupby_col"
  }
}
feature {
  name: "category_id-list_seq"
  value_count {
    min: 2
    max: 20
  }
  type: INT
  int_domain {
    name: "category_id-list_seq"
    min: 1
    max: 566
    is_categorical: true
  }
  annotation {
    tag: "list"
    tag: "categorical"
    tag: "item"


Initialize the sequential tabular module that converts the inputs and aggregate them into one single interaction tensor. We use `sequential-concat` aggregation method here. In the input block we also define masking method (see Section 3.2.2 for details).


We define our input block using `TabularSequenceFeatures` class. The `from_schema` module directly parse schema and accepts categorical and continuous sequential inputs and supports data augmentation, data aggregation, sequential-concat and elementwise-sum aggregations, the projection of the interaction embeddings and the masking tasks. `max_sequence_length` argument defines the maximum sequence length of our sequential input.

In [4]:
sequence_length, d_model = 20, 128
inputs = tr.TabularSequenceFeatures.from_schema(
        schema,
        max_sequence_length= sequence_length,
        masking = 'causal',
    )

- Define a SequentialBlock

In [5]:
body = tr.SequentialBlock(
        inputs,
        tr.MLPBlock([d_model]),
        tr.Block(torch.nn.GRU(input_size=d_model, hidden_size=d_model, num_layers=1), [None, 20, d_model])
)

In our experiments published in our [ACM RecSys'21 paper](https://github.com/NVIDIA-Merlin/publications/blob/main/2021_acm_recsys_transformers4rec/recsys21_transformers4rec_paper.pdf) [7], we used the next item prediction head. It was composed by an output layer using the tying embeddings technique, i.e., weight-tying the projection layer to the item embedding matrix weights, followed by a softmax layer to predict the relevance scores over all items.

Weight-tying, also known as `Tying Embeddings`, proposed originally by the NLP community to tie the weights of the input (item id) embedding matrix with the output projection layer, showed to be a very effective technique in extensive experimentation for competitions and empirical analysis (for more details see our [paper](https://github.com/NVIDIA-Merlin/publications/blob/main/2021_acm_recsys_transformers4rec/recsys21_transformers4rec_paper.pdf) and its online [appendix](https://github.com/NVIDIA-Merlin/publications/blob/main/2021_acm_recsys_transformers4rec/Appendices/Appendix_A-Techniques_used_in_Transformers4Rec_Meta-Architecture.md)). You can enable this option as follows.

- We link the transformer-body to the inputs and the prediction tasks to get the final pytorch `Model` class.

In [6]:
head = tr.Head(
    body,
    tr.NextItemPredictionTask(weight_tying=True, hf_format=True, 
                              metrics=[NDCGAt(top_ks=[10, 20], labels_onehot=True),  RecallAt(top_ks=[10, 20], labels_onehot=True)]),
    inputs=inputs,
)
model = tr.Model(head)

Projecting inputs of NextItemPredictionTask to'64' As weight tying requires the input dimension '128' to be equal to the item-id embedding dimension '64'


- Initialize the Dataloader

We use optimized NVTabular PyTorch Dataloader which has the following benefits:
- removing bottlenecks from dataloading by processing large chunks of data at a time instead of item by item
- processing datasets that don’t fit within the GPU or CPU memory by streaming from the disk
- reading data directly into the GPU memory and removing CPU-GPU communication
- preparing batch asynchronously into the GPU to avoid CPU-GPU communication
- supporting commonly used formats such as parquet

In [7]:
# import NVTabular dependencies
from nvtabular import Dataset as NVTDataset
from nvtabular.loader.torch import DLDataLoader
from nvtabular.loader.torch import TorchAsyncItr as NVTDataLoader


x_cat_names, x_cont_names = ['product_id-list_seq'], []

#   dictionary representing max sequence length for column
sparse_features_max = {
    fname: sequence_length
    for fname in x_cat_names + x_cont_names
}

# define collate function
def dataloader_collate_dict(inputs):
    # Gets only the features dict
    inputs = inputs[0][0]
    return inputs

class DLDataLoaderWrapper(DLDataLoader):
    def __init__(self, *args, **kwargs) -> None:
        if "batch_size" in kwargs:
            self._batch_size = kwargs.pop("batch_size")
        super().__init__(*args, **kwargs)


**Daily Fine-Tuning: Training over a time window**

Now that the model is defined, we are going to launch training. For that, Transfromers4rec extends HF Transformers Trainer class to adapt the evaluation loop for session-based recommendation task and the calculation of ranking metrics. The original train() method is not modified meaning that we leverage the efficient training implementation from that library, which manages for example half-precision (FP16) training.

- Set training arguments

In [8]:
from transformers4rec.config.trainer import T4RecTrainingArguments
from transformers4rec.torch import Trainer

#Set arguments for training 
train_args = T4RecTrainingArguments(local_rank = -1, 
                                    dataloader_drop_last = False,
                                    report_to = [],   #set empy list to avoig w&b login
                                    gradient_accumulation_steps = 1,
                                    per_device_train_batch_size = 256, 
                                    per_device_eval_batch_size = 32,
                                    output_dir = "./tmp", 
                                    max_sequence_length=sequence_length,
                                    learning_rate=0.00071,
                                    num_train_epochs=3,
                                    logging_steps=200,
                                   )

**Instantiate the Trainer**

In [9]:
# Instantiate the T4Rec Trainer, which manages training and evaluation
trainer = Trainer(
    model=model,
    args=train_args,
    schema=schema,
    compute_metrics=True,
)

- Define the output folder of the processed parquet files

In [10]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/sessions_by_day")

Here, we use a for loop that allows us to do fit_and_evaluate method to conduct a time-based finetuning by iteratively training and evaluating using a sliding time window: At each iteration, we use training data of a specific time index <i>t</i> to train the model then we evaluate on the validation data of next index <i>t + 1</i>. We set the start time to 1 and end time to 4.

In [11]:
%%time
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))

    train_dataset = NVTDataset(
    train_paths,
    engine="parquet")
    
    eval_dataset = NVTDataset(
    eval_paths,
    engine="parquet")

    train_loader = NVTDataLoader(
    dataset=train_dataset,
    batch_size= train_args.per_device_train_batch_size,
    shuffle=False,
    cats=x_cat_names,
    conts=x_cont_names,
    device=0,
    labels=[],
    sparse_names=x_cat_names + x_cont_names,
    sparse_max=sparse_features_max,
    sparse_as_dense=True,
    drop_last=False,
    )

    eval_loader = NVTDataLoader(
    dataset=eval_dataset,
    batch_size= train_args.per_device_eval_batch_size,
    shuffle=False,
    cats=x_cat_names,
    conts=x_cont_names,
    device=0,
    labels=[],
    sparse_names=x_cat_names + x_cont_names,
    sparse_max=sparse_features_max,
    sparse_as_dense=True,
    drop_last=False,
    )
    
    dl_loader = DLDataLoaderWrapper(
    train_loader, collate_fn=dataloader_collate_dict, batch_size=train_args.per_device_train_batch_size,
    )

    dl_loader_eval = DLDataLoaderWrapper(
    eval_loader, collate_fn=dataloader_collate_dict, batch_size=train_args.per_device_eval_batch_size,
    )

    trainer.train_dataloader = dl_loader
    trainer.eval_dataloader = dl_loader_eval
    
    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')

    trainer.reset_lr_scheduler()
    trainer.train()
    trainer.state.global_step +=1
    # Evaluate on the following day
    train_metrics = trainer.evaluate(metric_key_prefix='eval')
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    trainer.wipe_memory()

***** Running training *****
  Num examples = 112128
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1314


********************
Launch training for day 1 are:
********************



Step,Training Loss
200,9.7648
400,8.6243
600,8.9669
800,8.273
1000,8.7742
1200,8.2807


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 2 are:	

********************

 epoch = 3.0
 eval/loss = 8.917440414428711
 eval/next-item/ndcg_at_10 = 0.03845135122537613
 eval/next-item/ndcg_at_20 = 0.04783741384744644
 eval/next-item/recall_at_10 = 0.07420255243778229
 eval/next-item/recall_at_20 = 0.11145464330911636
 eval_runtime = 6.1148
 eval_samples_per_second = 2171.785
 eval_steps_per_second = 17.008


***** Running training *****
  Num examples = 106240
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1245


********************
Launch training for day 2 are:
********************



Step,Training Loss
200,8.9212
400,8.2718
600,8.6469
800,8.025
1000,8.5101
1200,7.8981


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 3 are:	

********************

 epoch = 3.0
 eval/loss = 8.575119018554688
 eval/next-item/ndcg_at_10 = 0.05184777081012726
 eval/next-item/ndcg_at_20 = 0.06420676410198212
 eval/next-item/recall_at_10 = 0.09867884963750839
 eval/next-item/recall_at_20 = 0.14793671667575836
 eval_runtime = 6.1897
 eval_samples_per_second = 1985.249
 eval_steps_per_second = 15.51


***** Running training *****
  Num examples = 97792
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1146


********************
Launch training for day 3 are:
********************



Step,Training Loss
200,8.5965
400,8.09
600,8.1987
800,7.9441
1000,7.9368


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 4 are:	

********************

 epoch = 3.0
 eval/loss = 8.273717880249023
 eval/next-item/ndcg_at_10 = 0.06169569119811058
 eval/next-item/ndcg_at_20 = 0.07696080207824707
 eval/next-item/recall_at_10 = 0.11755990982055664
 eval/next-item/recall_at_20 = 0.17798247933387756
 eval_runtime = 7.5932
 eval_samples_per_second = 2048.158
 eval_steps_per_second = 16.067
CPU times: user 2min 19s, sys: 30.9 s, total: 2min 49s
Wall time: 2min 49s


Let write out model evaluation accuracy results to a text file to compare model at the end

In [37]:
with open("results.txt", 'w') as f: 
    f.write('GRU accuracy results:')
    f.write('\n')
    for key, value in  model.compute_metrics().items(): 
        f.write('%s:%s\n' % (key, value.item()))

### Metrics

We have extended the HuggingFace transformers Trainer class (PyTorch only) to support evaluation of RecSys metrics. The following information
retrieval metrics are used to compute the Top-20 accuracy of recommendation lists containing all items: <br> 
- **Normalized Discounted Cumulative Gain (NDCG@20):** NDCG accounts for rank of the relevant item in the recommendation list and is a more fine-grained metric than HR, which only verifies whether the relevant item is among the top-k items.

- **Hit Rate (HR@20)**: Also known as `Recall@n` when there is only one relevant item in the recommendation list. HR just verifies whether the relevant item is among the top-n items.

#### Restart the kernel to free our GPU memory

In [38]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

At this stage if kernel does not restart automatically, we expect you to manually restart the kernel to free GPU memory so that you can move on to the next session-based model training with a SOTA deep learning Transformer-based model, [XLNet](https://arxiv.org/pdf/1906.08237.pdf).

## 3.2. Training a Transformer-based Session-based Recommendation Model

### 3.2.1 What's Transformers?

The Transformer is a competitive alternative to the models using Recurrent Neural Networks (RNNs) for a range of sequence modeling tasks. The Transformer architecture [5] was introduced as a novel architecture in NLP domain that aims to solve sequence-to-sequence tasks relying entirely on self-attention mechanism to compute representations of its input and output. Hence, the Transformer overperforms RNNs with their three mechanisms: 

- Non-sequential: Transformers network is parallelized where as RNN computations are inherently sequential. That resulted in significant speed-up in the training time.
- Self-attention mechanisms: Transformers rely entirely on self-attention mechanisms that directly model relationships between all item-ids in a sequence.  
- Positional encodings: A representation of the location or “position” of items in a sequence which is used to give the order context to the model architecture.

<div align="center"><img src="images/transformer_vs_rnn.png", width=600, height=600></div>
<p><center> Figure 3. Transformer vs vanilla RNN.</center></p>

Figure 4 illustrates the differences of Transformer (self-attention based) and a vanilla RNN architecture. As we see, RNN cannot be parallelized because it uses sequential processing over time (notice the sequential path from previous cells to the current one). On the other hand, the Transformer is a more powerful architecture because the self-attention mechanism is capable of representing dependencies within the sequence of tokens, favors parallel processing and handle longer sequences.

As illustrated in the [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf) paper, the original transformer model is made up of an encoder and decoder where each is a stack we can call a transformer block. In our transformer-meta architecture we use the encoder block of transformer architecture.

<div align="center"><img src="images/encoder.png", width=300, height=300></div>
<p><center> Figure 4. Encoder block of the Transformer Architecture.</center></p>

### 3.2.2. XLNet-MLM

Here, we use XLNet as the Transformer block in our meta-architecture and train the model with `Masked Language Model (MLM)` masking method. XLNet [9] was proposed as a generalized autoregressive (AR) pretraining method that uses a permutation language modeling (PLM) objective to combine the advantages of AR and auoencoding (AE) methods. Here, we use XLNet as the Transformer block in our meta-architecture and train the model with `Masked Language Model (MLM)` masking method. XLNet's main contribution is a modified language model training objective which learns conditional distributions for all permutations of tokens in a sequence[8]. 

MLM  pre-training objective was introduced in `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding` paper [7]. The Figure 5 illustrates the masking methods, causal language modeling (LM) and masked LM, that we use in RNN and XLNet-MLM, respectively. Causal LM is the task of predicting the token following a sequence of tokens, where the model only attends to the left context, i.e. models the probability of a token given the previous tokens in a sentence [6]. The MLM randomly masks some of the tokens from the input sequence, and the objective is to predict the original vocabulary id of the masked word based only on its context. The Transformer layer is allowed to use positions on the right (future information) during training. During inference, all past items are visible for the Transformer layer, which tries to predict the next item. In our experiments [7] we obtained very promising accuracy results with XLNET-MLM which allows the use of future information during training, and performs a type of data augmentation, by masking different positions of the sequences in each training epoch. Therefore, in this tutorial we use masked LM as masking method.

<div align="center"><img src="images/masking.png", width=600, height=600></div>
<p><center>Figure 5. Causal and Masked Language Model masking methods.</center></p>

###  3.2.3 Train XLNET-MLM for Next Item Prediction

Now we are going to leverage XLNET-masked LM (MLM) model to do next item prediction.

In [6]:
import os
import glob

import torch 
import transformers4rec.torch as tr

from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt

As we did above, we start with defining our schema object and our sub-schema to chose only `product-id` feature to train our XLNET-MLM architecture with only one feature. For masking task, we use `mlm` method.

In [7]:
from merlin_standard_lib import Schema

# Define schema object to pass it to the TabularSequenceFeatures class
SCHEMA_PATH = 'schema_tutorial.pb'
schema = Schema().from_proto_text(SCHEMA_PATH)

# Create a sub-schema only with the selected features
schema = schema.select_by_name(['product_id-list_seq'])

- Define Input block

In [8]:
#Input 
sequence_length, d_model = 20, 192
# Define input module to process tabular input-features and to prepare masked inputs
inputs= tr.TabularSequenceFeatures.from_schema(
    schema,
    max_sequence_length=sequence_length,
    d_output=d_model,
    masking="mlm",
)

We build a XLNetConfig class to update the config class of the transformer architecture with the specified arguments, then load the related model. Here we use it to instantiate an XLNET model according to the arguments (d_model, n_head, etc.), defining the model architecture.

TransformerBlock class is created to support HF Transformers for session-based and sequential-based recommendation models. `NextItemPredictionTask` is the class to support next item prediction task.

In [9]:
# Define XLNetConfig class and set default parameters for HF XLNet config  
transformer_config = tr.XLNetConfig.build(
    d_model=d_model, n_head=4, n_layer=2, total_seq_length=sequence_length
)
# Define the model block including: inputs, masking, projection and transformer block.
body = tr.SequentialBlock(
    inputs, tr.MLPBlock([192]), tr.TransformerBlock(transformer_config, masking=inputs.masking)
)

# Define the head for to next item prediction task 
head = tr.Head(
    body,
    tr.NextItemPredictionTask(weight_tying=True, hf_format=True, 
                                     metrics=[NDCGAt(top_ks=[10, 20], labels_onehot=True),  RecallAt(top_ks=[10, 20], labels_onehot=True)]),
    inputs=inputs,
)

# Get the end-to-end Model class 
model = tr.Model(head)

Projecting inputs of NextItemPredictionTask to'64' As weight tying requires the input dimension '192' to be equal to the item-id embedding dimension '64'


**Set training arguments**

An additional argument `data_loader_engine` is defined to automatically load the features needed for training using the schema. The default value is nvtabular for optimized GPU-based data-loading. Optionally the PyarrowDataLoader (pyarrow) can also be used as a basic option, but it is slower and works only for small datasets, as the full data is loaded to CPU memory.

In [10]:
from transformers4rec.config.trainer import T4RecTrainingArguments
from transformers4rec.torch import Trainer

#Set arguments for training 
training_args = T4RecTrainingArguments(
            output_dir="./tmp",
            max_sequence_length=20,
            data_loader_engine='nvtabular',
            num_train_epochs=3, 
            dataloader_drop_last=False,
            per_device_train_batch_size = 256,
            per_device_eval_batch_size = 32,
            gradient_accumulation_steps = 1,
            learning_rate=0.000666,
            report_to = [],
            logging_steps=200,
        )

**Instantiate the trainer**

In [11]:
# Instantiate the T4Rec Trainer, which manages training and evaluation
trainer = Trainer(
    model=model,
    args=training_args,
    schema=schema,
    compute_metrics=True,
)

- Define the output folder of the processed parquet files

In [12]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/sessions_by_day")

- Now, we do time-based fine-tuning the model by iteratively training and evaluating using a sliding time window.

In [13]:
%%time
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')
    trainer.train_dataset_or_path = train_paths
    trainer.reset_lr_scheduler()
    trainer.train()
    trainer.state.global_step +=1
    # Evaluate on the following day
    trainer.eval_dataset_or_path = eval_paths
    train_metrics = trainer.evaluate(metric_key_prefix='eval')
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    trainer.wipe_memory()

********************
Launch training for day 1 are:
********************



***** Running training *****
  Num examples = 112128
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1314


Step,Training Loss
200,9.9462
400,9.0255
600,8.7804
800,8.6416
1000,8.5375
1200,8.5041


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 2 are:	

********************

 epoch = 3.0
 eval/loss = 8.737692832946777
 eval/next-item/ndcg_at_10 = 0.04813284054398537
 eval/next-item/ndcg_at_20 = 0.05864527076482773
 eval/next-item/recall_at_10 = 0.09230072796344757
 eval/next-item/recall_at_20 = 0.1340019553899765
 eval_runtime = 7.175
 eval_samples_per_second = 1850.871
 eval_steps_per_second = 14.495
********************
Launch training for day 2 are:
********************



***** Running training *****
  Num examples = 106240
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1245


Step,Training Loss
200,8.6244
400,8.5141
600,8.3579
800,8.2962
1000,8.1938
1200,8.17


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 3 are:	

********************

 epoch = 3.0
 eval/loss = 8.403376579284668
 eval/next-item/ndcg_at_10 = 0.06307362020015717
 eval/next-item/ndcg_at_20 = 0.0768769383430481
 eval/next-item/recall_at_10 = 0.12020877748727798
 eval/next-item/recall_at_20 = 0.17509378492832184
 eval_runtime = 6.9589
 eval_samples_per_second = 1765.795
 eval_steps_per_second = 13.795
********************
Launch training for day 3 are:
********************



***** Running training *****
  Num examples = 97792
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1146


Step,Training Loss
200,8.2878
400,8.1976
600,8.0675
800,8.0375
1000,7.9296


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 4 are:	

********************

 epoch = 3.0
 eval/loss = 8.089454650878906
 eval/next-item/ndcg_at_10 = 0.07268889248371124
 eval/next-item/ndcg_at_20 = 0.08917907625436783
 eval/next-item/recall_at_10 = 0.1365627497434616
 eval/next-item/recall_at_20 = 0.20200979709625244
 eval_runtime = 8.8626
 eval_samples_per_second = 1754.792
 eval_steps_per_second = 13.766
CPU times: user 6min 56s, sys: 15.3 s, total: 7min 11s
Wall time: 2min 34s


- Add eval accuracy metric results to the existing resuls.txt file.

In [14]:
with open("results.txt", 'a') as f:
    f.write('\n')
    f.write('XLNet-MLM accuracy results:')
    f.write('\n')
    for key, value in  model.compute_metrics().items(): 
        f.write('%s:%s\n' % (key, value.item()))

### Restart the kernel to free our GPU memory

In [22]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

At this stage if kernel does not restart automatically, we expect you to manually restart the kernel to free GPU memory so that you can move on to the next session-based model training with XLNet using side information.

### 3.2.4 Train XLNET-MLM with Side Information for Next Item Prediction

It is a common practice in RecSys to leverage additional tabular features of item (product) metadata and user context, providing the model more
information for meaningful predictions. With that motivation, in this section, we will use additional features to train our XLNET-MLM architecture. We already checked our `schema.pb`, saw that it includes features and their tags. Now it is time to use these additional features that we created in the `02_ETL_with_NVTAbular` notebook.

In [1]:
import os
import glob
import nvtabular as nvt

import torch 
import transformers4rec.torch as tr

from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt

This time we want you to do some coding exercise and replace the `FIXME` in the cells with the proper codes.

In [2]:
# Define categorical and continuous columns to fed to training model
x_cat_names = ['product_id-list_seq', 'category_id-list_seq', 'brand-list_seq']
x_cont_names = ['product_recency_days_log_norm-list_seq', 'et_dayofweek_sin-list_seq', 'et_dayofweek_cos-list_seq', 
                'price_log_norm-list_seq', 'relative_price_to_avg_categ_id-list_seq']

from merlin_standard_lib import Schema

# Define schema object to pass it to the TabularSequenceFeatures class
SCHEMA_PATH ='schema_tutorial.pb'
schema = Schema().from_proto_text(SCHEMA_PATH)
schema = schema.select_by_name(x_cat_names + x_cont_names)

Below, we define `continuous_projection` argument, so that all numerical features are concatenated and projected by a number of MLP layers.

In [3]:
#Input 
sequence_length, d_model = 20, 192
# Define input module to process tabular input-features and to prepare masked inputs
inputs= tr.TabularSequenceFeatures.from_schema(
    schema,
    max_sequence_length=sequence_length,
    aggregation="concat",
    d_output=d_model,
    masking="mlm",
)


# Define XLNetConfig class and set default parameters for HF XLNet config  
transformer_config = tr.XLNetConfig.build(
    d_model=d_model, n_head=4, n_layer=2, total_seq_length=sequence_length
)
# Define the model block including: inputs, masking, projection and transformer block.
body = tr.SequentialBlock(
    inputs, tr.MLPBlock([192]), tr.TransformerBlock(transformer_config, masking=inputs.masking)
)

# Define the head related to next item prediction task 
head = tr.Head(
    body,
    tr.NextItemPredictionTask(weight_tying=True, hf_format=True, 
                                     metrics=[NDCGAt(top_ks=[10, 20], labels_onehot=True),  RecallAt(top_ks=[10, 20], labels_onehot=True)]),
    inputs=inputs,
)

# Get the end-to-end Model class 
model = tr.Model(head)

Projecting inputs of NextItemPredictionTask to'64' As weight tying requires the input dimension '192' to be equal to the item-id embedding dimension '64'


- Set training arguments

In [4]:
from transformers4rec.config.trainer import T4RecTrainingArguments
from transformers4rec.torch import Trainer

#Set arguments for training 
training_args = T4RecTrainingArguments(
            output_dir="./tmp",
            max_sequence_length=20,
            data_loader_engine='nvtabular',
            num_train_epochs=3, 
            dataloader_drop_last=False,
            per_device_train_batch_size = 256,
            per_device_eval_batch_size = 32,
            gradient_accumulation_steps = 1,
            learning_rate=0.000666,
            report_to = [],
            logging_steps=200,
)

In [5]:
# Instantiate the T4Rec Trainer, which manages training and evaluation
trainer = Trainer(
    model=model,
    args=training_args,
    schema=schema,
    compute_metrics=True,
)

- Define the output folder of the processed parquet files

In [6]:
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/workspace/data/sessions_by_day")

In [7]:
%%time
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')
    trainer.train_dataset_or_path = train_paths
    trainer.reset_lr_scheduler()
    trainer.train()
    trainer.state.global_step +=1
    # Evaluate on the following day
    trainer.eval_dataset_or_path = eval_paths
    train_metrics = trainer.evaluate(metric_key_prefix='eval')
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    trainer.wipe_memory()

********************
Launch training for day 1 are:
********************



***** Running training *****
  Num examples = 112128
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1314


Step,Training Loss
200,9.8001
400,8.9141
600,8.6313
800,8.4953
1000,8.3863
1200,8.3459


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 2 are:	

********************

 epoch = 3.0
 eval/loss = 8.59162712097168
 eval/next-item/ndcg_at_10 = 0.054925039410591125
 eval/next-item/ndcg_at_20 = 0.06662589311599731
 eval/next-item/recall_at_10 = 0.10150063782930374
 eval/next-item/recall_at_20 = 0.14810346066951752
 eval_runtime = 9.4227
 eval_samples_per_second = 1409.362
 eval_steps_per_second = 11.037
********************
Launch training for day 2 are:
********************



***** Running training *****
  Num examples = 106240
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1245


Step,Training Loss
200,8.4523
400,8.3149
600,8.12
800,8.0447
1000,7.9408
1200,7.9076


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 3 are:	

********************

 epoch = 3.0
 eval/loss = 8.177560806274414
 eval/next-item/ndcg_at_10 = 0.07021629810333252
 eval/next-item/ndcg_at_20 = 0.08569338172674179
 eval/next-item/recall_at_10 = 0.13521449267864227
 eval/next-item/recall_at_20 = 0.1965421736240387
 eval_runtime = 9.1669
 eval_samples_per_second = 1340.47
 eval_steps_per_second = 10.472
********************
Launch training for day 3 are:
********************



***** Running training *****
  Num examples = 97792
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 1146


Step,Training Loss
200,8.0273
400,7.9118
600,7.7424
800,7.683
1000,7.5656


Saving model checkpoint to ./tmp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Saving model checkpoint to ./tmp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 4 are:	

********************

 epoch = 3.0
 eval/loss = 7.783203125
 eval/next-item/ndcg_at_10 = 0.08664782345294952
 eval/next-item/ndcg_at_20 = 0.10513769090175629
 eval/next-item/recall_at_10 = 0.16290904581546783
 eval/next-item/recall_at_20 = 0.23647256195545197
 eval_runtime = 11.6329
 eval_samples_per_second = 1336.895
 eval_steps_per_second = 10.487
CPU times: user 7min 28s, sys: 17.2 s, total: 7min 46s
Wall time: 3min 10s


- Add XLNet-MLM with side information accuracy results to the `results.txt`

In [8]:
with open("results.txt", 'a') as f:
    f.write('\n')
    f.write('XLNet-MLM with side information accuracy results:')
    f.write('\n')
    for key, value in  model.compute_metrics().items(): 
        f.write('%s:%s\n' % (key, value.item()))

- Export the worflow and model in the format required by Triton server

After model training and evaluation is completed we can save our trained model. Here, we also use NVTabular’s `export_pytorch_ensemble` function which enables us to create model files and config files to be served to [Triton Inference Server](https://github.com/triton-inference-server/server). Nvidia Triton IS simplifies the deployment of AI models at scale in production.

- Load the workflow that we saved in the ETL notebook.

In [9]:
import nvtabular as nvt
workflow_path = os.path.join(INPUT_DATA_DIR, 'workflow_etl')
workflow = nvt.Workflow.load(workflow_path)

In [10]:
# dictionary representing max sequence length for column
sparse_features_max = {
    fname: sequence_length
    for fname in x_cat_names + x_cont_names
}

sparse_features_max

{'product_id-list_seq': 20,
 'category_id-list_seq': 20,
 'brand-list_seq': 20,
 'product_recency_days_log_norm-list_seq': 20,
 'et_dayofweek_sin-list_seq': 20,
 'et_dayofweek_cos-list_seq': 20,
 'price_log_norm-list_seq': 20,
 'relative_price_to_avg_categ_id-list_seq': 20}

- Export the worflow and model in the format required by Triton Inference Server

NVTabular’s `export_pytorch_ensemble` function enables us to create model files and config files to be served to Triton Inference Server.

In [11]:
from nvtabular.inference.triton import export_pytorch_ensemble
export_pytorch_ensemble(
    model,
    workflow,
    sparse_max=sparse_features_max,
    name= "t4r_pytorch",
    model_path= "/workspace/models",
    label_columns =[],
)

# Wrap Up

Congratulations on finishing this notebook. In this tutorial, we have presented Transformers4Rec, an open source library designed to enable RecSys researchers and practitioners to quickly and easily explore the latest developments of the NLP for sequential and session-based recommendation tasks.

Before we move on to the next notebook, `04-Inference-with-Triton`, let's print out our results.txt file. 

In [1]:
!cat results.txt

GRU accuracy results:
next-item/ndcg_at_10:0.06393890082836151
next-item/ndcg_at_20:0.07904116064310074
next-item/recall_at_10:0.12045864760875702
next-item/recall_at_20:0.1801726371049881

XLNET accuracy results:
next-item/ndcg_at_10:0.07268889248371124
next-item/ndcg_at_20:0.08917907625436783
next-item/recall_at_10:0.1365627497434616
next-item/recall_at_20:0.20200979709625244

XLNet-MLM with side information accuracy results:
next-item/ndcg_at_10:0.08664782345294952
next-item/ndcg_at_20:0.10513769090175629
next-item/recall_at_10:0.16290904581546783
next-item/recall_at_20:0.23647256195545197


**In the end, using side information is the best approach. Why is that? Have an idea?**

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

# References

1. Malte Ludewig and Dietmar Jannach. 2018. Evaluation of session-based recommendation algorithms. User Modeling and User-Adapted Interaction 28, 4-5 (2018), 331–390.<br>
2. Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of the 27th ACMinternational conference on information and knowledge management. 843–852.<br>
3. Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
4. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
5. Vaswani, A., et al. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
6. Lample, Guillaume, and Alexis Conneau. "Cross-lingual language model pretraining." arXiv preprint arXiv:1901.07291
7. Gabriel De Souza P. Moreira, et al. (2021). Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based Recommendation. RecSys'21.
8. Understanding XLNet, BorealisAI. Online available: https://www.borealisai.com/en/blog/understanding-xlnet/
9. Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." Advances in neural information processing systems 32 (2019).