In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Session-based Recommendation with XLNET

In this notebook we use [Transformers4Rec](https://github.com/NVIDIA-Merlin/Transformers4Rec) which leverages the popular [HuggingFace’s Transformers](https://github.com/huggingface/transformers) and make it possible to experiment with cutting-edge implementation of such architectures for sequential and session-based recommendation problems. We show how to build a session-based recommendation model with [XLNET](https://arxiv.org/abs/1906.08237), and train and evaluate it with optimized [NVTabular Pytorch dataloader](https://nvidia.github.io/NVTabular/main/training/pytorch.html). The XLNet architecture was designed to leverage the best of both auto-regressive language modeling and auto-encoding with its Permutation Language Modeling training method. In this example we will use XLNET with masked language modeling (MLM) training method, which showed very promising results in the experiments conducted in our [ACM RecSys'21 paper](https://github.com/NVIDIA-Merlin/publications/blob/main/2021_acm_recsys_transformers4rec/recsys21_transformers4rec_paper.pdf).

In the previous notebook we went through our ETL pipeline with NVTabular library, and created session-based features to be used in training a session-based recommendation model. In this notebook we will learn:

- Accelerating data loading of multiple features on PyTorch using NVTabular library
- Training and evaluating a Transformer-based (XLNET-MLM) session-based recommendation model with side (additional features)

## Build a DL model with Transformers4Rec library  

- import required libraries

In [2]:
import os
import glob
import torch 

import transformers4rec.torch as torch4rec
from transformers4rec.torch import TabularSequenceFeatures, MLPBlock, SequentialBlock, Head, TransformerBlock

from transformers4rec.utils.schema import DatasetSchema
from transformers4rec.torch.model.head import NextItemPredictionTask
from transformers4rec.config import transformer
from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt

Transformers4Rec library relies on schema object in creation of TabularSequence features. As you can see below schema.pb is a protobuf file contains metadata including statistics about features such as cardinality, min and max values and also tags each feature based on their characteristics and dtypes (e.g., categorical, continuous, list, integer).

- Manually set the schema 

In [3]:
# Define schema object to pass it to the SequentialTabularFeatures
SCHEMA_PATH = "./schema.pb"
schema = DatasetSchema.from_proto(SCHEMA_PATH)
!head -30 $SCHEMA_PATH

feature {
  name: "session_id"
  type: INT
  int_domain {
    name: "session_id"
    min: 1
    max: 11562158
    is_categorical: false
  }
  annotation {
    tag: "groupby_col"
  }
}
feature {
  name: "category-list_trim"
  value_count {
    min: 2
    max: 20
  }
  type: INT
  int_domain {
    name: "category-list_trim"
    min: 1
    max: 334
    is_categorical: true
  }
  annotation {
    tag: "list"
    tag: "categorical"
    tag: "item"


In [4]:
# select the features for trainining
schema = schema.select_by_name(['item_id-list_trim', 'category-list_trim', 'timestamp/weekday/sin-list_trim','timestamp/age_days-list_trim'])

### Define the sequential input module

Below we define our `input` block using `TabularSequenceFeatures` [class](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/features/sequence.py#L91). The `from_schema` module directly parse schema and accepts categorical and continuous sequential inputs and supports data augmentation, data aggregation with `sequential-concat`, `elementwise-sum` and `lement-wise sum & item multiplication ` techniques, the projection of the interaction embeddings and the masking tasks.

`max_sequence_length` defines the maximum sequence length of our sequential input, and if `continuous_projection` argument is set,  all numerical features are concatenated and projected with an MLP block so that continuous features are represented by a vector of size defined by user, which is `64` in this example.

In [4]:
inputs = TabularSequenceFeatures.from_schema(
        schema,
        max_sequence_length=20,
        continuous_projection=64,
        d_output=100,
        masking="mlm",
    )
inputs.masking.device = 'cuda'

### Define the Transformer Block

- LM task + HF Transformer architecture + Next item-prediction task. 
    - We build a [T4RecConfig](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/config/transformer.py#L8) class to update the config class of the transformer architecture with the specified arguments, then load the related model. Here we use it to instantiate an XLNET model according to the  arguments (d_model, n_head, etc.), defining the model architecture.
    - [TransformerBlock](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/block/transformer.py#L37) class is created to support HF Transformers for session-based and sequential-based recommendation models.
    - [NextItemPredictionTask](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/model/head.py#L238) is the class to support next item prediction task. We also support other predictions [tasks](https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/transformers4rec/torch/model/head.py). 

In [5]:
# Define XLNetConfig class and set default parameters for HF XLNet config  
transformer_config = transformer.XLNetConfig.build(
    d_model=64, n_head=4, n_layer=2, total_seq_length=20
)
# Define the model block including: inputs, masking, projection and transformer block.
body = torch4rec.SequentialBlock(
    inputs, torch4rec.MLPBlock([64]), torch4rec.TransformerBlock(transformer_config, masking=inputs.masking)
)

# Define the head related to next item prediction task 
head = torch4rec.Head(
    body,
    torch4rec.NextItemPredictionTask(weight_tying=True, hf_format=True, 
                                     metrics=[NDCGAt(top_ks=[100, 200], labels_onehot=True),  RecallAt(top_ks=[100, 200], labels_onehot=True)]),
    inputs=inputs,
)

# Get the end-to-end Model class 
model = torch4rec.Model(head)

Note that we can easily define an RNN-based model inside the `SequentialBlock` instead of a Transformer-based model. You can explore this [tutorial](https://github.com/NVIDIA-Merlin/Transformers4Rec/tree/main/tutorial) for a GRU-based model example.

### Train the model 

As data loader we use optimized NVTabular PyTorch Dataloader. You can learn more about it [here](https://nvidia.github.io/NVTabular/main/training/pytorch.html).

- **Set Training arguments**

In [6]:
from transformers4rec.config.trainer import T4RecTrainingArguments
from transformers4rec.torch import Trainer
#Set argumentd for training 
train_args = T4RecTrainingArguments(local_rank = -1, dataloader_drop_last = True, data_loader_engine='nvtabular',
                                  report_to = [], debug = ["r"], gradient_accumulation_steps = 1,
                                  per_device_train_batch_size = 256, per_device_eval_batch_size = 32,
                                  output_dir = ".", lr_scheduler_type='cosine', 
                                  learning_rate_num_cosine_cycles_by_epoch=1.5,
                                  max_sequence_length=20, fp16=False, no_cuda=False, )

# Daily Fine-Tuning: Training over a time window

Here we do daily fine-tuning meaning that we use the first day to train and second day to evaluate, then we use the second day data to train the model by resuming from the first step, and evaluate on the third day, so on so forth.

In this example, the evaluation of the session-based recommendation model is performed using traditional Top-N ranking metrics such as Normalized Discounted Cumulative Gain (NDCG@20) and Hit Rate (HR@20). NDCG accounts for rank of the relevant item in the recommendation list and is a more fine-grained metric than HR, which only verifies whether the relevant item is among the top-n items. HR@n is equivalent to Recall@n when there is only one relevant item in the recommendation list.

In [7]:
# Instantiate the T4Rec Trainer, which manages training and evaluation
trainer = Trainer(
    model=model,
    args=train_args,
    schema=schema,
    compute_metrics=True,
)

In [8]:
start_time_window_index = 1
final_time_window_index = 7
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    train_paths = glob.glob(f"/workspace/data/sessions_by_day/{time_index_train}/train.parquet")
    eval_paths = glob.glob(f"/workspace/data/sessions_by_day/{time_index_eval}/valid.parquet")  
    print(train_paths)
    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')
    trainer.train_dataset_or_path = train_paths
    trainer.reset_lr_scheduler()
    trainer.train()
    trainer.state.global_step +=1
    # Evaluate on the following day
    trainer.eval_dataset_or_path = eval_paths
    train_metrics = trainer.evaluate(metric_key_prefix='eval')
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    trainer.wipe_memory()

['/workspace/data/sessions_by_day/1/train.parquet']
********************
Launch training for day 1 are:
********************



***** Running training *****
  Num examples = 1024
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 12


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




***** Running training *****
  Num examples = 1024
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 12


********************
Eval results for day 2 are:	

********************

 epoch = 3.0
 eval_loss = 10.860979080200195
 eval_ndcgat_100 = 0.0
 eval_ndcgat_200 = 0.0
 eval_recallat_100 = 0.0
 eval_recallat_200 = 0.0
 eval_runtime = 0.0746
 eval_samples_per_second = 857.947
 eval_steps_per_second = 13.405
['/workspace/data/sessions_by_day/2/train.parquet']
********************
Launch training for day 2 are:
********************



Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


***** Running training *****
  Num examples = 1024
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 12


********************
Eval results for day 3 are:	

********************

 epoch = 3.0
 eval_loss = 10.81531047821045
 eval_ndcgat_100 = 0.0
 eval_ndcgat_200 = 0.0005674263811670244
 eval_recallat_100 = 0.0
 eval_recallat_200 = 0.004273504484444857
 eval_runtime = 0.092
 eval_samples_per_second = 1043.754
 eval_steps_per_second = 10.872
['/workspace/data/sessions_by_day/3/train.parquet']
********************
Launch training for day 3 are:
********************



Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


***** Running training *****
  Num examples = 1024
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 12


********************
Eval results for day 4 are:	

********************

 epoch = 3.0
 eval_loss = 10.836214065551758
 eval_ndcgat_100 = 0.0
 eval_ndcgat_200 = 0.0
 eval_recallat_100 = 0.0
 eval_recallat_200 = 0.0
 eval_runtime = 0.0746
 eval_samples_per_second = 858.235
 eval_steps_per_second = 13.41
['/workspace/data/sessions_by_day/4/train.parquet']
********************
Launch training for day 4 are:
********************



Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


***** Running training *****
  Num examples = 1024
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 12


********************
Eval results for day 5 are:	

********************

 epoch = 3.0
 eval_loss = 10.765353202819824
 eval_ndcgat_100 = 0.0
 eval_ndcgat_200 = 0.0
 eval_recallat_100 = 0.0
 eval_recallat_200 = 0.0
 eval_runtime = 0.0982
 eval_samples_per_second = 977.112
 eval_steps_per_second = 10.178
['/workspace/data/sessions_by_day/5/train.parquet']
********************
Launch training for day 5 are:
********************



Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


***** Running training *****
  Num examples = 1024
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 1
  Total optimization steps = 12


********************
Eval results for day 6 are:	

********************

 epoch = 3.0
 eval_loss = 10.712799072265625
 eval_ndcgat_100 = 0.0
 eval_ndcgat_200 = 0.0
 eval_recallat_100 = 0.0
 eval_recallat_200 = 0.0
 eval_runtime = 0.0987
 eval_samples_per_second = 972.559
 eval_steps_per_second = 10.131
['/workspace/data/sessions_by_day/6/train.parquet']
********************
Launch training for day 6 are:
********************



Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




********************
Eval results for day 7 are:	

********************

 epoch = 3.0
 eval_loss = 10.695266723632812
 eval_ndcgat_100 = 0.0
 eval_ndcgat_200 = 0.00139007403049618
 eval_recallat_100 = 0.0
 eval_recallat_200 = 0.009389671497046947
 eval_runtime = 0.0979
 eval_samples_per_second = 980.18
 eval_steps_per_second = 10.21


* **Save the model**

In [9]:
trainer._save_model_and_checkpoint(save_model_class=True)

Saving model checkpoint to ./checkpoint-13
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


* **Reload the model**

In [10]:
trainer.load_model_trainer_states_from_checkpoint('./checkpoint-%s'%trainer.state.global_step)

- **Re-compute eval metrics of validation data**

In [11]:
eval_data_paths = glob.glob(f"/workspace/data/sessions_by_day/7/valid.parquet")  

In [12]:
# set new data from day 7
eval_metrics = trainer.evaluate(eval_dataset=eval_data_paths, metric_key_prefix='eval')
for key in sorted(eval_metrics.keys()):
    print("  %s = %s" % (key, str(eval_metrics[key])))

  epoch = 3.0
  eval_loss = 10.681629180908203
  eval_ndcgat_100 = 0.0
  eval_ndcgat_200 = 0.0012942211469635367
  eval_recallat_100 = 0.0
  eval_recallat_200 = 0.008733624592423439
  eval_runtime = 0.1121
  eval_samples_per_second = 856.627
  eval_steps_per_second = 8.923


We can easily log and visualize model training and evaluation on [Weights & Biases (W&B)](https://wandb.ai/home), [Tensorboard](https://www.tensorflow.org/tensorboard) and [NVIDIA DLLogger](https://github.com/NVIDIA/dllogger). By default, Huggingface uses Weights & Biases (W&B) to log training and evaluation metrics. It allows a nice management of experiments, including config logging, and provides plots with the evolution of losses and metrics over time. Please visit [here](https://github.com/NVIDIA-Merlin/Transformers4Rec/tree/main/transformers4rec) for instructions on using  Weights & Biases.