In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

## Next item prediction with a Transformer-based model

In recent years, several deep learning-based algorithms have been proposed for recommendation systems while its adoption in industry deployments have been steeply growing. In particular, NLP inspired approaches have been successfully adapted for sequential and session-based recommendation problems, which are important for many domains like e-commerce, news and streaming media. Session-Based Recommender Systems (SBRS) have been proposed to model the sequence of interactions within the current user session, where a session is a short sequence of user interactions typically bounded by user inactivity. They have recently gained popularity due to their ability to capture short-term or contextual user preferences towards items.

The field of NLP has evolved significantly within the last decade, particularly due to the increased usage of deep learning. As a result, state of the art NLP approaches have inspired RecSys practitioners and researchers to adapt those architectures, especially for sequential and session-based recommendation problems. Here, we use one of the state-of-the-art Transformer-based architecture, XLNet with Causal Language Modeling (CLM) training technique for multi-class classification task. For this, we leverage the popular HuggingFace’s Transformers NLP library and make it possible to experiment with cutting-edge implementation of such architectures for sequential and session-based recommendation problems.

### 5.1.1. What's Transformers?
The Transformer is a competitive alternative to the models using Recurrent Neural Networks (RNNs) for a range of sequence modeling tasks. The Transformer architecture [6] was introduced as a novel architecture in NLP domain that aims to solve sequence-to-sequence tasks relying entirely on self-attention mechanism to compute representations of its input and output. Hence, the Transformer overperforms RNNs with their three mechanisms:

- Non-sequential: Transformers network is parallelized where as RNN computations are inherently sequential. That resulted in significant speed-up in the training time.<br>
- Self-attention mechanisms: Transformers rely entirely on self-attention mechanisms that directly model relationships between all item-ids in a sequence.
- Positional encodings: A representation of the location or “position” of items in a sequence which is used to give the order context to the model architecture.

**Learning Objectives:**
- Train and evaluate a transformer-based model (XLNet) for next-item prediction task
- Apply weight-tying technique

In [3]:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
import gc
import numpy as np

from merlin.schema.tags import Tags
from merlin.io.dataset import Dataset

import merlin.models.tf as mm
from merlin.models.tf.core.aggregation import SequenceAggregator
from merlin.models.tf.transforms.tensor import ListToDense, ListToRagged

import tensorflow as tf

In [4]:
seed=42
tf.random.set_seed(seed)
np.random.seed(seed)

In [5]:
DATA_FOLDER = os.environ.get(
    "DATA_FOLDER", 
    '/workspace/data'
)

In [6]:
train = Dataset(os.path.join(DATA_FOLDER, "train/*.parquet"))
valid = Dataset(os.path.join(DATA_FOLDER, "valid/*.parquet"))



In [7]:
target = train.schema.select_by_tag(Tags.SEQUENCE).column_names[0]
target

'city_id_list'

In [8]:
EPOCHS = int(os.environ.get(
    "EPOCHS", 
    '3'
))

dmodel = int(os.environ.get(
    "dmodel", 
    '64'
))

BATCH_SIZE = 1024
LEARNING_RATE = 0.003
DROPOUT = 0.0
LABEL_SMOOTHING = 0.2
TEMPERATURE_SCALING = 2

In [9]:
manual_dims = {
    'city_id_list': dmodel, 
    'hotel_country_list' :16
}

### XLNET MODEL

In [10]:
train = Dataset(os.path.join(DATA_FOLDER, "train/*.parquet"))
valid = Dataset(os.path.join(DATA_FOLDER, "valid/*.parquet"))

In [11]:
train.schema = train.schema.select_by_name(['city_id_list','booker_country_list', 'hotel_country_list',
                                            'weekday_checkin_list','weekday_checkout_list',
                                            'month_checkin_list','num_city_visited', 'length_of_stay_list']
                                          )

In [12]:
seq_schema =train.schema.select_by_tag(Tags.SEQUENCE)

In [13]:
context_schema = train.schema.select_by_tag(Tags.CONTEXT)

In [14]:
target_schema = train.schema.select_by_tag(Tags.ITEM_ID)
target = target_schema.column_names[0]
target

'city_id_list'

In [15]:
mlp_block = mm.MLPBlock(
                [128,dmodel],
                activation='relu',
                no_activation_last_layer=True,
                dropout=DROPOUT,
            )

Let's create a sequential block where we connect sequential inputs block (i.e., a SequentialLayer represents a sequence of Keras layers) with MLPBlock and then XLNetBlock. MLPBlock is used as a projection block to match the output dimensions of the seq_inputs block with the transformer block. In otherwords, due to residual connection in the Transformer model, we add an MLPBlock in the model pipeline. The output dim of the input block should match with the hidden dimension (d_model) of the XLNetBlock.

In [16]:
input_block = mm.InputBlockV2(
    train.schema,
    embeddings=mm.Embeddings(
        seq_schema.select_by_tag(Tags.CATEGORICAL), 
        sequence_combiner=None,
        dim=manual_dims
        ),
    post=mm.BroadcastToSequence(context_schema, seq_schema),
)

dense_block =mm.SequentialBlock(
    input_block,
    mlp_block,
    mm.XLNetBlock(d_model=dmodel, n_head=4, n_layer=2, 
                   pre=mm.ReplaceMaskedEmbeddings(),
                   post="inference_hidden_state",
                  )
)

In [None]:
batch= mm.sample_batch(train, batch_size=128, include_targets=False, to_ragged=True)

In [28]:
input_block(batch).shape

TensorShape([128, None, 114])

In [18]:
mlp_block2 = mm.MLPBlock(
                [128,dmodel],
                activation='relu',
                no_activation_last_layer=True,
            )

CategoricalOutput class has the functionality to do weight-tying, when we provide the EmbeddingTable related to the target feature in the `to_call` method. 

**Weight Tying:** Sharing the weight matrix between input-to-embedding layer and output-to-softmax layer; That is, instead of using two weight matrices, we just use only one weight matrix. The intuition behind doing so is to combat the problem of overfitting. Thus, weight tying can be considered as a form of regularization.

In [19]:
item_id_name = train.schema.select_by_tag(Tags.ITEM_ID).first.properties['domain']['name']
print(item_id_name)

city_id


In [20]:
prediction_task= mm.CategoricalOutput(
    to_call=input_block["categorical"][item_id_name],
    logits_temperature=TEMPERATURE_SCALING,
)

In [21]:
model_transformer = mm.Model(dense_block, mlp_block2, prediction_task)

In [22]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE,
)

In [23]:
%%time
model_transformer.compile(run_eagerly=False, optimizer=optimizer, loss="categorical_crossentropy",
              metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[4])
             )
model_transformer.fit(train, batch_size=512, epochs=3, pre=mm.SequenceMaskRandom(schema=seq_schema, target=target, masking_prob=0.1))

2023-02-23 00:26:28.697454: I tensorflow/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8700


Epoch 1/3






2023-02-23 00:26:53.317384: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: model/sequential_block_4/xl_net_block/replace_masked_embeddings/RaggedWhere/Assert/AssertGuard/branch_executed/_170


Epoch 2/3
Epoch 3/3
CPU times: user 4min 49s, sys: 36.1 s, total: 5min 26s
Wall time: 4min 7s


<keras.callbacks.History at 0x7f8802818e50>

In [24]:
predict_last = mm.SequenceMaskLast(schema=seq_schema, target=target)

In [25]:
valid.schema = train.schema

In [26]:
model_transformer.evaluate(
    valid,
    batch_size=1024,
    pre=predict_last,
    return_dict=True
)

2023-02-23 00:30:27.757963: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: model/sequential_block_4/xl_net_block/replace_masked_embeddings/RaggedWhere/Assert/AssertGuard/branch_executed/_131




{'loss': 0.3793621063232422,
 'recall_at_4': 0.47486501932144165,
 'mrr_at_4': 0.32167568802833557,
 'ndcg_at_4': 0.3602368235588074,
 'map_at_4': 0.32167568802833557,
 'precision_at_4': 0.11871625483036041,
 'regularization_loss': 0.0,
 'loss_batch': 0.3642217814922333}

- ReplaceMaskedEmbeddings and SequenceMaskRandom are used together: replacement embeddings are learned during the training
- ReplaceMaskedEmbeddings and SequencePredictNext are used together: replacement embeddings are not learned during the training

we can have an xlnet with `SequencePredictLast` and with a post with sequence summary set to last. This is about how we learn the representation of the sequence. 