In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions anda
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_models-transformers-net-item-prediction/nvidia_logo.png" style="width: 90px; float: right;">

# Transformer-based architecture for next-item prediction task

## Overview

In this use case we will train a Transformer-based architecture for next-item prediction task.

We will use the [booking.com dataset](https://github.com/bookingcom/ml-dataset-mdt) to train a session-based model. The dataset contains 1,166,835 of anonymized hotel reservations in the train set and 378,667 in the test set. Each reservation is a part of a customer's trip (identified by `utrip_id`) which includes consecutive reservations.

We will reshape the data to organize it into 'sessions'. Each session will be a full customer itinerary in chronological order. The goal will be to predict the city_id of the final reservation of each trip.


### Learning objectives

- Training a Transformer-based architecture for next-item prediction task

## Downloading and preparing the dataset

We will download the dataset using a functionality provided by merlin models. The dataset can be found on GitHub [here](https://github.com/bookingcom/ml-dataset-mdt).

In [2]:
from merlin.core.dispatch import get_lib
from merlin.datasets.ecommerce import get_booking
import numpy as np

from nvtabular import *
from nvtabular import ops
from merlin.schema.tags import Tags

import merlin.models.tf as mm

get_booking('/workspace/data')
train = get_lib().read_csv('/workspace/data/train_set.csv', parse_dates=['checkin', 'checkout'])

2023-01-10 00:50:16.355919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-10 00:50:16.356319: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-10 00:50:16.356459: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-10 00:50:16.557048: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate 

Each reservation has a unique `utrip_id`. During each trip a customer vists several destinations.

In [3]:
train.head()

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
0,1000027,2016-08-13,2016-08-14,8183,desktop,7168,Elbonia,Gondal,1000027_1
1,1000027,2016-08-14,2016-08-16,15626,desktop,7168,Elbonia,Gondal,1000027_1
2,1000027,2016-08-16,2016-08-18,60902,desktop,7168,Elbonia,Gondal,1000027_1
3,1000027,2016-08-18,2016-08-21,30628,desktop,253,Elbonia,Gondal,1000027_1
4,1000033,2016-04-09,2016-04-11,38677,mobile,359,Gondal,Cobra Island,1000033_1


We will train on sequences of `city_id` and `booker_country` and based on this information, our model will attempt to predict the next `city_id` (the next hop in the journey).

We will train a transformer model that can work with sequences of variable length within a batch. This functionality is provided to us out of the box and doesn't require any changes to the architecture. Thanks to it we do not have to pad or trim our sequences to any particular length -- our model can make effective use of all of the data!

With one exception. For a masked language model that we will be training, we need to discard sequences that are shorter than two hops. This makes sense as there is nothing our model could learn if it was only presented with an itinerary with a single destination on it!

Let us begin by splitting the data into a train and validation set based on trip ID.

Let's see how many unique trips there are in the dataset. Also, let us shuffle the trips along the way so that our validation set consists of a random sample of our train set.

In [4]:
utrip_ids = train.sample(frac=1).utrip_id.unique()
len(utrip_ids)

217686

Now let's assign data to our train and validation sets. Furthermore, we sort the data by `utrip_id` and `checkin`. This way we ensure our sequences of visited `city_ids` will be in proper order!

In [5]:
train_set_utrip_ids = utrip_ids[:int(0.8 * utrip_ids.shape[0])]
validation_set_utrip_ids = utrip_ids[int(0.8 * utrip_ids.shape[0]):]

train_set = train[train.utrip_id.isin(train_set_utrip_ids)].sort_values(['utrip_id', 'checkin'])
validation_set = train[train.utrip_id.isin(validation_set_utrip_ids)].sort_values(['utrip_id', 'checkin'])

##  Preprocessing with NVTabular

We can now begin with data preprocessing.

We will combine trips into "sessions", discard trips that are too short and calculate total trip length.

We will use NVTabular for this work. It offers optimized tabular data preprocessing operators that run on the GPU. If you would like to learn more about the NVTabular library, please take a look [here](https://github.com/NVIDIA-Merlin/NVTabular).

In [6]:
train_set_dataset = Dataset(train_set)
validation_set_dataset = Dataset(validation_set)

In [7]:
weekday_checkin = (
    ["checkin"]
    >> ops.LambdaOp(lambda col: col.dt.weekday)
    >> ops.Rename(name="weekday_checkin")
)

weekday_checkout = (
    ["checkout"]
    >> ops.LambdaOp(lambda col: col.dt.weekday)
    >> ops.Rename(name="weekday_checkout")
)


groupby_features = ['city_id', 'booker_country', 'utrip_id', 'hotel_country', 'checkin'] + weekday_checkin + weekday_checkout >> ops.Groupby(
    groupby_cols=['utrip_id'],
    aggs={
        'city_id': ['list', 'count'],
        'booker_country': ['list'],
        'hotel_country': ['list'],
        'weekday_checkin': ['list'],
        'weekday_checkout': ['list']
    },
    sort_cols="checkin"
)

groupby_features_city = groupby_features['city_id_list'] >> ops.Categorify() >> ops.AddTags([Tags.SEQUENCE, Tags.ITEM, Tags.ITEM_ID])
groupby_features_country = (
    groupby_features['booker_country_list', 'hotel_country_list', 'weekday_checkin_list', 'weekday_checkout_list']
    >> ops.Categorify() >> ops.AddTags([Tags.SEQUENCE, Tags.ITEM])
)
city_id_count = groupby_features['city_id_count'] >> ops.AddTags([Tags.CONTEXT, Tags.ITEM, Tags.CONTINUOUS])

# Filter out sessions with less than 2 interactions 
MINIMUM_SESSION_LENGTH = 2
filtered_sessions = groupby_features_city + groupby_features_country + city_id_count >> ops.Filter(f=lambda df: df["city_id_count"] >= MINIMUM_SESSION_LENGTH) 

In [8]:
wf = Workflow(filtered_sessions)

In [9]:
train_set_processed = wf.fit_transform(train_set_dataset)
validation_set_processed = wf.transform(validation_set_dataset)



Our data consists of a sequence of visited `city_ids`, a sequence of `booker_countries` (represented as integer categories) and a `city_id_count` column (which contains the count of visited cities in a trip).

In [10]:
train_set_processed.compute().head()

Unnamed: 0,city_id_list,booker_country_list,hotel_country_list,weekday_checkin_list,weekday_checkout_list,city_id_count
0,"[8240, 156, 2278, 2097]","[3, 3, 3, 3]","[3, 3, 3, 3]","[5, 7, 4, 3]","[7, 4, 2, 7]",4
1,"[63, 1160, 87, 619, 63]","[1, 1, 1, 1, 1]","[1, 1, 1, 1, 1]","[5, 1, 4, 3, 5]","[6, 4, 2, 5, 4]",5
2,"[7, 6, 24, 1051, 65, 52, 3]","[2, 2, 2, 2, 2, 2, 2]","[2, 2, 2, 16, 16, 3, 3]","[5, 1, 2, 6, 5, 7, 4]","[6, 3, 1, 5, 7, 4, 3]",7
3,"[1032, 757, 140, 3]","[2, 2, 2, 2]","[19, 19, 19, 3]","[1, 4, 2, 3]","[4, 3, 2, 5]",4
4,"[3603, 262, 662, 250, 359]","[1, 1, 1, 1, 1]","[30, 30, 30, 30, 30]","[1, 3, 6, 5, 1]","[2, 1, 5, 6, 3]",5


We are now ready to train our model.

Here is the schema of the data that our model will use.

In [11]:
seq_schema = train_set_processed.schema.select_by_tag(Tags.SEQUENCE)
seq_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.domain.min,properties.domain.max,properties.domain.name,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension
0,city_id_list,"(Tags.ID, Tags.ITEM_ID, Tags.SEQUENCE, Tags.CA...",int64,True,True,,0,0,0,.//categories/unique.city_id_list.parquet,0,37202,city_id_list,37203,512
1,booker_country_list,"(Tags.ITEM, Tags.SEQUENCE, Tags.CATEGORICAL)",int64,True,True,,0,0,0,.//categories/unique.booker_country_list.parquet,0,5,booker_country_list,6,16
2,hotel_country_list,"(Tags.ITEM, Tags.SEQUENCE, Tags.CATEGORICAL)",int64,True,True,,0,0,0,.//categories/unique.hotel_country_list.parquet,0,194,hotel_country_list,195,31
3,weekday_checkin_list,"(Tags.ITEM, Tags.SEQUENCE, Tags.CATEGORICAL)",int64,True,True,,0,0,0,.//categories/unique.weekday_checkin_list.parquet,0,7,weekday_checkin_list,8,16
4,weekday_checkout_list,"(Tags.ITEM, Tags.SEQUENCE, Tags.CATEGORICAL)",int64,True,True,,0,0,0,.//categories/unique.weekday_checkout_list.par...,0,7,weekday_checkout_list,8,16


Let's also identify the target column.

In [12]:
target = train_set_processed.schema.select_by_tag(Tags.SEQUENCE).column_names[0]
target

'city_id_list'

## Constructing the model

Let's construct our model.

We can specify various hyperparameters, such as the number of heads and number of layers to use.

For the transformer portion of our model, we will use the `XLNet` architecture. Additionally, we are passing `mm.ReplaceMaskedEmbeddings()` as our `pre` block. We will be training a masked language model and this parameter is responsible for the masking of our sequences.

Later, when we run the `fit` method on our model, we will specify the `masking_probability` of `0.3`. Through the combination of these parameters, our model will train on sequences where any given timestep will be masked with a probability of 0.3 and it will be our model's training task to infer the target value for that step!

To summarize, Masked Language Modeling is implemented by using two blocks in combination:

* `SequenceMaskRandom()` - Used as a pre for model.fit(), it randomly selects items from the sequence to be masked for prediction as targets, by using Keras masking.
* `ReplaceMaskedEmbeddings()` - Used as a pre for a `TransformerBlock`, it replaces the input embeddings at masked positions for prediction by a dummy trainable embedding, to avoid leakage of the targets.

In [13]:
dmodel=48
mlp_block = mm.MLPBlock(
                [128,dmodel],
                activation='relu',
                no_activation_last_layer=True,
            )
model = mm.Model(
    mm.InputBlockV2(
        seq_schema,
        embeddings=mm.Embeddings(
            train_set_processed.schema.select_by_tag(Tags.CATEGORICAL), sequence_combiner=None
        ),
    ),
    mlp_block,
    mm.XLNetBlock(d_model=dmodel, n_head=4, n_layer=2, 
                  pre=mm.ReplaceMaskedEmbeddings(),
                  post="inference_hidden_state",
                 ),
    mm.CategoricalOutput(
        train_set_processed.schema.select_by_name(target),
        default_loss="categorical_crossentropy",
    ),
)

## Model training

In [14]:
model.compile(run_eagerly=False, optimizer='adam', loss="categorical_crossentropy")
model.fit(train_set_processed, batch_size=64, epochs=5, pre=mm.SequenceMaskRandom(schema=seq_schema, target=target, masking_prob=0.3))



Epoch 1/5


2023-01-10 00:50:29.840560: I tensorflow/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8500








2023-01-10 00:50:37.152856: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: model/xl_net_block/replace_masked_embeddings/RaggedWhere/Assert/AssertGuard/branch_executed/_95


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f7e65771430>

## Model evaluation

We have trained our model.

But in training the metrics come from a masked language modelling task. A portion of steps in the sequence was masked for each example. The metrics were calculated on this task.

In reality, we probably care how well our model does on the next item prediction task (as it mimics the scenario in which the model would be likely to be used).

Let's measure the performance of the model on a task where it attempts to predict the last item in a sequence.

We will mask the last item using `SequenceMaskLast` and run inference.

In [15]:
model.evaluate(
    validation_set_processed,
    batch_size=128,
    pre=mm.SequenceMaskLast(schema=validation_set_processed.schema, target=target),
    return_dict=True
)

2023-01-10 00:59:14.752516: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: model/xl_net_block/replace_masked_embeddings/RaggedWhere/Assert/AssertGuard/branch_executed/_74




{'loss': 0.33035674691200256,
 'recall_at_10': 0.5542407035827637,
 'mrr_at_10': 0.3065614700317383,
 'ndcg_at_10': 0.3654871881008148,
 'map_at_10': 0.3065614700317383,
 'precision_at_10': 0.05542406067252159,
 'regularization_loss': 0.0,
 'loss_batch': 0.28873005509376526}

## Summary

We have trained a transformer model for the next item prediction task using language model masking.

For another session-based example that goes deeper into data preprocessing and that covers several advanced techniques (Weight Tying, Temperature Scaling) please see [Session-Based Next Item Prediction for Fashion E-Commerce](https://github.com/NVIDIA-Merlin/models/blob/t4rec_use_case/examples/usecases/ecommerce-session-based-next-item-prediction-for-fashion.ipynb). 