In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions anda
# limitations under the License.
# ==============================================================================

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_models_entertainment-with-pretrained-embeddings/nvidia_logo.png" style="width: 90px; float: right;">

# Transformer-based architecture for next-item prediction task

## Overview

In this use case we will train a Transformer-based architecture for next-item prediction task.

We will use the [booking.com dataset](https://github.com/bookingcom/ml-dataset-mdt) to train a session-based model. The dataset contains 1,166,835 of anonymized hotel reservations in the train set and 378,667 in the test set. Each reservation is a part of a customer's trip (identified by `utrip_id`) which includes consecutive reservations.

We will reshape the data to organize it into 'sessions'. Each session will be a full customer itinerary in chronological order. The goal will be to predict the city_id of the final reservation of each trip.


### Learning objectives

- Training a Transformer-based architecture for next-item prediction task

## Downloading and preparing the dataset

You can download the full dataset from GitHub [here](https://github.com/bookingcom/ml-dataset-mdt). Please place it alongside this notebook (or alternatively, change the `DATAPATH` to point to where it is located).

In [1]:
from merlin.core.dispatch import get_lib
import numpy as np

DATAPATH = 'ml-dataset-mdt'

itineraries = get_lib().read_csv(f'{DATAPATH}/train_set.csv', parse_dates=['checkin'])

Each reservation has a unique `utrip_id`. During each trip a customer vists several destinations.

In [2]:
itineraries.head()

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
0,1000027,2016-08-13,2016-08-14,8183,desktop,7168,Elbonia,Gondal,1000027_1
1,1000027,2016-08-14,2016-08-16,15626,desktop,7168,Elbonia,Gondal,1000027_1
2,1000027,2016-08-16,2016-08-18,60902,desktop,7168,Elbonia,Gondal,1000027_1
3,1000027,2016-08-18,2016-08-21,30628,desktop,253,Elbonia,Gondal,1000027_1
4,1000033,2016-04-09,2016-04-11,38677,mobile,359,Gondal,Cobra Island,1000033_1


We will train on sequences of `city_id` and `booker_country` and based on this information, our model will attempt to predict the next `city_id` (the next hop in the journey).

We will train a transformer model that can work with sequences of variable length within a batch. This functionality is provided to us out of the box and doesn't require any changes to the architecture. Thanks to it we do not have to pad or trim our sequences to any particular length -- our model can make effective use of all of the data!

With one exception. For a masked language model that we will be training, we need to discard sequences that are shorter than two hops. This makes sense as there is nothing our model could learn if it was only presented with an itinerary with a single destination on it!

Let us begin by splitting the data into a train and validation set based on trip ID.

In [3]:
utrip_ids = itineraries.sample(frac=1).utrip_id.unique()
len(utrip_ids)

217686

In [5]:
train_set_utrip_ids = utrip_ids[:int(0.8 * utrip_ids.shape[0])]
validation_set_utrip_ids = utrip_ids[int(0.8 * utrip_ids.shape[0]):]

train_set = itineraries[itineraries.utrip_id.isin(train_set_utrip_ids)].sort_values(['utrip_id', 'checkin'])
validation_set = itineraries[itineraries.utrip_id.isin(validation_set_utrip_ids)].sort_values(['utrip_id', 'checkin'])

In [6]:
from nvtabular import *
from nvtabular import ops
from merlin.models.tf import Loader

from merlin.schema.tags import Tags

2022-11-07 10:15:14.060190: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-07 10:15:14.161119: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-07 10:15:14.715579: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/hugectr/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib
2022-1



2022-11-07 10:15:15.214611: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-07 10:15:15.215113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-07 10:15:15.215273: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-07 10:15:15.238098: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [7]:
# from merlin.datasets.synthetic import generate_data
# ds = generate_data('bookingdotcom', 10000)

# feats = ['booker_country', 'utrip_id', 'device_class', 'affiliate_id', 'booker_country', 'hotel_country'] >> ops.Categorify() >> ops.TagAsItemFeatures()
# feats += ['city_id'] >> ops.Categorify() >> ops.TagAsItemID()
# feats += ['user_id'] >> ops.Categorify() >> ops.TagAsUserID()


# wf = Workflow(feats)
# o = wf.fit_transform(Dataset(itineraries))

# o.to_parquet('ds.parquet')

We can now begin with data preprocessing.

We will combine trips into "sessions", discard trips that are too short and calculate total trip length.

We will use nvtabular for this work. It offers optimized tabular data preprocessing operators that run on the GPU. If you would like to learn more about this software library, please take a look [here](https://github.com/NVIDIA-Merlin/NVTabular).

In [8]:
from nvtabular import *
from nvtabular import ops
from merlin.models.tf import Loader

from merlin.schema.tags import Tags

In [9]:
train_set_dataset = Dataset(train_set)
validation_set_dataset = Dataset(validation_set)

In [10]:
groupby_features = ['city_id', 'booker_country', 'utrip_id'] >> ops.Groupby(
    groupby_cols=['utrip_id'],
    aggs={
        'city_id': ['list', 'count'],
        'booker_country': ['list']
    }
)

groupby_features_city = groupby_features['city_id_list'] >> ops.Categorify() >> ops.AddTags([Tags.SEQUENCE, Tags.ITEM, Tags.ITEM_ID])
groupby_features_country = groupby_features['booker_country_list'] >> ops.Categorify() >> ops.AddTags([Tags.SEQUENCE, Tags.ITEM])
city_id_count = groupby_features['city_id_count'] >> ops.AddTags([Tags.CONTEXT, Tags.ITEM, Tags.CONTINUOUS])

# Filter out sessions with less than 2 interactions 
MINIMUM_SESSION_LENGTH = 2
filtered_sessions = groupby_features_city + groupby_features_country + city_id_count >> ops.Filter(f=lambda df: df["city_id_count"] >= MINIMUM_SESSION_LENGTH) 

In [11]:
wf = Workflow(filtered_sessions)

In [12]:
train_set_processed = wf.fit_transform(train_set_dataset)
validation_set_processed = wf.fit_transform(validation_set_dataset)



Our data consists of a sequence of visited `city_ids`, a sequence of `booker_countries` (represented as integer categories) and a `city_id_count` column (which contains the count of visited cities in a trip).

In [13]:
train_set_processed.compute().head()

Unnamed: 0,city_id_list,booker_country_list,city_id_count
0,"[6735, 155, 2356, 2446]","[3, 3, 3, 3]",4
1,"[885, 811, 137, 3]","[2, 2, 2, 2]",4
2,"[8, 402, 73, 6, 144, 272, 77, 77, 767, 6099]","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]",10
3,"[116, 151, 54, 341, 467]","[3, 3, 3, 3, 3]",5
4,"[1, 390, 313, 512, 257]","[1, 1, 1, 1, 1]",5


We are now ready to train our model.

In [14]:
import merlin.models.tf as mm

Let's identify two schemas. The first one for sequential features, the other for context features (`city_id_count`) that we will broadcast to the entire sequence.

Here is the schema of the data that our model will use.

In [15]:
seq_schema = train_set_processed.schema.select_by_tag(Tags.SEQUENCE)
seq_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.domain.min,properties.domain.max,properties.domain.name,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension
0,city_id_list,"(Tags.ID, Tags.ITEM_ID, Tags.CATEGORICAL, Tags...",int64,True,True,,0,0,0,.//categories/unique.city_id_list.parquet,0,37224,city_id_list,37225,512
1,booker_country_list,"(Tags.CATEGORICAL, Tags.SEQUENCE, Tags.ITEM)",int64,True,True,,0,0,0,.//categories/unique.booker_country_list.parquet,0,5,booker_country_list,6,16


Let's also identify the target column.

In [16]:
target = train_set_processed.schema.select_by_tag(Tags.SEQUENCE).column_names[0]
target

'city_id_list'

## Constructing the model

We begin by defining our `Loader`. It will be responsible for batching our data and passing it to the model.

In [17]:
loader = Loader(train_set_processed, batch_size=64, shuffle=True)

And now onto model construction.

We can specify various hyperparameters, such as the number of heads and number of layers to use.

In [18]:
model = mm.Model(
    mm.InputBlockV2(
        seq_schema,
        embeddings=mm.Embeddings(
            train_set_processed.schema.select_by_tag(Tags.CATEGORICAL), sequence_combiner=None
        ),
    ),
    mm.XLNetBlock(d_model=40, n_head=4, n_layer=2, pre=mm.ReplaceMaskedEmbeddings()),
    mm.CategoricalOutput(
        train_set_processed.schema.select_by_name(target),
        default_loss="categorical_crossentropy",
    ),
)

## Model training

In [20]:
model.compile(run_eagerly=False, optimizer='adam', loss="categorical_crossentropy")
model.fit(loader, epochs=5, pre=mm.SequenceMaskRandom(schema=seq_schema, target=target, masking_prob=0.3))

Epoch 1/5


2022-11-07 10:16:05.942620: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: model/xl_net_block/replace_masked_embeddings/RaggedWhere/Assert/AssertGuard/branch_executed/_9


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f75b78dc400>

## Model evaluation

We have trained our model.

But in training the metrics come from a masked language modelling task. A portion of steps in the sequence was masked for each example. The metrics were calculated on this task.

In reality, we probably care how well our model does on the next item prediction task (as it mimics the scenario in which the model would be likely to be used).

Let's measure the performance of the model on a task where it attempts to predict the last item in a sequence.

We will mask the last item using `SequenceMaskLast` and run inference.

In [21]:
loader_eval = Loader(validation_set_processed, batch_size=128, shuffle=False)

In [22]:
model.evaluate(loader_eval, batch_size=128, pre=mm.SequenceMaskLast(schema=validation_set_processed.schema, target=target))

2022-11-07 10:24:10.846370: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: model/xl_net_block/replace_masked_embeddings/RaggedWhere/Assert/AssertGuard/branch_executed/_9




[0.33792245388031006,
 0.54012131690979,
 0.30102217197418213,
 0.3579924404621124,
 0.30102217197418213,
 0.05401213467121124,
 0.0]