In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

## 4. Next Item Prediction with an RNN-based Model

In the previous example, we built, trained and evaluated an MLP model using sequential and scalar input features. Although we get promising accuracy results, MLP model does not take the sequential patterns into concern. We have to average the embedding values for the list categorical features, and aggregate (take average) the list continuos values.

In this example, we use a type of Recurrent Neural Networks (RNN) - the Long Short-term Memory Networks (LSTM)[1, 2] - to do next-city prediction using a sequence of events (trip history in our example) per user in a given trip (session). As a type of RNN, an LSTM model are a class of neural networks that is being used for modeling sequence data such as time series or natural language.

There is obviously some sequential patterns in our dataset that we want to capture to provide more relevant recommendations. We can do that using an LSTM model. 

<center><img src="./images/RNN.png" width=300 height=200/></center>

In general, LSTM models are coupled with Causal Language Modeling (CLM) pre-training scheme, which is the task of predicting the token following a sequence of tokens, where the model only attends to the left context, i.e. models the probability of a token given the previous tokens (city_ids in our case) in a sequence [3]. 

In our case, the input of the LSTM layer is a representation of the user interaction, the internal LSTM hidden state encodes a representation of the session based on past interactions and the outputs are the next-item predictions. For a given trip session, our proposed LSTM model generates logit values as predictions for the user's preference of the next item in the sequence. These logit values represent the likelihood of each item being the next one, and are based on the hidden representation of the last item in the session.

To do so, we train an LSTM model using [SequencePredictLast](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transforms/sequence.py#L254) technique, where we use the entire sequence to predict the next item (city), rather than performing a sliding approach to do next-item prediction.

**Learning Objectives:**</br>

In this lab, participants will learn:
- building an LSTM model using Merlin Models
- training and evaluating the LSTM model using sequential categorical and continuous input features for next-item prediction task.

### 4.1. Import Libraries

In [2]:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
import glob
import gc
import numpy as np

import tensorflow as tf

2023-02-27 16:02:17.049101: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Let's set the seeds to mitigate flakiness/randomness and to make model execution deterministic as much as possible.

In [3]:
seed=42
tf.random.set_seed(seed)
np.random.seed(seed)

Import Merlin APIs.

In [4]:
from merlin.schema.tags import Tags
from merlin.io.dataset import Dataset

import merlin.models.tf as mm



  warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")
2023-02-27 16:02:21.970490: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-27 16:02:23.965928: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2023-02-27 16:02:23.966023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16249 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:2d:00.0, compute capability: 7.0
  from .autonotebook import tqdm as notebook_tqdm


Define data path.

In [5]:
DATA_FOLDER = os.environ.get(
    "DATA_FOLDER", 
    '/workspace/data/'
)

Read in train and validation sets as Merlin Dataset objects. Note that these datasets have schema associated to them.

In [6]:
train = Dataset(os.path.join(DATA_FOLDER, "train/*.parquet"))
valid = Dataset(os.path.join(DATA_FOLDER, "valid/*.parquet"))



Define target column.

In [7]:
target = train.schema.select_by_tag(Tags.SEQUENCE).column_names[0]
target

'city_id_list'

Let's start with defining our hyper-parameters.

In [8]:
EPOCHS = int(os.environ.get(
    "EPOCHS", 
    '3'
))

dmodel = int(os.environ.get(
    "dmodel", 
    '64'
))

BATCH_SIZE = 1024
LEARNING_RATE = 0.005

Here we define schema object. We create a subschema by selecting the input features with their names using `select_by_name` method.

In [9]:
train.schema = train.schema.select_by_name(['city_id_list','booker_country_list', 'hotel_country_list',
                                            'weekday_checkin_list','weekday_checkout_list',
                                            'month_checkin_list','num_city_visited', 'length_of_stay_list']
                                          )

valid.schema = train.schema

seq_schema =train.schema.select_by_tag(Tags.SEQUENCE)

In [10]:
train.schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.domain.min,properties.domain.max,properties.domain.name,properties.value_count.min,properties.value_count.max
0,city_id_list,"(Tags.ITEM_ID, Tags.ID, Tags.SEQUENCE, Tags.CA...","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.city_id.parquet,39665.0,512.0,0.0,39664.0,city_id,0.0,10.0
1,booker_country_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.booker_country_hotel_coun...,196.0,31.0,0.0,195.0,booker_country_hotel_country,0.0,10.0
2,hotel_country_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.booker_country_hotel_coun...,196.0,31.0,0.0,195.0,booker_country_hotel_country,0.0,10.0
3,weekday_checkin_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkin.parquet,8.0,16.0,0.0,7.0,checkin,0.0,10.0
4,weekday_checkout_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkout.parquet,8.0,16.0,0.0,7.0,checkout,0.0,10.0
5,month_checkin_list,"(Tags.LIST, Tags.CATEGORICAL, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkin.parquet,8.0,16.0,0.0,7.0,checkin,0.0,10.0
6,num_city_visited,"(Tags.CONTEXT, Tags.CONTINUOUS)","DType(name='float64', element_type=<ElementTyp...",False,False,,0.0,0.0,0.0,.//categories/unique.city_id.parquet,39665.0,512.0,0.0,39664.0,city_id,,
7,length_of_stay_list,"(Tags.LIST, Tags.SEQUENCE, Tags.CONTINUOUS)","DType(name='float64', element_type=<ElementTyp...",True,True,,,,,,,,,,,0.0,10.0


Define the context schema which represents the context features, which is `num_city_visited` in our case. The reason we define a separate context schema for the context feature is that because, we need to broadcast this scalar feature to list feature. For that we need its schema.

In [11]:
context_schema = train.schema.select_by_tag(Tags.CONTEXT)
context_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.domain.min,properties.domain.max,properties.domain.name
0,num_city_visited,"(Tags.CONTEXT, Tags.CONTINUOUS)","DType(name='float64', element_type=<ElementTyp...",False,False,,0.0,0.0,0.0,.//categories/unique.city_id.parquet,39665.0,512.0,0.0,39664.0,city_id


We define the embedding dimension of certain categorical features. Note that with `dim` arg, wefeed a fix dimension for all  categorical features. In this example, we set the embedding dimension as `16` for each categorical feature.

### 4.2. Build an LSTM model architecture

In this section, we train a LSTM model which enables straight (past) sequence to be used. The input block concatenates the embedding vectors for all sequential features per step, and then concatenates the continuous features that we created in the `ETL-with-NVTabular` notebook to each input sequence. The concatenated vectors are processed by a LSTM architecture. Then we connect it with a Multi-Layer Perceptron Block. We use the last item in the `city_id_list` column as target.

We visualize the model architecture in the figure below.

<img src="./images/LSTM.png" width=600 height=400/>

Define the input block.

In [12]:
input_block = mm.InputBlockV2(
    train.schema,
    embeddings=mm.Embeddings(
        seq_schema.select_by_tag(Tags.CATEGORICAL), 
        sequence_combiner=None,
        dim=16
        ),
    post=mm.BroadcastToSequence(context_schema, seq_schema),
)

If you noticed, here we set `sequence_combiner` to None, since, we do not want to combine the embeddings for each position in a given sequence.

We also add post argument in the `InputBlockV2` to broadcast the scalar features to list features. With [BroadcastToSequence](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transforms/features.py#L814) class, we are able to replicate the scalar value (e.g. `num_city_visited` in our example) for each position in the sequence to match the timesteps of sequence features.

Let's check the output shape of the input_block.

In [13]:
batch = mm.sample_batch(train, batch_size=BATCH_SIZE, include_targets=False, to_ragged=True)

In [14]:
input_block(batch).shape

TensorShape([1024, None, 98])

We obtain a 3-D sequence representation (`batch_size, sequence_length, sum_of_emb_dim_of_features + 1-d of the continuous features`). Note that total embedding dimension for categorical features is 96, and extra dimensions comes from two continuous input features (length_of_stay_list and num_city_visited). The sequence_length dimension is printed out as `None`, because it is a variable length given a batch. That's why we get the sequence_length dim printed as `None`.

Once we define the input block, we can easily connect it to an LSTM block (layer) using [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) api. Note that we use `connect()` method to do so. We set the `units` parameter as `64`, which represents dimensionality of the output space.

In [15]:
from tensorflow.keras import regularizers
dense_block =input_block.connect(tf.keras.layers.LSTM(64,
        return_sequences=False, 
        kernel_regularizer=regularizers.l2(1e-4),
    )
)

When `return_sequences` is set to `False` in `tf.keras` LSTM layers, only the last hidden state output (h<N>) is returned, which captures an abstract representation of the input sequence.

We can add an MLP block as a projection layer. We noticed that adding an extra layer increases the accuracy metrics in this example.

In [16]:
mlp_block = mm.MLPBlock(
                [128,dmodel],
                activation='relu',
                no_activation_last_layer=True,
            )

Next, we define the prediction task. Our objective is `multi-class classification` - which is the city visited at the end of the trip session. Therefore, this is a multi-class classification task, and the default_loss in the CategoricalOutput class is `categorical_crossentropy`.

In [17]:
prediction_task = mm.CategoricalOutput(
    seq_schema.select_by_name(target), 
    default_loss="categorical_crossentropy"
)

In [18]:
model_lstm = mm.Model(dense_block, mlp_block, prediction_task)

Define optimizer and compile the model.

In [19]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=LEARNING_RATE
)

model_lstm.compile(
    optimizer=optimizer,
    run_eagerly=False,
    loss=tf.keras.losses.CategoricalCrossentropy(
        from_logits=True, 
    ),
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[4])
)

The [SequencePredictLast](https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/transforms/sequence.py#L254) class indicates that the last item in the sequence is the target. As a result, all the sequential input features are truncated before the last position, and the target is extracted as the last element of the sequence of city_ids. Based on that LSTM should return the hidden representation of the last item.

In [20]:
predict_last = mm.SequencePredictLast(schema=seq_schema.select_by_tag(Tags.SEQUENCE), target=target)

In [21]:
%%time
history = model_lstm.fit(
    train,
    epochs=3,
    batch_size=BATCH_SIZE,
    pre=predict_last
)

2023-02-27 15:58:51.148439: I tensorflow/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8700


Epoch 1/3




Epoch 2/3
Epoch 3/3
CPU times: user 1min 35s, sys: 10.9 s, total: 1min 46s
Wall time: 1min 5s


Evaluate the model.

In [22]:
model_lstm.evaluate(
    valid,
    batch_size=BATCH_SIZE,
    pre=predict_last,
    return_dict=True
)



{'loss': 3.390162706375122,
 'recall_at_4': 0.583443284034729,
 'mrr_at_4': 0.4348030388355255,
 'ndcg_at_4': 0.472442090511322,
 'map_at_4': 0.4348030388355255,
 'precision_at_4': 0.14586082100868225,
 'regularization_loss': 0.016703125089406967,
 'loss_batch': 3.327507495880127}

### Summary

In this lab, we learned how to build an RNN-based model to leverage the sequential input features for next-item prediction task. We used an LSTM model, and we can observe that this architecture gave us some boost in model accuracy metrics. This might not be suprising since with LSTM we do not have to average the sequential feature embeddings as we did in MLP architecture.

Note that we do not perform hyperparameter tuning, we used the same hyperparameters for MLP and LSTM models but this does not mean that each model gives good performance with the same set of hyperparameters. 

At this point, you might still want to explore different architectures, and to try more sophisticated, yet computationaly more expensive ones, like Transformers-based architectures. Let's move on to the next notebook `05-Next-item-prediction-with-Transformers` to build one using Merlin Models.

Please execute the cell below to shut down the kernel before moving on to the next notebook.

In [23]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### References

[1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. online available: https://arxiv.org/pdf/1909.09586.pdf <br>
[2] Staudemeyer, Ralf C., and Eric Rothstein Morris. (2019). "Understanding LSTM-a tutorial into long short-term memory recurrent neural networks." online available: https://arxiv.org/pdf/1909.09586.pdf<br>
[3] Lample, Guillaume, and Alexis Conneau. "Cross-lingual language model pretraining." arXiv preprint arXiv:1901.07291