In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

## 2. Data Preprocessing and Feature Engineering with MERLIN

### 2.1. Feature Engineering on GPU with NVTabular

Merlin [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular) is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS Dask-cuDF library. If you want to learn more about NVTabular, we recommend the examples in the NVTabular GitHub [repository](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples).

- process datasets that exceed GPU and CPU memory without having to worry about scale
- focus on what to do with the data and not how to do it by using abstraction at the operation level
- prepare datasets quickly and easily for experimentation so that more models can be trained.

**Learning Objectives**

Our goal is to predict the next city to be visited in a session. Therefore, we reshape the data to organize it into 'sessions', in other words, we generate sequential features per session (per trip). Each session will be a full customer itinerary in chronological order. 

Below, we do following data operations with NVTabular:
- Categorify categorical columns with `Categorify()` operator
- Create temporal features with `LambdaOp`
- Create a new continuous feature using `LamdaOp`
- Groupby dataset with `Groupby` op
- Transform continuous features with `LogOp` and `Normalize` operators
- Truncate the sequences using `LambdaOp`
- Export the preprocessed datasets as parquet files and schema file

### 2.2. Import Libraries

In [2]:
import os

import glob
import cudf 
import gc
import nvtabular as nvt
from nvtabular.ops import *

from merlin.schema.tags import Tags
from merlin.io.dataset import Dataset

2023-02-24 21:36:29.248378: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")


Define the raw dataset path.

In [3]:
DATA_FOLDER = os.environ.get(
    "DATA_FOLDER", 
    '/workspace/data/'
)

Read in the train and valid parquet files as cudf data frames.

In [4]:
train=cudf.read_parquet(os.path.join(DATA_FOLDER, "train.parquet"))
valid=cudf.read_parquet(os.path.join(DATA_FOLDER, "valid.parquet"))

Let's look at the raw input features and see what kind of features we can use and create from these features. The goal of feature engineering is simply to adapt the data better to the task (problem) we tackle with. Feature engineering is primarily done to improve the model's predictive power. We can start by selecting the features that are more relevant in predicting the target. From there, we can also engineer new features, that might have better correlation with the target feature.

In [5]:
print(train.head())

   user_id    checkin   checkout  city_id device_class  affiliate_id  \
0  2000964 2015-12-31 2016-01-01    63341       mobile          8151   
1  2595109 2015-12-31 2016-01-01    27404       mobile           359   
2   727105 2015-12-31 2016-01-01    18820       mobile           359   
3  1032571 2016-01-01 2016-01-02    21996       mobile          9924   
4   110418 2016-01-01 2016-01-02     3763      desktop          9924   

         booker_country hotel_country   utrip_id  
0  The Devilfire Empire  Cobra Island  2000964_1  
1  The Devilfire Empire  Cobra Island  2595109_1  
2  The Devilfire Empire  Cobra Island   727105_1  
3  The Devilfire Empire  Cobra Island  1032571_1  
4  The Devilfire Empire  Glubbdubdrib   110418_1  


 `city_id` column is the main feature for us to use. By using user's travel history, our goal is to predict the next one that a traveler can visit. Note that we can create a model only using sequence of city_id as an input feature. However, we can also explore what other features we can feed to our model, so that we can improve model's accuracy.

Our dataset has timestamp (checkin and checkout) columns. We can create temporal features such as the <i>weekday</i>, <i>month</i> or another temporal feature from `checkin` or `checkout` columns. These features can give information about users' temporal behaviours, and, tells us which cities are more preferred when.

The location of the booker and the hotel's location be two important features as well to predict what the next city a traveler can visit.

Similarly we can think about creating other features like `length of a stay` for every stay or `number of cities visited` in a given user trip.

Create temporal features and categorify them.

In [6]:
weekday_checkin = (
    ["checkin"]
    >> LambdaOp(lambda col: col.dt.weekday)
    >> Categorify()
    >> Rename(name="weekday_checkin")
)

weekday_checkout = (
    ["checkout"]
    >> LambdaOp(lambda col: col.dt.weekday)
    >> Categorify()
    >> Rename(name="weekday_checkout")
)

month_checkin = (
    ["checkin"]
    >> LambdaOp(lambda col: col.dt.month)
    >> Categorify() 
    >> Rename(name="month_checkin")
)

Create a new feature from length of stay of each stay.

In [7]:
def length_stay(col, gdf):
    stay_length = (gdf['checkout'] - col).dt.days
    return stay_length

    
length_of_stay = (['checkin'] 
                  >> LambdaOp(length_stay, dependency=['checkout']) 
                  >> LogOp() 
                  >> Normalize()
                  >> AddTags([Tags.SEQUENCE])
                  >> Rename(name="length_of_stay")
                 )

Let's group interactions (each user travel) into sessions. Currently, every row is a traveled city in the dataset. Our goal is to predict (and recommend) the final city (city_id) of each trip (utrip_id). Therefore, we groupby the dataset by `utrip_id` to have one row for each prediction. Each row will have a sequence of encoded city ids which a user visited. The NVTabular `GroupBy` op enables the transformation by sorting the columns according to `checkin` date, and then aggregating the interactions per `utrip_id` based on the aggregation method we define below.

In [8]:
city_cat = ['city_id'] >> Categorify() 

# jointly encode
location = [['booker_country', 'hotel_country']] >> Categorify()

# filter out the rows where the city_id is 0. 
# This applies on validation set since the OOV cities are mapped to 0 in validation set.
filtered_feats= (
    city_cat + ['utrip_id', 'checkin'] + location + weekday_checkin + weekday_checkout + month_checkin + length_of_stay 
    >> Filter(f=lambda df: df["city_id"]!=0)
)

groupby_features = (filtered_feats
                    >> Groupby(
                        groupby_cols=['utrip_id'],
                        aggs={
                            'city_id': ['list', 'count', 'last'],
                            'booker_country': ['list'],
                            'hotel_country': ['list'],
                            'weekday_checkin': ['list'],
                            'weekday_checkout': ['list'],
                            "month_checkin": ['list'],
                            "length_of_stay": ['list'],
                        },
                        sort_cols=["checkin"]
                    )
                   )

groupby_features_city = (groupby_features['city_id_list'] 
                         >> AddTags([Tags.ITEM, Tags.ITEM_ID, Tags.SEQUENCE])
                        )

groupby_features_country = (
    groupby_features['booker_country_list', 'hotel_country_list']
    >> AddTags([Tags.SEQUENCE])
)

groupby_features_time = (
    groupby_features['weekday_checkin_list', 'weekday_checkout_list', 'month_checkin_list']
    >> AddTags([Tags.SEQUENCE])
)

We truncate the sequence features in length via `sessions_max_length` param, which is set as 10 in this example. In addition, we filter out the sessions that have less than 2 travels.

In [9]:
SESSIONS_MAX_LENGTH= 10
truncated_features = (groupby_features_city + groupby_features_country + groupby_features_time + groupby_features['length_of_stay_list']
                      >> ListSlice(-SESSIONS_MAX_LENGTH) 
                     )

# Filter out sessions with less than 2 interactions 
MINIMUM_SESSION_LENGTH = 2
filtered_sessions = (groupby_features['utrip_id',  'city_id_count'] + truncated_features 
                     >> Filter(f=lambda df: df["city_id_count"] >= MINIMUM_SESSION_LENGTH)
                    )

num_city_visited = (filtered_sessions['city_id_count']
               >> LogOp() 
               >> Normalize()
               >> Rename(name="num_city_visited")
               >> AddTags([Tags.CONTEXT,Tags.CONTINUOUS])
              )

list_feats = ['city_id_list', 'booker_country_list', 'hotel_country_list', 'weekday_checkin_list', 'weekday_checkout_list', 'month_checkin_list', 'length_of_stay_list']
outputs = filtered_sessions[list_feats, 'utrip_id'] + num_city_visited 

Initialize the NVTabular dataset object and workflow graph. When we initialize a Workflow with our pipeline, workflow organizes the input and output columns.

In [10]:
workflow = nvt.Workflow(outputs)

Create NVTabular Dataset objects using our raw datasets. Then, we calculate statistics for this workflow on the input dataset, i.e. on our training set, using the `workflow.fit()` method so that our Workflow can use these stats to transform any given input. Note that when we export files to disk, we also export a `schema.pbtxt` file that we will use during modeling step.

In [11]:
train_dataset = Dataset(train)
valid_dataset = Dataset(valid)

# fit data
workflow.fit(train_dataset)

# transform train set and save data to disk
workflow.transform(train_dataset).to_parquet(os.path.join(DATA_FOLDER, "train/"))



Now we can transform our validation set and export transformed dataset to disk as a parquet file.

In [12]:
workflow.transform(valid_dataset).to_parquet(os.path.join(DATA_FOLDER, "valid/"))

We can check out the output schema of the workflow. Take a look at what meta data output schema stores.

In [13]:
workflow.output_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.start_index,properties.cat_path,properties.domain.min,properties.domain.max,properties.domain.name,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.value_count.min,properties.value_count.max
0,utrip_id,(),"DType(name='object', element_type=<ElementType...",False,False,,,,,,,,,,,,
1,city_id_list,"(Tags.LIST, Tags.SEQUENCE, Tags.CATEGORICAL, T...","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.city_id.parquet,0.0,39664.0,city_id,39665.0,512.0,0.0,10.0
2,booker_country_list,"(Tags.CATEGORICAL, Tags.LIST, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.booker_country_hotel_coun...,0.0,195.0,booker_country_hotel_country,196.0,31.0,0.0,10.0
3,hotel_country_list,"(Tags.CATEGORICAL, Tags.LIST, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.booker_country_hotel_coun...,0.0,195.0,booker_country_hotel_country,196.0,31.0,0.0,10.0
4,weekday_checkin_list,"(Tags.CATEGORICAL, Tags.LIST, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkin.parquet,0.0,7.0,checkin,8.0,16.0,0.0,10.0
5,weekday_checkout_list,"(Tags.CATEGORICAL, Tags.LIST, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkout.parquet,0.0,7.0,checkout,8.0,16.0,0.0,10.0
6,month_checkin_list,"(Tags.CATEGORICAL, Tags.LIST, Tags.SEQUENCE)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,0.0,.//categories/unique.checkin.parquet,0.0,7.0,checkin,8.0,16.0,0.0,10.0
7,length_of_stay_list,"(Tags.CONTINUOUS, Tags.LIST, Tags.SEQUENCE)","DType(name='float64', element_type=<ElementTyp...",True,True,,,,,,,,,,,0.0,10.0
8,num_city_visited,"(Tags.CONTINUOUS, Tags.CONTEXT)","DType(name='float64', element_type=<ElementTyp...",False,False,,0.0,0.0,0.0,.//categories/unique.city_id.parquet,0.0,39664.0,city_id,39665.0,512.0,,


Let's print the head of our preprocessed train dataset. You can notice that now each example (row) is a session and the sequential features with respect to user interactions were converted to lists with matching length.

In [14]:
df=cudf.read_parquet(os.path.join(DATA_FOLDER, 'train', 'part_0.parquet'))

In [15]:
print(df.head(2))

    utrip_id             city_id_list booker_country_list hotel_country_list  \
0  1000027_1  [8264, 154, 2312, 2027]        [2, 2, 2, 2]       [1, 1, 1, 1]   
1  1000033_1  [62, 1258, 90, 629, 62]     [1, 1, 1, 1, 1]    [4, 4, 4, 4, 4]   

  weekday_checkin_list weekday_checkout_list month_checkin_list  \
0         [5, 7, 4, 3]          [7, 4, 2, 7]       [0, 0, 0, 0]   
1      [5, 1, 4, 3, 5]       [6, 4, 2, 5, 4]    [6, 6, 6, 6, 6]   

                                 length_of_stay_list  num_city_visited  
0  [-0.736162543296814, 0.4681011438369751, 0.468...         -0.798553  
1  [0.4681011438369751, -0.736162543296814, 0.468...         -0.085908  


Save the workflow.

In [16]:
workflow.save(os.path.join(DATA_FOLDER, "workflow_etl"))

In [17]:
del train, valid, train_dataset, valid_dataset, df
gc.collect()

1017

### Summary

In this lab, we learned how to transform our dataset and create sequential features to train and evaluate a session-based recommendation model.

Please execute the cell below to shut down the kernel before moving on to the next notebook `03-Next-item-prediction-with-MLP`.

In [18]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}