# Session-based Recs with Transformers4Rec: ETL

Followed a step by step tutorial:
https://nvidia-merlin.github.io/Transformers4Rec/main/examples/tutorial/index.html

## Imports

In [2]:
import os
import numpy as np
import pandas as pd
import glob

## Read E-Commerce Data
preprocessing already carried out:\
- `event_time` $\rightarrow$ `event_time_ts` = time when the event happened (UTC)
- `prod_first_event_time_ts` = indicates the timestamp that an item was seen first time
- removed rows where the `user_session` is Null. 2 rows removed.
- label encoded `user_session` $\rightarrow$ integers
- removed consequetively repeated (user,item) interactions

In [6]:
# define where the data is kept
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", '../data/')
df = pd.read_csv(os.path.join(INPUT_DATA_DIR, '2019-Oct-Processed.csv'), index_col=0)

In [7]:
df.head()

Unnamed: 0,event_type,product_id,category_id,category_code,brand,price,user_id,event_time_ts,user_session,prod_first_event_time_ts
0,view,1004768,2053013555631882655,electronics.smartphone,samsung,251.47,546521725,1570361085000000000,2,1569890856000000000
1,view,1005098,2053013555631882655,electronics.smartphone,samsung,152.58,546521725,1570361154000000000,2,1569896681000000000
2,view,1005073,2053013555631882655,electronics.smartphone,samsung,1153.03,546521725,1570361159000000000,2,1569888057000000000
3,view,1004871,2053013555631882655,electronics.smartphone,samsung,286.6,546521725,1570361199000000000,2,1569896322000000000
4,view,1004751,2053013555631882655,electronics.smartphone,samsung,197.15,546521725,1570361213000000000,2,1569896971000000000


In [8]:
df.shape

(6390928, 10)

In [9]:
df.isnull().any()

event_type                  False
product_id                  False
category_id                 False
category_code                True
brand                        True
price                       False
user_id                     False
event_time_ts               False
user_session                False
prod_first_event_time_ts    False
dtype: bool

## Categorical Feature Encoding
- label encode all categorical features
- add one to all category encodings as `0` is reserved for padding

In [23]:
cat_feats = ['user_session', 'category_code', 'brand', 'user_id', 'product_id', 'category_id', 'event_type']

In [24]:
for cat in cat_feats:
    df[cat] = df[cat].astype('category')
    df[cat] = df[cat].cat.codes
    df[cat] = df[cat]+1

In [25]:
df.head()

Unnamed: 0,event_type,product_id,category_id,category_code,brand,price,user_id,event_time_ts,user_session,prod_first_event_time_ts
0,3,790,85,95,2076,251.47,598877,1570361085000000000,1,1569890856000000000
1,3,1046,85,95,2076,152.58,598877,1570361154000000000,1,1569896681000000000
2,3,1022,85,95,2076,1153.03,598877,1570361159000000000,1,1569888057000000000
3,3,875,85,95,2076,286.6,598877,1570361199000000000,1,1569896322000000000
4,3,780,85,95,2076,197.15,598877,1570361213000000000,1,1569896971000000000
