# 1 - Introduction

Name: M Naufal Indriatmoko

Linkedin: https://bit.ly/naufal-linkedin

Github: https://bit.ly/naufal-git

---
# 2 - Importing Libraries

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

---
# 3 - Data Loading

In [2]:
file = 'dataset/first question/user-events.csv'
df = pd.read_csv(file)

Dataset overview:

In [3]:
df.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2020-09-24 11:57:06 UTC,view,1996170,2144415922528452715,electronics.telephone,,31.9,1515915625519388267,LJuJVLEjPT
1,2020-09-24 11:57:26 UTC,view,139905,2144415926932472027,computers.components.cooler,zalman,17.16,1515915625519380411,tdicluNnRY
2,2020-09-24 11:57:27 UTC,view,215454,2144415927158964449,,,9.81,1515915625513238515,4TMArHtXQy
3,2020-09-24 11:57:33 UTC,view,635807,2144415923107266682,computers.peripherals.printer,pantum,113.81,1515915625519014356,aGFYrNgC08
4,2020-09-24 11:57:36 UTC,view,3658723,2144415921169498184,,cameronsino,15.87,1515915625510743344,aa4mmk0kwQ


There is no column description provided, but all of the columns are self-explanatory. As for the price, it is assumed in USD.

In [29]:
df.event_type.value_counts()

view        793748
cart         54035
purchase     37346
Name: event_type, dtype: int64

Column event_type would be a target variable for a classification model, to predict a purchasing event.

In [10]:
df.dtypes

event_time       datetime64[ns, UTC]
event_type                    object
product_id                     int64
category_id                    int64
category_code                 object
brand                         object
price                        float64
user_id                        int64
user_session                  object
event_day                      int64
event_hour                     int64
event_minute                   int64
dtype: object

Column event_time consists of date and time information, but it is detected as 'object'. It would be more convenient for the analysis if it is in 'datetime' format.

In [9]:
df['event_time'] = pd.to_datetime(df['event_time'])
df['event_day'] = df['event_time'].dt.day
df['event_hour'] = df['event_time'].dt.hour
df['event_minute'] = df['event_time'].dt.minute

In [12]:
df.dtypes

event_time       datetime64[ns, UTC]
event_type                    object
product_id                     int64
category_id                    int64
category_code                 object
brand                         object
price                        float64
user_id                        int64
user_session                  object
event_day                      int64
event_hour                     int64
event_minute                   int64
dtype: object

Column event_time is now in 'datetime' format. Additionally, there are three more columns specifying the day, hour, and minute of the event.

In [14]:
# additional info
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns.')
print(f'Total number of missing values: ', df.isnull().sum().sum())
missval_list = df.columns[df.isnull().any()].tolist()
print('Column with missing values: ', missval_list)
for j in missval_list:
    print(f'Percentage of missing values in {j}: ', round(df[j].isnull().sum() / df.shape[0],4), '%')

The dataset has 885129 rows and 12 columns.
Total number of missing values:  448748
Column with missing values:  ['category_code', 'brand', 'user_session']
Percentage of missing values in category_code:  0.2669 %
Percentage of missing values in brand:  0.2399 %
Percentage of missing values in user_session:  0.0002 %


There are several missing values in the dataset. It will be handled later in the data preprocessing step.

---
# 4 - Exploratory Data Analysis (EDA)

## 4.1 - Descriptive Statistics

In [28]:
df.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session,event_day,event_hour,event_minute
count,885129,885129,885129.0,885129.0,648910,672765,885129.0,885129.0,884964,885129.0,885129.0,885129.0
unique,,3,,,107,999,,,490398,,,
top,,view,,,computers.components.videocards,asus,,,nFlhu5QzOd,,,
freq,,793748,,,116717,27706,,,572,,,
mean,2020-12-14 11:05:10.680594944+00:00,,1906621.0,2.144423e+18,,,146.328713,1.515916e+18,,16.056461,12.41559,29.630257
min,2020-09-24 11:57:06+00:00,,102.0,2.144416e+18,,,0.22,1.515916e+18,,1.0,0.0,0.0
25%,2020-11-05 20:48:22+00:00,,698803.0,2.144416e+18,,,26.46,1.515916e+18,,9.0,8.0,15.0
50%,2020-12-14 15:34:14+00:00,,1452883.0,2.144416e+18,,,65.71,1.515916e+18,,16.0,12.0,30.0
75%,2021-01-23 07:16:12+00:00,,3721194.0,2.144416e+18,,,190.49,1.515916e+18,,24.0,17.0,45.0
max,2021-02-28 23:59:09+00:00,,4183880.0,2.227847e+18,,,64771.06,1.515916e+18,,31.0,23.0,59.0


Some insights from above table:
- The dataset contains five months of data.
- The dataset starts from 24 September 2020, and ends on 28 February 2021.
- Products available as cheap as $0.22 and as expensive as $64,771.06
- There are 999 unique brands in the dataset.
- Videocards are the most popular product in the dataset.
- Asus is the most popular brand in the dataset.
- The most frequent event type is 'view'.

---
# 5 - Data Preprocessing

---
# 6 - Model Definition

---
# 7 - Model Training

---
# 8 - Model Evaluation

---
# 9 - Conclusion