<a href="https://colab.research.google.com/github/MeinHserhT/CS14115/blob/main/Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Link github:
https://github.com/MeinHserhT/CS14115/blob/main/Final.ipynb

References:
- For data exploratory: https://www.kaggle.com/code/jsaguiar/complete-exploratory-analysis-all-columns/notebook 
- Feature engineering and training model: https://www.kaggle.com/competitions/ga-customer-revenue-prediction/discussion/82614 

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/'gStore Revenue Prediction'/data

Mounted at /content/drive
/content/drive/MyDrive/gStore Revenue Prediction/data


In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
from os import listdir
from datetime import datetime, timedelta
import ast

import gc
gc.enable()

import warnings
warnings.filterwarnings('ignore')

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

# 1.&nbsp;Understand the problem

- Overview: The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

- Goal: Analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to <b>predict revenue per customer</b>

- Data format: 
    + Each row in the dataset is one visit to the store. 
    + <b>Not all rows in test_v2.csv will correspond to a row in the submission</b>, but all unique fullVisitorIds will correspond to a row in the submission.
    + Due to the formatting of fullVisitorId you must <b>load the Id's as strings in order for all Id's to be properly unique!</b>
    + There are multiple columns which contain <b>JSON blobs of varying depth</b>. In one of those JSON columns, totals, the sub-column transactionRevenue contains the revenue information we are trying to predict. This sub-column exists only for the training data.


- Data train: user transactions which are collected from GStore around the world 01/08/2016 to 30/04/2018.
- Data test: ALL users' transactions in the future time.
 + Public LB: is being calculated for those visitors during the same timeframe of 01/05/2018 to 15/10/2018
 + Private LB: is being calculated on the future-looking timeframe of 01/12/2018 to 31/01/2019 - for those **same** set of users. 
 
 $\Rightarrow$ Therefore, your submission that is intended for the public LB timeframe will be different from the private LB timeframe, which will be rescored/recalculated on the future timeframe.
 
 
- Input: All transactions of a user from 01/05/2018 to 15/10/2018.
- Output: Total revenue of that user during the predicting time. (01/12/2018 to 31/01/2019)
 
 We are predicting the <b>natural log of the sum of all transactions per user</b>. 
 
$$
y_{user} = \sum_{i=1}^{n} transaction_{user_i} 
$$
$$
target_{user} = \ln({y_{user}+1})
$$
 

- External Data: is <b>permitted</b> for this competition. This includes the <a href="https://support.google.com/analytics/answer/6367342#access&zippy=%2Cin-this-article">Google Merchandise Store Demo Account</a>. Although the Demo Account contains the predicted variable, final standings will not benefit from access to this external data, because it requires future-looking predictions.

- Evaluation Metric:
 
 Submissions are scored on the root mean squared error. RMSE is defined as:

$$ \text{RMSE} = \sqrt{\frac{1}{n}\sum^n_{i=1}(y_i - \hat{y}_i)^2} $$

![](https://drive.google.com/uc?export=view&id=1RXHGiWn7RGhTvxjnpFnpzRS4munE4pM6)


# 2.&nbsp;Prepare data
Follow the link: https://www.kaggle.com/code/minhvngc/exploration

- Due to large dataset(train_v2.csv: 25.41GB, test_v2.csv: 7.62GB), we read data with chunksize method and extract JSON columns (based on: https://www.kaggle.com/code/codlife/pre-processing-for-huge-train-data-with-chunksize/notebook) and save it to separate files (each file contains all values of 1 column. Ex: train_fullVisitorId.csv, test_device.browser, ...)
- After run the first chunksize, we recognized that file train_hits is very large (1~2GB) and we had about 17 chunksize to read. So we cannot save it (kaggle maximum output size is 20GB https://www.kaggle.com/docs/notebooks)

- Don't use column 'hits' due to very large data (~20GB based on https://www.kaggle.com/competitions/ga-customer-revenue-prediction/discussion/71048).





# 3.&nbsp;Explore data

Follow the link: https://colab.research.google.com/drive/121tKvyBSNPEtPmau4dhL1NjIFrzw9xZl

> For each file(column), we read it and explore the values' frequency and its description.

#  4.&nbsp;Preprocess data

## 4.1 Load data
- Read files and merge them into train dataframe and test dataframe. 
- Notice: fullVisitorId must be read as <b>string</b>


In [None]:
lst_file = listdir()
lst_train_file = [f for f in lst_file if 'train' in f]
lst_test_file = [f for f in lst_file if 'test' in f]

train_default_df = pd.DataFrame()
test_default_df = pd.DataFrame()

In [None]:
%%time
for f in lst_train_file:
    col = f[:-4]
    if col == 'train_fullVisitorId':
        train_default_df[col] = pd.read_csv(f, names = ['index', col], dtype=str).set_index('index')[col].to_list()
    else:
        train_default_df[col] = pd.read_csv(f, names = ['index', col],).set_index('index')[col]

train_df = train_default_df.copy()
train_df.columns = [c[6:] for c in train_df.columns]
train_df.head(2)

CPU times: user 27 s, sys: 3.65 s, total: 30.7 s
Wall time: 52 s


Unnamed: 0_level_0,channelGrouping,customDimensions,date,device.browser,device.browserVersion,device.browserSize,device.deviceCategory,device.flashVersion,device.isMobile,device.language,device.mobileDeviceBranding,device.mobileDeviceInfo,device.mobileDeviceModel,device.mobileDeviceMarketingName,device.operatingSystem,device.mobileInputSelector,device.operatingSystemVersion,device.screenColors,device.screenResolution,fullVisitorId,geoNetwork.city,geoNetwork.cityId,geoNetwork.continent,geoNetwork.country,geoNetwork.longitude,geoNetwork.latitude,geoNetwork.metro,geoNetwork.networkDomain,geoNetwork.networkLocation,geoNetwork.region,geoNetwork.subContinent,socialEngagementType,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.sessionQualityDim,totals.timeOnSite,totals.totalTransactionRevenue,totals.transactionRevenue,totals.transactions,totals.visits,trafficSource.adContent,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.slot,trafficSource.campaign,trafficSource.campaignCode,trafficSource.isTrueDirect,trafficSource.keyword,trafficSource.medium,trafficSource.referralPath,trafficSource.source,visitId,visitNumber,visitStartTime
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
0,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",20171016,Firefox,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,3162355547410993243,not available in demo dataset,not available in demo dataset,Europe,Germany,not available in demo dataset,not available in demo dataset,not available in demo dataset,(not set),not available in demo dataset,not available in demo dataset,Western Europe,Not Socially Engaged,1.0,1,1.0,1.0,1.0,,,,,1,,,not available in demo dataset,,,,,(not set),,,water bottle,organic,,google,1508198450,1,1508198450
1,Referral,"[{'index': '4', 'value': 'North America'}]",20171016,Chrome,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Chrome OS,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,8934116514970143966,Cupertino,not available in demo dataset,Americas,United States,not available in demo dataset,not available in demo dataset,San Francisco-Oakland-San Jose CA,(not set),not available in demo dataset,California,Northern America,Not Socially Engaged,,2,,2.0,2.0,28.0,,,,1,,,not available in demo dataset,,,,,(not set),,,,referral,/a/google.com/transportation/mtv-services/bikes/bike2workmay2016,sites.google.com,1508176307,6,1508176307


In [None]:
%%time
test_default_df = pd.DataFrame()
for f in lst_test_file:
    col = f[:-4]
    if col == 'test_fullVisitorId':
        test_default_df[col] = pd.read_csv(f, names=['index', col], dtype=str).set_index('index')[col].to_list()
    else:
        test_default_df[col] = pd.read_csv(f, names=['index', col]).set_index('index')[col]

test_df = test_default_df.copy()
test_df.columns = [c[5:] for c in test_df.columns]
# test_df.head(2)

CPU times: user 6.96 s, sys: 932 ms, total: 7.89 s
Wall time: 8.63 s


## 4.2 Drop bad columns

### 4.2.1 Columns with only 1 value

In [None]:
only_1_lst = [col for col in train_df.columns 
                    if len(train_df[col].value_counts(dropna=False)) == 1]
train_df[only_1_lst].head()

Unnamed: 0_level_0,device.browserVersion,device.browserSize,device.flashVersion,device.language,device.mobileDeviceBranding,device.mobileDeviceInfo,device.mobileDeviceModel,device.mobileDeviceMarketingName,device.mobileInputSelector,device.operatingSystemVersion,device.screenColors,device.screenResolution,geoNetwork.cityId,geoNetwork.longitude,geoNetwork.latitude,geoNetwork.networkLocation,socialEngagementType,totals.visits,trafficSource.adwordsClickInfo.criteriaParameters
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
1,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
2,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
3,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
4,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset


- Columns have `'not available in demo dataset'` are not provided in this competition.
- Column `socialEngagementType` has `'Not Socially Engaged'` because this is an ecommerce site that does not contain any social engagement (like, loves, ...)
- Column `totals.visits` has `1` because `1` represent for sessions with interaction events. (???)

In [None]:
train_df.drop(columns = only_1_lst, inplace=True)
test_df.drop(columns = only_1_lst, inplace=True)

### 4.2.2 Handle column 'customDimensions'
This column has a special format: List of a json -> Split it into 2 columns: `customDimensions.index` and `customDimensions.value`

In [None]:
train_df['customDimensions'].value_counts(dropna=False)

[{'index': '4', 'value': 'North America'}]      768223
[]                                              333235
[{'index': '4', 'value': 'EMEA'}]               313991
[{'index': '4', 'value': 'APAC'}]               222071
[{'index': '4', 'value': 'South America'}]       45553
[{'index': '4', 'value': 'Central America'}]     25264
Name: customDimensions, dtype: int64

In [None]:
def handle_customDimensions(df):
    # convert string representation of list to a list
    df['customDimensions'] = df['customDimensions'].apply(lambda x: ast.literal_eval(x))

    # fill empty string
    df['customDimensions'] = df['customDimensions'].apply(lambda x: x[0] if len(x)==1 else "{}")

    # convert json string
    splitted_df = pd.json_normalize(df['customDimensions'])
    df[['customDimensions.' + col for col in splitted_df.columns]] = splitted_df[splitted_df.columns]
    df.drop('customDimensions', axis=1, inplace=True)

handle_customDimensions(train_df)
handle_customDimensions(test_df)

In [None]:
train_df.drop('customDimensions.index', axis=1, inplace=True)
test_df.drop('customDimensions.index', axis=1, inplace=True)

In [None]:
temp_df = train_df[['customDimensions.value', 'geoNetwork.continent']].drop_duplicates()
temp_df = temp_df.groupby('customDimensions.value', dropna=False)['geoNetwork.continent'].apply(list).to_frame()
temp_df

Unnamed: 0_level_0,geoNetwork.continent
customDimensions.value,Unnamed: 1_level_1
APAC,"[Asia, Oceania, Americas, Europe, (not set), Africa]"
Central America,"[Americas, Europe]"
EMEA,"[Europe, Asia, Americas, Africa, (not set)]"
North America,"[Americas, Europe, Asia, (not set), Africa, Oceania]"
South America,"[Americas, Europe]"
,"[Europe, Asia, Americas, Oceania, (not set), Africa]"


In [None]:
# https://en.wikipedia.org/wiki/North_America
temp_df = train_df[['customDimensions.value', 'geoNetwork.continent', 'geoNetwork.subContinent', 'geoNetwork.country']]
temp_df = temp_df.drop_duplicates()
temp_df = temp_df[(temp_df['customDimensions.value'] == 'North America') & (temp_df['geoNetwork.continent'] == 'Americas')]
temp_df

Unnamed: 0_level_0,customDimensions.value,geoNetwork.continent,geoNetwork.subContinent,geoNetwork.country
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,North America,Americas,Northern America,United States
14,North America,Americas,Northern America,Canada
19785,North America,Americas,South America,Venezuela
222928,North America,Americas,South America,Colombia
373973,North America,Americas,Central America,Costa Rica
396716,North America,Americas,Central America,Mexico
1030353,North America,Americas,South America,Bolivia
1310915,North America,Americas,Caribbean,Puerto Rico
1373630,North America,Americas,South America,Brazil
1591621,North America,Americas,Caribbean,Dominican Republic


In [None]:
train_df.drop('customDimensions.value', axis=1, inplace=True)
test_df.drop('customDimensions.value', axis=1, inplace=True)

### 4.2.3 Handle column 'visitId'

This column is used to distinguish session when it combines with `fullVisitorId` ⟶ unnecessary feature in learning model.

In [None]:
train_df.drop(columns = ['visitId'], inplace=True)
test_df.drop(columns = ['visitId'], inplace=True)

In [None]:
train_df['trafficSource.campaign'].replace('(not set)', np.nan, inplace=True)
test_df['trafficSource.campaign'].replace('(not set)', np.nan, inplace=True)

### 4.2.4 Handle columns with too much `NaN`



In [None]:
percent_missing = train_df.isnull().sum() * 100 / len(train_df)
percent_missing[percent_missing > 90].sort_values(ascending=False)

trafficSource.campaignCode                      99.999941
totals.totalTransactionRevenue                  98.916256
totals.transactionRevenue                       98.916256
totals.transactions                             98.913622
trafficSource.adContent                         96.210525
trafficSource.adwordsClickInfo.adNetworkType    95.593727
trafficSource.adwordsClickInfo.page             95.593727
trafficSource.adwordsClickInfo.isVideoAd        95.593727
trafficSource.adwordsClickInfo.slot             95.593727
trafficSource.adwordsClickInfo.gclId            95.585005
trafficSource.campaign                          93.923272
dtype: float64

- Column `trafficSource.campaignCode` only exists in `train_df`.
- Column `totals.totalTransactionRevenue`, `totals.transactions` are used in calculating the target columns ⟶ can not be dropped.
- Column `totals.transactionRevenue` is deprecated. Based on [*source*](https://support.google.com/analytics/answer/3437719?hl=vi)
- Other columns have many missing values ⟶ can be dropped.

In [None]:
nan_lst = percent_missing[percent_missing > 90].index.to_list()
nan_lst.remove('totals.totalTransactionRevenue')
nan_lst.remove('totals.transactions')

train_df.drop(columns = nan_lst, inplace=True)

nan_lst.remove('trafficSource.campaignCode')
test_df.drop(columns = nan_lst, inplace=True)

### 4.2.5 Handle column 'visitStartTime'

This column represents unix timestamp of start_time which includes date information

In [None]:
train_df.drop('visitStartTime', axis=1, inplace=True)
test_df.drop('visitStartTime', axis=1, inplace=True)

### 4.2.6 Handle column 'geoNetwork.networkDomain'
High cardinality categorical column

In [None]:
train_df.drop('geoNetwork.networkDomain', axis=1, inplace=True)
test_df.drop('geoNetwork.networkDomain', axis=1, inplace=True)

## 4.3 Handle missing value


### 4.3.1 Categorical columns

In [None]:
cat_cols = train_df.select_dtypes(include=['object']).columns.tolist()
percent_missing = train_df[cat_cols].isnull().sum() * 100 / len(train_df)
percent_missing.sort_values(ascending=False)

trafficSource.isTrueDirect    68.711209
trafficSource.referralPath    66.852910
trafficSource.keyword         61.626014
channelGrouping                0.000000
device.browser                 0.000000
device.deviceCategory          0.000000
device.operatingSystem         0.000000
fullVisitorId                  0.000000
geoNetwork.city                0.000000
geoNetwork.continent           0.000000
geoNetwork.country             0.000000
geoNetwork.metro               0.000000
geoNetwork.region              0.000000
geoNetwork.subContinent        0.000000
trafficSource.medium           0.000000
trafficSource.source           0.000000
dtype: float64

- column 'trafficSource.isTrueDirect' is Boolean column -> convert [True/False] to [1/0]
- column 'trafficSource.referralPath' has NaN when 'trafficSource.medium' is "referral" -> don't need to use it.
- column 'trafficSource.keyword' cannot fill NaN because we cannot understand it (by somehow it's different from 'not set' and 'not provided') -> we don't use it. 

In [None]:
def handle_category(df):
    df['trafficSource.isTrueDirect'] = df['trafficSource.isTrueDirect'].replace(True, 1)
    df.drop('trafficSource.referralPath', axis=1, inplace=True)
    df.drop('trafficSource.keyword',axis=1, inplace=True)

handle_category(train_df)
handle_category(test_df)

### 4.3.2 Numeric columns 

In [None]:
num_cols = train_df._get_numeric_data().columns.to_list()
percent_missing = train_df[num_cols].isnull().sum() * 100 / len(train_df)
percent_missing.sort_values(ascending=False)

totals.totalTransactionRevenue    98.916256
totals.transactions               98.913622
trafficSource.isTrueDirect        68.711209
totals.timeOnSite                 51.178076
totals.bounces                    48.980910
totals.sessionQualityDim          48.893983
totals.newVisits                  23.467676
totals.pageviews                   0.013990
date                               0.000000
device.isMobile                    0.000000
totals.hits                        0.000000
visitNumber                        0.000000
dtype: float64

In [None]:
train_df[train_df['totals.pageviews'].isnull()]['totals.totalTransactionRevenue'].value_counts(dropna=False).to_frame().reset_index().rename(columns={'index': 'totalTransactionRevenue', 'totals.totalTransactionRevenue': 'value_counts'}, index={0: 'NULL pageviews'})

Unnamed: 0,totalTransactionRevenue,value_counts
NULL pageviews,,239


- 'pageviews': NaN is unreasonable (maybe outliers) -> drop rows with NaN pageviews

In [None]:
def handle_numeric(df):
    df['totals.totalTransactionRevenue'] = df['totals.totalTransactionRevenue'].fillna(0.)
    df['totals.transactions'] = df['totals.transactions'].fillna(0.) 
    df['trafficSource.isTrueDirect'] = df['trafficSource.isTrueDirect'].fillna(0.)
    df['totals.timeOnSite'] = df['totals.timeOnSite'].fillna(0.) 
    df['totals.bounces'] = df['totals.bounces'].fillna(0.)
    df['totals.sessionQualityDim'] = df['totals.sessionQualityDim'].fillna(0.)
    df['totals.newVisits'] = df['totals.newVisits'].fillna(0.)
    df = df[df['totals.pageviews'].notnull()]    
handle_numeric(train_df)
handle_numeric(test_df)

In [None]:
# boolean columns have 2 values
train_df[[col for col in num_cols if len(train_df[col].value_counts()) == 2]]

Unnamed: 0_level_0,device.isMobile,totals.bounces,totals.newVisits,trafficSource.isTrueDirect
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,1.0,1.0,0.0
1,False,0.0,0.0,0.0
2,True,0.0,1.0,1.0
3,False,0.0,1.0,0.0
4,False,0.0,1.0,0.0
...,...,...,...,...
1708332,False,0.0,1.0,0.0
1708333,True,0.0,1.0,0.0
1708334,True,0.0,1.0,0.0
1708335,False,0.0,1.0,0.0


In [None]:
train_df['device.isMobile'] = train_df['device.isMobile'].astype(int)
test_df['device.isMobile'] = test_df['device.isMobile'].astype(int)

In [None]:
train_df

Unnamed: 0_level_0,channelGrouping,date,device.browser,device.deviceCategory,device.isMobile,device.operatingSystem,fullVisitorId,geoNetwork.city,geoNetwork.continent,geoNetwork.country,geoNetwork.metro,geoNetwork.region,geoNetwork.subContinent,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.sessionQualityDim,totals.timeOnSite,totals.totalTransactionRevenue,totals.transactions,trafficSource.isTrueDirect,trafficSource.medium,trafficSource.source,visitNumber
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
0,Organic Search,20171016,Firefox,desktop,0,Windows,3162355547410993243,not available in demo dataset,Europe,Germany,not available in demo dataset,not available in demo dataset,Western Europe,1.0,1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,organic,google,1
1,Referral,20171016,Chrome,desktop,0,Chrome OS,8934116514970143966,Cupertino,Americas,United States,San Francisco-Oakland-San Jose CA,California,Northern America,0.0,2,0.0,2.0,2.0,28.0,0.0,0.0,0.0,referral,sites.google.com,6
2,Direct,20171016,Chrome,mobile,1,Android,7992466427990357681,not available in demo dataset,Americas,United States,not available in demo dataset,not available in demo dataset,Northern America,0.0,2,1.0,2.0,1.0,38.0,0.0,0.0,1.0,(none),(direct),1
3,Organic Search,20171016,Chrome,desktop,0,Windows,9075655783635761930,not available in demo dataset,Asia,Turkey,not available in demo dataset,not available in demo dataset,Western Asia,0.0,2,1.0,2.0,1.0,1.0,0.0,0.0,0.0,organic,google,1
4,Organic Search,20171016,Chrome,desktop,0,Windows,6960673291025684308,not available in demo dataset,Americas,Mexico,not available in demo dataset,not available in demo dataset,Central America,0.0,2,1.0,2.0,1.0,52.0,0.0,0.0,0.0,organic,google,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1708332,Social,20170104,Chrome,desktop,0,Windows,5123779100307500332,not available in demo dataset,Americas,Puerto Rico,not available in demo dataset,not available in demo dataset,Caribbean,0.0,17,1.0,15.0,0.0,626.0,0.0,0.0,0.0,referral,youtube.com,1
1708333,Social,20170104,Chrome,mobile,1,Android,7231728964973959842,not available in demo dataset,Asia,Sri Lanka,not available in demo dataset,not available in demo dataset,Southern Asia,0.0,18,1.0,13.0,0.0,258.0,0.0,0.0,0.0,referral,youtube.com,1
1708334,Social,20170104,Android Webview,mobile,1,Android,5744576632396406899,Seoul,Asia,South Korea,(not set),Seoul,Eastern Asia,0.0,24,1.0,21.0,0.0,991.0,0.0,0.0,0.0,referral,youtube.com,1
1708335,Social,20170104,Chrome,desktop,0,Windows,2709355455991750775,not available in demo dataset,Asia,Indonesia,not available in demo dataset,not available in demo dataset,Southeast Asia,0.0,24,1.0,22.0,0.0,1274.0,0.0,0.0,0.0,referral,facebook.com,1


- Categorical columns: 13 (except 'fullVisitorId')
- Boolean columns: 4
- Numerical columns: 7 (except 'date')

# 5.&nbsp;Feature engineering

## 5.1 Handle column 'date'
Change column's datatype from int to datetime

In [None]:
train_df["date"] = pd.to_datetime(train_df["date"], infer_datetime_format=True, format="%Y%m%d")
test_df["date"] = pd.to_datetime(test_df["date"], infer_datetime_format=True, format="%Y%m%d")

## 5.2 Create TimeFrame

In [None]:
cat_cols = train_df.select_dtypes(include=['object']).columns.tolist()
cat_cols.remove('fullVisitorId')

In [None]:
category_dict = {}
for col in cat_cols:
    category_dict[col] = sorted(set(train_df[col]))

In [None]:
# based on solution's idea, we stimulated submission timeframe by using sliding window method in train dataframe
def getTimeFramewithFeatures(df, start_date):
    # input timeframe
    df_168 = df[(df['date'] >= start_date) & (df['date'] < start_date + np.timedelta64(168, 'D'))]
    max_date = df['date'].max()
    min_date = df['date'].min()

    # output timeframe
    df_62 = df[(df['date'] >= start_date + np.timedelta64(168+46, 'D')) & (df['date'] < start_date + np.timedelta64(168+46+62, 'D'))]
    df_62 = df_62.groupby('fullVisitorId')[['totals.totalTransactionRevenue']].sum().apply(np.log1p, axis=1).reset_index()
    df_62.rename(columns={'totals.totalTransactionRevenue': 'target'}, inplace=True)
    df_62['isReturned'] = 1
    df_visitorId = pd.DataFrame(data={'fullVisitorId': df_168['fullVisitorId'].unique()})
    df_62 = pd.merge(df_visitorId, df_62, how='left')
    df_62.fillna(0, inplace=True)

    percent_dict = {}
    for col in cat_cols:
        percent_dict[col] = df_168[col].value_counts(normalize=True).to_frame().reset_index().rename(columns={'index':col, col:'%'+col})
    
    df_168 = df_168.groupby('fullVisitorId').agg({
        # date
        'date': [
            ('firstSes', 'min'), 
            ('lastSes', 'max'),
            ('unique', 'nunique'),
            ('session_cnt', 'count')
        ],        
        
        # categorical
        'geoNetwork.city': [('geoNetwork.city', get_most_common)], 
        'geoNetwork.region': [('geoNetwork.region', get_most_common)],
        'trafficSource.source' : [('trafficSource.source', get_most_common)], 
        'geoNetwork.country' : [('geoNetwork.country', get_most_common)],
        'device.browser' : [('device.browser', get_most_common)],
        'geoNetwork.metro': [('geoNetwork.metro', get_most_common)],
        'device.operatingSystem': [('device.operatingSystem', get_most_common)],
        'geoNetwork.subContinent': [('geoNetwork.subContinent', get_most_common)],
        'channelGrouping':[('channelGrouping', get_most_common)],
        'trafficSource.medium': [('trafficSource.medium', get_most_common)],
        'geoNetwork.continent': [('geoNetwork.continent', get_most_common)],
        'device.deviceCategory': [('device.deviceCategory', get_most_common)],

        #boolean
        'device.isMobile': [('device.isMobile', 'mean')],
        'totals.bounces': [('totals.bounces', 'mean')],
        'totals.newVisits':  [('totals.newVisits', 'mean')],
        'trafficSource.isTrueDirect':  [('trafficSource.isTrueDirect', 'mean')],

        # numerical
        'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('totals.transactions_sum', 'sum')],
        'visitNumber': [('visitNumber_max', 'max')],
        'totals.timeOnSite': [
            ('timeOnSite_sum', 'sum'),
            ('timeOnSite_min', 'min'),
            ('timeOnSite_max', 'max'),
            ('timeOnSite_mean', 'mean'),
        ],
        'totals.sessionQualityDim': [
            ('sessionQualityDim_sum', 'sum'),
            ('sessionQualityDim_max', 'min'),
            ('sessionQualityDim_min', 'max'),
            ('sessionQualityDim_mean', 'mean'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'mean'),
        ],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'mean'),
        ],
    })
    df_168.columns = df_168.columns.droplevel()

    df_168['interval'] = (df_168['lastSes'] - df_168['firstSes']).astype(int)/10**9/86400
    df_168['firstSes'] = (df_168['firstSes'] - min_date).astype(int)/10**9/86400
    df_168['lastSes'] = (max_date - df_168['lastSes']).astype(int)/10**9/86400

    df_168 = df_168.reset_index()

    for col in cat_cols:
        df_168 = pd.merge(df_168, percent_dict[col])

    final_df = pd.merge(df_168, df_62)
    return final_df

In [None]:
def getDataFrameBySlidingWindow(input_df, window_size, output_path):
    start_date = input_df['date'].min()

    while start_date + np.timedelta64(168+46+62,'D') < input_df['date'].max():
        df = getTimeFramewithFeatures(input_df, start_date)

        for col in cat_cols:
            default_dict = pd.Series(sorted(set(train_df[col]))).to_dict()
            reversed_dict = dict(zip(default_dict.values(), default_dict.keys()))
            df[col] = df[col].map(reversed_dict)

        if start_date == input_df['date'].min():
            df.to_csv(output_path, index=False)
        else:
            df.to_csv(output_path, index=False, header=False, mode='a')
        start_date = start_date + window_size

### 5.2.1 Create Validation dataset (last 276 days in train_v2.csv)



In [None]:
start_date = train_df['date'].max() - np.timedelta64(168+46+62,'D')
valid_df = train_df[(train_df['date'] > start_date) & (train_df['date'] <= train_df['date'].max())]
train_df = train_df[train_df['date'] <= start_date]
print('train dataset', train_df.shape)
print('validation dataset:', valid_df.shape)

train dataset (895081, 25)
validation dataset: (813256, 25)


In [None]:
%%time
valid = getTimeFramewithFeatures(valid_df, valid_df['date'].min())
for col in cat_cols:
    default_dict = pd.Series(sorted(set(train_df[col]))).to_dict()
    reversed_dict = dict(zip(default_dict.values(), default_dict.keys()))
    valid[col] = valid[col].map(reversed_dict)
valid.to_csv('output/valid.csv', index=False)

CPU times: user 1min 12s, sys: 628 ms, total: 1min 13s
Wall time: 1min 16s


### 5.2.2 Slide window with stride: 1 month

In [None]:
%%time
getDataFrameBySlidingWindow(train_df, np.timedelta64(1,'M'), 'output/ws_1_month.csv')

CPU times: user 3min 24s, sys: 1.83 s, total: 3min 26s
Wall time: 3min 33s


### 5.2.3 Slide window with stride: 46 days

In [None]:
%%time
getDataFrameBySlidingWindow(train_df, np.timedelta64(46, 'D'), 'output/ws_46_days.csv')

CPU times: user 2min 19s, sys: 1.1 s, total: 2min 20s
Wall time: 2min 28s


### 5.2.4 Slide window with stride: 62 days

In [None]:
%%time
getDataFrameBySlidingWindow(train_df, np.timedelta64(62, 'D'), 'output/ws_62_days.csv')

CPU times: user 2min 17s, sys: 1.16 s, total: 2min 19s
Wall time: 2min 24s


### 5.2.5 Slide window with stride: 168 days

In [None]:
%%time
getDataFrameBySlidingWindow(train_df, np.timedelta64(168, 'D'), 'output/ws_168_days.csv')

CPU times: user 1min 10s, sys: 565 ms, total: 1min 11s
Wall time: 1min 13s


# 6.&nbsp;Choose model: LightGBM

> is a gradient boosting framework based on decision trees to increases: 

- the efficiency of the model
- reduces memory usage.

Light GBM splits the tree leaf-wise with the best fit whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. In other words, Light GBM grows trees vertically while other algorithms grow trees horizontally.

It uses two novel techniques:
- Gradient-based One Side Sampling (GOSS): sampling method which down samples the instances on basis of gradients.
- Exclusive Feature Bundling (EFB): down sample the feature to speed up tree learning.

**Advantage:**
- *Faster training speed and higher efficiency*: Light GBM uses a histogram-based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
- *Lower memory usage*: Replaces continuous values to discrete bins which results in lower memory usage.
- *Better accuracy than any other boosting algorithm*: It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy.
- *Compatibility with Large Datasets*: It is capable of performing equally well with large datasets with a significant reduction in training time as compared to XGBoost.

**Disadvantage:**
- *Overfitting*: Light GBM split the tree leaf-wise which can lead to overfitting as it produces much complex trees.
- *Compatibility with Datasets*: Light GBM is sensitive to overfitting and thus can easily overfit small data.

*Source:* https://www.kaggle.com/general/264327

> **Level-wise:**

<img src='https://drive.google.com/uc?id=1Aq7vJbXb9NLrm7H-S1m8AT2NEkVJuAlW' width=600>

> **Leaf-wise:**

<img src='https://drive.google.com/uc?id=1TNxATnK-nFacko8qMJNqq9mEDAO_9fw0' width=700>

**Missing value handle:**
- Enables the missing value handle by default. Disable it by setting `use_missing=false`.
- Uses NA (NaN) to represent missing values by default. Change it to use zero by setting `zero_as_missing=true`.
- When `zero_as_missing=false` (default), the unrecorded values in sparse matrices (and LightSVM) are treated as zeros.
- When `zero_as_missing=true`, NA and zeros (including unrecorded values in sparse matrices (and LightSVM)) are treated as missing.


**Categorical handle:**
- LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories. This often performs better than one-hot encoding.
- Use `categorical_feature` to specify the categorical features. Refer to the parameter `categorical_feature` in Parameters.
- Categorical features must be encoded as non-negative integers (int) less than Int32.MaxValue (2147483647).
- Use `min_data_per_group`, `cat_smooth` to deal with over-fitting (when `#data` is small or `#category` is large).
- For a categorical feature with high cardinality (`#category` is large), it often works best to treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or by embedding the categories in a low-dimensional numeric space.

**Parameters:**
- `task`: default = `'train'`; options = `'train'`, `'prediction'`; Specifies the task we wish to perform which is either train or prediction.

- `application`: default = `'regression'`, options:
    - `'regression'`: perform regression task
    - `'binary'`: Binary classification

- `data`: type=string; training data, LightGBM will train from this data.
- `feature_fraction`: default=1 ; specifies the fraction of features to be taken for each iteration
- `bagging_fraction`: default=1 ; specifies the fraction of data to be used for each iteration and is generally used to speed up the training and avoid overfitting.
- `bagging_frequency`: 0 means disable bagging; k means perform bagging at every k iteration. Every k-th iteration, LightGBM will randomly select bagging_fraction * 100 % of the data to use for the next k iterations
- `min_data_in_leaf` : minimal number of data in one leaf
- `categorical_feature` : type=string ; specify the categorical features we want to use for training our model

# 7.&nbsp;Train model


In [None]:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error 

In [None]:
compare_dict={'labelEncode':{},
              'percent':{}}

In [None]:
def train_model(input_file, idea, output_name, unused_cols, cat_index=None):
    train = pd.read_csv(input_file, low_memory=False, dtype={'fullVisitorId': str})
    train.drop(columns=unused_cols, inplace=True)
    train.drop_duplicates(inplace=True)
    
    model = lgb.LGBMRegressor(categorical_features=cat_index)
    model.fit(train.drop(['target'],axis=1), train['target'])

    y_pred = model.predict(valid.drop(['target'], axis=1))
    compare_dict[idea][output_name] = mean_squared_error(valid['target'], y_pred)

## 7.1 Idea01: Label Encoding
- Use set of categories in train set to numbering
- Config the categorical_feature in lgbm 

In [None]:
label_unused_cols = ['fullVisitorId', '%device.browser', '%device.deviceCategory', '%device.operatingSystem',
                '%geoNetwork.city', '%geoNetwork.continent', '%geoNetwork.country',
                '%geoNetwork.metro', '%geoNetwork.region', '%geoNetwork.subContinent', 
                '%trafficSource.medium', '%trafficSource.source', '%channelGrouping', 'isReturned']

In [None]:
cate_index = [i for i in range(4, 16)]
cate_index

[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

### 7.1.1 Valid df

In [None]:
valid = pd.read_csv('output/valid.csv', low_memory=False, dtype={'fullVisitorId': str})
valid.drop(columns=label_unused_cols, inplace=True)
valid.drop_duplicates(inplace=True)
valid.shape

(309065, 41)

### 7.1.2 Train df

In [None]:
%%time
train_model('output/ws_1_month.csv', 'labelEncode', 'stride_1_month', label_unused_cols, cate_index)
train_model('output/ws_46_days.csv', 'labelEncode', 'stride_46_days', label_unused_cols, cate_index)
train_model('output/ws_62_days.csv', 'labelEncode', 'stride_62_days', label_unused_cols, cate_index)
train_model('output/ws_168_days.csv', 'labelEncode', 'stride_168_days', label_unused_cols, cate_index)

CPU times: user 1min 23s, sys: 1.44 s, total: 1min 24s
Wall time: 1min 4s


## 7.2 Idea02: Percent Category
- Use the categories' frequency in 168_days input in each timeframe (use numeric columns instead of using catergory columns)

In [None]:
percent_unused_cols = ['fullVisitorId', 'geoNetwork.city',
              'geoNetwork.region', 'trafficSource.source', 'geoNetwork.country',
              'device.browser', 'geoNetwork.metro', 'device.operatingSystem',
              'geoNetwork.subContinent', 'channelGrouping', 'trafficSource.medium',
              'geoNetwork.continent', 'device.deviceCategory', 'isReturned']

### 7.2.1 Valid df

In [None]:
valid = pd.read_csv('output/valid.csv', low_memory=False, dtype={'fullVisitorId': str})
valid.drop(columns=percent_unused_cols, inplace=True)
valid.drop_duplicates(inplace=True)
valid.shape

(309098, 41)

### 7.1.2 Train df

In [None]:
%%time
train_model('output/ws_1_month.csv', 'percent', 'stride_1_month', percent_unused_cols)
train_model('output/ws_46_days.csv', 'percent', 'stride_46_days', percent_unused_cols)
train_model('output/ws_62_days.csv', 'percent', 'stride_62_days', percent_unused_cols)
train_model('output/ws_168_days.csv', 'percent', 'stride_168_days', percent_unused_cols)

CPU times: user 1min 40s, sys: 1.61 s, total: 1min 42s
Wall time: 1min 13s


## 7.3 Comparison
- idea02's result is better idea01's
- smaller strides mostly have the better result


In [None]:
df = pd.DataFrame(compare_dict).transpose()
df

Unnamed: 0,stride_1_month,stride_46_days,stride_62_days,stride_168_days
labelEncode,0.144981,0.14877,0.148119,0.160208
percent,0.144088,0.147198,0.145321,0.153288


## 7.4 Tune Parameter


In [None]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [None]:
%%time
getDataFrameBySlidingWindow(pd.concat([train_df, valid_df], ignore_index=True), np.timedelta64(1, 'M'), 'output/full_df.csv')
train_df=[]
valid_df=[]

CPU times: user 12min 47s, sys: 8.77 s, total: 12min 56s
Wall time: 13min 16s


In [None]:
rs_params = {
    # 'num_leaves': sp_randint(6, 30),
    'boosting_type': ['gbdt', 'dart'], #, 'goss'],
    'bagging_fraction': (0.5, 0.8),
    'bagging_frequency': (5, 8),
    'feature_fraction': (0.5, 0.8),
    'min_data_in_leaf': (90, 120),
    'min_child_weight': [1e-3, 1e-2],
}

In [None]:
train = pd.read_csv('output/full_df.csv', low_memory=False, dtype={'fullVisitorId': str})
train.drop(columns=percent_unused_cols, inplace=True)
train.drop_duplicates(inplace=True)

reg = lgb.LGBMRegressor(max_depth=-1)
rs_cv = RandomizedSearchCV(estimator=reg,
                           param_distributions=rs_params,
                           cv = 3,
                           n_iter=100,
                           verbose=20)

search = rs_cv.fit(train.drop(['target'],axis=1), train['target'])

Fitting 3 folds for each of 64 candidates, totalling 192 fits
[CV 1/3; 1/64] START bagging_fraction=0.5, bagging_frequency=5, boosting_type=gbdt, feature_fraction=0.5, min_child_weight=0.001, min_data_in_leaf=90
[CV 1/3; 1/64] END bagging_fraction=0.5, bagging_frequency=5, boosting_type=gbdt, feature_fraction=0.5, min_child_weight=0.001, min_data_in_leaf=90;, score=0.029 total time=  24.0s
[CV 2/3; 1/64] START bagging_fraction=0.5, bagging_frequency=5, boosting_type=gbdt, feature_fraction=0.5, min_child_weight=0.001, min_data_in_leaf=90
[CV 2/3; 1/64] END bagging_fraction=0.5, bagging_frequency=5, boosting_type=gbdt, feature_fraction=0.5, min_child_weight=0.001, min_data_in_leaf=90;, score=0.029 total time=  23.0s
[CV 3/3; 1/64] START bagging_fraction=0.5, bagging_frequency=5, boosting_type=gbdt, feature_fraction=0.5, min_child_weight=0.001, min_data_in_leaf=90
[CV 3/3; 1/64] END bagging_fraction=0.5, bagging_frequency=5, boosting_type=gbdt, feature_fraction=0.5, min_child_weight=0.001

In [None]:
search.best_params_

{'min_data_in_leaf': 90,
 'min_child_weight': 0.001,
 'feature_fraction': 0.5,
 'boosting_type': 'dart',
 'bagging_frequency': 5,
 'bagging_fraction': 0.5}

In [None]:
search.best_score_

0.021849578437470707

In [None]:
model = lgb.LGBMRegressor(max_depth=-1)
model.set_params(**search.best_params_)
model.fit(train.drop(['target'],axis=1), train['target'])

LGBMRegressor(bagging_fraction=0.5, bagging_frequency=5, boosting_type='dart',
              feature_fraction=0.5, min_data_in_leaf=90)

# 8.&nbsp;Predict output

In [None]:
%%time
test = getTimeFramewithFeatures(test_df, test_df['date'].min())
test.drop(columns=percent_unused_cols[1:], inplace=True)
test.shape

CPU times: user 42.4 s, sys: 157 ms, total: 42.6 s
Wall time: 42.6 s


(296530, 42)

In [None]:
%%time
y_pred = model.predict(test.drop(['target', 'fullVisitorId'], axis=1))
prediction = pd.DataFrame(zip(test['fullVisitorId'], y_pred), columns = ['fullVisitorId', 'PredictedLogRevenue'])
prediction['PredictedLogRevenue'][prediction['PredictedLogRevenue'] < 0] = 0

prediction.to_csv('output/prediction.csv', index=False)
prediction

CPU times: user 2.76 s, sys: 34 ms, total: 2.79 s
Wall time: 2.67 s


Unnamed: 0,fullVisitorId,PredictedLogRevenue
0,0000018966949534117,0.003037
1,0036693306580175184,0.003037
2,0037203376062803825,0.095187
3,0051869652803526521,0.003037
4,0066480182734334520,0.003037
...,...,...
296525,992838595332277097,0.001249
296526,7047034012929122042,0.001249
296527,0308499825805587947,0.001705
296528,1447216801700455400,0.001249


# 9 . Reflection
Vũ Công Minh:
- Học hỏi được nhiều điều mới:
  + Cách xử lý bộ dữ liệu lớn (sử dụng chunksize)
  + Cách xử lý những cột categories
  + Cách làm giả lập timeframe của tác giả
  + Cách train model chia làm 2 model của tác giả (classify and regression)
- Khó khăn:
  + Chọn đề tài phù hợp.
  + Quá nhiều dữ liệu.

Võ Ngọc Minh:
- Tìm hiểu cách xử lý bộ dữ liệu lớn
- Rèn luyện tính cận thẩn trong lúc làm việc
- Khó khăn:
  + Tìm hiểu nhiều hướng giải quyết khác nhau.

Nâng cấp
- Tune thêm nhiều hyperparameters mới để tăng độ chính xác
- Có thể bỏ bớt một số cột không quan trọng 
- Ensemble các model lại