<a href="https://colab.research.google.com/github/MeinHserhT/CS14115/blob/main/Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References:
- For data visualization: https://www.kaggle.com/code/jsaguiar/complete-exploratory-analysis-all-columns/notebook 
- Feature engineering and training model: https://www.kaggle.com/competitions/ga-customer-revenue-prediction/discussion/82614 

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/'gStore Revenue Prediction'/data

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/gStore Revenue Prediction/data


In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
from os import listdir
from datetime import datetime, timedelta
import ast

import gc
gc.enable()

import warnings
warnings.filterwarnings('ignore')

# 1.&nbsp;Understand the problem

- Overview: The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

- Goal: Analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to <b>predict revenue per customer</b>

- Data format: 
    + Each row in the dataset is one visit to the store. 
    + <b>Not all rows in test_v2.csv will correspond to a row in the submission</b>, but all unique fullVisitorIds will correspond to a row in the submission.
    + Due to the formatting of fullVisitorId you must <b>load the Id's as strings in order for all Id's to be properly unique!</b>
    + There are multiple columns which contain JSON blobs of varying depth. In one of those JSON columns, totals, the sub-column transactionRevenue contains the revenue information we are trying to predict. This sub-column exists only for the training data.


- Data train: user transactions which are collected from GStore around the world 01/08/2016 to 30/04/2018.
- Data test: ALL users' transactions in the future time.
 + Public LB: is being calculated for those visitors during the same timeframe of 01/05/2018 to 15/10/2018
 + Private LB: is being calculated on the future-looking timeframe of 01/12/2018 to 31/01/2019 - for those **same** set of users. 
 
 $\Rightarrow$ Therefore, your submission that is intended for the public LB timeframe will be different from the private LB timeframe, which will be rescored/recalculated on the future timeframe.
 
 
- Input: All transactions of a user from 01/05/2018 to 15/10/2018.
- Output: Total revenue of that user during the predicting time. (01/12/2018 to 31/01/2019)
 
 We are predicting the <b>natural log of the sum of all transactions per user</b>. 
 
$$
y_{user} = \sum_{i=1}^{n} transaction_{user_i} 
$$
$$
target_{user} = \ln({y_{user}+1})
$$
 

- External Data: is <b>permitted</b> for this competition. This includes the <a href="https://support.google.com/analytics/answer/6367342#access&zippy=%2Cin-this-article">Google Merchandise Store Demo Account</a>. Although the Demo Account contains the predicted variable, final standings will not benefit from access to this external data, because it requires future-looking predictions.

- Evaluation Metric:
 
 Submissions are scored on the root mean squared error. RMSE is defined as:

$$ \text{RMSE} = \sqrt{\frac{1}{n}\sum^n_{i=1}(y_i - \hat{y}_i)^2} $$

![](https://drive.google.com/uc?export=view&id=1RXHGiWn7RGhTvxjnpFnpzRS4munE4pM6)


# 2.&nbsp;Prepare data

Follow the link: https://www.kaggle.com/code/minhvngc/exploration



# 3.&nbsp;Explore data

Follow the link: https://colab.research.google.com/drive/121tKvyBSNPEtPmau4dhL1NjIFrzw9xZl

#  4.&nbsp;Preprocess data

## 4.1 Load data

In [3]:
lst_file = listdir()
lst_train_file = [f for f in lst_file if 'train' in f]
lst_test_file = [f for f in lst_file if 'test' in f]

train_default_df = pd.DataFrame()
test_default_df = pd.DataFrame()

In [4]:
for f in lst_train_file:
    col = f[:-4]
    if col == 'fullVisitorId':
        train_default_df[col] =  pd.read_csv(f, names = ['index', col], dtype=str)\
                                    .set_index('index')[col]
    train_default_df[col] = pd.read_csv(f, names = ['index', col],)\
                                .set_index('index')[col]

train_df = train_default_df.copy()
train_df.columns = [c[6:] for c in train_df.columns]
train_df.head(2)

Unnamed: 0_level_0,channelGrouping,customDimensions,date,device.browser,device.browserVersion,device.browserSize,device.deviceCategory,device.flashVersion,device.isMobile,device.language,device.mobileDeviceBranding,device.mobileDeviceInfo,device.mobileDeviceModel,device.mobileDeviceMarketingName,device.operatingSystem,device.mobileInputSelector,device.operatingSystemVersion,device.screenColors,device.screenResolution,fullVisitorId,geoNetwork.city,geoNetwork.cityId,geoNetwork.continent,geoNetwork.country,geoNetwork.longitude,geoNetwork.latitude,geoNetwork.metro,geoNetwork.networkDomain,geoNetwork.networkLocation,geoNetwork.region,geoNetwork.subContinent,socialEngagementType,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.sessionQualityDim,totals.timeOnSite,totals.totalTransactionRevenue,totals.transactionRevenue,totals.transactions,totals.visits,trafficSource.adContent,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.slot,trafficSource.campaign,trafficSource.campaignCode,trafficSource.isTrueDirect,trafficSource.keyword,trafficSource.medium,trafficSource.referralPath,trafficSource.source,visitId,visitNumber,visitStartTime
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
0,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",20171016,Firefox,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,3162355547410993243,not available in demo dataset,not available in demo dataset,Europe,Germany,not available in demo dataset,not available in demo dataset,not available in demo dataset,(not set),not available in demo dataset,not available in demo dataset,Western Europe,Not Socially Engaged,1.0,1,1.0,1.0,1.0,,,,,1,,,not available in demo dataset,,,,,(not set),,,water bottle,organic,,google,1508198450,1,1508198450
1,Referral,"[{'index': '4', 'value': 'North America'}]",20171016,Chrome,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Chrome OS,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,8934116514970143966,Cupertino,not available in demo dataset,Americas,United States,not available in demo dataset,not available in demo dataset,San Francisco-Oakland-San Jose CA,(not set),not available in demo dataset,California,Northern America,Not Socially Engaged,,2,,2.0,2.0,28.0,,,,1,,,not available in demo dataset,,,,,(not set),,,,referral,/a/google.com/transportation/mtv-services/bikes/bike2workmay2016,sites.google.com,1508176307,6,1508176307


In [5]:
for f in lst_test_file:
    col = f[:-4]
    if col == 'fullVisitorId':
        test_default_df[col] = pd.read_csv(f, names=['index', col], dtype=str)\
                                    .set_index('index')[col]
    test_default_df[col] = pd.read_csv(f, names=['index', col])\
                                .set_index('index')[col]

test_df = test_default_df.copy()
test_df.columns = [c[5:] for c in test_df.columns]
test_df.head(2)

Unnamed: 0_level_0,channelGrouping,device.deviceCategory,device.browserSize,date,device.browserVersion,device.browser,customDimensions,device.mobileDeviceMarketingName,device.mobileDeviceInfo,device.mobileDeviceBranding,device.screenColors,device.isMobile,device.mobileDeviceModel,device.operatingSystemVersion,device.operatingSystem,device.language,device.mobileInputSelector,device.flashVersion,device.screenResolution,geoNetwork.city,fullVisitorId,geoNetwork.longitude,geoNetwork.latitude,geoNetwork.continent,geoNetwork.metro,geoNetwork.country,geoNetwork.cityId,geoNetwork.networkDomain,geoNetwork.region,geoNetwork.subContinent,socialEngagementType,geoNetwork.networkLocation,totals.bounces,totals.hits,totals.sessionQualityDim,totals.totalTransactionRevenue,totals.pageviews,totals.newVisits,totals.transactionRevenue,totals.transactions,totals.timeOnSite,trafficSource.isTrueDirect,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.campaign,trafficSource.adwordsClickInfo.gclId,trafficSource.keyword,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.adContent,totals.visits,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.slot,visitNumber,trafficSource.source,trafficSource.referralPath,trafficSource.medium,visitId,visitStartTime
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1
0,Organic Search,mobile,not available in demo dataset,20180511,not available in demo dataset,Chrome,"[{'index': '4', 'value': 'APAC'}]",not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,True,not available in demo dataset,not available in demo dataset,Android,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,(not set),7460955084541987166,not available in demo dataset,not available in demo dataset,Asia,(not set),India,not available in demo dataset,unknown.unknown,Delhi,Southern Asia,Not Socially Engaged,not available in demo dataset,,4,1,,3.0,,,,973.0,True,,(not set),,(not provided),not available in demo dataset,(not set),1,,,,2,google,(not set),organic,1526099341,1526099341
1,Direct,desktop,not available in demo dataset,20180511,not available in demo dataset,Chrome,"[{'index': '4', 'value': 'North America'}]",not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,Macintosh,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,San Francisco,460252456180441002,not available in demo dataset,not available in demo dataset,Americas,San Francisco-Oakland-San Jose CA,United States,not available in demo dataset,(not set),California,Northern America,Not Socially Engaged,not available in demo dataset,,4,1,,3.0,,,,49.0,True,,(not set),,(not set),not available in demo dataset,(not set),1,,,,166,(direct),(not set),(none),1526064483,1526064483


## 4.2 Drop bad columns

### 4.2.1 Columns with only 1 value

In [6]:
only_1_lst = [col for col in train_df.columns 
                    if len(train_df[col].value_counts(dropna=False)) == 1]
train_df[only_1_lst].head()

Unnamed: 0_level_0,device.browserVersion,device.browserSize,device.flashVersion,device.language,device.mobileDeviceBranding,device.mobileDeviceInfo,device.mobileDeviceModel,device.mobileDeviceMarketingName,device.mobileInputSelector,device.operatingSystemVersion,device.screenColors,device.screenResolution,geoNetwork.cityId,geoNetwork.longitude,geoNetwork.latitude,geoNetwork.networkLocation,socialEngagementType,totals.visits,trafficSource.adwordsClickInfo.criteriaParameters
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
1,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
2,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
3,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset
4,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Not Socially Engaged,1,not available in demo dataset


- Columns have `'not available in demo dataset'` are not provided in this competition.
- Column `socialEngagementType` has `'Not Socially Engaged'` because this is an ecommerce site that does not contain any social engagement (like, loves, ...)
- Column `totals.visits` has `1` because `1` represent for sessions with interaction events. (???)

In [7]:
train_df.drop(columns = only_1_lst, inplace=True)
test_df.drop(columns = only_1_lst, inplace=True)

### 4.2.2 Handle column `customDimensions`
This column has a special format: List of a json -> Split it into 2 columns: `customDimensions.index` and `customDimensions.value`

In [8]:
train_df['customDimensions'].value_counts(dropna=False)

[{'index': '4', 'value': 'North America'}]      768223
[]                                              333235
[{'index': '4', 'value': 'EMEA'}]               313991
[{'index': '4', 'value': 'APAC'}]               222071
[{'index': '4', 'value': 'South America'}]       45553
[{'index': '4', 'value': 'Central America'}]     25264
Name: customDimensions, dtype: int64

In [9]:
def handle_customDimensions(df):
    # convert string representation of list to a list
    df['customDimensions'] = df['customDimensions'].apply(lambda x: ast.literal_eval(x))

    # fill empty string
    df['customDimensions'] = df['customDimensions'].apply(lambda x: x[0] if len(x)==1 else "{}")

    # convert json string
    splitted_df = pd.json_normalize(df['customDimensions'])
    df[['customDimensions.' + col for col in splitted_df.columns]] = splitted_df[splitted_df.columns]
    df.drop('customDimensions', axis=1, inplace=True)

handle_customDimensions(train_df)
handle_customDimensions(test_df)

In [10]:
train_df.drop('customDimensions.index', axis=1, inplace=True)
test_df.drop('customDimensions.index', axis=1, inplace=True)

In [11]:
temp_df = train_df[['customDimensions.value', 'geoNetwork.continent']].drop_duplicates()
temp_df = temp_df.groupby('customDimensions.value', dropna=False)['geoNetwork.continent'].apply(list).to_frame()
temp_df

Unnamed: 0_level_0,geoNetwork.continent
customDimensions.value,Unnamed: 1_level_1
APAC,"[Asia, Oceania, Americas, Europe, (not set), Africa]"
Central America,"[Americas, Europe]"
EMEA,"[Europe, Asia, Americas, Africa, (not set)]"
North America,"[Americas, Europe, Asia, (not set), Africa, Oceania]"
South America,"[Americas, Europe]"
,"[Europe, Asia, Americas, Oceania, (not set), Africa]"


In [12]:
# https://en.wikipedia.org/wiki/North_America
temp_df = train_df[['customDimensions.value', 'geoNetwork.continent', 'geoNetwork.subContinent', 'geoNetwork.country']]
temp_df = temp_df.drop_duplicates()
temp_df = temp_df[(temp_df['customDimensions.value'] == 'North America') & (temp_df['geoNetwork.continent'] == 'Americas')]
temp_df

Unnamed: 0_level_0,customDimensions.value,geoNetwork.continent,geoNetwork.subContinent,geoNetwork.country
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,North America,Americas,Northern America,United States
14,North America,Americas,Northern America,Canada
19785,North America,Americas,South America,Venezuela
222928,North America,Americas,South America,Colombia
373973,North America,Americas,Central America,Costa Rica
396716,North America,Americas,Central America,Mexico
1030353,North America,Americas,South America,Bolivia
1310915,North America,Americas,Caribbean,Puerto Rico
1373630,North America,Americas,South America,Brazil
1591621,North America,Americas,Caribbean,Dominican Republic


In [13]:
train_df.drop('customDimensions.value', axis=1, inplace=True)
test_df.drop('customDimensions.value', axis=1, inplace=True)

### 4.2.3 Handle column `visitId`

This column is used to distinguish session when it combines with `fullVisitorId` ⟶ unnecessary feature in learning model.

In [14]:
train_df.drop(columns = ['visitId'], inplace=True)
test_df.drop(columns = ['visitId'], inplace=True)

In [15]:
train_df['trafficSource.campaign'].replace('(not set)', np.nan, inplace=True)
test_df['trafficSource.campaign'].replace('(not set)', np.nan, inplace=True)

### 4.2.4 Handle columns with too much `NaN`



In [16]:
percent_missing = train_df.isnull().sum() * 100 / len(train_df)
percent_missing[percent_missing > 90].sort_values(ascending=False)

trafficSource.campaignCode                      99.999941
totals.totalTransactionRevenue                  98.916256
totals.transactionRevenue                       98.916256
totals.transactions                             98.913622
trafficSource.adContent                         96.210525
trafficSource.adwordsClickInfo.adNetworkType    95.593727
trafficSource.adwordsClickInfo.page             95.593727
trafficSource.adwordsClickInfo.isVideoAd        95.593727
trafficSource.adwordsClickInfo.slot             95.593727
trafficSource.adwordsClickInfo.gclId            95.585005
trafficSource.campaign                          93.923272
dtype: float64

- Column `trafficSource.campaignCode` only exists in `train_df`.
- Column `totals.totalTransactionRevenue`, `totals.transactions` are used in calculating the target columns ⟶ can not be dropped.
- Column `totals.transactionRevenue` is deprecated. Based on [*source*](https://support.google.com/analytics/answer/3437719?hl=vi)
- Other columns have many missing values ⟶ can be dropped.

In [17]:
nan_lst = percent_missing[percent_missing > 90].index.to_list()
nan_lst.remove('totals.totalTransactionRevenue')
nan_lst.remove('totals.transactions')

train_df.drop(columns = nan_lst, inplace=True)

nan_lst.remove('trafficSource.campaignCode')
test_df.drop(columns = nan_lst, inplace=True)

### 4.2.5 Handle column `visitStartTime'

This column represents unix timestamp of start_time which includes date information

In [18]:
train_df.drop('visitStartTime', axis=1, inplace=True)
test_df.drop('visitStartTime', axis=1, inplace=True)

## 4.4 Handle missing value


### 4.4.1 Categorical columns

In [19]:
cat_cols = train_df.select_dtypes(include=['object']).columns.tolist()
percent_missing = train_df[cat_cols].isnull().sum() * 100 / len(train_df)
percent_missing.sort_values(ascending=False)

trafficSource.isTrueDirect    68.711209
trafficSource.referralPath    66.852910
trafficSource.keyword         61.626014
channelGrouping                0.000000
geoNetwork.networkDomain       0.000000
trafficSource.medium           0.000000
geoNetwork.subContinent        0.000000
geoNetwork.region              0.000000
geoNetwork.metro               0.000000
device.browser                 0.000000
geoNetwork.country             0.000000
geoNetwork.continent           0.000000
geoNetwork.city                0.000000
fullVisitorId                  0.000000
device.operatingSystem         0.000000
device.deviceCategory          0.000000
trafficSource.source           0.000000
dtype: float64

In [20]:
def handle_category(df):
    df['trafficSource.isTrueDirect'] = df['trafficSource.isTrueDirect'].replace(True, 1)
    df.drop('trafficSource.referralPath', axis=1, inplace=True)
    df.drop('trafficSource.keyword',axis=1, inplace=True)

handle_category(train_df)
handle_category(test_df)

### 4.4.2 Numeric columns 

In [21]:
num_cols = train_df._get_numeric_data().columns.to_list()
percent_missing = train_df[num_cols].isnull().sum() * 100 / len(train_df)
percent_missing.sort_values(ascending=False)

totals.totalTransactionRevenue    98.916256
totals.transactions               98.913622
trafficSource.isTrueDirect        68.711209
totals.timeOnSite                 51.178076
totals.bounces                    48.980910
totals.sessionQualityDim          48.893983
totals.newVisits                  23.467676
totals.pageviews                   0.013990
date                               0.000000
device.isMobile                    0.000000
totals.hits                        0.000000
visitNumber                        0.000000
dtype: float64

In [22]:
train_df[train_df['totals.pageviews'].isnull()]['totals.totalTransactionRevenue'].value_counts(dropna=False).to_frame().reset_index().rename(columns={'index': 'totalTransactionRevenue', 'totals.totalTransactionRevenue': 'value_counts'}, index={0: 'NULL pageviews'})

Unnamed: 0,totalTransactionRevenue,value_counts
NULL pageviews,,239


In [23]:
def handle_numeric(df):
    df['totals.totalTransactionRevenue'] = df['totals.totalTransactionRevenue'].fillna(0.)
    df['totals.transactions'] = df['totals.transactions'].fillna(0.) 
    df['trafficSource.isTrueDirect'] = df['trafficSource.isTrueDirect'].fillna(0.)
    df['totals.timeOnSite'] = df['totals.timeOnSite'].fillna(0.) 
    df['totals.bounces'] = df['totals.bounces'].fillna(0.)
    df['totals.sessionQualityDim'] = df['totals.sessionQualityDim'].fillna(0.)
    df['totals.newVisits'] = df['totals.newVisits'].fillna(0.)
    df = df[df['totals.pageviews'].notnull()]    
handle_numeric(train_df)
handle_numeric(test_df)

In [24]:
train_df[[col for col in num_cols if len(train_df[col].value_counts()) == 2]]

Unnamed: 0_level_0,device.isMobile,totals.bounces,totals.newVisits,trafficSource.isTrueDirect
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,1.0,1.0,0.0
1,False,0.0,0.0,0.0
2,True,0.0,1.0,1.0
3,False,0.0,1.0,0.0
4,False,0.0,1.0,0.0
...,...,...,...,...
1708332,False,0.0,1.0,0.0
1708333,True,0.0,1.0,0.0
1708334,True,0.0,1.0,0.0
1708335,False,0.0,1.0,0.0


In [25]:
train_df['device.isMobile'] = train_df['device.isMobile'].astype(int)
test_df['device.isMobile'] = test_df['device.isMobile'].astype(int)

## 4.5 Handle high cardinality

In [26]:
cat_cols = train_df.select_dtypes(include=['object']).columns.tolist()
cat_df = pd.DataFrame()
cat_df['train_distinct_values'] = train_df[cat_cols].nunique().sort_values(ascending=False)
cat_df['test_distinct_values'] = test_df[cat_cols].nunique().sort_values(ascending=False)
cat_df.reset_index(inplace=True)
cat_df.rename(columns={'index': 'column_name'}, inplace=True)
cat_df['common_values'] = cat_df.apply(lambda x: len(set(train_df[x['column_name']].value_counts().index).intersection(test_df[x['column_name']].value_counts().index)),axis=1)
cat_df

Unnamed: 0,column_name,train_distinct_values,test_distinct_values,common_values
0,fullVisitorId,1367971,316850,2476
1,geoNetwork.networkDomain,41982,15934,9511
2,geoNetwork.city,956,503,362
3,geoNetwork.region,483,269,234
4,trafficSource.source,345,192,146
5,geoNetwork.country,228,208,207
6,device.browser,129,62,30
7,geoNetwork.metro,123,82,75
8,device.operatingSystem,24,22,20
9,geoNetwork.subContinent,23,23,23


In [27]:
def handle_high(col, train, test):
    common = list(set(train[col].unique()).intersection(test[col].unique()))

    train[col] = pd.factorize(pd.Categorical(train[col], categories=common))[0]
    train[col].replace(-1, len(common), inplace=True)
    test[col] = pd.factorize(pd.Categorical(test[col], categories=common))[0]
    test[col].replace(-1, len(common), inplace=True)

col_lst = cat_df[1:]['column_name'].to_list()
for col in col_lst:
    handle_high(col, train_df, test_df)

In [28]:
train_df

Unnamed: 0_level_0,channelGrouping,date,device.browser,device.deviceCategory,device.isMobile,device.operatingSystem,fullVisitorId,geoNetwork.city,geoNetwork.continent,geoNetwork.country,geoNetwork.metro,geoNetwork.networkDomain,geoNetwork.region,geoNetwork.subContinent,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.sessionQualityDim,totals.timeOnSite,totals.totalTransactionRevenue,totals.transactions,trafficSource.isTrueDirect,trafficSource.medium,trafficSource.source,visitNumber
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
0,0,20171016,0,0,0,0,3162355547410993243,0,0,0,0,0,0,0,1.0,1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0,0,1
1,1,20171016,1,0,0,1,8934116514970143966,1,1,1,1,0,1,1,0.0,2,0.0,2.0,2.0,28.0,0.0,0.0,0.0,1,1,6
2,2,20171016,1,1,1,2,7992466427990357681,0,1,1,0,1,0,1,0.0,2,1.0,2.0,1.0,38.0,0.0,0.0,1.0,2,2,1
3,0,20171016,1,0,0,0,9075655783635761930,0,2,2,0,2,0,2,0.0,2,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0,0,1
4,0,20171016,1,0,0,0,6960673291025684308,0,1,3,0,3,0,3,0.0,2,1.0,2.0,1.0,52.0,0.0,0.0,0.0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1708332,6,20170104,1,0,0,0,5123779100307500332,0,1,72,0,388,0,18,0.0,17,1.0,15.0,0.0,626.0,0.0,0.0,0.0,1,13,1
1708333,6,20170104,1,1,1,2,7231728964973959842,0,2,66,0,2,0,5,0.0,18,1.0,13.0,0.0,258.0,0.0,0.0,0.0,1,13,1
1708334,6,20170104,7,1,1,2,5744576632396406899,66,2,50,3,2,47,9,0.0,24,1.0,21.0,0.0,991.0,0.0,0.0,0.0,1,13,1
1708335,6,20170104,1,0,0,0,2709355455991750775,0,2,28,0,2,0,6,0.0,24,1.0,22.0,0.0,1274.0,0.0,0.0,0.0,1,36,1


- Categorical columns: 13 (except 'fullVisitorId')
- Boolean columns: 4
- Numerical columns: 7 (except 'date')

# 5.&nbsp;Feature engineering

## 5.1 Train data

In [29]:
from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

train_df["date"] = pd.to_datetime(train_df["date"], infer_datetime_format=True, format="%Y%m%d")
test_df["date"] = pd.to_datetime(test_df["date"], infer_datetime_format=True, format="%Y%m%d")

In [30]:
def getTimeFramewithFeatures(tr, k=1):
    # train timeframe
    tf = tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*(k-1))) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k))]
    
    tf_fvid = set(tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46)) 
                & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    # user id in the test timeframe appeared in train timeframe
    tf_returned = tf[tf['fullVisitorId'].isin(tf_fvid)]
    tf_tst = tr[tr['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
            & (tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
            & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]

    # for returned user
    tf_target = tf_tst.groupby('fullVisitorId')[['totals.totalTransactionRevenue']]\
                        .sum().apply(np.log1p, axis=1).reset_index()
    tf_target.rename(columns={'totals.totalTransactionRevenue': 'target'}, 
                     inplace=True)
    # for new user
    tf_nonret = pd.DataFrame()
    tf_nonret['fullVisitorId'] = list(set(tf['fullVisitorId']) - tf_fvid)    
    tf_nonret['target'] = 0
    
    tf_target = pd.concat([tf_target, tf_nonret], axis=0).reset_index(drop=True)

    tf_maxdate = max(tf['date'])
    tf_mindate = min(tf['date'])

    tf = tf.groupby('fullVisitorId').agg({
        # date
        'date': [
            ('firstSes', 'min'), 
            ('lastSes', 'max'),
            ('unique', 'nunique'),
            ('session_cnt', 'count')
        ],        
        # categorical
        'geoNetwork.networkDomain': [('geoNetwork.networkDomain',  get_most_common)],
        'geoNetwork.city': [('geoNetwork.city',  get_most_common)], 
        'geoNetwork.region': [('geoNetwork.region',  get_most_common)],
        'trafficSource.source' : [('trafficSource.source',  get_most_common)], 
        'geoNetwork.country' : [('geoNetwork.country',  get_most_common)],
        'device.browser' : [('device.browser',  get_most_common)],
        'geoNetwork.metro': [('geoNetwork.metro',  get_most_common)],
        'device.operatingSystem': [('device.operatingSystem',  get_most_common)],
        'geoNetwork.subContinent': [('geoNetwork.subContinent',  get_most_common)],
        'channelGrouping':[('channelGrouping',  get_most_common)],
        'trafficSource.medium': [('trafficSource.medium',  get_most_common)],
        'geoNetwork.continent': [('geoNetwork.continent',  get_most_common)],
        'device.deviceCategory': [('device.deviceCategory',  get_most_common)],
        #boolean
        'device.isMobile': [('device.isMobile',  'mean')],
        'totals.bounces': [('totals.bounces',  'mean')],
        'totals.newVisits':  [('totals.newVisits',  'mean')],
        'trafficSource.isTrueDirect':  [('trafficSource.isTrueDirect',  'mean')],

        # numeric
        'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('totals.transactions_sum', 'sum')],
        'visitNumber': [('visitNumber_max', 'max')],
        'totals.timeOnSite': [
            ('timeOnSite_sum', 'sum'),
            ('timeOnSite_min', 'min'),
            ('timeOnSite_max', 'max'),
            ('timeOnSite_max', 'mean'),
        ],
        'totals.sessionQualityDim': [
            ('sessionQualityDim_max', 'sum'),
            ('sessionQualityDim_max', 'min'),
            ('sessionQualityDim_min', 'max'),
            ('sessionQualityDim_mean', 'mean'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'mean'),
        ],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'mean'),
        ],
    })

    tf.columns = tf.columns.droplevel()

    tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
    tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
    tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400

    tf = pd.merge(tf, tf_target, left_on='fullVisitorId', right_on='fullVisitorId')
    return tf

In [31]:
%time df1 = getTimeFramewithFeatures(train_df, k=1)
%time df2 = getTimeFramewithFeatures(train_df, k=2)
%time df3 = getTimeFramewithFeatures(train_df, k=3)

train_1 = pd.concat([df1, df2, df3], ignore_index=True)
train_1.to_csv('output/self_ft/train_1.csv', index=False)

CPU times: user 1min 17s, sys: 397 ms, total: 1min 17s
Wall time: 1min 21s
CPU times: user 1min, sys: 267 ms, total: 1min
Wall time: 1min
CPU times: user 1min 17s, sys: 655 ms, total: 1min 18s
Wall time: 1min 19s


In [32]:
tf_maxdate = max(test_df['date'])
tf_mindate = min(test_df['date'])

tf = test_df.groupby('fullVisitorId').agg({
    # date
    'date': [
        ('firstSes', 'min'), 
        ('lastSes', 'max'),
        ('unique', 'nunique'),
        ('session_cnt', 'count')
    ],        
    # categorical
    'geoNetwork.networkDomain': [('geoNetwork.networkDomain',  get_most_common)],
    'geoNetwork.city': [('geoNetwork.city',  get_most_common)], 
    'geoNetwork.region': [('geoNetwork.region',  get_most_common)],
    'trafficSource.source' : [('trafficSource.source',  get_most_common)], 
    'geoNetwork.country' : [('geoNetwork.country',  get_most_common)],
    'device.browser' : [('device.browser',  get_most_common)],
    'geoNetwork.metro': [('geoNetwork.metro',  get_most_common)],
    'device.operatingSystem': [('device.operatingSystem',  get_most_common)],
    'geoNetwork.subContinent': [('geoNetwork.subContinent',  get_most_common)],
    'channelGrouping':[('channelGrouping',  get_most_common)],
    'trafficSource.medium': [('trafficSource.medium',  get_most_common)],
    'geoNetwork.continent': [('geoNetwork.continent',  get_most_common)],
    'device.deviceCategory': [('device.deviceCategory',  get_most_common)],
    #boolean
    'device.isMobile': [('device.isMobile',  'mean')],
    'totals.bounces': [('totals.bounces',  'mean')],
    'totals.newVisits':  [('totals.newVisits',  'mean')],
    'trafficSource.isTrueDirect':  [('trafficSource.isTrueDirect',  'mean')],

    # numeric
    'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
    'totals.transactions': [('totals.transactions_sum', 'sum')],
    'visitNumber': [('visitNumber_max', 'max')],
    'totals.timeOnSite': [
        ('timeOnSite_sum', 'sum'),
        ('timeOnSite_min', 'min'),
        ('timeOnSite_max', 'max'),
        ('timeOnSite_max', 'mean'),
    ],
    'totals.sessionQualityDim': [
        ('sessionQualityDim_max', 'sum'),
        ('sessionQualityDim_max', 'min'),
        ('sessionQualityDim_min', 'max'),
        ('sessionQualityDim_mean', 'mean'),
    ],
    'totals.pageviews': [
        ('pageviews_sum', 'sum'),
        ('pageviews_min', 'min'),
        ('pageviews_max', 'max'),
        ('pageviews_mean', 'mean'),
    ],
    'totals.hits': [
        ('hits_sum', 'sum'),
        ('hits_min', 'min'), 
        ('hits_max', 'max'), 
        ('hits_mean', 'mean'),
    ],
})

tf.columns = tf.columns.droplevel()

tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400
tf.to_csv('output/self_ft/test_X.csv', index=False)

# 6.&nbsp;Choose model

## LightGBM

> is a gradient boosting framework based on decision trees to increases: 

- the efficiency of the model
- reduces memory usage.

Light GBM splits the tree leaf-wise with the best fit whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. In other words, Light GBM grows trees vertically while other algorithms grow trees horizontally.

It uses two novel techniques:
- Gradient-based One Side Sampling (GOSS): sampling method which down samples the instances on basis of gradients.
- Exclusive Feature Bundling (EFB): down sample the feature to speed up tree learning.

**Advantage:**
- *Faster training speed and higher efficiency*: Light GBM uses a histogram-based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
- *Lower memory usage*: Replaces continuous values to discrete bins which results in lower memory usage.
- *Better accuracy than any other boosting algorithm*: It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy.
- *Compatibility with Large Datasets*: It is capable of performing equally well with large datasets with a significant reduction in training time as compared to XGBoost.

**Disadvantage:**
- *Overfitting*: Light GBM split the tree leaf-wise which can lead to overfitting as it produces much complex trees.
- *Compatibility with Datasets*: Light GBM is sensitive to overfitting and thus can easily overfit small data.

*Source:* https://www.kaggle.com/general/264327

> **Level-wise:**

<img src='https://drive.google.com/uc?id=1Aq7vJbXb9NLrm7H-S1m8AT2NEkVJuAlW' width=600>

> **Leaf-wise:**

<img src='https://drive.google.com/uc?id=1TNxATnK-nFacko8qMJNqq9mEDAO_9fw0' width=700>

**Missing value handle:**
- Enables the missing value handle by default. Disable it by setting `use_missing=false`.
- Uses NA (NaN) to represent missing values by default. Change it to use zero by setting `zero_as_missing=true`.
- When `zero_as_missing=false` (default), the unrecorded values in sparse matrices (and LightSVM) are treated as zeros.
- When `zero_as_missing=true`, NA and zeros (including unrecorded values in sparse matrices (and LightSVM)) are treated as missing.


**Categorical handle:**
- LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories. This often performs better than one-hot encoding.
- Use `categorical_feature` to specify the categorical features. Refer to the parameter `categorical_feature` in Parameters.
- Categorical features must be encoded as non-negative integers (int) less than Int32.MaxValue (2147483647).
- Use `min_data_per_group`, `cat_smooth` to deal with over-fitting (when `#data` is small or `#category` is large).
- For a categorical feature with high cardinality (`#category` is large), it often works best to treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or by embedding the categories in a low-dimensional numeric space.

**Parameters:**
- `task`: default = `'train'`; options = `'train'`, `'prediction'`; Specifies the task we wish to perform which is either train or prediction.

- `application`: default = `'regression'`, options:
    - `'regression'`: perform regression task
    - `'binary'`: Binary classification

- `data`: type=string; training data, LightGBM will train from this data.
- `num_iterations`: number of boosting iterations to be performed; default=100; type=int.
- `num_leaves`: number of leaves in one tree ; default=31 ; type=int.
- `device`: default= cpu ; options = gpu,cpu. Device on which we want to train our model. Choose GPU for faster training.
- `max_depth`: Specify the max depth to which tree will grow. This parameter is used to deal with overfitting.

- `feature_fraction`: default=1 ; specifies the fraction of features to be taken for each iteration
- `bagging_fraction`: default=1 ; specifies the fraction of data to be used for each iteration and is generally used to speed up the training and avoid overfitting.
- `min_gain_to_split`: default=.1 ; min gain to perform splitting
- `max_bin `: max number of bins to bucket the feature values.
- `min_data_in_bin` : min number of data in one bin
- `num_threads`: default=OpenMP_default, type=int ;Number of threads for Light GBM.
label : type=string ; specify the label column
- `categorical_feature` : type=string ; specify the categorical features we want to use for training our model
- `num_class`: default=1; type=int ; used only for multi-class classification

# 7.&nbsp;Train model

In [40]:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [43]:
data = pd.read_csv('output/self_ft/train_1.csv', low_memory=False, dtype={'fullVisitorId': str})
x_pred = pd.read_csv('output/self_ft/test_X.csv', dtype={'fullVisitorId': str})

In [44]:
x = data.drop(['fullVisitorId','target'],axis=1)
y = data['target']
# train and test split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.33,random_state=42)

In [45]:
x_train.columns

Index(['firstSes', 'lastSes', 'unique', 'session_cnt',
       'geoNetwork.networkDomain', 'geoNetwork.city', 'geoNetwork.region',
       'trafficSource.source', 'geoNetwork.country', 'device.browser',
       'geoNetwork.metro', 'device.operatingSystem', 'geoNetwork.subContinent',
       'channelGrouping', 'trafficSource.medium', 'geoNetwork.continent',
       'device.deviceCategory', 'device.isMobile', 'totals.bounces',
       'totals.newVisits', 'trafficSource.isTrueDirect',
       'totals.totalTransactionRevenue_sum', 'totals.transactions_sum',
       'visitNumber_max', 'timeOnSite_sum', 'timeOnSite_min', 'timeOnSite_max',
       'timeOnSite_max.1', 'sessionQualityDim_max', 'sessionQualityDim_max.1',
       'sessionQualityDim_min', 'sessionQualityDim_mean', 'pageviews_sum',
       'pageviews_min', 'pageviews_max', 'pageviews_mean', 'hits_sum',
       'hits_min', 'hits_max', 'hits_mean', 'interval'],
      dtype='object')

In [50]:
cat_index = [4,5,6,7,8,9,10,11,12,13,14,15,16]

In [51]:
%%time
params = {
    'num_leaves': [7, 14, 21, 28, 31, 50],
    'learning_rate': [0.05, 0.03, 0.003],
    'max_depth': [-1, 3, 5],
    'n_estimators': [50, 100, 200, 500],
}

grid = GridSearchCV(lgb.LGBMRegressor(max_bin=256, categorical_features=cat_index, random_state=0), params, scoring='r2', cv=5)
grid.fit(x_train, y_train)

lgbm_tuned = grid.best_estimator_

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.68 µs


KeyboardInterrupt: ignored

In [None]:
# model = lgb.LGBMRegressor(max_bin=256, categorical_features=x_train.columns[4:7], learning_rate=0.05,random_state=42)
# model.fit(x_train,y_train,eval_set=[(x_test,y_test)], verbose=20, eval_metric='rmse')

# 8.&nbsp;Predict output

In [None]:
sr = x_pred.dtypes
lst_cat = sr[sr == 'object'].index.tolist()[1:]

x_pred[lst_cat] = x_pred[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])

In [None]:
pred = lgbm_tuned.predict(x_pred.drop(columns='fullVisitorId'))
prediction = pd.DataFrame(zip(x_pred['fullVisitorId'], pred), columns = ['fullVisitorId', 'PredictedLogRevenue'])
pd.DataFrame(prediction).to_csv('output/self_ft/prediction.csv', index=False)
prediction