# Google Analytics Customer Revenue Prediction

* Data Exploration
* 2018.09.17 ~

## 1. Import Required Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 2. Data Loading

In [None]:
train = pd.read_csv("../data/train.csv", index_col="fullVisitorId", parse_dates=['date'])
test = pd.read_csv("../data/test.csv",index_col="fullVisitorId", parse_dates=['date'])
submission = pd.read_csv("../data/sample_submission.csv",index_col="fullVisitorId")

In [None]:
print(train.shape)
train.head()

In [None]:
print(test.shape)
test.head()

In [None]:
print(submission.shape)
submission.head()


### **[comment_180917]**
* JSON 형태의 컬럼 Values를 DataFrame 형태로 변환을 어떻게 해야되나? 

## 3. Conver JSON to df

In [2]:
import json
from pandas.io.json import json_normalize

In [3]:
def load_df(csv_path='../data/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path,
                     parse_dates=['date'],
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    return df

In [4]:
%%time
train = load_df()
test = load_df('../data/test.csv')

CPU times: user 3min 48s, sys: 7.67 s, total: 3min 56s
Wall time: 3min 53s


## 4. Feature Exploration

### Data Fields
* fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
* channelGrouping - The channel via which the user came to the Store.
* date - The date on which the user visited the Store.
* device - The specifications for the device used to access the Store.
* geoNetwork - This section contains information about the geography of the user.
* sessionId - A unique identifier for this visit to the store.
* socialEngagementType - Engagement type, either "Socially Engaged" or "Not Socially Engaged".
* totals - This section contains aggregate values across the session.
* trafficSource - This section contains information about the Traffic Source from which the session originated.
* visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you *  should use a combination of fullVisitorId and visitId.
* visitNumber - The session number for this user. If this is the first session, then this is set to 1.
* visitStartTime - The timestamp (expressed as POSIX time).

In [5]:
train.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device.browser,device.browserSize,...,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.campaign,trafficSource.campaignCode,trafficSource.isTrueDirect,trafficSource.keyword,trafficSource.medium,trafficSource.referralPath,trafficSource.source
0,Organic Search,2016-09-02,1131660440785968503,1131660440785968503_1472830385,Not Socially Engaged,1472830385,1,1472830385,Chrome,not available in demo dataset,...,,,,(not set),,,(not provided),organic,,google
1,Organic Search,2016-09-02,377306020877927890,377306020877927890_1472880147,Not Socially Engaged,1472880147,1,1472880147,Firefox,not available in demo dataset,...,,,,(not set),,,(not provided),organic,,google
2,Organic Search,2016-09-02,3895546263509774583,3895546263509774583_1472865386,Not Socially Engaged,1472865386,1,1472865386,Chrome,not available in demo dataset,...,,,,(not set),,,(not provided),organic,,google
3,Organic Search,2016-09-02,4763447161404445595,4763447161404445595_1472881213,Not Socially Engaged,1472881213,1,1472881213,UC Browser,not available in demo dataset,...,,,,(not set),,,google + online,organic,,google
4,Organic Search,2016-09-02,27294437909732085,27294437909732085_1472822600,Not Socially Engaged,1472822600,2,1472822600,Chrome,not available in demo dataset,...,,,,(not set),,True,(not provided),organic,,google


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 903653 entries, 0 to 903652
Data columns (total 55 columns):
channelGrouping                                      903653 non-null object
date                                                 903653 non-null datetime64[ns]
fullVisitorId                                        903653 non-null object
sessionId                                            903653 non-null object
socialEngagementType                                 903653 non-null object
visitId                                              903653 non-null int64
visitNumber                                          903653 non-null int64
visitStartTime                                       903653 non-null int64
device.browser                                       903653 non-null object
device.browserSize                                   903653 non-null object
device.browserVersion                                903653 non-null object
device.deviceCategory                           

### 4-1 ID ?
* The data involves three kinds of ID, 'fullVisitorId','visitId','sessionId'


In [7]:
print("fullVisitorId:", train['fullVisitorId'].unique().shape[0])
print("visitId:", train['visitId'].unique().shape[0])
print("sessionId:", train['sessionId'].unique().shape[0])
train.shape[0]

fullVisitorId: 714167
visitId: 886303
sessionId: 902755


903653

### 4-2 Feature by Feature

* feature 항목별 탐색을 통해 if drop or not 결정
* feature 항복별 dtypes 결정

In [8]:
train.columns

Index(['channelGrouping', 'date', 'fullVisitorId', 'sessionId',
       'socialEngagementType', 'visitId', 'visitNumber', 'visitStartTime',
       'device.browser', 'device.browserSize', 'device.browserVersion',
       'device.deviceCategory', 'device.flashVersion', 'device.isMobile',
       'device.language', 'device.mobileDeviceBranding',
       'device.mobileDeviceInfo', 'device.mobileDeviceMarketingName',
       'device.mobileDeviceModel', 'device.mobileInputSelector',
       'device.operatingSystem', 'device.operatingSystemVersion',
       'device.screenColors', 'device.screenResolution', 'geoNetwork.city',
       'geoNetwork.cityId', 'geoNetwork.continent', 'geoNetwork.country',
       'geoNetwork.latitude', 'geoNetwork.longitude', 'geoNetwork.metro',
       'geoNetwork.networkDomain', 'geoNetwork.networkLocation',
       'geoNetwork.region', 'geoNetwork.subContinent', 'totals.bounces',
       'totals.hits', 'totals.newVisits', 'totals.pageviews',
       'totals.transactionRevenue

* feature 항목별 value 값 확인 이후 아래와 같이 각 컬럼별 성격을 나눔

In [9]:
categorical = ['channelGrouping','device.deviceCategory','trafficSource.adwordsClickInfo.adNetworkType']
float_col = ['visitNumber','totals.bounces','totals.hits','totals.newVisits', 'totals.pageviews', 'totals.transactionRevenue','totals.visits']
float_col_test = ['visitNumber','totals.bounces','totals.hits','totals.newVisits', 'totals.pageviews','totals.visits']
date=['date']
drop=['socialEngagementType', 'visitStartTime', 'device.browser', 'device.browserSize','device.browserVersion','device.flashVersion', 'device.isMobile','device.mobileDeviceModel','device.language','device.mobileDeviceBranding',
    'device.mobileDeviceInfo', 'device.mobileDeviceMarketingName', 'device.mobileInputSelector','device.operatingSystem','device.operatingSystemVersion','device.screenColors',
     'device.screenResolution','geoNetwork.city', 'geoNetwork.cityId','geoNetwork.continent','geoNetwork.country','geoNetwork.latitude', 'geoNetwork.longitude', 'geoNetwork.metro',
       'geoNetwork.networkDomain', 'geoNetwork.networkLocation',
       'geoNetwork.region', 'geoNetwork.subContinent','trafficSource.adContent','trafficSource.adwordsClickInfo.criteriaParameters','trafficSource.adwordsClickInfo.gclId',
      'trafficSource.adwordsClickInfo.isVideoAd','trafficSource.adwordsClickInfo.page','trafficSource.adwordsClickInfo.slot','trafficSource.campaignCode','trafficSource.medium',
      'trafficSource.referralPath','trafficSource.source','trafficSource.isTrueDirect']
test_drop = ['socialEngagementType', 'visitStartTime', 'device.browser', 'device.browserSize','device.browserVersion','device.flashVersion', 'device.isMobile','device.mobileDeviceModel','device.language','device.mobileDeviceBranding',
    'device.mobileDeviceInfo', 'device.mobileDeviceMarketingName', 'device.mobileInputSelector','device.operatingSystem','device.operatingSystemVersion','device.screenColors',
     'device.screenResolution','geoNetwork.city', 'geoNetwork.cityId','geoNetwork.continent','geoNetwork.country','geoNetwork.latitude', 'geoNetwork.longitude', 'geoNetwork.metro',
       'geoNetwork.networkDomain', 'geoNetwork.networkLocation',
       'geoNetwork.region', 'geoNetwork.subContinent','trafficSource.adContent','trafficSource.adwordsClickInfo.criteriaParameters','trafficSource.adwordsClickInfo.gclId',
      'trafficSource.adwordsClickInfo.isVideoAd','trafficSource.adwordsClickInfo.page','trafficSource.adwordsClickInfo.slot','trafficSource.medium',
      'trafficSource.referralPath','trafficSource.source','trafficSource.isTrueDirect']
ID = ['visitId', 'sessionId']
boolen = ['trafficSource.campaign','trafficSource.keyword']

* drop useless columns

In [10]:
train = train.drop(drop, axis=1)

In [11]:
print(train.shape)
train.head()

(903653, 16)


Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,visitId,visitNumber,device.deviceCategory,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.transactionRevenue,totals.visits,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.campaign,trafficSource.keyword
0,Organic Search,2016-09-02,1131660440785968503,1131660440785968503_1472830385,1472830385,1,desktop,1,1,1.0,1,,1,,(not set),(not provided)
1,Organic Search,2016-09-02,377306020877927890,377306020877927890_1472880147,1472880147,1,desktop,1,1,1.0,1,,1,,(not set),(not provided)
2,Organic Search,2016-09-02,3895546263509774583,3895546263509774583_1472865386,1472865386,1,desktop,1,1,1.0,1,,1,,(not set),(not provided)
3,Organic Search,2016-09-02,4763447161404445595,4763447161404445595_1472881213,1472881213,1,desktop,1,1,1.0,1,,1,,(not set),google + online
4,Organic Search,2016-09-02,27294437909732085,27294437909732085_1472822600,1472822600,2,mobile,1,1,,1,,1,,(not set),(not provided)


In [12]:
test = test.drop(test_drop, axis=1)

In [13]:
print(test.shape)
test.head()

(804684, 15)


Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,visitId,visitNumber,device.deviceCategory,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.visits,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.campaign,trafficSource.keyword
0,Organic Search,2017-10-16,6167871330617112363,6167871330617112363_1508151024,1508151024,2,desktop,,4,,4,1,,(not set),(not provided)
1,Organic Search,2017-10-16,643697640977915618,0643697640977915618_1508175522,1508175522,1,desktop,,5,1.0,5,1,,(not set),(not provided)
2,Organic Search,2017-10-16,6059383810968229466,6059383810968229466_1508143220,1508143220,1,desktop,,7,1.0,7,1,,(not set),(not provided)
3,Organic Search,2017-10-16,2376720078563423631,2376720078563423631_1508193530,1508193530,1,mobile,,8,1.0,4,1,,(not set),(not provided)
4,Organic Search,2017-10-16,2314544520795440038,2314544520795440038_1508217442,1508217442,1,desktop,,9,1.0,4,1,,(not set),(not provided)


* Parsing date column

In [14]:
train['year']=train['date'].dt.year
train['month']=train['date'].dt.month
train['day']=train['date'].dt.day
train['week']=train['date'].dt.dayofweek
test['year']=test['date'].dt.year
test['month']=test['date'].dt.month
test['day']=test['date'].dt.day
test['week']=test['date'].dt.dayofweek

train = train.drop('date', axis=1)
test = test.drop('date', axis=1)

In [15]:
train.head()

Unnamed: 0,channelGrouping,fullVisitorId,sessionId,visitId,visitNumber,device.deviceCategory,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.transactionRevenue,totals.visits,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.campaign,trafficSource.keyword,year,month,day,week
0,Organic Search,1131660440785968503,1131660440785968503_1472830385,1472830385,1,desktop,1,1,1.0,1,,1,,(not set),(not provided),2016,9,2,4
1,Organic Search,377306020877927890,377306020877927890_1472880147,1472880147,1,desktop,1,1,1.0,1,,1,,(not set),(not provided),2016,9,2,4
2,Organic Search,3895546263509774583,3895546263509774583_1472865386,1472865386,1,desktop,1,1,1.0,1,,1,,(not set),(not provided),2016,9,2,4
3,Organic Search,4763447161404445595,4763447161404445595_1472881213,1472881213,1,desktop,1,1,1.0,1,,1,,(not set),google + online,2016,9,2,4
4,Organic Search,27294437909732085,27294437909732085_1472822600,1472822600,2,mobile,1,1,,1,,1,,(not set),(not provided),2016,9,2,4


In [16]:
test.head()

Unnamed: 0,channelGrouping,fullVisitorId,sessionId,visitId,visitNumber,device.deviceCategory,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.visits,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.campaign,trafficSource.keyword,year,month,day,week
0,Organic Search,6167871330617112363,6167871330617112363_1508151024,1508151024,2,desktop,,4,,4,1,,(not set),(not provided),2017,10,16,0
1,Organic Search,643697640977915618,0643697640977915618_1508175522,1508175522,1,desktop,,5,1.0,5,1,,(not set),(not provided),2017,10,16,0
2,Organic Search,6059383810968229466,6059383810968229466_1508143220,1508143220,1,desktop,,7,1.0,7,1,,(not set),(not provided),2017,10,16,0
3,Organic Search,2376720078563423631,2376720078563423631_1508193530,1508193530,1,mobile,,8,1.0,4,1,,(not set),(not provided),2017,10,16,0
4,Organic Search,2314544520795440038,2314544520795440038_1508217442,1508217442,1,desktop,,9,1.0,4,1,,(not set),(not provided),2017,10,16,0


In [17]:
for col in float_col:
    train[col]=train[col].astype(float)

    
for col in float_col_test:
    test[col]=test[col].astype(float)    



In [18]:
train.dtypes

channelGrouping                                  object
fullVisitorId                                    object
sessionId                                        object
visitId                                           int64
visitNumber                                     float64
device.deviceCategory                            object
totals.bounces                                  float64
totals.hits                                     float64
totals.newVisits                                float64
totals.pageviews                                float64
totals.transactionRevenue                       float64
totals.visits                                   float64
trafficSource.adwordsClickInfo.adNetworkType     object
trafficSource.campaign                           object
trafficSource.keyword                            object
year                                              int64
month                                             int64
day                                             

In [19]:
test.dtypes

channelGrouping                                  object
fullVisitorId                                    object
sessionId                                        object
visitId                                           int64
visitNumber                                     float64
device.deviceCategory                            object
totals.bounces                                  float64
totals.hits                                     float64
totals.newVisits                                float64
totals.pageviews                                float64
totals.visits                                   float64
trafficSource.adwordsClickInfo.adNetworkType     object
trafficSource.campaign                           object
trafficSource.keyword                            object
year                                              int64
month                                             int64
day                                               int64
week                                            

In [20]:
train[boolen].head()

Unnamed: 0,trafficSource.campaign,trafficSource.keyword
0,(not set),(not provided)
1,(not set),(not provided)
2,(not set),(not provided)
3,(not set),google + online
4,(not set),(not provided)


In [21]:
train.loc[train['trafficSource.campaign']!="(not set)", ["trafficSource.campaign"]]=1
train.loc[train['trafficSource.campaign']=="(not set)", ["trafficSource.campaign"]]=0

In [22]:
test.loc[test['trafficSource.campaign']!="(not set)", ["trafficSource.campaign"]]=1
test.loc[test['trafficSource.campaign']=="(not set)", ["trafficSource.campaign"]]=0

In [23]:
train['trafficSource.keyword'].fillna(0)
train.loc[train['trafficSource.keyword']!="(not provided)", ["trafficSource.keyword"]]=1
train.loc[train['trafficSource.keyword']=="(not provided)", ["trafficSource.keyword"]]=0

test['trafficSource.keyword'].fillna(0)
test.loc[test['trafficSource.keyword']!="(not provided)", ["trafficSource.keyword"]]=1
test.loc[test['trafficSource.keyword']=="(not provided)", ["trafficSource.keyword"]]=0

In [24]:
train[categorical].head()

Unnamed: 0,channelGrouping,device.deviceCategory,trafficSource.adwordsClickInfo.adNetworkType
0,Organic Search,desktop,
1,Organic Search,desktop,
2,Organic Search,desktop,
3,Organic Search,desktop,
4,Organic Search,mobile,


In [25]:
train = pd.get_dummies(train,columns=['trafficSource.adwordsClickInfo.adNetworkType'])
train = pd.get_dummies(train,columns=['channelGrouping'])
train = pd.get_dummies(train,columns=['device.deviceCategory'])

test = pd.get_dummies(test,columns=['trafficSource.adwordsClickInfo.adNetworkType'])
test = pd.get_dummies(test,columns=['channelGrouping'])
test = pd.get_dummies(test,columns=['device.deviceCategory'])

In [26]:
train.head()

Unnamed: 0,fullVisitorId,sessionId,visitId,visitNumber,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.transactionRevenue,totals.visits,...,channelGrouping_Affiliates,channelGrouping_Direct,channelGrouping_Display,channelGrouping_Organic Search,channelGrouping_Paid Search,channelGrouping_Referral,channelGrouping_Social,device.deviceCategory_desktop,device.deviceCategory_mobile,device.deviceCategory_tablet
0,1131660440785968503,1131660440785968503_1472830385,1472830385,1.0,1.0,1.0,1.0,1.0,,1.0,...,0,0,0,1,0,0,0,1,0,0
1,377306020877927890,377306020877927890_1472880147,1472880147,1.0,1.0,1.0,1.0,1.0,,1.0,...,0,0,0,1,0,0,0,1,0,0
2,3895546263509774583,3895546263509774583_1472865386,1472865386,1.0,1.0,1.0,1.0,1.0,,1.0,...,0,0,0,1,0,0,0,1,0,0
3,4763447161404445595,4763447161404445595_1472881213,1472881213,1.0,1.0,1.0,1.0,1.0,,1.0,...,0,0,0,1,0,0,0,1,0,0
4,27294437909732085,27294437909732085_1472822600,1472822600,2.0,1.0,1.0,,1.0,,1.0,...,0,0,0,1,0,0,0,0,1,0


In [28]:
train.isnull().sum()

fullVisitorId                                                        0
sessionId                                                            0
visitId                                                              0
visitNumber                                                          0
totals.bounces                                                  453023
totals.hits                                                          0
totals.newVisits                                                200593
totals.pageviews                                                   100
totals.transactionRevenue                                       892138
totals.visits                                                        0
trafficSource.campaign                                               0
trafficSource.keyword                                                0
year                                                                 0
month                                                                0
day   

In [31]:
train['totals.bounces']

0         1.0
1         1.0
2         1.0
3         1.0
4         1.0
5         1.0
6         1.0
7         1.0
8         1.0
9         1.0
10        1.0
11        1.0
12        1.0
13        1.0
14        1.0
15        1.0
16        1.0
17        1.0
18        1.0
19        1.0
20        1.0
21        1.0
22        1.0
23        1.0
24        1.0
25        1.0
26        1.0
27        1.0
28        1.0
29        1.0
         ... 
903623    NaN
903624    NaN
903625    NaN
903626    NaN
903627    NaN
903628    NaN
903629    NaN
903630    NaN
903631    NaN
903632    NaN
903633    NaN
903634    NaN
903635    NaN
903636    NaN
903637    NaN
903638    NaN
903639    NaN
903640    NaN
903641    NaN
903642    NaN
903643    NaN
903644    NaN
903645    NaN
903646    NaN
903647    NaN
903648    NaN
903649    NaN
903650    NaN
903651    NaN
903652    NaN
Name: totals.bounces, Length: 903653, dtype: float64