# Project - Google Analytics Customer Revenue Preprocessing

## Presenting the initial data: 

<b>Data Fields: </b>

<b>fullVisitorIdv</b> - A unique identifier for each user of the Google Merchandise Store. <br>
<b>channelGrouping</b> - The channel via which the user came to the Store.<br>
<b>date</b> - The date on which the user visited the Store.<br>
<b>device </b>- The specifications for the device used to access the Store.<br>
<b>geoNetwork</b> - This section contains information about the geography of the user.<br>
<b>sessionId</b> - A unique identifier for this visit to the store.<br>
<b>socialEngagementType</b> - Engagement type, either "Socially Engaged" or "Not Socially Engaged".<br>
<b>totals</b> - This section contains aggregate values across the session.<br>
<b>trafficSource</b> - This section contains information about the Traffic Source from which the session originated.<br>
<b>visitId</b> - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.<br>
<b>visitNumber</b> - The session number for this user. If this is the first session, then this is set to 1.<br>
<b>visitStartTime</b> - The timestamp (expressed as POSIX time).<br>

# Objectives: 

The main objectives of this project are :

* Load the data so everything is in tabular format (some columns contain JSON so it you will need to find ways to separate those into independent columns)
* Identify the variables that need special processing (removing or infering missing values, removing columns that don't contain useful information)
* Run visualizations to better understand the data

tip : the ```pd.read_csv``` function has some very useful arguments that will help you read the data properly. Use the ```converters``` argument along with the ```json.loads``` function in order to read the json columns , ```dtype``` enables you to set the type of specific columns, you may also use the ```skiprows``` argument in order to load a fraction of the dataset for faster subsequent processing.

## Importing necessary libraries

In [1]:
# Necessary librarys
import os # it's a operational system library, to set some informations
import random # random is to generate random values

import pandas as pd # to manipulate data frames 
import numpy as np # to work with matrix

import matplotlib.pyplot as plt # to graphics plot
import seaborn as sns # a good library to graphic plots

# Importing librarys to use on interactive graphs
from plotly.offline import init_notebook_mode, iplot, plot 
import plotly.graph_objs as go 

import json # to convert json in df
from pandas.io.json import json_normalize # to normalize the json file

# to set a style to all graphs
plt.style.use('fivethirtyeight')
init_notebook_mode(connected=True)

## Some columns are in Json format so it will be necessary to handle them.

# Importing the datasets

In [6]:
%%time 
# %%time is used to calculate the timing of code chunk execution #

# We will import the data using the name and extension that will be concatenated with dir_path
df = pd.read_csv("s3://full-stack-bigdata-datasets/Machine Learning Supervisé/projects/preprocessing_linear_models/Google_dataset_sample.csv", low_memory=False) 
# The same to test dataset
#df_test = json_read("test.csv") 

CPU times: user 2.4 s, sys: 423 ms, total: 2.83 s
Wall time: 5.03 s


In [7]:
# This command shows the first 5 rows of our dataset
pd.set_option('display.max_columns', 500)

df.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device.browser,device.browserVersion,device.browserSize,device.operatingSystem,device.operatingSystemVersion,device.isMobile,device.mobileDeviceBranding,device.mobileDeviceModel,device.mobileInputSelector,device.mobileDeviceInfo,device.mobileDeviceMarketingName,device.flashVersion,device.language,device.screenColors,device.screenResolution,device.deviceCategory,geoNetwork.continent,geoNetwork.subContinent,geoNetwork.country,geoNetwork.region,geoNetwork.metro,geoNetwork.city,geoNetwork.cityId,geoNetwork.networkDomain,geoNetwork.latitude,geoNetwork.longitude,geoNetwork.networkLocation,totals.visits,totals.hits,totals.pageviews,totals.bounces,totals.newVisits,totals.transactionRevenue,trafficSource.campaign,trafficSource.source,trafficSource.medium,trafficSource.keyword,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.isTrueDirect,trafficSource.referralPath,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adContent,trafficSource.campaignCode
0,Organic Search,20160902,4763447161404445595,4763447161404445595_1472881213,Not Socially Engaged,1472881213,1,1472881213,UC Browser,not available in demo dataset,not available in demo dataset,Linux,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Asia,Southeast Asia,Indonesia,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,unknown.unknown,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,1,1.0,1.0,1.0,,(not set),google,organic,google + online,not available in demo dataset,,,,,,,,,
1,Organic Search,20160902,1905672039242460897,1905672039242460897_1472817241,Not Socially Engaged,1472817241,1,1472817241,Chrome,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Asia,Southern Asia,Pakistan,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,unknown.unknown,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,1,1.0,1.0,1.0,,(not set),google,organic,(not provided),not available in demo dataset,,,,,,,,,
2,Organic Search,20160902,3696906537737368442,3696906537737368442_1472856874,Not Socially Engaged,1472856874,1,1472856874,Chrome,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Americas,South America,Argentina,Buenos Aires,(not set),Buenos Aires,not available in demo dataset,phonevision.com.ar,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,1,1.0,1.0,1.0,,(not set),google,organic,(not provided),not available in demo dataset,,,,,,,,,
3,Organic Search,20160902,8794587387581803040,8794587387581803040_1472816048,Not Socially Engaged,1472816048,1,1472816048,Internet Explorer,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Europe,Western Europe,Germany,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,hafele.com,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,1,1.0,1.0,1.0,,(not set),google,organic,(not provided),not available in demo dataset,,,,,,,,,
4,Organic Search,20160902,1438836965936298791,1438836965936298791_1472833322,Not Socially Engaged,1472833322,1,1472833322,Internet Explorer,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Europe,Northern Europe,Denmark,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,tdc.net,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,1,1.0,1.0,1.0,,(not set),google,organic,(not provided),not available in demo dataset,,,,,,,,,


## Knowing the missing values

## Creating the function to handle with date 

In [8]:
# library of datetime
from datetime import datetime

# This function is to extract date features
def date_process(df):
    df["date"] = pd.to_datetime(df["date"], format="%Y%m%d") # seting the column as pandas datetime
    df["_weekday"] = df['date'].dt.weekday #extracting week day
    df["_day"] = df['date'].dt.day # extracting day
    df["_month"] = df['date'].dt.month # extracting day
    df["_year"] = df['date'].dt.year # extracting day
    df['_visitHour'] = (df['visitStartTime'].apply(lambda x: str(datetime.fromtimestamp(x).hour))).astype(int)
    
    return df #returning the df after the transformations

In [9]:
df_train = date_process(df) #calling the function that we created above

## Deal with missing values

In [10]:
def NumericalColumns(df):    # fillna numeric feature
    df['totals.pageviews'].fillna(1, inplace=True) #filling NA's with 1
    df['totals.newVisits'].fillna(0, inplace=True) #filling NA's with 0
    df['totals.bounces'].fillna(0, inplace=True)   #filling NA's with 0
    df['trafficSource.isTrueDirect'].fillna(False, inplace=True) # filling boolean with False
    df['trafficSource.adwordsClickInfo.isVideoAd'].fillna(True, inplace=True) # filling boolean with True
    df["totals.transactionRevenue"] = df["totals.transactionRevenue"].fillna(0.0).astype(float) #filling NA with zero
    df['totals.pageviews'] = df['totals.pageviews'].astype(int) # setting numerical column as integer
    df['totals.newVisits'] = df['totals.newVisits'].astype(int) # setting numerical column as integer
    df['totals.bounces'] = df['totals.bounces'].astype(int)  # setting numerical column as integer
    df["totals.hits"] = df["totals.hits"].astype(float) # setting numerical to float
    df['totals.visits'] = df['totals.visits'].astype(int) # seting as int

    return df #return the transformed dataframe

In [14]:
from sklearn import preprocessing

def Normalizing(df):
    # Use MinMaxScaler to normalize the column
    df["totals.hits"] =  (df['totals.hits'] - min(df['totals.hits'])) / (max(df['totals.hits'])  - min(df['totals.hits']))
    # normalizing the transaction Revenue
    df['totals.transactionRevenue'] = df_train['totals.transactionRevenue'].apply(lambda x: np.log10(x+1))
    # return the modified df
    return df 

In [15]:
# call the function to transform the numerical columns
df_train = NumericalColumns(df_train)

# Call the function that will normalize some features
df_train = Normalizing(df_train)

In [16]:
df_train.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device.browser,device.browserVersion,device.browserSize,device.operatingSystem,device.operatingSystemVersion,device.isMobile,device.mobileDeviceBranding,device.mobileDeviceModel,device.mobileInputSelector,device.mobileDeviceInfo,device.mobileDeviceMarketingName,device.flashVersion,device.language,device.screenColors,device.screenResolution,device.deviceCategory,geoNetwork.continent,geoNetwork.subContinent,geoNetwork.country,geoNetwork.region,geoNetwork.metro,geoNetwork.city,geoNetwork.cityId,geoNetwork.networkDomain,geoNetwork.latitude,geoNetwork.longitude,geoNetwork.networkLocation,totals.visits,totals.hits,totals.pageviews,totals.bounces,totals.newVisits,totals.transactionRevenue,trafficSource.campaign,trafficSource.source,trafficSource.medium,trafficSource.keyword,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.isTrueDirect,trafficSource.referralPath,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adContent,trafficSource.campaignCode,_weekday,_day,_month,_year,_visitHour
0,Organic Search,2016-09-02,4763447161404445595,4763447161404445595_1472881213,Not Socially Engaged,1472881213,1,1472881213,UC Browser,not available in demo dataset,not available in demo dataset,Linux,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Asia,Southeast Asia,Indonesia,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,unknown.unknown,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,0.0,1,1,1,0.0,(not set),google,organic,google + online,not available in demo dataset,False,,,,,,True,,,4,2,9,2016,7
1,Organic Search,2016-09-02,1905672039242460897,1905672039242460897_1472817241,Not Socially Engaged,1472817241,1,1472817241,Chrome,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Asia,Southern Asia,Pakistan,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,unknown.unknown,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,0.0,1,1,1,0.0,(not set),google,organic,(not provided),not available in demo dataset,False,,,,,,True,,,4,2,9,2016,13
2,Organic Search,2016-09-02,3696906537737368442,3696906537737368442_1472856874,Not Socially Engaged,1472856874,1,1472856874,Chrome,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Americas,South America,Argentina,Buenos Aires,(not set),Buenos Aires,not available in demo dataset,phonevision.com.ar,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,0.0,1,1,1,0.0,(not set),google,organic,(not provided),not available in demo dataset,False,,,,,,True,,,4,2,9,2016,0
3,Organic Search,2016-09-02,8794587387581803040,8794587387581803040_1472816048,Not Socially Engaged,1472816048,1,1472816048,Internet Explorer,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Europe,Western Europe,Germany,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,hafele.com,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,0.0,1,1,1,0.0,(not set),google,organic,(not provided),not available in demo dataset,False,,,,,,True,,,4,2,9,2016,13
4,Organic Search,2016-09-02,1438836965936298791,1438836965936298791_1472833322,Not Socially Engaged,1472833322,1,1472833322,Internet Explorer,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,desktop,Europe,Northern Europe,Denmark,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,tdc.net,not available in demo dataset,not available in demo dataset,not available in demo dataset,1,0.0,1,1,1,0.0,(not set),google,organic,(not provided),not available in demo dataset,False,,,,,,True,,,4,2,9,2016,18


In [17]:
# let's remove the date, sessionId, visitId, visitNumber, visitStartTime, geoNetwork.region, geoNetwork.metro, geoNetwork.city, geoNetwork.networkDomain
# let's transform all the _weekday, _day, _month, _year, _visitHour to string so we can dummyfy them
df_clean = df_train.drop(["date", "sessionId", "visitId", "visitNumber", "visitStartTime", "geoNetwork.region", "geoNetwork.metro", "geoNetwork.city", "geoNetwork.networkDomain",
"trafficSource.source",	"trafficSource.medium", "trafficSource.isTrueDirect",	"trafficSource.adwordsClickInfo.isVideoAd",	"trafficSource.campaignCode", "geoNetwork.continent",	"geoNetwork.subContinent", "_day"], axis = 1)

transform_to_string = ["_weekday", "_month", "_year", "_visitHour"]
for col in transform_to_string:
    df_clean[col] = df_clean[col].astype(str)

df_id = df_clean["fullVisitorId"]
df_no_id = df_clean.drop(["fullVisitorId"], axis=1)

object_variables = df_no_id.select_dtypes(include = "object")
non_object_variables = df_no_id.select_dtypes(exclude = "object")

category_to_replace = []
for col in object_variables :
    value_proportion_table = object_variables[col].value_counts()/len(object_variables)
    columns_to_replace = [col for col in value_proportion_table.keys() if value_proportion_table[col]>0.01]
    category_to_replace.append(columns_to_replace)

for i, col in enumerate(object_variables.columns) :
    object_variables[col] = np.where(object_variables[col].isin(category_to_replace[i]),object_variables[col], "others")

df_no_id = pd.concat([object_variables,non_object_variables], axis=1)

# dummyfy the variables
df_clean = pd.get_dummies(df_no_id, drop_first=True)
df_clean["fullVisitorId"] = df_id
df_clean.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,device.isMobile,totals.visits,totals.hits,totals.pageviews,totals.bounces,totals.newVisits,totals.transactionRevenue,trafficSource.adwordsClickInfo.page,channelGrouping_Direct,channelGrouping_Organic Search,channelGrouping_Paid Search,channelGrouping_Referral,channelGrouping_Social,channelGrouping_others,device.browser_Edge,device.browser_Firefox,device.browser_Internet Explorer,device.browser_Safari,device.browser_others,device.operatingSystem_Chrome OS,device.operatingSystem_Linux,device.operatingSystem_Macintosh,device.operatingSystem_Windows,device.operatingSystem_iOS,device.operatingSystem_others,device.deviceCategory_mobile,device.deviceCategory_tablet,geoNetwork.country_Brazil,geoNetwork.country_Canada,geoNetwork.country_France,geoNetwork.country_Germany,geoNetwork.country_India,geoNetwork.country_Indonesia,geoNetwork.country_Italy,geoNetwork.country_Japan,geoNetwork.country_Mexico,geoNetwork.country_Netherlands,geoNetwork.country_Philippines,geoNetwork.country_Poland,geoNetwork.country_Russia,geoNetwork.country_Spain,geoNetwork.country_Taiwan,geoNetwork.country_Thailand,geoNetwork.country_Turkey,geoNetwork.country_United Kingdom,geoNetwork.country_United States,geoNetwork.country_Vietnam,geoNetwork.country_others,trafficSource.campaign_AW - Dynamic Search Ads Whole Site,trafficSource.campaign_Data Share Promo,trafficSource.campaign_others,trafficSource.keyword_6qEhsCssdK0z36ri,trafficSource.keyword_others,trafficSource.referralPath_/analytics/web/,trafficSource.referralPath_/yt/about/,trafficSource.referralPath_/yt/about/es-419/,trafficSource.referralPath_/yt/about/pt-BR/,trafficSource.referralPath_/yt/about/ru/,trafficSource.referralPath_/yt/about/th/,trafficSource.referralPath_/yt/about/tr/,trafficSource.referralPath_/yt/about/vi/,trafficSource.referralPath_others,trafficSource.adwordsClickInfo.slot_others,trafficSource.adwordsClickInfo.adNetworkType_others,_weekday_1,_weekday_2,_weekday_3,_weekday_4,_weekday_5,_weekday_6,_month_10,_month_11,_month_12,_month_2,_month_3,_month_4,_month_5,_month_6,_month_7,_month_8,_month_9,_year_2017,_visitHour_1,_visitHour_10,_visitHour_11,_visitHour_12,_visitHour_13,_visitHour_14,_visitHour_15,_visitHour_16,_visitHour_17,_visitHour_18,_visitHour_19,_visitHour_2,_visitHour_20,_visitHour_21,_visitHour_22,_visitHour_23,_visitHour_3,_visitHour_4,_visitHour_5,_visitHour_6,_visitHour_7,_visitHour_8,_visitHour_9,fullVisitorId
0,False,1,0.0,1,1,1,0.0,,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,4763447161404445595
1,False,1,0.0,1,1,1,0.0,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1905672039242460897
2,False,1,0.0,1,1,1,0.0,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3696906537737368442
3,False,1,0.0,1,1,1,0.0,,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8794587387581803040
4,False,1,0.0,1,1,1,0.0,,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1438836965936298791


In [18]:
df_clean.shape

(126410, 106)

In [19]:
# let's group and aggregate data thanks to the fullVisitorId
df_agg = df_clean.groupby("fullVisitorId").sum()
df_agg.head()

Unnamed: 0_level_0,device.isMobile,totals.visits,totals.hits,totals.pageviews,totals.bounces,totals.newVisits,totals.transactionRevenue,trafficSource.adwordsClickInfo.page,channelGrouping_Direct,channelGrouping_Organic Search,channelGrouping_Paid Search,channelGrouping_Referral,channelGrouping_Social,channelGrouping_others,device.browser_Edge,device.browser_Firefox,device.browser_Internet Explorer,device.browser_Safari,device.browser_others,device.operatingSystem_Chrome OS,device.operatingSystem_Linux,device.operatingSystem_Macintosh,device.operatingSystem_Windows,device.operatingSystem_iOS,device.operatingSystem_others,device.deviceCategory_mobile,device.deviceCategory_tablet,geoNetwork.country_Brazil,geoNetwork.country_Canada,geoNetwork.country_France,geoNetwork.country_Germany,geoNetwork.country_India,geoNetwork.country_Indonesia,geoNetwork.country_Italy,geoNetwork.country_Japan,geoNetwork.country_Mexico,geoNetwork.country_Netherlands,geoNetwork.country_Philippines,geoNetwork.country_Poland,geoNetwork.country_Russia,geoNetwork.country_Spain,geoNetwork.country_Taiwan,geoNetwork.country_Thailand,geoNetwork.country_Turkey,geoNetwork.country_United Kingdom,geoNetwork.country_United States,geoNetwork.country_Vietnam,geoNetwork.country_others,trafficSource.campaign_AW - Dynamic Search Ads Whole Site,trafficSource.campaign_Data Share Promo,trafficSource.campaign_others,trafficSource.keyword_6qEhsCssdK0z36ri,trafficSource.keyword_others,trafficSource.referralPath_/analytics/web/,trafficSource.referralPath_/yt/about/,trafficSource.referralPath_/yt/about/es-419/,trafficSource.referralPath_/yt/about/pt-BR/,trafficSource.referralPath_/yt/about/ru/,trafficSource.referralPath_/yt/about/th/,trafficSource.referralPath_/yt/about/tr/,trafficSource.referralPath_/yt/about/vi/,trafficSource.referralPath_others,trafficSource.adwordsClickInfo.slot_others,trafficSource.adwordsClickInfo.adNetworkType_others,_weekday_1,_weekday_2,_weekday_3,_weekday_4,_weekday_5,_weekday_6,_month_10,_month_11,_month_12,_month_2,_month_3,_month_4,_month_5,_month_6,_month_7,_month_8,_month_9,_year_2017,_visitHour_1,_visitHour_10,_visitHour_11,_visitHour_12,_visitHour_13,_visitHour_14,_visitHour_15,_visitHour_16,_visitHour_17,_visitHour_18,_visitHour_19,_visitHour_2,_visitHour_20,_visitHour_21,_visitHour_22,_visitHour_23,_visitHour_3,_visitHour_4,_visitHour_5,_visitHour_6,_visitHour_7,_visitHour_8,_visitHour_9
fullVisitorId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1
114156543135683,0,1,0.0,1,1,1,0.0,0.0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
450371054833295,0,1,0.0,1,1,1,0.0,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
538867824729259,0,1,0.0,1,1,1,0.0,0.0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
639845445148063,0,1,0.014028,7,0,1,0.0,0.0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
72202462202136,1,1,0.0,1,1,1,0.0,0.0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [20]:
# Let's separate target and training variables

y = df_agg["totals.transactionRevenue"]
X = df_agg.drop(["totals.transactionRevenue"], axis=1)

In [21]:
# split the data between train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)

In [22]:
# normalize the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = pd.DataFrame(sc.fit_transform(X_train), columns = X_train.columns, index= X_train.index)
X_test = pd.DataFrame(sc.fit_transform(X_test), columns = X_test.columns, index= X_test.index)

X_train.head()

Unnamed: 0_level_0,device.isMobile,totals.visits,totals.hits,totals.pageviews,totals.bounces,totals.newVisits,trafficSource.adwordsClickInfo.page,channelGrouping_Direct,channelGrouping_Organic Search,channelGrouping_Paid Search,channelGrouping_Referral,channelGrouping_Social,channelGrouping_others,device.browser_Edge,device.browser_Firefox,device.browser_Internet Explorer,device.browser_Safari,device.browser_others,device.operatingSystem_Chrome OS,device.operatingSystem_Linux,device.operatingSystem_Macintosh,device.operatingSystem_Windows,device.operatingSystem_iOS,device.operatingSystem_others,device.deviceCategory_mobile,device.deviceCategory_tablet,geoNetwork.country_Brazil,geoNetwork.country_Canada,geoNetwork.country_France,geoNetwork.country_Germany,geoNetwork.country_India,geoNetwork.country_Indonesia,geoNetwork.country_Italy,geoNetwork.country_Japan,geoNetwork.country_Mexico,geoNetwork.country_Netherlands,geoNetwork.country_Philippines,geoNetwork.country_Poland,geoNetwork.country_Russia,geoNetwork.country_Spain,geoNetwork.country_Taiwan,geoNetwork.country_Thailand,geoNetwork.country_Turkey,geoNetwork.country_United Kingdom,geoNetwork.country_United States,geoNetwork.country_Vietnam,geoNetwork.country_others,trafficSource.campaign_AW - Dynamic Search Ads Whole Site,trafficSource.campaign_Data Share Promo,trafficSource.campaign_others,trafficSource.keyword_6qEhsCssdK0z36ri,trafficSource.keyword_others,trafficSource.referralPath_/analytics/web/,trafficSource.referralPath_/yt/about/,trafficSource.referralPath_/yt/about/es-419/,trafficSource.referralPath_/yt/about/pt-BR/,trafficSource.referralPath_/yt/about/ru/,trafficSource.referralPath_/yt/about/th/,trafficSource.referralPath_/yt/about/tr/,trafficSource.referralPath_/yt/about/vi/,trafficSource.referralPath_others,trafficSource.adwordsClickInfo.slot_others,trafficSource.adwordsClickInfo.adNetworkType_others,_weekday_1,_weekday_2,_weekday_3,_weekday_4,_weekday_5,_weekday_6,_month_10,_month_11,_month_12,_month_2,_month_3,_month_4,_month_5,_month_6,_month_7,_month_8,_month_9,_year_2017,_visitHour_1,_visitHour_10,_visitHour_11,_visitHour_12,_visitHour_13,_visitHour_14,_visitHour_15,_visitHour_16,_visitHour_17,_visitHour_18,_visitHour_19,_visitHour_2,_visitHour_20,_visitHour_21,_visitHour_22,_visitHour_23,_visitHour_3,_visitHour_4,_visitHour_5,_visitHour_6,_visitHour_7,_visitHour_8,_visitHour_9
fullVisitorId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1
9427870688042783967,0.867244,-0.195738,-0.144457,-0.132253,-0.673633,0.11047,-0.089394,-0.276217,0.517198,-0.099283,-0.225012,-0.42751,-0.052703,-0.049148,-0.154935,-0.073584,1.167185,-0.186095,-0.079866,-0.14729,-0.329267,-0.524847,1.539985,-0.050886,0.988575,-0.136501,-0.148167,-0.12002,-0.117704,-0.101689,-0.233319,-0.101769,-0.102876,-0.099484,-0.107858,-0.094337,-0.095809,-0.089796,-0.112814,-0.090165,-0.080468,-0.155547,-0.154812,-0.178214,0.36295,-0.137164,-0.398836,-0.066082,-0.118201,-0.076029,-0.056122,-0.625763,-0.093718,-0.206601,-0.119056,-0.11793,-0.119973,-0.121661,-0.131656,-0.092324,0.05992,-0.185136,-0.184673,1.515668,-0.412507,-0.404626,-0.386268,-0.337749,-0.343465,-0.291173,-0.340356,-0.245193,-0.213275,-0.242774,-0.240158,-0.227882,-0.227195,-0.238344,-0.247076,-0.234336,0.312274,-0.17106,-0.202517,-0.200929,-0.203624,-0.209798,-0.213967,-0.231293,-0.242548,-0.249835,-0.249308,-0.255365,-0.202338,-0.251188,-0.233664,-0.237584,-0.229176,-0.201128,-0.205504,4.543914,-0.195913,-0.197456,-0.194028,-0.204512
9294875467542463120,-0.439296,-0.195738,-0.201055,-0.202585,-0.673633,0.11047,-0.089394,-0.276217,-0.582488,-0.099283,-0.225012,0.913306,-0.052703,-0.049148,-0.154935,-0.073584,1.167185,-0.186095,-0.079866,-0.14729,0.593493,-0.524847,-0.277658,-0.050886,-0.409636,-0.136501,-0.148167,-0.12002,-0.117704,-0.101689,-0.233319,-0.101769,-0.102876,-0.099484,5.620562,-0.094337,-0.095809,-0.089796,-0.112814,-0.090165,-0.080468,-0.155547,-0.154812,-0.178214,-0.380278,-0.137164,-0.398836,-0.066082,-0.118201,-0.076029,-0.056122,0.22158,-0.093718,-0.206601,6.629056,-0.11793,-0.119973,-0.121661,-0.131656,-0.092324,-0.758178,-0.185136,-0.184673,-0.403559,-0.412507,-0.404626,1.762051,-0.337749,-0.343465,-0.291173,-0.340356,-0.245193,-0.213275,-0.242774,-0.240158,-0.227882,-0.227195,-0.238344,-0.247076,2.068235,-0.589062,-0.17106,-0.202517,-0.200929,-0.203624,-0.209798,-0.213967,-0.231293,-0.242548,-0.249835,-0.249308,-0.255365,-0.202338,-0.251188,-0.233664,-0.237584,-0.229176,-0.201128,-0.205504,-0.198999,-0.195913,-0.197456,-0.194028,-0.204512
9456658598495244920,-0.439296,-0.195738,-0.257653,-0.272917,0.398372,0.11047,-0.089394,-0.276217,-0.582488,-0.099283,-0.225012,0.913306,-0.052703,-0.049148,-0.154935,-0.073584,1.167185,-0.186095,-0.079866,-0.14729,0.593493,-0.524847,-0.277658,-0.050886,-0.409636,-0.136501,-0.148167,-0.12002,-0.117704,-0.101689,-0.233319,-0.101769,-0.102876,-0.099484,-0.107858,-0.094337,-0.095809,-0.089796,-0.112814,-0.090165,-0.080468,-0.155547,-0.154812,-0.178214,-0.380278,-0.137164,1.284322,-0.066082,-0.118201,-0.076029,-0.056122,0.22158,-0.093718,1.854989,-0.119056,-0.11793,-0.119973,-0.121661,-0.131656,-0.092324,-0.758178,-0.185136,-0.184673,-0.403559,1.625716,-0.404626,-0.386268,-0.337749,-0.343465,-0.291173,-0.340356,-0.245193,-0.213275,-0.242774,-0.240158,-0.227882,-0.227195,-0.238344,2.026229,-0.234336,-0.589062,-0.17106,-0.202517,-0.200929,-0.203624,-0.209798,-0.213967,-0.231293,-0.242548,-0.249835,-0.249308,-0.255365,-0.202338,-0.251188,-0.233664,-0.237584,-0.229176,-0.201128,-0.205504,4.543914,-0.195913,-0.197456,-0.194028,-0.204512
6891084756410794026,-0.439296,0.542776,0.025336,0.149074,0.398372,0.11047,-0.089394,-0.276217,1.616884,-0.099283,-0.225012,-0.42751,-0.052703,-0.049148,-0.154935,5.599213,-0.401913,-0.186095,-0.079866,-0.14729,-0.329267,1.633875,-0.277658,-0.050886,-0.409636,-0.136501,-0.148167,-0.12002,-0.117704,-0.101689,-0.233319,-0.101769,-0.102876,-0.099484,-0.107858,-0.094337,-0.095809,-0.089796,-0.112814,-0.090165,-0.080468,-0.155547,-0.154812,-0.178214,1.106178,-0.137164,-0.398836,-0.066082,-0.118201,-0.076029,-0.056122,-0.625763,-0.093718,-0.206601,-0.119056,-0.11793,-0.119973,-0.121661,-0.131656,-0.092324,0.878017,0.599956,0.60031,-0.403559,-0.412507,3.697897,-0.386268,-0.337749,-0.343465,-0.291173,-0.340356,-0.245193,-0.213275,-0.242774,-0.240158,-0.227882,-0.227195,-0.238344,4.299533,-0.234336,-0.589062,-0.17106,-0.202517,-0.200929,-0.203624,-0.209798,4.042264,3.757493,-0.242548,-0.249835,-0.249308,-0.255365,-0.202338,-0.251188,-0.233664,-0.237584,-0.229176,-0.201128,-0.205504,-0.198999,-0.195913,-0.197456,-0.194028,-0.204512
2387066031914314449,-0.439296,-0.195738,-0.257653,-0.272917,0.398372,-7.136843,-0.089394,-0.276217,0.517198,-0.099283,-0.225012,-0.42751,-0.052703,-0.049148,2.827051,-0.073584,-0.401913,-0.186095,-0.079866,-0.14729,-0.329267,0.554514,-0.277658,-0.050886,-0.409636,-0.136501,-0.148167,-0.12002,-0.117704,-0.101689,-0.233319,-0.101769,-0.102876,-0.099484,-0.107858,-0.094337,-0.095809,-0.089796,-0.112814,-0.090165,-0.080468,-0.155547,-0.154812,-0.178214,-0.380278,-0.137164,1.284322,-0.066082,-0.118201,-0.076029,-0.056122,-0.625763,-0.093718,-0.206601,-0.119056,-0.11793,-0.119973,-0.121661,-0.131656,-0.092324,0.05992,-0.185136,-0.184673,-0.403559,-0.412507,-0.404626,-0.386268,-0.337749,2.08434,-0.291173,-0.340356,2.019723,-0.213275,-0.242774,-0.240158,-0.227882,-0.227195,-0.238344,-0.247076,-0.234336,-0.589062,-0.17106,-0.202517,-0.200929,-0.203624,-0.209798,-0.213967,-0.231293,3.580548,-0.249835,-0.249308,-0.255365,-0.202338,-0.251188,-0.233664,-0.237584,-0.229176,-0.201128,-0.205504,-0.198999,-0.195913,-0.197456,-0.194028,-0.204512


In [27]:
# let's train a model
from statsmodels.api import OLS

X_train["constant"] = 1
X_test["constant"] = 1
model = OLS(y_train, X_train)

In [28]:
model_fit = model.fit()

In [29]:
model_fit.summary()

0,1,2,3
Dep. Variable:,totals.transactionRevenue,R-squared:,0.42
Model:,OLS,Adj. R-squared:,0.419
Method:,Least Squares,F-statistic:,490.9
Date:,"Fri, 18 Dec 2020",Prob (F-statistic):,0.0
Time:,16:38:33,Log-Likelihood:,48513.0
No. Observations:,70000,AIC:,-96820.0
Df Residuals:,69896,BIC:,-95870.0
Df Model:,103,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
device.isMobile,0.0212,0.017,1.233,0.217,-0.012,0.055
totals.visits,6.262e+10,4.72e+10,1.327,0.184,-2.99e+10,1.55e+11
totals.hits,0.0121,0.003,3.785,0.000,0.006,0.018
totals.pageviews,0.0810,0.003,23.897,0.000,0.074,0.088
totals.bounces,0.0078,0.001,9.820,0.000,0.006,0.009
totals.newVisits,0.0020,0.000,4.316,0.000,0.001,0.003
trafficSource.adwordsClickInfo.page,0.0030,0.007,0.407,0.684,-0.012,0.018
channelGrouping_Direct,-3.36e+10,2.53e+10,-1.327,0.184,-8.32e+10,1.6e+10
channelGrouping_Organic Search,-4.205e+10,3.17e+10,-1.327,0.184,-1.04e+11,2.01e+10

0,1,2,3
Omnibus:,110395.44,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,511234205.812
Skew:,9.354,Prob(JB):,0.0
Kurtosis:,421.247,Cond. No.,538000000000000.0


In [30]:
# remove strongly correlated variables
corr = X.corr()
high_corr = corr > 0.95
high_corr_list = [(i,j) for i in range(corr.shape[0]) for j in range(corr.shape[0]) if i != j and high_corr.iloc[i,j]]
high_corr_list

[(1, 61),
 (1, 62),
 (2, 3),
 (3, 2),
 (47, 50),
 (50, 47),
 (61, 1),
 (61, 62),
 (62, 1),
 (62, 61)]

In [31]:
no_keep = []
unique_couples = []
for couple in high_corr_list :
    if (couple[1],couple[0]) not in unique_couples:
        unique_couples.append(couple)
        no_keep.append(couple[1])

X_train = X_train.drop(X_train.columns[no_keep], axis=1)
X_test = X_test.drop(X_test.columns[no_keep], axis=1)

In [32]:
print(X.columns[no_keep])

Index(['trafficSource.adwordsClickInfo.slot_others',
       'trafficSource.adwordsClickInfo.adNetworkType_others',
       'totals.pageviews', 'trafficSource.keyword_6qEhsCssdK0z36ri',
       'trafficSource.adwordsClickInfo.adNetworkType_others'],
      dtype='object')


In [33]:
model = OLS(y_train, X_train)
model_fit = model.fit()
model_fit.summary()

0,1,2,3
Dep. Variable:,totals.transactionRevenue,R-squared:,0.415
Model:,OLS,Adj. R-squared:,0.414
Method:,Least Squares,F-statistic:,500.6
Date:,"Fri, 18 Dec 2020",Prob (F-statistic):,0.0
Time:,16:38:55,Log-Likelihood:,48218.0
No. Observations:,70000,AIC:,-96240.0
Df Residuals:,69900,BIC:,-95320.0
Df Model:,99,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
device.isMobile,0.0187,0.017,1.086,0.277,-0.015,0.053
totals.visits,1.881e+09,3.31e+10,0.057,0.955,-6.31e+10,6.68e+10
totals.hits,0.0871,0.001,141.988,0.000,0.086,0.088
totals.bounces,0.0054,0.001,7.078,0.000,0.004,0.007
totals.newVisits,0.0022,0.000,4.702,0.000,0.001,0.003
trafficSource.adwordsClickInfo.page,0.0220,0.004,5.184,0.000,0.014,0.030
channelGrouping_Direct,-1.01e+09,1.78e+10,-0.057,0.955,-3.59e+10,3.38e+10
channelGrouping_Organic Search,-1.263e+09,2.22e+10,-0.057,0.955,-4.49e+10,4.23e+10
channelGrouping_Paid Search,-4.962e+08,8.74e+09,-0.057,0.955,-1.76e+10,1.66e+10

0,1,2,3
Omnibus:,111340.388,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,472577507.49
Skew:,9.567,Prob(JB):,0.0
Kurtosis:,405.07,Cond. No.,338000000000000.0


In [38]:
# Now let's try lasso in order to remove bad predictors
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

params = {'alpha' : [10**(-a) for a in range(10)]}
lasso = Lasso()
grid = GridSearchCV(lasso,param_grid=params, cv = 10, verbose=1)

grid.fit(X_train,y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

Objective did not converge. You might want to increase the number of iterations. Duality gap: 101.03862535795793, tolerance: 0.15473778397559532


Objective did not converge. You might want to increase the number of iterations. Duality gap: 75.8965298476287, tolerance: 0.1640723187765464


Objective did not converge. You might want to increase the number of iterations. Duality gap: 65.49735244335977, tolerance: 0.16499283553720895


Objective did not converge. You might want to increase the number of iterations. Duality gap: 17.84338358283685, tolerance: 0.166130589789124


Objective did not converge. You might want to increase the number of iterations. Duality gap: 16.547769679970997, tolerance: 0.15925528758754312


Objective did not converge. You might want to increase the number of iterations. Duality gap: 41.39297007682592, tolerance: 0.16104407575209753


Objective did not converge. You might want t


Objective did not converge. You might want to increase the number of iterations. Duality gap: 473.15277707684425, tolerance: 0.16499283553720895


Objective did not converge. You might want to increase the number of iterations. Duality gap: 473.3925832013896, tolerance: 0.166130589789124


Objective did not converge. You might want to increase the number of iterations. Duality gap: 467.76812761675274, tolerance: 0.15925528758754312


Objective did not converge. You might want to increase the number of iterations. Duality gap: 463.7354200469688, tolerance: 0.16104407575209753


Objective did not converge. You might want to increase the number of iterations. Duality gap: 449.63057064549713, tolerance: 0.15416720519158922


Objective did not converge. You might want to increase the number of iterations. Duality gap: 439.1303843395132, tolerance: 0.14318111161870756


Objective did not converge. You might want to increase the number of iterations. Duality gap: 454.88287023748626, toleranc

GridSearchCV(cv=10, estimator=Lasso(),
             param_grid={'alpha': [1, 0.1, 0.01, 0.001, 0.0001, 1e-05, 1e-06,
                                   1e-07, 1e-08, 1e-09]},
             verbose=1)

In [39]:
print(grid.best_params_)

{'alpha': 0.001}


In [40]:
best_model = grid.best_estimator_
print("Score on the train set :", best_model.score(X_train,y_train))
print("Score on the test set :", best_model.score(X_test,y_test))

Score on the train set : 0.4106467480884891
Score on the test set : 0.17259610335638587


In [41]:
print("columns that have been removed with lasso : ", X_train.columns[best_model.coef_==0])

columns that have been removed with lasso :  Index(['totals.visits', 'totals.bounces',
       'trafficSource.adwordsClickInfo.page', 'channelGrouping_Organic Search',
       'device.browser_Firefox', 'device.browser_others',
       'device.operatingSystem_Macintosh', 'device.operatingSystem_Windows',
       'device.operatingSystem_iOS', 'device.deviceCategory_mobile',
       'geoNetwork.country_Brazil', 'geoNetwork.country_Germany',
       'geoNetwork.country_India', 'geoNetwork.country_Italy',
       'geoNetwork.country_Mexico', 'geoNetwork.country_Netherlands',
       'geoNetwork.country_Philippines', 'geoNetwork.country_Poland',
       'geoNetwork.country_Russia', 'geoNetwork.country_Turkey',
       'trafficSource.campaign_AW - Dynamic Search Ads Whole Site',
       'trafficSource.keyword_others', 'trafficSource.referralPath_/yt/about/',
       'trafficSource.referralPath_/yt/about/th/', '_weekday_1', '_weekday_2',
       '_weekday_3', '_weekday_4', '_month_12', '_month_3', '_month_

In [42]:
print("columns that have been kept with lasso : ", X_train.columns[best_model.coef_!=0])

columns that have been kept with lasso :  Index(['device.isMobile', 'totals.hits', 'totals.newVisits',
       'channelGrouping_Direct', 'channelGrouping_Paid Search',
       'channelGrouping_Referral', 'channelGrouping_Social',
       'channelGrouping_others', 'device.browser_Edge',
       'device.browser_Internet Explorer', 'device.browser_Safari',
       'device.operatingSystem_Chrome OS', 'device.operatingSystem_Linux',
       'device.operatingSystem_others', 'device.deviceCategory_tablet',
       'geoNetwork.country_Canada', 'geoNetwork.country_France',
       'geoNetwork.country_Indonesia', 'geoNetwork.country_Japan',
       'geoNetwork.country_Spain', 'geoNetwork.country_Taiwan',
       'geoNetwork.country_Thailand', 'geoNetwork.country_United Kingdom',
       'geoNetwork.country_United States', 'geoNetwork.country_Vietnam',
       'geoNetwork.country_others', 'trafficSource.campaign_Data Share Promo',
       'trafficSource.campaign_others',
       'trafficSource.referralPath_/an

In [43]:
# try with ridge
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

params = {'alpha':np.arange(0,1000,10)} # determine the range of parameters to try
ridge = Ridge() # create an instance of the model

grid = GridSearchCV(ridge, params, cv=10, verbose = 1)
grid_fit = grid.fit(X_train, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:   39.2s finished


In [44]:
print(grid.best_params_)

{'alpha': 990}


In [45]:
best_model = grid.best_estimator_
print("Score on the train set :", best_model.score(X_train,y_train))
print("Score on the test set :", best_model.score(X_test,y_test))

Score on the train set : 0.41456803686697974
Score on the test set : 0.16324611210817042
