# Google Store revenue prediction

In this notebook we will be exploring the Google Store dataset from Kaggle, and attempt to predict a customer's total revenue. First, some setup:

In [1]:
#imports
import pandas as pd
import numpy as np
import os
import json
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import lightgbm as lgb
import seaborn as sns
import datetime
import squarify
import plotly.offline as py
py.init_notebook_mode(connected=True)

color = sns.color_palette()

%matplotlib inline

from sklearn.model_selection import train_test_split
from scipy.stats import kurtosis, skew # it's to explore some statistics of numerical values
from plotly import tools
from pandas import json_normalize
from sklearn import model_selection, preprocessing, metrics

# Importing librarys to use on interactive graphs
from plotly.offline import init_notebook_mode, iplot, plot 
import plotly.graph_objs as go 

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999


def load_df(csv_path='data/train.csv', nrows=100000):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

Next, we need to load the data:

In [2]:
%%time
train = load_df()
test = load_df("data/test.csv")
pd.set_option('display.max_columns', None)

Loaded train.csv. Shape: (100000, 55)
Loaded test.csv. Shape: (100000, 53)
Wall time: 33.3 s


seeing if our data is missing any values

In [3]:
def missing_values(data):
    total = data.isnull().sum().sort_values(ascending = False) # getting the sum of null values and ordering
    percent = (data.isnull().sum() / data.isnull().count() * 100 ).sort_values(ascending = False) #getting the percent and order of null
    df = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) # Concatenating the total and percent
    print("Total columns at least one Values: ")
    print (df[~(df['Total'] == 0)]) # Returning values of nulls different of 0
    
    print("\n Total of Sales % of Total: ", round((train[train['totals.transactionRevenue'] != np.nan]['totals.transactionRevenue'].count() / len(train['totals.transactionRevenue']) * 100),4))
    
    return 

In [4]:
missing_values(train)

Total columns at least one Values: 
                                              Total  Percent
trafficSource.campaignCode                    99999   99.999
trafficSource.adContent                       98675   98.675
totals.transactionRevenue                     98601   98.601
trafficSource.adwordsClickInfo.isVideoAd      97426   97.426
trafficSource.adwordsClickInfo.adNetworkType  97426   97.426
trafficSource.adwordsClickInfo.slot           97426   97.426
trafficSource.adwordsClickInfo.page           97426   97.426
trafficSource.adwordsClickInfo.gclId          97375   97.375
trafficSource.isTrueDirect                    69546   69.546
trafficSource.referralPath                    63527   63.527
trafficSource.keyword                         55782   55.782
totals.bounces                                51084   51.084
totals.newVisits                              22737   22.737
totals.pageviews                                  7    0.007

 Total of Sales % of Total:  1.399


Our target have just 1.3% of non-null values
8 columns with 97%+ of missing values
4 columns with 50%+ of missing values
1 column with 22.19%
1 column with 0.011%

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 55 columns):
 #   Column                                             Non-Null Count   Dtype 
---  ------                                             --------------   ----- 
 0   channelGrouping                                    100000 non-null  object
 1   date                                               100000 non-null  int64 
 2   fullVisitorId                                      100000 non-null  object
 3   sessionId                                          100000 non-null  object
 4   socialEngagementType                               100000 non-null  object
 5   visitId                                            100000 non-null  int64 
 6   visitNumber                                        100000 non-null  int64 
 7   visitStartTime                                     100000 non-null  int64 
 8   device.browser                                     100000 non-null  object
 9   devic

Lets see how many unique visitors we have in train and test set, and the number of common visitors

In [6]:
print("Number of unique visitors in train set : ",train.fullVisitorId.nunique(), " out of rows : ",train.shape[0])
print("Number of unique visitors in test set : ",test.fullVisitorId.nunique(), " out of rows : ",test.shape[0])
print("Number of common visitors in train and test set : ",len(set(train.fullVisitorId.unique()).intersection(set(test.fullVisitorId.unique())) ))

Number of unique visitors in train set :  89213  out of rows :  100000
Number of unique visitors in test set :  88041  out of rows :  100000
Number of common visitors in train and test set :  349


TODO: Maybe remove/replace

Lets look at variable names found in train dataset but not in test dataset

In [7]:
print("Variables not in test but in train : ", set(train.columns).difference(set(test.columns)))

Variables not in test but in train :  {'totals.transactionRevenue', 'trafficSource.campaignCode'}


We see that apart from our target "totals.transactionRevenue" there is the variable "trafficSource.campaignCode" not present in test dataset. This needs to be removed. We also drop the constant variables we got earlier. "sessionId" will also be removed seeing as it is a unique identifier of the visit.

In [8]:
cols_to_drop = const_cols + ['sessionId']

train_df = train.drop(cols_to_drop + ["trafficSource.campaignCode"], axis=1)
test_df = test.drop(cols_to_drop, axis=1)

NameError: name 'const_cols' is not defined

In [9]:
train_df.shape

NameError: name 'train_df' is not defined

In [10]:
train_df.nunique()

NameError: name 'train_df' is not defined

#### Find kurtosis and Skewness of Transaction Revenue

In [11]:
pd.DataFrame(train.skew(),columns=["totals.transactionRevenue"])

Unnamed: 0,totals.transactionRevenue
date,0.165344
fullVisitorId,0.173518
sessionId,0.173518
visitId,0.258577
visitNumber,19.161959
visitStartTime,0.258577
device.isMobile,1.102698
totals.visits,0.0
totals.hits,9.970617
totals.pageviews,9.056392


In [12]:
pd.DataFrame(train.kurt(),columns=["totals.transactionRevenue"])

Unnamed: 0,totals.transactionRevenue
date,-1.967244
fullVisitorId,-0.5099
sessionId,-0.5099
visitId,-1.177327
visitNumber,464.222545
visitStartTime,-1.177327
device.isMobile,-0.784072
totals.visits,0.0
totals.hits,248.448182
totals.pageviews,229.658284


In [13]:
def scatter_plot(cnt_srs, color):
    trace = go.Scatter(
        x=cnt_srs.index[::-1],
        y=cnt_srs.values[::-1],
        showlegend=False,
        marker=dict(
            color=color,
        ),
    )
    return trace

train_df['date'] = train_df['date'].apply(lambda x: datetime.date(int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:])))
cnt_srs = train_df.groupby('date')['totals.transactionRevenue'].agg(['size', 'count'])
cnt_srs.columns = ["count", "count of non-zero revenue"]
cnt_srs = cnt_srs.sort_index()
#cnt_srs.index = cnt_srs.index.astype('str')
trace1 = scatter_plot(cnt_srs["count"], 'red')
#trace2 = scatter_plot(cnt_srs["count of non-zero revenue"], 'blue')

fig = tools.make_subplots(rows=2, cols=1, vertical_spacing=0.08,
                          subplot_titles=["Date - Count", "Date - Non-zero Revenue count"])
fig.append_trace(trace1, 1, 1)
#fig.append_trace(trace2, 2, 1)
fig['layout'].update(height=400, width=800, paper_bgcolor='rgb(233,233,233)', title="Date Plots")
py.iplot(fig, filename='date-plots')

NameError: name 'train_df' is not defined

We have data from 1 Aug, 2016 to 31 July, 2017 in our training dataset
In Nov 2016, though there is an increase in the count of visitors, there is 
no increase in non-zero revenue counts during that time period(relative to the mean).

In [14]:
test_df['date'] = test_df['date'].apply(lambda x: datetime.date(int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:])))
cnt_srs = test_df.groupby('date')['fullVisitorId'].size()


trace = scatter_plot(cnt_srs, 'red')

layout = go.Layout(
    height=400,
    width=800,
    paper_bgcolor='rgb(233,233,233)',
    title='Dates in Test set'
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="ActivationDate")

NameError: name 'test_df' is not defined

In the test set, we have dates from 2 Aug, 2017 to 30 Apr, 2018. So there are no common dates between train and test set. So it might be a good idea to do time based validation for this dataset.

### Device Browser

In [15]:
# the top 10 of browsers represent % of total
print("Percentual of Browser usage: ")
print(train_df['device.browser'].value_counts()[:7] ) # printing the top 7 percentage of browsers

# seting the graph size
plt.figure(figsize=(14,6))

# Let explore the browser used by users
sns.countplot(train_df[train_df['device.browser']\
                       .isin(train_df['device.browser']\
                             .value_counts()[:10].index.values)]['device.browser'], palette="hls") # It's a module to count the category's
plt.title("TOP 10 Most Frequent Browsers", fontsize=20) # Adding Title and seting the size
plt.xlabel("Browser Names", fontsize=16) # Adding x label and seting the size
plt.ylabel("Count", fontsize=16) # Adding y label and seting the size
plt.xticks(rotation=45) # Adjust the xticks, rotating the labels

plt.show() #use plt.show to render the graph that we did above

Percentual of Browser usage: 


NameError: name 'train_df' is not defined

In [16]:
# the top 5 of browsers represent % of total
print("Percentual of Operational System: ")
print(train_df['device.operatingSystem'].value_counts()[:5]) # printing the top 7 percentage of browsers

# seting the graph size
plt.figure(figsize=(14,7))

# let explore the browser used by users
sns.countplot(train_df["device.operatingSystem"], palette="hls") # It's a module to count the category's
plt.title("Operational System used Count", fontsize=20) # seting the title size
plt.xlabel("Operational System Name", fontsize=16) # seting the x label size
plt.ylabel("OS Count", fontsize=16) # seting the y label size
plt.xticks(rotation=45) # Adjust the xticks, rotating the labels

plt.show() #use plt.show to render the graph that we did above

Percentual of Operational System: 


NameError: name 'train_df' is not defined

### Geographic information

In [17]:
# the top 8 of browsers represent % of total
print("Description of SubContinent count: ")
print(train_df['geoNetwork.subContinent'].value_counts()[:8]) # printing the top 7 percentage of browsers

# seting the graph size
plt.figure(figsize=(16,7))

# let explore the browser used by users
sns.countplot(train_df[train_df['geoNetwork.subContinent']\
                       .isin(train_df['geoNetwork.subContinent']\
                             .value_counts()[:15].index.values)]['geoNetwork.subContinent'], palette="hls") # It's a module to count the category's
plt.title("TOP 15 most frequent SubContinents", fontsize=20) # seting the title size
plt.xlabel("subContinent Names", fontsize=18) # seting the x label size
plt.ylabel("SubContinent Count", fontsize=18) # seting the y label size
plt.xticks(rotation=45) # Adjust the xticks, rotating the labels

plt.show() #use plt.show to render the graph that we did above

Description of SubContinent count: 


NameError: name 'train_df' is not defined

In [18]:
country_tree = train_df["geoNetwork.country"].value_counts() #counting the values of Country

print("Description most frequent countrys: ")
print(country_tree[:15]) #printing the 15 top most 

country_tree = round((train_df["geoNetwork.country"].value_counts()[:30] \
                       / len(train_df['geoNetwork.country']) * 100),2)

plt.figure(figsize=(14,5))
g = squarify.plot(sizes=country_tree.values, label=country_tree.index, 
                  value=country_tree.values,
                  alpha=.4, color=color)
g.set_title("'TOP 30 Countrys - % size of total",fontsize=20)
g.set_axis_off()
plt.show()

NameError: name 'train_df' is not defined

### Paying Customers

In [19]:
payingCustomers = train_df.loc[train_df['totals.transactionRevenue'].notna()]
payingCustomers
payingCustomers["fullVisitorId"]

NameError: name 'train_df' is not defined

In [20]:
payingCustomers.describe()

NameError: name 'payingCustomers' is not defined

In [21]:
payingCustomers.info()

NameError: name 'payingCustomers' is not defined

## Transformers

#### Transformer for removing unwanted features

In [22]:
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

class FeatureReducer(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self.features = features
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X.drop(self.features, axis=1)

#### Transformers for labeling, converting and imputing the data

In [23]:
class Labeler(BaseEstimator, TransformerMixin):
    def __init__(self, cat_cols):
        self.cat_cols = cat_cols
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        for col in self.cat_cols:
            lbl = preprocessing.LabelEncoder()
            lbl.fit(list(X[col].values.astype('str')))
            X[col] = lbl.transform(list(X[col].values.astype('str')))
        return X

class Floatinator(BaseEstimator, TransformerMixin):
    def __init__(self, num_cols):
        self.num_cols = num_cols
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        for col in self.num_cols:
            X[col] = X[col].astype(float)
        return X
    
class SimplerImputer(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        for col in self.cols:
            X[col].fillna(0.0, inplace=True)
        return X


#### Splitting the data

## Collect all columns which need to be dropped

In [24]:
useless_cols = [col for col in train.columns 
                if train[col].isna().all() 
                or train[col].eq("not available in demo dataset").all()
                or train[col].nunique(dropna=False)==1]
useless_cols = useless_cols + ["trafficSource.campaignCode"] + ["sessionId"]
useless_cols


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



['socialEngagementType',
 'device.browserVersion',
 'device.browserSize',
 'device.operatingSystemVersion',
 'device.mobileDeviceBranding',
 'device.mobileDeviceModel',
 'device.mobileInputSelector',
 'device.mobileDeviceInfo',
 'device.mobileDeviceMarketingName',
 'device.flashVersion',
 'device.language',
 'device.screenColors',
 'device.screenResolution',
 'geoNetwork.cityId',
 'geoNetwork.latitude',
 'geoNetwork.longitude',
 'geoNetwork.networkLocation',
 'totals.visits',
 'trafficSource.adwordsClickInfo.criteriaParameters',
 'trafficSource.campaignCode',
 'sessionId']

#### Declare categorical columns

In [25]:
cat_cols = ["channelGrouping", "device.browser", 
            "device.deviceCategory", "device.operatingSystem", 
            "geoNetwork.city", "geoNetwork.continent", 
            "geoNetwork.country", "geoNetwork.metro",
            "geoNetwork.networkDomain", "geoNetwork.region", 
            "geoNetwork.subContinent", "trafficSource.adContent", 
            "trafficSource.adwordsClickInfo.adNetworkType", 
            "trafficSource.adwordsClickInfo.gclId", 
            "trafficSource.adwordsClickInfo.page", 
            "trafficSource.adwordsClickInfo.slot", "trafficSource.campaign",
            "trafficSource.keyword", "trafficSource.medium", 
            "trafficSource.referralPath", "trafficSource.source",
            'trafficSource.adwordsClickInfo.isVideoAd', 'trafficSource.isTrueDirect']

### Declare numerical columns

In [26]:
num_cols = ["totals.hits", "totals.pageviews", "visitNumber", "visitStartTime", 'totals.bounces',  'totals.newVisits']

### Imports

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [29]:
for col in categorical_cols:
    print(col)
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(train_df[col].values.astype('str')) + list(test_df[col].values.astype('str')))
    train_df[col] = lbl.transform(list(train_df[col].values.astype('str')))
    test_df[col] = lbl.transform(list(test_df[col].values.astype('str')))

channelGrouping


NameError: name 'train_df' is not defined

In [30]:
"""label_encoding_pipeline = Pipeline([
    ("encode_labels", ColumnTransformer(
        ("label_transformer", LabelEncoder(), categorical_cols)
    ))
])"""

full_pipeline = Pipeline([
    ("impute_revenue", ColumnTransformer(
        ('revenue_imputer', SimpleImputer(strategy="constant", fill_value=0), "totals.transactionRevenue")
    )),
    ('reduce', FeatureReducer(useless_cols)),
    ("encode_labels", ColumnTransformer(
        ("label_transformer", LabelEncoder(), categorical_cols)
    ))
])

"""
train_pipeline = Pipeline([
    full_pipeline,
    ("impute_revenue", ColumnTransformer([
        ('revenue_imputer', SimpleImputer(strategy="constant", fill_value=0), "totals.transactionRevenue")
    ]))
])"""

'\ntrain_pipeline = Pipeline([\n    full_pipeline,\n    ("impute_revenue", ColumnTransformer([\n        (\'revenue_imputer\', SimpleImputer(strategy="constant", fill_value=0), "totals.transactionRevenue")\n    ]))\n])'

## Get data

In [31]:
train = load_df(nrows=100000, csv_path="data/train.csv")
test = load_df("data/test.csv", 100000)
pd.set_option('display.max_columns', None)

Loaded train.csv. Shape: (100000, 55)
Loaded test.csv. Shape: (100000, 53)


#### Declare categorical columns

In [40]:
cat_cols = ["channelGrouping", "device.browser", 
            "device.deviceCategory", "device.operatingSystem", 
            "geoNetwork.city", "geoNetwork.continent", 
            "geoNetwork.country", "geoNetwork.metro",
            "geoNetwork.networkDomain", "geoNetwork.region", 
            "geoNetwork.subContinent", "trafficSource.adContent", 
            "trafficSource.adwordsClickInfo.adNetworkType", 
            "trafficSource.adwordsClickInfo.gclId", 
            "trafficSource.adwordsClickInfo.page", 
            "trafficSource.adwordsClickInfo.slot", "trafficSource.campaign",
            "trafficSource.keyword", "trafficSource.medium", 
            "trafficSource.referralPath", "trafficSource.source",
            'trafficSource.adwordsClickInfo.isVideoAd', 'trafficSource.isTrueDirect']

### Declare numerical columns

In [41]:
num_cols = ["totals.transactionRevenue", "totals.hits", "totals.pageviews", "visitNumber", "visitStartTime", 'totals.bounces',  'totals.newVisits']

## Preprocess data for training

In [42]:
# Remove useless columns
train = FeatureReducer(useless_cols).transform(train)

KeyError: "['socialEngagementType' 'device.browserVersion' 'device.browserSize'\n 'device.operatingSystemVersion' 'device.mobileDeviceBranding'\n 'device.mobileDeviceModel' 'device.mobileInputSelector'\n 'device.mobileDeviceInfo' 'device.mobileDeviceMarketingName'\n 'device.flashVersion' 'device.language' 'device.screenColors'\n 'device.screenResolution' 'geoNetwork.cityId' 'geoNetwork.latitude'\n 'geoNetwork.longitude' 'geoNetwork.networkLocation' 'totals.visits'\n 'trafficSource.adwordsClickInfo.criteriaParameters'\n 'trafficSource.campaignCode' 'sessionId'] not found in axis"

In [43]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 34 columns):
 #   Column                                        Non-Null Count   Dtype 
---  ------                                        --------------   ----- 
 0   channelGrouping                               100000 non-null  object
 1   date                                          100000 non-null  int64 
 2   fullVisitorId                                 100000 non-null  object
 3   visitId                                       100000 non-null  int64 
 4   visitNumber                                   100000 non-null  int64 
 5   visitStartTime                                100000 non-null  int64 
 6   device.browser                                100000 non-null  object
 7   device.operatingSystem                        100000 non-null  object
 8   device.isMobile                               100000 non-null  bool  
 9   device.deviceCategory                         100000 non-nul

In [44]:
# Impute values
cols_to_impute = [
    "totals.transactionRevenue",
    "totals.pageviews",
    "totals.bounces",
    "totals.newVisits"
]
train = SimplerImputer(cols_to_impute).transform(train)

In [45]:
train = Labeler(cat_cols).transform(train)

In [46]:
train = Floatinator(num_cols).transform(train)

In [47]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 34 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   channelGrouping                               100000 non-null  int64  
 1   date                                          100000 non-null  int64  
 2   fullVisitorId                                 100000 non-null  object 
 3   visitId                                       100000 non-null  int64  
 4   visitNumber                                   100000 non-null  float64
 5   visitStartTime                                100000 non-null  float64
 6   device.browser                                100000 non-null  int64  
 7   device.operatingSystem                        100000 non-null  int64  
 8   device.isMobile                               100000 non-null  bool   
 9   device.deviceCategory                         100

In [48]:
train.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,visitId,visitNumber,visitStartTime,device.browser,device.operatingSystem,device.isMobile,device.deviceCategory,geoNetwork.continent,geoNetwork.subContinent,geoNetwork.country,geoNetwork.region,geoNetwork.metro,geoNetwork.city,geoNetwork.networkDomain,totals.hits,totals.pageviews,totals.bounces,totals.newVisits,totals.transactionRevenue,trafficSource.campaign,trafficSource.source,trafficSource.medium,trafficSource.keyword,trafficSource.isTrueDirect,trafficSource.referralPath,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adContent
0,4,20160902,1131660440785968503,1472830385,1.0,1472830000.0,5,12,False,0,3,21,168,93,0,118,6360,1.0,1.0,1.0,1.0,0.0,0,49,5,5,1,527,4,2,2389,2,1,21
1,4,20160902,377306020877927890,1472880147,1.0,1472880000.0,8,7,False,0,5,1,9,217,52,289,1762,1.0,1.0,1.0,1.0,0.0,0,49,5,5,1,527,4,2,2389,2,1,21
2,4,20160902,3895546263509774583,1472865386,1.0,1472865000.0,5,12,False,0,4,19,151,49,0,145,6597,1.0,1.0,1.0,1.0,0.0,0,49,5,5,1,527,4,2,2389,2,1,21
3,4,20160902,4763447161404445595,1472881213,1.0,1472881000.0,26,6,False,0,3,16,76,217,52,289,6597,1.0,1.0,1.0,1.0,0.0,0,49,5,203,1,527,4,2,2389,2,1,21
4,4,20160902,27294437909732085,1472822600,2.0,1472823000.0,5,1,True,1,4,13,174,217,52,289,6597,1.0,1.0,1.0,0.0,0.0,0,49,5,5,0,527,4,2,2389,2,1,21


## Split data

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train, 
    train["totals.transactionRevenue"], test_size=0.4, random_state=42)

In [50]:
y_train.head()

40507    0.0
72707    0.0
90912    0.0
28532    0.0
13006    0.0
Name: totals.transactionRevenue, dtype: float64

In [51]:
y_test.describe()

count    4.000000e+04
mean     1.813006e+06
std      3.129121e+07
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      2.365500e+09
Name: totals.transactionRevenue, dtype: float64

In [57]:
from sklearn.ensemble import GradientBoostingRegressor

#X_train = X_train.drop('totals.transactionRevenue', axis=1)
gbr = GradientBoostingRegressor(random_state=0)
gbr.fit(X_train, y_train)
gbr.score(X_test.drop('totals.transactionRevenue', axis=1), y_test)

-0.7378455520438922

## Storing the trained model

In [61]:
from joblib import dump

dump(gbr, "estimator.joblib")

['estimator.joblib']

## Deploying the model
At this point, we set up our server and move the joblib file to the server folder before deploying the server. The next step is to make sure our server actually works. To do this, we simply send some test data and see if the response is as expected.

In [60]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 75721 to 40093
Data columns (total 34 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   channelGrouping                               40000 non-null  int64  
 1   date                                          40000 non-null  int64  
 2   fullVisitorId                                 40000 non-null  object 
 3   visitId                                       40000 non-null  int64  
 4   visitNumber                                   40000 non-null  float64
 5   visitStartTime                                40000 non-null  float64
 6   device.browser                                40000 non-null  int64  
 7   device.operatingSystem                        40000 non-null  int64  
 8   device.isMobile                               40000 non-null  bool   
 9   device.deviceCategory                         40000 non-n

In [58]:
import requests
import json
 
json_data = X_test.head().to_json()
headers = {'Content-type': 'application/json'}
url = "https://hvl-dat158-ml2.herokuapp.com/"

req = requests.post(url, data=json_data, headers=headers)
preds = req.json()
preds[preds < 0] = 0
preds_df = pd.DataFrame(preds)
preds_df

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [None]:
from sklearn.metrics import mean_squared_error as mse
mse(y_test,pred)

## Deployment

Suggestion for deployment: Write a web API which can receive data like one or more rows from the test dataset and return a prediction for that data. We are thinking the client will send the data as JSON and receive a JSON response.

In order to make this happen, we will need to have a way to transform our data to JSON format, as well as a way to transform it back to a dataframe. 

In [None]:
train.select_dtypes(exclude=["number","bool_","object_"])

In [None]:
train.select_dtypes(np.number).head()

In [None]:
train.select_dtypes(exclude=["number","bool_","object_"])