# Google Analytics Customer Revenue Prediction

This notebook is made to document the steps made to create an algorithm able to predict the natural log of the transaction revenue of some customers of a GStore. The whole description can be found at https://www.kaggle.com/c/ga-customer-revenue-prediction. 

First of all, we have to include the whole set of libraries we need to achieve our result.

In [None]:
import os
import json
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing
from sklearn import neighbors
from pandas.io.json import json_normalize
%matplotlib notebook
import matplotlib.pyplot as plot

Then, we need to import the files where our data are collected. It is important to notice that some columns of the dataset are in a JSON format. These columns contain several informations that can't be read properly. In order to make them available to our machine learning algorithm, we need to append to the dataset an additional column for each information contained in each JSON column. In this notebook the function used to make it possible is taken from a Kaggle kernel made my Julian Peller (https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields/notebook).

In [None]:
def load_df(csv_path='../input/train_v2.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

Now we are ready to retrieve the needed information in a proper way.

In [None]:
ga_customers_train = load_df(nrows=200000)
ga_customers_test = load_df(csv_path='../input/test_v2.csv')

Let's have a look at our data: how they look like?

In [None]:
ga_customers_train.head(5)

Since we don't need information contained in columns named 'customDimensions' and 'hits' (I don't even know what they represent) we can drop them from the dataframe. Also, in the train set there are some fields that contain constant values that aren't useful for our purpose because they don't give us any additional information. So, in order to speed up the algorithm execution and to save time and memory, it is better to just drop them from the train dataframe (and also from the test one, because we can't train our model on those features, so those are unuseful). 

In [None]:
ga_customers_train = ga_customers_train.drop(columns=['customDimensions', 'hits'])
ga_customers_test = ga_customers_test.drop(columns=['customDimensions', 'hits'])
const_cols = [c for c in ga_customers_train.columns if ga_customers_train[c].nunique(dropna=False)==1 ]
ga_customers_train = ga_customers_train.drop(columns=const_cols)
ga_customers_test = ga_customers_test.drop(columns=const_cols)

Since we are predicting the natural log of the sum of all transactions per user, let's sum up all transactions per user and the compute their natural log.

In [None]:
ga_customers_train['totals.transactionRevenue'] = ga_customers_train['totals.transactionRevenue'].astype('float')
total_transactions_per_user = ga_customers_train.groupby('fullVisitorId')['totals.transactionRevenue'].sum().reset_index()

Now let's plot the result.

In [None]:
plot.figure(figsize=(8,6))
plot.scatter(range(total_transactions_per_user.shape[0]), np.sort(np.log1p(total_transactions_per_user["totals.transactionRevenue"].values)))
plot.xlabel('index', fontsize=12)
plot.ylabel('Transaction Revenue', fontsize=12)
plot.show()

It is really intresting to see that just a small part of the customers of the online shop has bought something. This according to the part of the problem description that says: 
> The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

In fact, the percentage of real customers in this case is even smaller:

In [None]:
real_customers = pd.notnull(ga_customers_train["totals.transactionRevenue"]).sum()
print("Percentage of real customers (people who bought something):")
print(real_customers/ga_customers_train.size*100)

Now that we have a properly formatted dataframe and an idea of how the data are distributed, let's make some experiments in order to predict our result. What we have to do is to build an object able to predict the transaction revenue value of a user online-store session. For example, it would be great if we succeed in making a decision tree algorithm working with these data.

There is just one problem: decision trees don't work with categorical data. So we need to use sklearn to encode our categorical data using `LabelEncoder`.  Note that encoding is useful only categorical features, in numerical ones it could only leads to data misunderstanding and loss of mean.  So we have to check what fields are strings and what fields are already numerical, in order to properly encode the columns containing strings only.

In [None]:
ga_customers_train.dtypes

Here we can see that there are a lot of features that are marked as objects even if we certainly know that some of them are numbers and some other are strings. To fix this problem we have to manually cast those fields.

Before casting these features, however, it is better to merge test and train dataframes, in order to have the same features set and features encoding in both of them. We can then split them back in training set and test set by taking respectly the first tot  and last tot other items.

In [None]:
mixed_dataset = ga_customers_train.append(ga_customers_test, sort=False)

The below snippet of code casts every feature in a proper way.

**NB** Some features can be simply cast to a numerical value or to a string. Others need to be cast to a unicode type of string, in order to handle some specific records containing unicode characters.

In [None]:
mixed_dataset['channelGrouping'] = mixed_dataset['channelGrouping'].fillna('none').astype('category')
#we don't train our model on Ids features, so we can avoid to handle their types
mixed_dataset['device.browser'] = mixed_dataset['device.browser'].fillna('none').astype('category')
mixed_dataset['device.deviceCategory'] = mixed_dataset['device.deviceCategory'].fillna('none').astype('category')
mixed_dataset['device.operatingSystem'] = mixed_dataset['device.operatingSystem'].fillna('none').astype('category')
mixed_dataset['geoNetwork.city'] = mixed_dataset['geoNetwork.city'].fillna('none').astype('|S2048')
mixed_dataset['geoNetwork.country'] = mixed_dataset['geoNetwork.country'].fillna('none').astype('unicode')
mixed_dataset['geoNetwork.continent'] = mixed_dataset['geoNetwork.continent'].fillna('none').astype('|S2048')
mixed_dataset['geoNetwork.metro'] = mixed_dataset['geoNetwork.metro'].fillna('none').astype('|S2048')
mixed_dataset['geoNetwork.networkDomain'] = mixed_dataset['geoNetwork.networkDomain'].fillna('none').astype('|S2048')
mixed_dataset['geoNetwork.region'] = mixed_dataset['geoNetwork.region'].fillna('none').astype('|S2048')
mixed_dataset['geoNetwork.subContinent'] = mixed_dataset['geoNetwork.subContinent'].fillna('none').astype('|S2048')
mixed_dataset['totals.bounces'] = pd.to_numeric(mixed_dataset['totals.bounces'].fillna(0))
mixed_dataset['totals.hits'] = pd.to_numeric(mixed_dataset['totals.hits'].fillna(0))
mixed_dataset['totals.newVisits'] = pd.to_numeric(mixed_dataset['totals.newVisits'].fillna(0))
mixed_dataset['totals.pageviews'] = pd.to_numeric(mixed_dataset['totals.pageviews'].fillna(0))
mixed_dataset['totals.sessionQualityDim'] = pd.to_numeric(mixed_dataset['totals.sessionQualityDim'].fillna(0))
mixed_dataset['totals.timeOnSite'] = pd.to_numeric(mixed_dataset['totals.timeOnSite'].fillna(0))
mixed_dataset['totals.transactionRevenue'] = mixed_dataset['totals.transactionRevenue'].fillna(0)
mixed_dataset['totals.totalTransactionRevenue'] = pd.to_numeric(mixed_dataset['totals.totalTransactionRevenue'].fillna(0))
mixed_dataset['totals.transactions'] = pd.to_numeric(mixed_dataset['totals.transactions'].fillna(0))
mixed_dataset['trafficSource.adContent'] = mixed_dataset['totals.transactions'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.adwordsClickInfo.adNetworkType'] = mixed_dataset['trafficSource.adwordsClickInfo.adNetworkType'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.adwordsClickInfo.gclId'] = mixed_dataset['trafficSource.adwordsClickInfo.gclId'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.adwordsClickInfo.isVideoAd'] = mixed_dataset['trafficSource.adwordsClickInfo.isVideoAd'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.adwordsClickInfo.page'] = pd.to_numeric(mixed_dataset['trafficSource.adwordsClickInfo.page'].fillna(0))
mixed_dataset['trafficSource.adwordsClickInfo.slot'] = mixed_dataset['trafficSource.adwordsClickInfo.slot'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.campaign'] = mixed_dataset['trafficSource.campaign'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.campaignCode'] = mixed_dataset['trafficSource.campaignCode'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.isTrueDirect'] = mixed_dataset['trafficSource.isTrueDirect'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.keyword'] = mixed_dataset['trafficSource.keyword'].fillna('none').astype('unicode')
mixed_dataset['trafficSource.medium'] = mixed_dataset['trafficSource.medium'].fillna('none').astype('|S2048')
mixed_dataset['trafficSource.referralPath'] = mixed_dataset['trafficSource.referralPath'].fillna('none').astype('unicode')
mixed_dataset['trafficSource.source'] = mixed_dataset['trafficSource.source'].fillna('none').astype('|S2048')

We have casted all object features. Let's check real types now.

In [None]:
mixed_dataset.dtypes

We can see that while some features have been cast correctly, other has remained marked as objects. Indeed, those features will have the same behavior of a string, so there is no real problem in this.

Now we can proceed with the real encoding.

In [None]:
les = []
features = ['channelGrouping', 
            'device.browser',
            'device.deviceCategory',
            'device.operatingSystem', 
            'geoNetwork.city', 
            'geoNetwork.country',
            'geoNetwork.metro',
            'geoNetwork.networkDomain', 
            'geoNetwork.region',
            'geoNetwork.continent',
            'geoNetwork.subContinent',
            'trafficSource.adContent',
            'trafficSource.adwordsClickInfo.adNetworkType',
            'trafficSource.adwordsClickInfo.gclId',
            'trafficSource.adwordsClickInfo.isVideoAd',
            'trafficSource.adwordsClickInfo.slot', 
            'trafficSource.campaign',
            'trafficSource.campaignCode',
            'trafficSource.isTrueDirect', 
            'trafficSource.keyword',
            'trafficSource.medium', 
            'trafficSource.referralPath',
            'trafficSource.source']
for feature in features:
    le = preprocessing.LabelEncoder()
    try:
        le.fit(mixed_dataset[feature])
        mixed_dataset[feature] = le.transform(mixed_dataset[feature])
        les.append(le)
    except (SystemError, UnicodeEncodeError):
       print("Error: can't handle " + feature + " data type (maybe it is a unicode string).")

Let's check the result.

In [None]:
print(mixed_dataset.values[0,:])

Ok, we have formatted properly the entire merged dataset. Now we can divide the result into two objects (`X_train` and `X_test`), both having the same number of features.

In [None]:
X_train = mixed_dataset.drop(columns=['totals.transactionRevenue','fullVisitorId', 'visitId']).values[0:ga_customers_train.shape[0],:]
X_test = mixed_dataset.drop(columns=['totals.transactionRevenue','fullVisitorId', 'visitId']).values[ga_customers_train.shape[0]:mixed_dataset.shape[0],:]
y = mixed_dataset['totals.transactionRevenue'].values[0:ga_customers_train.shape[0]]

Let's create and train the regressor.

In [None]:
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y)

Now that we have our regressor, let's try to apply it to the test set.

In [None]:
tree_result = regressor.predict(X_test)

Well, we have a result.  However we don't know how good our predictions are. Let's make some considerations:
* our regressor is a simple decision tree regressor, trained on a single dataset
* it would be better if we have different trees trained on different datasets, in order to obtain different regressors
* once obtained different regressors, we can combine the predictions of those regressors to get better predictions
* `RandomForestRegressor` fom klearn makes exactly this thing: takes a dataset, creates different subsets on that dataset, creates different decision tree regressors and makes predictions based on the combination of those. 
So, let's try to apply a random forest regressor to our data.

In [None]:
random_forest = RandomForestRegressor(n_estimators=100, bootstrap=True)
random_forest.fit(X_train, y)

In [None]:
ra_predictions = random_forest.predict(X_test)

`ra_predictions` seems to be a better predictor for our purpose. At least, it is more trained. So it's not a bas idea to create the submission file with those data.

In [None]:
submission = ga_customers_test[['fullVisitorId']].copy()
submission.loc[:, 'PredictedLogRevenue'] = ra_predictions
grouped_test = submission[['fullVisitorId', 'PredictedLogRevenue']].groupby('fullVisitorId').sum().reset_index()
grouped_test["PredictedLogRevenue"] = np.log1p(grouped_test["PredictedLogRevenue"])
grouped_test.to_csv('submit.csv',index=False)