- Overview: The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

- Goal: Analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to <b>predict revenue per customer</b>

- Data format: 
    + Each row in the dataset is one visit to the store. 
    + <b>Not all rows in test_v2.csv will correspond to a row in the submission</b>, but all unique fullVisitorIds will correspond to a row in the submission.
    + Due to the formatting of fullVisitorId you must <b>load the Id's as strings in order for all Id's to be properly unique!</b>
    + There are multiple columns which contain JSON blobs of varying depth. In one of those JSON columns, totals, the sub-column transactionRevenue contains the revenue information we are trying to predict. This sub-column exists only for the training data.


- Input: user transactions which are collected from GStore around the world.

- Output: ALL users' transactions in the future time period of December 1st 2018 through January 31st 2019.
 + Public LB: is being calculated for those visitors during the same timeframe of 5/1/18 to 10/15/18
 + Private LB: is being calculated on the future-looking timeframe of 12/1/18 to 1/31/19 - for those same set of users.
 
 => Therefore, your submission that is intended for the public LB timeframe will be different from the private LB timeframe, which will be rescored/recalculated on the future timeframe.
 
 We are predicting the <b>natural log of the sum of all transactions per user</b>. 
 
$$
y_{user} = \sum_{i=1}^{n} transaction_{user_i} 
$$
$$
target_{user} = \ln({y_{user}+1})
$$
 

- External Data: is <b>permitted</b> for this competition. This includes the <a href="https://support.google.com/analytics/answer/6367342#access&zippy=%2Cin-this-article">Google Merchandise Store Demo Account</a>. Although the Demo Account contains the predicted variable, final standings will not benefit from access to this external data, because it requires future-looking predictions.

# 1. Data exploration

In [13]:
import numpy as np
import pandas as pd
import os
import json
from pandas import json_normalize

dir = '../input/ga-customer-revenue-prediction/'
for _, _, filenames in os.walk(dir):
    for filename in filenames:
        print(filename)

 <b>train_v2.csv</b>: the updated training set - contains user transactions from August 1st 2016 to April 30th 2018.
 
 <b>test_v2.csv</b>: the updated test set - contains user transactions from May 1st 2018 to October 15th 2018.
 
 <b>sample_submission_v2.csv</b> - a updated sample submission file in the correct format. Contains all fullVisitorIds in test_v2.csv. Your submission's PredictedLogRevenue column should make forward-looking predictions for each of these fullVisitorIds for the timeframe of December 1st 2018 to January 31st 2019.

## 1.1 Data
- `fullVisitorId`: A unique identifier for each user of the Google Merchandise Store.
- `channelGrouping`: The channel via which the user came to the Store.
- `date`: The date on which the user visited the Store.
- `device`: The specifications for the device used to access the Store.
- `geoNetwork`: This section contains information about the geography of the user.
- `socialEngagementType`: Engagement type, either "Socially Engaged" or "Not Socially Engaged".
- `totals`: This section contains aggregate values across the session.
- `trafficSource`: This section contains information about the Traffic Source from which the session originated.
- `visitId`: An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a <b>completely unique ID</b>, you should use a combination of fullVisitorId and visitId.
- `visitNumber`: The session number for this user. If this is the first session, then this is set to 1.
- `visitStartTime`: The timestamp (expressed as POSIX time).
- `hits`: This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
- `customDimensions`: This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
- `totals`: This set of columns mostly includes high-level aggregate data.

In [14]:
# %%time
# train_v2 = pd.read_csv(dir + 'train_v2.csv')
# test_v2 = pd.read_csv(dir + 'test_v2.csv')

In [15]:
%%time
train_v1 = pd.read_csv(dir + 'train.csv', low_memory=False)
test_v1 = pd.read_csv(dir + 'test.csv', low_memory=False)
sample_submission_v1 = pd.read_csv(dir + 'sample_submission.csv')

In [16]:
train_v1.sample(n=5, random_state=1)

In [17]:
test_v1.sample(n=5, random_state=1)

In [18]:
sample_submission_v1.sample(n=5, random_state=1)

In [19]:
print('train_v1 shape:\t', train_v1.shape)
print('test_v1 shape:\t', test_v1.shape)
print('sample_submission_v1 shape:\t', sample_submission_v1.shape)

In [20]:
def load_df(csv_path, nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path,
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows,
                     low_memory=False,)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

In [21]:
%%time
train_flat_v1 = load_df(dir + 'train.csv')
test_flat_v1 = load_df(dir + 'test.csv')

In [22]:
train_flat_v1.sample(n=5, random_state=1)

In [23]:
test_flat_v1.sample(n=5, random_state=1)

## 1.2 Evaluation Metric

Submissions are scored on the root mean squared error. RMSE is defined as:

$$ \text{RMSE} = \sqrt{\frac{1}{n}\sum^n_{i=1}(y_i - \hat{y}_i)^2} $$
