# Google Analytics- Customer Revenue Retention

by [Raul Maldonado](https://www.linkedin.com/in/raulm8/)

![GA Image](https://www.digitaltechnology.institute/wp-content/uploads/2018/03/google-analytics.gif)



## Introduction

RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has [partnered with Google Cloud and Kaggle](https://www.kaggle.com/c/ga-customer-revenue-prediction) to demonstrate the business impact that thorough data analysis can have.

We analyze Google Merchandise Store (GStore) customer dataset to for associations in revenue and customers, and how to predict for customer reveue.

## Import

In [1]:
import pandas as pd


In [2]:
parse_dates = ['date']
ga_trainDf = pd.read_csv('../Resources/Data/ZipFiles/train_v2.csv.zip',\
                            compression='zip',nrows=5000, parse_dates=parse_dates )

In [3]:
ga_trainDf.head()

Unnamed: 0,channelGrouping,customDimensions,date,device,fullVisitorId,geoNetwork,hits,socialEngagementType,totals,trafficSource,visitId,visitNumber,visitStartTime
0,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",2017-10-16,"{""browser"": ""Firefox"", ""browserVersion"": ""not ...",3162355547410993243,"{""continent"": ""Europe"", ""subContinent"": ""Weste...","[{'hitNumber': '1', 'time': '0', 'hour': '17',...",Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1508198450,1,1508198450
1,Referral,"[{'index': '4', 'value': 'North America'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",8934116514970143966,"{""continent"": ""Americas"", ""subContinent"": ""Nor...","[{'hitNumber': '1', 'time': '0', 'hour': '10',...",Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""referralPath"": ""/a/google.com/transportation...",1508176307,6,1508176307
2,Direct,"[{'index': '4', 'value': 'North America'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",7992466427990357681,"{""continent"": ""Americas"", ""subContinent"": ""Nor...","[{'hitNumber': '1', 'time': '0', 'hour': '17',...",Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""campaign"": ""(not set)"", ""source"": ""(direct)""...",1508201613,1,1508201613
3,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",9075655783635761930,"{""continent"": ""Asia"", ""subContinent"": ""Western...","[{'hitNumber': '1', 'time': '0', 'hour': '9', ...",Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1508169851,1,1508169851
4,Organic Search,"[{'index': '4', 'value': 'Central America'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",6960673291025684308,"{""continent"": ""Americas"", ""subContinent"": ""Cen...","[{'hitNumber': '1', 'time': '0', 'hour': '14',...",Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1508190552,1,1508190552


In [4]:
def string_cleaning(rawString):
    convertedString = rawString.replace("\"","")\
                        .replace("\'","")\
                        .replace("{","")\
                        .replace("}","")\
                        .split(',')
    return(convertedString)

def geographic_parse(rawString):
    convertStr = string_cleaning(rawString)
    return(convertStr[0], convertStr[1])
# ga_trainDf['geoNetwork'][0].replace("\"","").replace("\'","").replace("{","").replace("}","").split(',')
ga_trainDf['geoNetwork_new'] = ga_trainDf['geoNetwork'].transform(geographic_parse)
ga_trainDf['Continent'] = ga_trainDf['geoNetwork_new'].transform(lambda x: x[0].split(":")[1])
ga_trainDf['Sub-Continent'] = ga_trainDf['geoNetwork_new'].transform(lambda x: x[1].split(":")[1])

In [5]:
def sessionInput(string):
    stringFormatted = 0
    try:
        stringFormatted = int(string[5].split(":")[1])
    
    except:
        stringFormatted = 0
    return 0


ga_trainDf['totals_new'] = ga_trainDf['totals'].apply(string_cleaning)

In [6]:
ga_trainDf['totals_new'][:1]

0    [visits: 1,  hits: 1,  pageviews: 1,  bounces:...
Name: totals_new, dtype: object

In [12]:
'''
marketing_metrics_parser
Param 1: metricsArray
Param 2; index

Impute 0 value for missing marketing metrics from
original dictionary or list (e.g. A returned row 
would have visits and hits, but no pageviews data
--thus hitting and error)
'''
def marketing_metrics_parser(metricsArray, index):
    try:
        return(metricsArray[index].split(":")[1])
    except:
        return(0)
        

In [13]:

##Segment: visits,hits, pageviews, bounces, newVisits, sessionQualityDim
ga_trainDf['visits'] = ga_trainDf['totals_new'].apply(marketing_metrics_parser,index=(0))
ga_trainDf['hits'] = ga_trainDf['totals_new'].transform(marketing_metrics_parser,index=(1)) 
ga_trainDf['pageviews'] = ga_trainDf['totals_new'].transform(marketing_metrics_parser,index=(2))
ga_trainDf['bounces'] = ga_trainDf['totals_new'].transform(marketing_metrics_parser,index=(3)) 
ga_trainDf['newVisits'] = ga_trainDf['totals_new'].transform(marketing_metrics_parser,index=(4))
ga_trainDf['sessionQualityDim'] = ga_trainDf['totals_new'].transform(sessionInput)

In [14]:
ga_trainDf['deviceType'] = ga_trainDf['device'].apply(\
                    lambda x: x.split(',')[0][13:-1])

In [15]:
ga_trainDf['Region'] = ga_trainDf.customDimensions.apply(\
                                        lambda x: x[x.find('\'value\':')+10:-3])

In [16]:
def trafficSource_cleaning(trafficString):
    trafficList_cleaned = string_cleaning(trafficString)
    trafficHash = {}
    for keyVal in trafficList_cleaned[:-1]:
        parsedItems= keyVal.split(":")
        trafficHash[parsedItems[0]] = parsedItems[1]
    hasKeyList = list(trafficHash.keys())
    print(hasKeyList)
    if "campaign" not in hasKeyList:
        trafficHash["campaign"] = "(not set)"
    if "referralPath" not in hasKeyList:
        trafficHash["referralPath"] = "(not set)"
    if "source" not in hasKeyList:
        trafficHash["source"] = "(not set)"
    if "medium" not in hasKeyList:
        trafficHash["medium"] = "(not set)"
    if "keyword" not in hasKeyList:
        trafficHash["keyword"] = "(not set)"
    if "adwordsClickInfo" not in hasKeyList:
        trafficHash["adwordsClickInfo"] = "(not set)"
    print(trafficHash)
    return(trafficHash)
        
    '''for all key pairs being mapped in hash, if one a key is not in a list, then input values'''
    '''from that hash, we then create columns our of each key-value pair'''
    
ga_trainDf['trafficSource'][:10].transform(lambda x: trafficSource_cleaning(x)["source"])

# # ##Segment: referralPath, campaign, source, medium, adwordsClickInfo
# ga_trainDf['Ca'] = ga_trainDf['totals_new'].transform(lambda x: x[1].split(":")[1]) 
# ga_trainDf['pageviews'] = ga_trainDf['totals_new'].transform(lambda x: x[2].split(":")[1])
# ga_trainDf['bounces'] = ga_trainDf['totals_new'].transform(lambda x: x[3].split(":")[1]) 
# ga_trainDf['newVisits'] = ga_trainDf['totals_new'].transform(lambda x: x[4].split(":")[1])
# ga_trainDf['sessionQualityDim'] = ga_trainDf['totals_new'].transform(sessionInput)

['campaign', ' source', ' medium', ' keyword']
{'campaign': ' (not set)', ' source': ' google', ' medium': ' organic', ' keyword': ' water bottle', 'referralPath': '(not set)', 'source': '(not set)', 'medium': '(not set)', 'keyword': '(not set)', 'adwordsClickInfo': '(not set)'}
['referralPath', ' campaign', ' source', ' medium']
{'referralPath': ' /a/google.com/transportation/mtv-services/bikes/bike2workmay2016', ' campaign': ' (not set)', ' source': ' sites.google.com', ' medium': ' referral', 'campaign': '(not set)', 'source': '(not set)', 'medium': '(not set)', 'keyword': '(not set)', 'adwordsClickInfo': '(not set)'}
['campaign', ' source', ' medium', ' adwordsClickInfo']
{'campaign': ' (not set)', ' source': ' (direct)', ' medium': ' (none)', ' adwordsClickInfo': ' criteriaParameters', 'referralPath': '(not set)', 'source': '(not set)', 'medium': '(not set)', 'keyword': '(not set)', 'adwordsClickInfo': '(not set)'}
['campaign', ' source', ' medium', ' keyword']
{'campaign': ' (not

0    (not set)
1    (not set)
2    (not set)
3    (not set)
4    (not set)
5    (not set)
6    (not set)
7    (not set)
8    (not set)
9    (not set)
Name: trafficSource, dtype: object

In [17]:
ga_trainDf['trafficSource'][55]

'{"campaign": "(not set)", "source": "google", "medium": "organic", "keyword": "(not provided)", "adwordsClickInfo": {"criteriaParameters": "not available in demo dataset"}}'

In [18]:
ga_trainDf.head()
# Remove device, geoNetwork, totals, customDimensions, totals_new

Unnamed: 0,channelGrouping,customDimensions,date,device,fullVisitorId,geoNetwork,hits,socialEngagementType,totals,trafficSource,...,Continent,Sub-Continent,totals_new,visits,pageviews,bounces,newVisits,sessionQualityDim,deviceType,Region
0,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",2017-10-16,"{""browser"": ""Firefox"", ""browserVersion"": ""not ...",3162355547410993243,"{""continent"": ""Europe"", ""subContinent"": ""Weste...",1,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",...,Europe,Western Europe,"[visits: 1, hits: 1, pageviews: 1, bounces:...",1,1,1,1,0,Firefox,EMEA
1,Referral,"[{'index': '4', 'value': 'North America'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",8934116514970143966,"{""continent"": ""Americas"", ""subContinent"": ""Nor...",2,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""referralPath"": ""/a/google.com/transportation...",...,Americas,Northern America,"[visits: 1, hits: 2, pageviews: 2, timeOnSi...",1,2,28,2,0,Chrome,North America
2,Direct,"[{'index': '4', 'value': 'North America'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",7992466427990357681,"{""continent"": ""Americas"", ""subContinent"": ""Nor...",2,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""campaign"": ""(not set)"", ""source"": ""(direct)""...",...,Americas,Northern America,"[visits: 1, hits: 2, pageviews: 2, timeOnSi...",1,2,38,1,0,Chrome,North America
3,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",9075655783635761930,"{""continent"": ""Asia"", ""subContinent"": ""Western...",2,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",...,Asia,Western Asia,"[visits: 1, hits: 2, pageviews: 2, timeOnSi...",1,2,1,1,0,Chrome,EMEA
4,Organic Search,"[{'index': '4', 'value': 'Central America'}]",2017-10-16,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",6960673291025684308,"{""continent"": ""Americas"", ""subContinent"": ""Cen...",2,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""2"", ""pageviews"": ""2"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",...,Americas,Central America,"[visits: 1, hits: 2, pageviews: 2, timeOnSi...",1,2,52,1,0,Chrome,Central America
