<a href="https://colab.research.google.com/github/MeinHserhT/CS14115/blob/main/Full.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Overview: The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

- Goal: Analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to <b>predict revenue per customer</b>

- Data format: 
    + Each row in the dataset is one visit to the store. 
    + <b>Not all rows in test_v2.csv will correspond to a row in the submission</b>, but all unique fullVisitorIds will correspond to a row in the submission.
    + Due to the formatting of fullVisitorId you must <b>load the Id's as strings in order for all Id's to be properly unique!</b>
    + There are multiple columns which contain JSON blobs of varying depth. In one of those JSON columns, totals, the sub-column transactionRevenue contains the revenue information we are trying to predict. This sub-column exists only for the training data.


- Data train: user transactions which are collected from GStore around the world 01/08/2016 to 30/04/2018.
- Data test: ALL users' transactions in the future time.
 + Public LB: is being calculated for those visitors during the same timeframe of 01/05/2018 to 15/10/2018
 + Private LB: is being calculated on the future-looking timeframe of 01/12/2018 to 31/01/2019 - for those **same** set of users. 
 
 $\Rightarrow$ Therefore, your submission that is intended for the public LB timeframe will be different from the private LB timeframe, which will be rescored/recalculated on the future timeframe.
 
 
- Input: All transactions of a user from 01/05/2018 to 15/10/2018.
- Output: Total revenue of that user during the predicting time. (01/12/2018 to 31/01/2019)
 
 We are predicting the <b>natural log of the sum of all transactions per user</b>. 
 
$$
y_{user} = \sum_{i=1}^{n} transaction_{user_i} 
$$
$$
target_{user} = \ln({y_{user}+1})
$$
 

- External Data: is <b>permitted</b> for this competition. This includes the <a href="https://support.google.com/analytics/answer/6367342#access&zippy=%2Cin-this-article">Google Merchandise Store Demo Account</a>. Although the Demo Account contains the predicted variable, final standings will not benefit from access to this external data, because it requires future-looking predictions.

- Evaluation Metric

Submissions are scored on the root mean squared error. RMSE is defined as:

$$ \text{RMSE} = \sqrt{\frac{1}{n}\sum^n_{i=1}(y_i - \hat{y}_i)^2} $$

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd drive/MyDrive/'gStore Revenue Prediction'/data

/content/drive/MyDrive/gStore Revenue Prediction/data


In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
import os
import json
from pandas import json_normalize
from scipy.stats import norm
from datetime import datetime, timedelta
import ast

import gc
gc.enable()

All columns description here: 
- https://brandee.edu.vn/glossary/3437719-analytics-en/
- https://support.google.com/analytics/answer/3437719?hl=vi 

# EXPLORE ALL COLUMNS
Based on: https://www.kaggle.com/code/jsaguiar/complete-exploratory-analysis-all-columns/notebook

## 1. fullVisitorId
A unique identifier for each user of the Google Merchandise Store.

In [None]:
explore_train_df = pd.read_csv('train_fullVisitorId.csv', names = ['index', 'train_fullVisitorId'], dtype={'train_fullVisitorId': 'str'}).set_index('index')
explore_train_df

Unnamed: 0_level_0,train_fullVisitorId
index,Unnamed: 1_level_1
0,3162355547410993243
1,8934116514970143966
2,7992466427990357681
3,9075655783635761930
4,6960673291025684308
...,...
1708332,5123779100307500332
1708333,7231728964973959842
1708334,5744576632396406899
1708335,2709355455991750775


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_fullVisitorId,value_counts
0,1957458976293878100,400
1,7282998257608986241,315
2,3884810646891698298,268
3,0824839726118485274,258
4,7477638593794484792,218
...,...,...
1323725,3585417242472829270,1
1323726,3585411503564862793,1
1323727,3585408508447373708,1
1323728,3585392490236113827,1


In [None]:
explore_test_df = pd.read_csv('test_fullVisitorId.csv', names = ['index', 'test_fullVisitorId'], dtype={'test_fullVisitorId': 'str'}).set_index('index')
explore_test_df

Unnamed: 0_level_0,test_fullVisitorId
index,Unnamed: 1_level_1
0,7460955084541987166
1,460252456180441002
2,3461808543879602873
3,975129477712150630
4,8381672768065729990
...,...
401584,6701149525099562370
401585,6154541330147351453
401586,6013469762773705448
401587,4565378823441900999


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_fullVisitorId,value_counts
0,0650107116874202739,105
1,7706472452740899006,86
2,7282998257608986241,85
3,6246006502985590876,72
4,8801084265240272984,71
...,...,...
296525,3663544797782090129,1
296526,3663548931131262300,1
296527,3663556135730628346,1
296528,366357980994158949,1


In [None]:
len(set(vl_count_train_df['train_fullVisitorId']) - set(vl_count_test_df['test_fullVisitorId']))

1320971

## 2. ChannelGrouping
Channel Groupings are rule-based groupings of your traffic sources

In [None]:
explore_train_df = pd.read_csv('train_channelGrouping.csv', names = ['index', 'train_channelGrouping']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_channelGrouping
index,Unnamed: 1_level_1
0,Organic Search
1,Referral
2,Direct
3,Organic Search
4,Organic Search


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_channelGrouping,value_counts
0,Organic Search,738963
1,Social,354971
2,Direct,273134
3,Referral,211307
4,Display,51283
5,Paid Search,45627
6,Affiliates,32915
7,(Other),137


In [None]:
# based on: https://dashthis.com/blog/google-analytics-display-traffic/
des = {
"Organic Search": "This traffic found your site in a search engine such as Google or Bing. If you’re focusing on optimizing pages for search engines, this is an important channel to watch;",
"Display": "This traffic found your site by clicking on an ad that you ran on another website. Banner ads on blogs and image ads on news sites are some common generators of display traffic;",
"Direct": "This traffic came to your site by entering your URL directly into the address bar of browsers. Keep an eye on this one if you've been running offline or traditional media ads like print, TV, or radio, because they require audiences to remember and type out your web address;",
"Referral": "This traffic followed a backlink from another website to yours, and you'll see this traffic if it doesn't fall under one of the other buckets;",
"Paid Search": "this traffic comes from your paid search ads which appear in the search results of Bing, Google, or other search network players like AOL and Ask.com;",
"Social": "This traffic will be counted from people who find your page through an associated social media account. Check in on users who are landing on your page as a result of social media accounts like Facebook, LinkedIn, or Twitter;",
"Email": "This traffic clicked on links from email campaigns, follow up emails, and even email signatures;",
"(Other)": "If GA greets your web traffic with a shrug emoji, they'll throw it in this channel. Note that there are often better ways to group this traffic.",
}
des_df = pd.DataFrame(list(des.items()), columns = ['train_channelGrouping', 'description'])

In [None]:
pd.merge(vl_count_train_df, des_df, on=['train_channelGrouping']).set_index('train_channelGrouping')

Unnamed: 0_level_0,value_counts,description
train_channelGrouping,Unnamed: 1_level_1,Unnamed: 2_level_1
Organic Search,738963,"This traffic found your site in a search engine such as Google or Bing. If you’re focusing on optimizing pages for search engines, this is an important channel to watch;"
Social,354971,"This traffic will be counted from people who find your page through an associated social media account. Check in on users who are landing on your page as a result of social media accounts like Facebook, LinkedIn, or Twitter;"
Direct,273134,"This traffic came to your site by entering your URL directly into the address bar of browsers. Keep an eye on this one if you've been running offline or traditional media ads like print, TV, or radio, because they require audiences to remember and type out your web address;"
Referral,211307,"This traffic followed a backlink from another website to yours, and you'll see this traffic if it doesn't fall under one of the other buckets;"
Display,51283,This traffic found your site by clicking on an ad that you ran on another website. Banner ads on blogs and image ads on news sites are some common generators of display traffic;
Paid Search,45627,"this traffic comes from your paid search ads which appear in the search results of Bing, Google, or other search network players like AOL and Ask.com;"
(Other),137,"If GA greets your web traffic with a shrug emoji, they'll throw it in this channel. Note that there are often better ways to group this traffic."


In [None]:
explore_test_df = pd.read_csv('test_channelGrouping.csv', names = ['index', 'test_channelGrouping']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_channelGrouping
index,Unnamed: 1_level_1
0,Organic Search
1,Direct
2,Organic Search
3,Direct
4,Organic Search


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_channelGrouping,value_counts
0,Organic Search,198378
1,Direct,76076
2,Referral,59505
3,Social,36881
4,Paid Search,12834
5,Affiliates,10833
6,Display,7076
7,(Other),6


## 3. Date (format into Day-Month-Year)
The date on which the user visited the Store.

In [None]:
explore_train_df = pd.read_csv('train_date.csv', names = ['index', 'train_date']).set_index('index')
explore_train_df

Unnamed: 0_level_0,train_date
index,Unnamed: 1_level_1
0,20171016
1,20171016
2,20171016
3,20171016
4,20171016
...,...
1708332,20170104
1708333,20170104
1708334,20170104
1708335,20170104


In [None]:
explore_train_df['train_date'] = explore_train_df['train_date'].apply(lambda x: str(x)[:4] + ' ' + str(x)[4:6] + ' ' + str(x)[6:])
explore_train_df[['year', 'month', 'day']] = explore_train_df['train_date'].str.split(' ', expand=True)
explore_train_df.drop(columns=['train_date']).sort_values(by=['year', 'month', 'day'])

Unnamed: 0_level_0,year,month,day
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
995708,2016,08,01
995709,2016,08,01
995710,2016,08,01
995711,2016,08,01
995712,2016,08,01
...,...,...,...
1032486,2018,04,30
1032487,2018,04,30
1032488,2018,04,30
1032489,2018,04,30


In [None]:
explore_test_df = pd.read_csv('test_date.csv', names = ['index', 'test_date']).set_index('index')
explore_test_df

Unnamed: 0_level_0,test_date
index,Unnamed: 1_level_1
0,20180511
1,20180511
2,20180511
3,20180511
4,20180511
...,...
401584,20180907
401585,20180907
401586,20180907
401587,20180907


In [None]:
explore_test_df.sort_values('test_date')

Unnamed: 0_level_0,test_date
index,Unnamed: 1_level_1
343283,20180501
342758,20180501
342759,20180501
342760,20180501
342761,20180501
...,...
12683,20181015
12684,20181015
12685,20181015
12687,20181015


## 4. Device
The specifications for the device used to access the Store.

### 4.1. device.browser (fill 'Others' after CocCoc's value_count)
The browser used (e.g., "Chrome" or "Firefox")

In [None]:
explore_train_df = pd.read_csv('train_device.browser.csv', names = ['index', 'train_device.browser']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.browser
index,Unnamed: 1_level_1
0,Firefox
1,Chrome
2,Chrome
3,Chrome
4,Chrome


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.browser,value_counts
0,Chrome,1173056
1,Safari,312165
2,Firefox,63845
3,Internet Explorer,35474
4,Android Webview,34266
...,...,...
124,DoCoMo,1
125,Dillo,1
126,DDG-Android-3.1.1,1
127,Changa 99695759,1


In [None]:
# temp_df = vl_count_df.apply(lambda x: x if x['value_counts'] > 941 else {'device.browser': 'Other', 'value_counts': x['value_counts']}, axis = 1)
# temp_df['percent'] = temp_df['value_counts'] / 1708337 * 100
# temp_df.groupby('device.browser').sum().sort_values('value_counts')

In [None]:
vl_count_train_df.groupby('value_counts')['train_device.browser'].apply(list).to_frame()

Unnamed: 0_level_0,train_device.browser
value_counts,Unnamed: 1_level_1
1,"[ejpxuidzlmagvthsfbqnkwyocr, efkaxnbyohqtspzlvcwrjmigdu, eosutpkiahjzvdgcwxlmyfqbrn, epxmjusghnvircdfkwqlotzbay, ecwozghsufybtdkjrlvxpamiqn, ecgiwapzltrkujdhmqsbxfonvy, flobzsdixhuwqakptjmcrveygn, jdbknvrluyeaxoipgwczmthsqf, flwadqukonrjegpbisyxztvhcm, rpfanjzoxyemsgbtichqkudwlv, ymzsbiduaejrchvxlwkfnqgtop, wvsmagudcqeytijorlhxnfzkbp, wncrmxukofqljsgvzahiybpdet, wfpknuqxovyilmrdzbhgtecjas, wdhtapevfnqzskcroxgjmiybul, vjebamzrktwcysxpdlonhiufqg, uybjlgntzwpacihremkqsxdovf, ujvrzsonxihlgaqdmkwtbfcpey, uhdypcxbgzajmeqwlofnrsitkv, tfowdqmibyshaklxuregpcnzvj, subjectAgent: NoticiasBoom, starmaker, rbydojcflwzvnuaepmsgxhiktq, fspmihbxzowgnuctrqykjlvade, ohukwejvqmdtibfrzpycgxanls, ohfgqlpiuyknvmbctszjarxdwe, njroiedbwpmvykqlatxzuhcfgs, mhwxofpevcagujznbsiqlrkytd, lxjwoyfivgdbkqtuzsrmhencpa, lpmqaxwbzyteokrfusnjhvdigc, lhkbrtuwomdeafnqygvxcspizj, kqebrzuwmiycxdvtoljnhsfpga, jscatcher, dkagwlhmfqxercuozpnbvtsiyj, ighfsbrmpoctzjqxlywdenvuka, hbijxvdyrgnatwzmlcpkfusqoe, dohyinzpvbsktjeguxmrqcwafl, NokiaC7-00, cnwmpegudakrqzljtvfxohbysi, cajrnbtvqwfkolzyxushpdgime, ;__CT_JOB_ID__:a80e8e16-6e98-455b-885a-a4dd40f3d344;, ;__CT_JOB_ID__:a7ed0808-e70c-4b19-b1a3-1018bbb7dc7f;, ;__CT_JOB_ID__:a4f837b8-8d78-4c42-ba9a-d870cf1a4a7e;, ;__CT_JOB_ID__:a24a8978-e5e8-4dc9-af66-c4ed89ea25d7;, ;__CT_JOB_ID__:97909e28-4228-4b55-8ad5-cc791f2b583c;, ;__CT_JOB_ID__:89e59554-ad41-4e94-957b-f12bd012530c;, ;__CT_JOB_ID__:85da5736-a78e-45a9-837e-f5a53e5cd725;, ;__CT_JOB_ID__:7e575295-571e-4e82-9254-7f2c8bbb9183;, ;__CT_JOB_ID__:76fd1acb-e365-43c0-b967-908bcf5d5b59;, ;__CT_JOB_ID__:6e9dcf2f-f58f-4938-91e3-77e00868177b;, ;__CT_JOB_ID__:65da7e5f-0f05-4b5d-8d31-1f4d470a2b82;, ;__CT_JOB_ID__:58e2ecba-7666-4a10-b498-8216457ce472;, ;__CT_JOB_ID__:4333777f-bb0c-4a18-935e-df5658dbce2d;, ;__CT_JOB_ID__:2e0eca60-83ab-482d-bb81-343d113254fb;, ;__CT_JOB_ID__:2547db0b-ec43-452a-a0d4-ff42b7dc7907;, ;__CT_JOB_ID__:0b39e7ca-1431-42e3-ba1f-9d8951a65840;, ;__CT_JOB_ID__:0a075729-93a5-43d0-9638-4cbd41d5f5a5;, ;__CT_JOB_ID__:d14534ff-e2fc-4692-92aa-e34508f1c418;, ;__CT_JOB_ID__:dd6177aa-1baa-4007-9b38-b7cab4f7611c;, ;__CT_JOB_ID__:fe02e46f-b6ae-41f1-8563-3b40bbb623a9;, M5, bsfnwveckhgpdoyjxmizruqtla, ajsqixbltuvwpmdcokfyzhgren, afjurnqyolshpibxczdwktmvge, [Use default User-agent string] LIVRENPOCHE, User Agent, TCL P500M, Reddit, KINGSUN-F4, ADM, IE with Chrome Frame, Hisense M20-M_LTE, HTC802t_TD, DoCoMo, Dillo, DDG-Android-3.1.1, Changa 99695759, zurcqesbhljxmpwdgnvkoyafit]"
2,"[NokiaE52-1, Amazon.com, Android Runtime]"
3,"[MQQBrowser, LYF_LS_4002_11, ThumbSniper, SAMSUNG-SM-B355E Opera]"
4,"[no-ua, CSM Click, Autn-WKOOP, Netscape, DASH_JR_3G]"
5,"[DESKTOP, Browser]"
6,"[Konqueror, YE]"
7,"[Nichrome, 0]"
8,"[+Simple Browser, Lunascape]"
12,[(not set)]
15,[Playstation Vita Browser]


In [None]:
explore_test_df = pd.read_csv('test_device.browser.csv', names = ['index', 'test_device.browser']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.browser
index,Unnamed: 1_level_1
0,Chrome
1,Chrome
2,Chrome
3,Chrome
4,Internet Explorer


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.browser,value_counts
0,Chrome,305526
1,Safari,57892
2,Firefox,12527
3,Opera Mini,5221
4,Internet Explorer,4528
...,...,...
57,;__CT_JOB_ID__:f3b7bb35-9ce9-4a6d-81eb-0bc9f815ea36;,1
58,;__CT_JOB_ID__:fc3791f1-f340-420c-a606-c299c26381fe;,1
59,;__CT_JOB_ID__:051f8c23-a890-4634-b34d-1aeebc201356;,1
60,GameSessions CEF,1


### 4.2 device.browserSize (*drop this column*)

The browser size in lenght and width.

In [None]:
explore_train_df = pd.read_csv('train_device.browserSize.csv', names = ['index', 'train_device.browserSize']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.browserSize
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.browserSize,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.browserSize.csv', names = ['index', 'test_device.browserSize']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.browserSize
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.browserSize,value_counts
0,not available in demo dataset,401589


### 4.3 device.browserVersion (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.browserVersion.csv', names = ['index', 'train_device.browserVersion']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.browserVersion
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.browserVersion,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.browserVersion.csv', names = ['index', 'test_device.browserVersion']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.browserVersion
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.browserVersion,value_counts
0,not available in demo dataset,401589


### 4.4 device.deviceCategory
The type of device (Mobile, Tablet, Desktop).

In [None]:
explore_train_df = pd.read_csv('train_device.deviceCategory.csv', names = ['index', 'train_device.deviceCategory']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.deviceCategory
index,Unnamed: 1_level_1
0,desktop
1,desktop
2,mobile
3,desktop
4,desktop


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.deviceCategory,value_counts
0,desktop,1171579
1,mobile,471336
2,tablet,65422


In [None]:
explore_test_df = pd.read_csv('test_device.deviceCategory.csv', names = ['index', 'test_device.deviceCategory']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.deviceCategory
index,Unnamed: 1_level_1
0,mobile
1,desktop
2,desktop
3,mobile
4,tablet


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.deviceCategory,value_counts
0,desktop,277648
1,mobile,111725
2,tablet,12216


### 4.5 device.mobileDeviceInfo (*drop this column*)
The branding, model, and marketing name used to identify the mobile device.

In [None]:
explore_train_df = pd.read_csv('train_device.mobileDeviceInfo.csv', names = ['index', 'train_device.mobileDeviceInfo']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.mobileDeviceInfo
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.mobileDeviceInfo,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.mobileDeviceInfo.csv', names = ['index', 'test_device.mobileDeviceInfo']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.mobileDeviceInfo
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.mobileDeviceInfo,value_counts
0,not available in demo dataset,401589


### 4.6 device.mobileDeviceMarketingName (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.mobileDeviceMarketingName.csv', names = ['index', 'train_device.mobileDeviceMarketingName']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.mobileDeviceMarketingName
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.mobileDeviceMarketingName,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.mobileDeviceMarketingName.csv', names = ['index', 'test_device.mobileDeviceMarketingName']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.mobileDeviceMarketingName
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.mobileDeviceMarketingName,value_counts
0,not available in demo dataset,401589


### 4.7 device.mobileDeviceModel (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.mobileDeviceModel.csv', names = ['index', 'train_device.mobileDeviceModel']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.mobileDeviceModel
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.mobileDeviceModel,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.mobileDeviceModel.csv', names = ['index', 'test_device.mobileDeviceModel']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.mobileDeviceModel
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.mobileDeviceModel,value_counts
0,not available in demo dataset,401589


### 4.8 device.mobileInputSelector (*drop this column*)
Selector (e.g., touchscreen, joystick, clickwheel, stylus) used on the mobile device.

In [None]:
explore_train_df = pd.read_csv('train_device.mobileInputSelector.csv', names = ['index', 'train_device.mobileInputSelector']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.mobileInputSelector
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.mobileInputSelector,value_counts
0,not available in demo dataset,1708337


In [None]:

explore_test_df = pd.read_csv('test_device.mobileInputSelector.csv', names = ['index', 'test_device.mobileInputSelector']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.mobileInputSelector
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.mobileInputSelector,value_counts
0,not available in demo dataset,401589


### 4.9 device.mobileDeviceMarketingName (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.mobileDeviceMarketingName.csv', names = ['index', 'train_device.mobileDeviceMarketingName']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.mobileDeviceMarketingName
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.mobileDeviceMarketingName,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.mobileDeviceMarketingName.csv', names = ['index', 'test_fullVisitorId']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_fullVisitorId
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_fullVisitorId,value_counts
0,not available in demo dataset,401589


### 4.10 device.operatingSystem (fill 'not set')

In [None]:
explore_train_df = pd.read_csv('train_device.operatingSystem.csv', names = ['index', 'train_device.operatingSystem']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.operatingSystem
index,Unnamed: 1_level_1
0,Windows
1,Chrome OS
2,Android
3,Windows
4,Windows


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.operatingSystem,value_counts
0,Windows,619720
1,Macintosh,438514
2,Android,299386
3,iOS,219334
4,Linux,63971
5,Chrome OS,51318
6,(not set),11815
7,Windows Phone,1675
8,Samsung,911
9,Tizen,709


In [None]:
des = {
"Windows": "Microsoft Windows, also called Windows and Windows OS, computer operating system (OS) developed by Microsoft Corporation to run personal computers (PCs)",
"Macintosh": "macOS is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers",
"Android": "Android OS is a Linux-based mobile operating system that primarily runs on smartphones and tablets",
"iOS": "IOS is a mobile operating system for Apple-manufactured devices. iOS runs on the iPhone, iPad, iPod Touch and Apple TV",
"Linux": "Linux is a Unix-like, open source and community-developed operating system (OS) for computers, servers, mainframes, mobile devices and embedded devices",
"Chrome OS": "Google ChromeOS is a fast, secure and versatile cloud-first operating system that is easy to manage and powers Chromebooks and other Chrome devices",
"(not set)": np.NaN,
"Windows Phone": "Windows Mobile was a Micrososft operating system that targeted smartphones and Pocket PCs (but no longer supported)",
"Samsung": np.NaN,
"Tizen": "Tizen, developed by Samsung and Intel, is a Linux-based open-source operating system that can support smartphones, tablets, and PCs in addition to TVs",
"BlackBerry": "BlackBerry OS is a proprietary mobile operating system designed specifically for Research In Motion's (RIM) BlackBerry devices",
"OS/2": "OS/2 (Operating System/2) is a series of computer operating systems, initially created by Microsoft and IBM. The name stands for Operating_System/2, because it was introduced as part of the same generation change release as IBM's Personal_System/2 (PS/2) line of second-generation personal computers.",
"Xbox": "The Xbox system software is the operating system developed exclusively for the Xbox consoles",
"Nintendo Wii": "Wii is a home console from Nintendo. Launched in 2006, it introduced motion controlled gaming to a wide audience of Nintendo fans and people who didn't traditionally play video games",
"Firefox OS": "Firefox OS (project name: Boot to Gecko, also known as B2G) is a discontinued open-source operating system – made for smartphones, tablet computers, smart TVs and dongles designed by Mozilla and external contributors",
"Nintendo WiiU": "Wii U is Nintendo's first high definition home console, a powerful system with a controller that changes the way you can play games and connect together",
"FreeBSD": "FreeBSD is a free and open-source Unix-like operating system descended from the Berkeley Software Distribution (BSD), which was based on Research Unix",
"Playstation Vita": "The PlayStation Vita (PS Vita, or Vita) is a handheld video game console developed and marketed by Sony Interactive Entertainment.",
"Nintendo 3DS":"The Nintendo 3DS system software is the updatable operating system used by the Nintendo 3DS",
"SunOS": "SunOS is a Unix-branded operating system developed by Sun Microsystems for their workstation and server computer systems",
"OpenBSD": "The OpenBSD project produces a FREE, multi-platform 4.4BSD-based UNIX-like operating system",
"Nokia": np.NaN, # Asha or Symbian...
"SymbianOS": "Symbian OS is an operating system for mobile phones primarily used on Nokia advanced or data enabled smart phones",
"NTT DoCoMo": "NTT Docomo, Inc. (株式会社NTTドコモ, Kabushiki gaisha Entītī Dokomo, formerly and still colloquially stylized as NTT DoCoMo until 2008) is a Japanese mobile phone operator"
}

In [None]:
explore_test_df = pd.read_csv('test_device.operatingSystem.csv', names = ['index', 'test_device.operatingSystem']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.operatingSystem
index,Unnamed: 1_level_1
0,Android
1,Macintosh
2,Chrome OS
3,iOS
4,Windows


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.operatingSystem,value_counts
0,Windows,138005
1,Macintosh,102791
2,Android,65436
3,iOS,53307
4,Chrome OS,20684
5,Linux,16324
6,(not set),4382
7,Samsung,388
8,Tizen,93
9,Windows Phone,91


### 4.11 device.operatingSystemVersion (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.operatingSystemVersion.csv', names = ['index', 'train_device.operatingSystemVersion']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.operatingSystemVersion
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.operatingSystemVersion,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.operatingSystemVersion.csv', names = ['index', 'test_device.operatingSystemVersion']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.operatingSystemVersion
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.operatingSystemVersion,value_counts
0,not available in demo dataset,401589


### 4.12 device.isMobile

In [None]:
explore_train_df = pd.read_csv('train_device.isMobile.csv', names = ['index', 'train_device.isMobile']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.isMobile
index,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.isMobile,value_counts
0,False,1171670
1,True,536667


In [None]:
explore_test_df = pd.read_csv('test_device.isMobile.csv', names = ['index', 'test_device.isMobile']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.isMobile
index,Unnamed: 1_level_1
0,True
1,False
2,False
3,True
4,True


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.isMobile,value_counts
0,False,277621
1,True,123968


### 4.13 device.mobileDeviceBranding (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.mobileDeviceBranding.csv', names = ['index', 'train_channelGrouping']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_channelGrouping
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_channelGrouping,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.mobileDeviceBranding.csv', names = ['index', 'test_device.mobileDeviceBranding']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.mobileDeviceBranding
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.mobileDeviceBranding,value_counts
0,not available in demo dataset,401589


### 4.14 device.flashVersion (*drop this column*)
The version of the Adobe Flash plugin that is installed on the browser.

In [None]:
explore_train_df = pd.read_csv('train_device.flashVersion.csv', names = ['index', 'train_device.flashVersion']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.flashVersion
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.flashVersion,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.flashVersion.csv', names = ['index', 'test_device.flashVersion']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.flashVersion
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.flashVersion,value_counts
0,not available in demo dataset,401589


### 4.15 device.language (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.language.csv', names = ['index', 'train_device.language']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.language
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.language,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.language.csv', names = ['index', 'test_device.language']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.language
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.language,value_counts
0,not available in demo dataset,401589


### 4.16 device.screenColors (*drop this column*)
Number of colors supported by the display, expressed as the bit-depth (e.g., "8-bit", "24-bit", etc.).

In [None]:
explore_train_df = pd.read_csv('train_device.screenColors.csv', names = ['index', 'train_device.screenColors']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.screenColors
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.screenColors,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.screenColors.csv', names = ['index', 'test_device.screenColors']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.screenColors
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.screenColors,value_counts
0,not available in demo dataset,401589


### 4.17 device.screenResolution (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_device.screenResolution.csv', names = ['index', 'train_device.screenResolution']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_device.screenResolution
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_device.screenResolution,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_device.screenResolution.csv', names = ['index', 'test_device.screenResolution']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_device.screenResolution
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_device.screenResolution,value_counts
0,not available in demo dataset,401589


## 5. GeoNetwork
This section contains information about the geography of the user.

Reference: https://support.google.com/analytics/answer/6160484  

### 5.1 geoNetwork.continent

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.continent.csv', names = ['index', 'train_geoNetwork.continent']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.continent
index,Unnamed: 1_level_1
0,Europe
1,Americas
2,Americas
3,Asia
4,Americas


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_geoNetwork.continent,value_counts
0,Americas,877403
1,Asia,396719
2,Europe,368037
3,Africa,35481
4,Oceania,28180
5,(not set),2517


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.continent.csv', names = ['index', 'test_geoNetwork.continent']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.continent
index,Unnamed: 1_level_1
0,Asia
1,Americas
2,Americas
3,Americas
4,Americas


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.continent,value_counts
0,Americas,216932
1,Asia,85768
2,Europe,83870
3,Oceania,7870
4,Africa,6767
5,(not set),382


### 5.2 geoNetwork.subContinent	

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.subContinent.csv', names = ['index', 'train_geoNetwork.subContinent']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.subContinent
index,Unnamed: 1_level_1
0,Western Europe
1,Northern America
2,Northern America
3,Western Asia
4,Central America


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_geoNetwork.subContinent,value_counts
0,Northern America,768345
1,Southeast Asia,121634
2,Southern Asia,121062
3,Western Europe,115153
4,Northern Europe,111693
5,Eastern Asia,91072
6,South America,75112
7,Eastern Europe,74007
8,Southern Europe,67184
9,Western Asia,60966


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.subContinent.csv', names = ['index', 'test_geoNetwork.subContinent']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.subContinent
index,Unnamed: 1_level_1
0,Southern Asia
1,Northern America
2,Northern America
3,Northern America
4,Northern America


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.subContinent,value_counts
0,Northern America,193790
1,Southern Asia,32269
2,Eastern Asia,30024
3,Western Europe,27594
4,Northern Europe,27077
5,South America,17013
6,Southern Europe,16223
7,Southeast Asia,16029
8,Eastern Europe,12976
9,Australasia,7827


In [None]:
len(set(vl_count_test_df['test_geoNetwork.subContinent']) - set(vl_count_train_df['train_geoNetwork.subContinent']))

0

### 5.3 geoNetwork.country (ALL 249-> mapping)
Ref: https://stackoverflow.com/questions/47856976/number-of-total-countries-in-google-analytics-is-around-230-while-google-search 

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.country.csv', names = ['index', 'train_geoNetwork.country']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.country
index,Unnamed: 1_level_1
0,Germany
1,United States
2,United States
3,Turkey
4,Mexico


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_geoNetwork.country,value_counts
0,United States,717217
1,India,105317
2,United Kingdom,73341
3,Canada,51057
4,Germany,38516
...,...,...
223,St. Pierre & Miquelon,1
224,Norfolk Island,1
225,Solomon Islands,1
226,Montserrat,1


In [None]:
vl_count_train_df.groupby('value_counts')['train_geoNetwork.country'].apply(list).to_frame()

Unnamed: 0_level_0,train_geoNetwork.country
value_counts,Unnamed: 1_level_1
1,"[Anguilla, Micronesia, Eritrea, Samoa, São Tomé & Príncipe, St. Helena, Tonga, St. Pierre & Miquelon, Norfolk Island, Solomon Islands, Montserrat, Åland Islands]"
2,"[St. Barthélemy, American Samoa, Marshall Islands]"
3,[Cook Islands]
4,"[Comoros, Dominica, Seychelles, Vanuatu]"
5,"[Equatorial Guinea, St. Martin]"
...,...
38516,[Germany]
51057,[Canada]
73341,[United Kingdom]
105317,[India]


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.country.csv', names = ['index', 'test_geoNetwork.country']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.country
index,Unnamed: 1_level_1
0,India
1,United States
2,United States
3,United States
4,United States


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.country,value_counts
0,United States,180794
1,India,28900
2,United Kingdom,18375
3,Canada,12985
4,Japan,10787
...,...,...
203,Bhutan,1
204,St. Barthélemy,1
205,St. Kitts & Nevis,1
206,St. Vincent & Grenadines,1


### 5.4 geoNetwork.region
The region from which sessions originate, derived from IP addresses. In the U.S., a region is a state, such as New York.

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.region.csv', names = ['index', 'train_geoNetwork.region']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.region
index,Unnamed: 1_level_1
0,not available in demo dataset
1,California
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 *100
vl_count_train_df

Unnamed: 0,train_geoNetwork.region,value_counts,percent
0,not available in demo dataset,932959,54.612117
1,California,206669,12.097672
2,(not set),49774,2.913594
3,New York,49733,2.911194
4,England,25824,1.511646
...,...,...,...
478,Maha Sarakham,6,0.000351
479,Braga,6,0.000351
480,Binh Phuoc,6,0.000351
481,Abruzzo,6,0.000351


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.region.csv', names = ['index', 'test_geoNetwork.region']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.region
index,Unnamed: 1_level_1
0,Delhi
1,California
2,not available in demo dataset
3,Texas
4,California


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.region,value_counts
0,not available in demo dataset,206434
1,California,56683
2,New York,13869
3,(not set),11894
4,England,8123
...,...,...
264,Federal District,6
265,Atlantico,6
266,Central Visayas,6
267,Daejeon,6


### 5.5 geoNetwork.metro

src: https://www.thebalancecareers.com/what-is-a-designated-market-area-dma-2315180

The Designated Market Area (DMA) from which sessions originate.

A designated market area (DMA) is a geographic region where Nielsen, the ratings company, analyzes and quantifies how television is viewed.

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.metro.csv', names = ['index', 'train_geoNetwork.metro']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.metro
index,Unnamed: 1_level_1
0,not available in demo dataset
1,San Francisco-Oakland-San Jose CA
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 *100
vl_count_train_df

Unnamed: 0,train_geoNetwork.metro,value_counts,percent
0,not available in demo dataset,932959,54.612117
1,(not set),386896,22.647522
2,San Francisco-Oakland-San Jose CA,182745,10.697245
3,New York NY,50419,2.951350
4,London,23643,1.383978
...,...,...,...
118,Tallahassee FL-Thomasville GA,6,0.000351
119,Dayton OH,6,0.000351
120,Des Moines-Ames IA,6,0.000351
121,Springfield-Holyoke MA,6,0.000351


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.metro.csv', names = ['index', 'test_geoNetwork.metro']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.metro
index,Unnamed: 1_level_1
0,(not set)
1,San Francisco-Oakland-San Jose CA
2,not available in demo dataset
3,Houston TX
4,Los Angeles CA


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.metro,value_counts
0,not available in demo dataset,206434
1,(not set),89826
2,San Francisco-Oakland-San Jose CA,50162
3,New York NY,14016
4,London,7446
...,...,...
77,Rochester NY,7
78,Harlingen-Weslaco-Brownsville-McAllen TX,6
79,Waco-Temple-Bryan TX,6
80,Las Vegas NV,6


### 5.6 geoNetwork.city
Ref: https://www.quora.com/What-is-the-difference-between-a-city-and-a-region 

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.city.csv', names = ['index', 'train_geoNetwork.city']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.city
index,Unnamed: 1_level_1
0,not available in demo dataset
1,Cupertino
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 *100
vl_count_train_df

Unnamed: 0,train_geoNetwork.city,value_counts,percent
0,not available in demo dataset,932959,54.612117
1,Mountain View,74110,4.338137
2,(not set),65867,3.855621
3,New York,49460,2.895213
4,San Francisco,36960,2.163508
...,...,...,...
951,Daly City,4,0.000234
952,Morgan Hill,4,0.000234
953,North Creek,4,0.000234
954,Boise,3,0.000176


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.city.csv', names = ['index', 'test_geoNetwork.city']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.city
index,Unnamed: 1_level_1
0,(not set)
1,San Francisco
2,not available in demo dataset
3,Houston
4,Irvine


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.city,value_counts
0,not available in demo dataset,206434
1,San Francisco,20084
2,(not set),18659
3,New York,13794
4,Sunnyvale,9258
...,...,...
498,McAllen,6
499,Marseille,6
500,Campbell,4
501,Saratoga,3


### 5.7 geoNetwork.cityId (*drop this column*)

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.cityId.csv', names = ['index', 'train_geoNetwork.cityId']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.cityId
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_geoNetwork.cityId,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.cityId.csv', names = ['index', 'test_geoNetwork.cityId']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.cityId
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.cityId,value_counts
0,not available in demo dataset,401589


### 5.8 geoNetwork.latitude (*drop this column*)
The approximate latitude of users' city, derived from their IP addresses or Geographical IDs. Locations north of the equator have positive latitudes and locations south of the equator have negative latitudes.

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.latitude.csv', names = ['index', 'train_geoNetwork.latitude']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.latitude
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_geoNetwork.latitude,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.latitude.csv', names = ['index', 'test_geoNetwork.latitude']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.latitude
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.latitude,value_counts
0,not available in demo dataset,401589


### 5.9 geoNetwork.longitude (*drop this column*)
The approximate longitude of users' city, derived from their IP addresses or Geographical IDs. Locations east of the prime meridian have positive longitudes and locations west of the prime meridian have negative longitudes.


In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.longitude.csv', names = ['index', 'train_geoNetwork.longitude']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.longitude
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_geoNetwork.longitude,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.longitude.csv', names = ['index', 'test_geoNetwork.longitude']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.longitude
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.longitude,value_counts
0,not available in demo dataset,401589


### 5.10 geoNetwork.networkDomain (top20)
The domain name of user's ISP, derived from the domain name registered to the ISP's IP address.



In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.networkDomain.csv', names = ['index', 'train_geoNetwork.networkDomain']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.networkDomain
index,Unnamed: 1_level_1
0,(not set)
1,(not set)
2,windjammercable.net
3,unknown.unknown
4,prod-infinitum.com.mx


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 *10
vl_count_train_df

Unnamed: 0,train_geoNetwork.networkDomain,value_counts,percent
0,(not set),499049,2.921256
1,unknown.unknown,269796,1.579290
2,comcast.net,55486,0.324795
3,rr.com,28715,0.168087
4,verizon.net,26547,0.155397
...,...,...,...
41977,hochland.com,1,0.000006
41978,hochrheinnet.de,1,0.000006
41979,omygods.com,1,0.000006
41980,hockaday.org,1,0.000006


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.networkDomain.csv', names = ['index', 'test_geoNetwork.networkDomain']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.networkDomain
index,Unnamed: 1_level_1
0,unknown.unknown
1,(not set)
2,onlinecomputerworks.com
3,(not set)
4,com


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.networkDomain,value_counts
0,(not set),137978
1,unknown.unknown,56949
2,comcast.net,14194
3,rr.com,6787
4,verizon.net,6481
...,...,...
15929,novell.com,1
15930,novelent.com,1
15931,novator.ru,1
15932,novatel.bg,1


In [None]:
vl_count_train_df.head(20)

Unnamed: 0,train_geoNetwork.networkDomain,value_counts,percent
0,(not set),499049,2.921256
1,unknown.unknown,269796,1.57929
2,comcast.net,55486,0.324795
3,rr.com,28715,0.168087
4,verizon.net,26547,0.155397
5,ttnet.com.tr,17078,0.099969
6,comcastbusiness.net,16826,0.098493
7,hinet.net,15933,0.093266
8,virginm.net,12594,0.073721
9,cox.net,10722,0.062763


In [None]:
vl_count_test_df.head(20)

Unnamed: 0,test_geoNetwork.networkDomain,value_counts
0,(not set),137978
1,unknown.unknown,56949
2,comcast.net,14194
3,rr.com,6787
4,verizon.net,6481
5,hinet.net,4763
6,comcastbusiness.net,3972
7,sbcglobal.net,3230
8,virginm.net,2640
9,optonline.net,2246


### 5.11 geoNetwork.networkLocation (*drop this column*)
The names of the service providers used to reach the property. For example, if most users of the website come via the major cable internet service providers, its value will be these service providers' names.

In [None]:
explore_train_df = pd.read_csv('train_geoNetwork.networkLocation.csv', names = ['index', 'train_geoNetwork.networkLocation']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_geoNetwork.networkLocation
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_geoNetwork.networkLocation,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_geoNetwork.networkLocation.csv', names = ['index', 'test_geoNetwork.networkLocation']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_geoNetwork.networkLocation
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_geoNetwork.networkLocation,value_counts
0,not available in demo dataset,401589


## 6. SocialEngagementType (*drop this column*)
Engagement type, either "Socially Engaged" or "Not Socially Engaged".

In [None]:
explore_train_df = pd.read_csv('train_socialEngagementType.csv', names = ['index', 'train_socialEngagementType']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_socialEngagementType
index,Unnamed: 1_level_1
0,Not Socially Engaged
1,Not Socially Engaged
2,Not Socially Engaged
3,Not Socially Engaged
4,Not Socially Engaged


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_socialEngagementType,value_counts
0,Not Socially Engaged,1708337


In [None]:
explore_test_df = pd.read_csv('test_socialEngagementType.csv', names = ['index', 'test_socialEngagementType']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_socialEngagementType
index,Unnamed: 1_level_1
0,Not Socially Engaged
1,Not Socially Engaged
2,Not Socially Engaged
3,Not Socially Engaged
4,Not Socially Engaged


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_socialEngagementType,value_counts
0,Not Socially Engaged,401589


## 7. Totals 
This section contains aggregate values across the session.

### 7.1 totals.bounces (NaN -> 0)
Total bounces (for convenience). For a bounced session, the value is 1, otherwise it is null. 
> When a user opens a single page on your site and then exits without triggering any other requests to the Analytics server during that session.

In [None]:
explore_train_df = pd.read_csv('train_totals.bounces.csv', names = ['index', 'train_totals.bounces']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.bounces
index,Unnamed: 1_level_1
0,1.0
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_totals.bounces,value_counts
0,1.0,871578
1,,836759


In [None]:
explore_test_df = pd.read_csv('test_totals.bounces.csv', names = ['index', 'test_totals.bounces']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.bounces
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.bounces,value_counts
0,,218911
1,1.0,182678


### 7.2 totals.hits
Total number of hits within the session.

In [None]:
explore_train_df = pd.read_csv('train_totals.hits.csv', names = ['index', 'train_totals.hits']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.hits
index,Unnamed: 1_level_1
0,1
1,2
2,2
3,2
4,2


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_totals.hits,value_counts
0,1,864064
1,2,237499
2,3,134435
3,4,80875
4,5,63687
...,...,...
292,262,1
293,260,1
294,259,1
295,204,1


In [None]:
explore_test_df = pd.read_csv('test_totals.hits.csv', names = ['index', 'test_totals.hits']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.hits
index,Unnamed: 1_level_1
0,4
1,4
2,4
3,5
4,5


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.hits,value_counts
0,1,181092
1,2,46337
2,3,33946
3,4,21227
4,5,17081
...,...,...
203,197,1
204,203,1
205,153,1
206,208,1


### 7.3 totals.newVisits (NaN -> 0)
Total number of new users in session (for convenience). If this is the first visit, this value is 1, otherwise it is null.

In [None]:
explore_train_df = pd.read_csv('train_totals.newVisits.csv', names = ['index', 'train_totals.newVisits']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.newVisits
index,Unnamed: 1_level_1
0,1.0
1,
2,1.0
3,1.0
4,1.0


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_totals.newVisits,value_counts
0,1.0,1307430
1,,400907


In [None]:
explore_test_df = pd.read_csv('test_totals.newVisits.csv', names = ['index', 'test_totals.newVisits']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.newVisits
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,1.0


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.newVisits,value_counts
0,1.0,286065
1,,115524


### 7.4 totals.pageviews (handle outliers?)
Total number of pageviews within the session.

In [None]:
explore_train_df = pd.read_csv('train_totals.pageviews.csv', names = ['index', 'train_totals.pageviews']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.pageviews
index,Unnamed: 1_level_1
0,1.0
1,2.0
2,2.0
3,2.0
4,2.0


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_totals.pageviews,value_counts
0,1.0,876328
1,2.0,249794
2,3.0,142896
3,4.0,86666
4,5.0,64712
...,...,...
226,213.0,1
227,215.0,1
228,219.0,1
229,220.0,1


In [None]:
explore_test_df = pd.read_csv('test_totals.pageviews.csv', names = ['index', 'test_totals.pageviews']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.pageviews
index,Unnamed: 1_level_1
0,3.0
1,3.0
2,3.0
3,4.0
4,4.0


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.pageviews,value_counts
0,1.0,183926
1,2.0,52223
2,3.0,36301
3,4.0,23540
4,5.0,18418
...,...,...
151,141.0,1
152,147.0,1
153,149.0,1
154,152.0,1


### 7.5 totals.sessionQualityDim (NaN -> 0)
An estimate of how close a particular session was to transacting, ranging from 1 to 100, calculated for each session. A value closer to 1 indicates a low session quality, or far from transacting, while a value closer to 100 indicates a high session quality, or very close to transacting. A value of 0 indicates that Session Quality is not calculated for the selected time range.

In [None]:
explore_train_df = pd.read_csv('train_totals.sessionQualityDim.csv', names = ['index', 'train_totals.sessionQualityDim']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.sessionQualityDim
index,Unnamed: 1_level_1
0,1.0
1,2.0
2,1.0
3,1.0
4,1.0


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_totals.sessionQualityDim,value_counts,percent
0,,835274,48.893983
1,1.0,717560,42.003422
2,2.0,52517,3.074159
3,3.0,17066,0.998983
4,4.0,9552,0.559140
...,...,...,...
96,96.0,52,0.003044
97,97.0,27,0.001580
98,98.0,9,0.000527
99,99.0,5,0.000293


In [None]:
explore_test_df = pd.read_csv('test_totals.sessionQualityDim.csv', names = ['index', 'test_totals.sessionQualityDim']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.sessionQualityDim
index,Unnamed: 1_level_1
0,1
1,1
2,1
3,1
4,1


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.sessionQualityDim,value_counts
0,1,304045
1,2,31789
2,3,9741
3,4,5526
4,5,3681
...,...,...
94,95,33
95,96,26
96,97,14
97,98,3


### 7.6 totals.timeOnSite (nan->0)
The total time on screen in seconds.


In [None]:
explore_train_df = pd.read_csv('train_totals.timeOnSite.csv', names = ['index', 'train_totals.timeOnSite']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.timeOnSite
index,Unnamed: 1_level_1
0,
1,28.0
2,38.0
3,1.0
4,52.0


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_totals.timeOnSite,value_counts,percent
0,,874294,51.178076
1,5.0,9862,0.577287
2,4.0,9738,0.570028
3,6.0,9150,0.535609
4,7.0,8221,0.481228
...,...,...,...
4770,4356.0,1,0.000059
4771,3452.0,1,0.000059
4772,3243.0,1,0.000059
4773,4364.0,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_totals.timeOnSite.csv', names = ['index', 'test_totals.timeOnSite']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.timeOnSite
index,Unnamed: 1_level_1
0,973.0
1,49.0
2,24.0
3,25.0
4,49.0


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.timeOnSite,value_counts
0,,183686
1,10.0,2242
2,7.0,2205
3,9.0,2186
4,8.0,2185
...,...,...
3574,3120.0,1
3575,3122.0,1
3576,3127.0,1
3577,3129.0,1


### 7.7 totals.totalTransactionRevenue
Total transaction revenue, expressed as the value passed to Analytics multiplied by 10^6 (e.g., 2.40 would be given as 2400000).

In [None]:
explore_train_df = pd.read_csv('train_totals.totalTransactionRevenue.csv', names = ['index', 'train_totals.totalTransactionRevenue']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.totalTransactionRevenue
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_totals.totalTransactionRevenue,value_counts,percent
0,,1689823,98.916256
1,24990000.0,147,0.008605
2,23990000.0,143,0.008371
3,22990000.0,134,0.007844
4,25990000.0,120,0.007024
...,...,...,...
8502,105850000.0,1,0.000059
8503,105790000.0,1,0.000059
8504,105770000.0,1,0.000059
8505,105760000.0,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_totals.totalTransactionRevenue.csv', names = ['index', 'test_totals.totalTransactionRevenue']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.totalTransactionRevenue
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.totalTransactionRevenue,value_counts
0,,396995
1,24990000.0,54
2,25990000.0,45
3,28990000.0,38
4,30990000.0,38
...,...,...
2457,11490000.0,1
2458,78970000.0,1
2459,78960000.0,1
2460,78950000.0,1


### 7.8 totals.transactionRevenue (*drop this column*)
This school is no longer used. Use "totals.totalTransactionRevenue" instead (see above).

In [None]:
explore_train_df = pd.read_csv('train_totals.transactionRevenue.csv', names = ['index', 'train_totals.transactionRevenue']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.transactionRevenue
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_totals.transactionRevenue,value_counts,percent
0,,1689823,98.916256
1,16990000.0,308,0.018029
2,19990000.0,248,0.014517
3,39980000.0,220,0.012878
4,18990000.0,219,0.012819
...,...,...,...
7247,87600000.0,1,0.000059
7248,87570000.0,1,0.000059
7249,87490000.0,1,0.000059
7250,87470000.0,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_totals.transactionRevenue.csv', names = ['index', 'test_totals.transactionRevenue']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.transactionRevenue
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.transactionRevenue,value_counts
0,,396995
1,21990000.0,208
2,17590000.0,198
3,35180000.0,126
4,23990000.0,106
...,...,...
1946,79360000.0,1
1947,79160000.0,1
1948,79110000.0,1
1949,78970000.0,1


### 7.9 totals.transactions (NaN->0)
Total number of ecommerce transactions within the session.


In [None]:
explore_train_df = pd.read_csv('train_totals.transactions.csv', names = ['index', 'train_totals.transactions']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.transactions
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_totals.transactions,value_counts,percent
0,,1689778,98.913622
1,1.0,18048,1.056466
2,2.0,420,0.024585
3,3.0,43,0.002517
4,4.0,17,0.000995
5,5.0,11,0.000644
6,6.0,6,0.000351
7,7.0,5,0.000293
8,8.0,3,0.000176
9,12.0,2,0.000117


In [None]:
explore_test_df = pd.read_csv('test_totals.transactions.csv', names = ['index', 'test_totals.transactions']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.transactions
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.transactions,value_counts
0,,395284
1,1.0,6151
2,2.0,141
3,3.0,4
4,4.0,3
5,5.0,3
6,6.0,2
7,9.0,1


### 7.10 totals.visits (drop this column)
The number of sessions (for convenience). This value is 1 for sessions with interaction events. The value is null if there are no interaction events in the session.


In [None]:
explore_train_df = pd.read_csv('train_totals.visits.csv', names = ['index', 'train_totals.visits']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_totals.visits
index,Unnamed: 1_level_1
0,1
1,1
2,1
3,1
4,1


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_totals.visits,value_counts
0,1,1708337


In [None]:
explore_test_df = pd.read_csv('test_totals.visits.csv', names = ['index', 'test_totals.visits']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_totals.visits
index,Unnamed: 1_level_1
0,1
1,1
2,1
3,1
4,1


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_totals.visits,value_counts
0,1,401589


## 8. Traffic Source 
This section contains information about the Traffic Source from which the session originated.

### 8.1 trafficSource.adContent (drop this column)
The ad content of the traffic source. Can be set by the utm_content URL parameter.


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.adContent.csv', names = ['index', 'train_trafficSource.adContent']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.adContent
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.adContent,value_counts,percent
0,,1643600,96.210525
1,Google Merchandise Store,39566,2.316054
2,Google Merchandise Collection,6762,0.395824
3,Placement Accessores 300 x 250,3040,0.177951
4,Smart display ad - 8/17/2017,2664,0.155941
...,...,...,...
72,Google store,2,0.000117
73,Swag w/ Google Logos,1,0.000059
74,Google Apparel,1,0.000059
75,GA Help Center,1,0.000059


In [None]:
vl_count_train_df.groupby('value_counts')['train_trafficSource.adContent'].apply(list).to_frame()

Unnamed: 0_level_0,train_trafficSource.adContent
value_counts,Unnamed: 1_level_1
1,"[Swag w/ Google Logos, Google Apparel, GA Help Center, Men's Apparel from Google]"
2,"[google store, cool, Free Shipping!, Google store]"
3,"[Men's-Outerwear Google Apparel, Ad from 2/17/17, url_builder, Full auto ad NATIVE ONLY, free shipping, Full auto ad with Primary Color]"
4,[Official Google Merchandise - Fast Shipping]
5,[{KeyWord:Google Branded Outerwear}]
6,[Drinkware 120x600]
7,"[Full auto ad TEXT/NATIVE, Office 2018 - 120 x 600, visit us again]"
8,"[Google Paraphernalia, Google Store, test_tyler_hr_merchant]"
9,[Want Google Sunglasses]
10,[{KeyWord:Google Branded Apparel}]


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.adContent.csv', names = ['index', 'test_trafficSource.adContent']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.adContent
index,Unnamed: 1_level_1
0,(not set)
1,(not set)
2,(not set)
3,(not set)
4,(not set)


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.adContent,value_counts
0,(not set),390841
1,YouTube Merchandise Collection,4901
2,Google Merchandise Collection,3397
3,Official Google Merchandise,825
4,Smart display ad - 8/17/2017,469
5,Google Office Merchandise,426
6,Google Apparel Merchandise,144
7,BQ,141
8,Official Google Branded Bags,122
9,Official Google Drinkware,81


### 8.2 trafficSource.adwordsClickInfo.adNetworkType (drop this column)
Network Type. Takes one of the following values: 
- Google Search
- Content
- Search partners
- Ad Exchange
- Yahoo Japan Search
- Yahoo Japan AFS
- unknown


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.adwordsClickInfo.adNetworkType.csv', names = ['index', 'train_trafficSource.adwordsClickInfo.adNetworkType']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.adwordsClickInfo.adNetworkType
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.adwordsClickInfo.adNetworkType,value_counts,percent
0,,1633063,95.593727
1,Content,42223,2.471585
2,Google Search,33043,1.93422
3,Search partners,8,0.000468


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.adwordsClickInfo.adNetworkType.csv', names = ['index', 'test_trafficSource.adwordsClickInfo.adNetworkType']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.adwordsClickInfo.adNetworkType
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.adwordsClickInfo.adNetworkType,value_counts
0,,390984
1,Google Search,10136
2,Content,469


### 8.3 trafficSource.adwordsClickInfo.criteriaParameters (*drop this column*)

Descriptive string for the targeting criterion.


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.adwordsClickInfo.criteriaParameters.csv', names = ['index', 'train_trafficSource.adwordsClickInfo.criteriaParameters']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.adwordsClickInfo.criteriaParameters
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_trafficSource.adwordsClickInfo.criteriaParameters,value_counts
0,not available in demo dataset,1708337


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.adwordsClickInfo.criteriaParameters.csv', names = ['index', 'test_trafficSource.adwordsClickInfo.criteriaParameters']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.adwordsClickInfo.criteriaParameters
index,Unnamed: 1_level_1
0,not available in demo dataset
1,not available in demo dataset
2,not available in demo dataset
3,not available in demo dataset
4,not available in demo dataset


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.adwordsClickInfo.criteriaParameters,value_counts
0,not available in demo dataset,401589


### 8.4 trafficSource.adwordsClickInfo.gclId (*drop this column*)
The Google Click ID.


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.adwordsClickInfo.gclId.csv', names = ['index', 'train_trafficSource.adwordsClickInfo.gclId']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.adwordsClickInfo.gclId
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.adwordsClickInfo.gclId,value_counts,percent
0,,1632914,95.585005
1,CN_Whvvc_9UCFd6LswodGTgKCQ,74,0.004332
2,Cj0KEQjwmIrJBRCRmJ_x7KDo-9oBEiQAuUPKMufMpuG3ZdwYO8GTsjiBFd5MPHStZa9y_9NCrI8X97oaAglc8P8HAQ,70,0.004098
3,COT1-vPT4tYCFZWNswodcwsHxg,60,0.003512
4,CN3fusbjvtYCFQsmhgodIEQO-g,51,0.002985
...,...,...,...
59004,CNrQoY3G5dYCFYY1aQodYqkIow,1,0.000059
59005,CNrQvuWA29YCFYWRjwodjRYHFA,1,0.000059
59006,CNrSoKH5ns8CFdgegQodafIMpg,1,0.000059
59007,CNrSp6-AyNYCFctXDQodnQEOOg,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.adwordsClickInfo.gclId.csv', names = ['index', 'test_trafficSource.adwordsClickInfo.gclId']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.adwordsClickInfo.gclId
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.adwordsClickInfo.gclId,value_counts
0,,390977
1,EAIaIQobChMIg7ea6KLl2gIVCZXICh0VTgpLEAEYASAAEgJQIvD_BwE,43
2,Cj0KCQjw_7HdBRDPARIsAN_ltcIh7a6GM8F-QcdRE5-wDQWQAALlEJGusZkDWWGKR_XJdAqombo1NLsaAq-5EALw_wcB,26
3,EAIaIQobChMI8MWOisKg2wIVUJd3Ch2RWgW7EAEYASAAEgLq2_D_BwE,19
4,EAIaIQobChMImu3Fooau3QIVw7rACh2jWAsnEAAYASAAEgIhH_D_BwE,17
...,...,...
9012,CjwKCAjw54fdBRBbEiwAW28S9oH5tTho5n21q2zKSutdsiVu-qH8_o4VlnJ7DyyaUE91G585-MBVexoCqc8QAvD_BwE,1
9013,CjwKCAjw54fdBRBbEiwAW28S9nqUVAjN-FFaYukXAM9flwxjFuEDNFQMRqDH507xUIRFpDyCrOcl4BoCjsoQAvD_BwE,1
9014,CjwKCAjw54fdBRBbEiwAW28S9nlxj9H9QA4wtMoCuAdgWOeH_tVqXGO-s0yAl0blNCH-3OWwRNvMVBoCp5EQAvD_BwE,1
9015,CjwKCAjw54fdBRBbEiwAW28S9nDh01gIZft3cAH5d2USV5g7ijzjVY4aakf7ERybZj3h4nIvkGKdhBoCPxQQAvD_BwE,1


### 8.5 trafficSource.adwordsClickInfo.isVideoAd (fillna -> True)
True if it is a Trueview video ad.


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.adwordsClickInfo.isVideoAd.csv', names = ['index', 'train_trafficSource.adwordsClickInfo.isVideoAd']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.adwordsClickInfo.isVideoAd
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.adwordsClickInfo.isVideoAd,value_counts,percent
0,,1633063,95.593727
1,False,75274,4.406273


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.adwordsClickInfo.isVideoAd.csv', names = ['index', 'test_trafficSource.adwordsClickInfo.isVideoAd']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.adwordsClickInfo.isVideoAd
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.adwordsClickInfo.isVideoAd,value_counts
0,,390984
1,False,10605


### 8.6 trafficSource.adwordsClickInfo.page (drop this column)
Page number in search results where the ad was shown.

In [None]:
explore_train_df = pd.read_csv('train_trafficSource.adwordsClickInfo.page.csv', names = ['index', 'train_trafficSource.adwordsClickInfo.page']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.adwordsClickInfo.page
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.adwordsClickInfo.page,value_counts,percent
0,,1633063,95.593727
1,1.0,73913,4.326605
2,2.0,1057,0.061873
3,3.0,172,0.010068
4,4.0,80,0.004683
5,5.0,30,0.001756
6,6.0,10,0.000585
7,7.0,6,0.000351
8,9.0,3,0.000176
9,8.0,1,5.9e-05


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.adwordsClickInfo.page.csv', names = ['index', 'test_trafficSource.adwordsClickInfo.page']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.adwordsClickInfo.page
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.adwordsClickInfo.page,value_counts
0,,390984
1,1.0,10600
2,2.0,5


### 8.7 trafficSource.adwordsClickInfo.slot (drop this column)

Position of the Ad. Takes one of the following values:
- RHS
- Top


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.adwordsClickInfo.slot.csv', names = ['index', 'train_trafficSource.adwordsClickInfo.slot']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.adwordsClickInfo.slot
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.adwordsClickInfo.slot,value_counts,percent
0,,1633063,95.593727
1,RHS,42750,2.502434
2,Top,32447,1.899333
3,Google Display Network,77,0.004507


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.adwordsClickInfo.slot.csv', names = ['index', 'test_trafficSource.adwordsClickInfo.slot']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.adwordsClickInfo.slot
index,Unnamed: 1_level_1
0,
1,
2,
3,
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.adwordsClickInfo.slot,value_counts
0,,390984
1,Google search: Top,10099
2,Google Display Network,459
3,Top,26
4,Google search: Other,11
5,RHS,10


### 8.8 trafficSource.campaign (drop this column)
The campaign value. Usually set by the utm_campaign URL parameter.

In [None]:
explore_train_df = pd.read_csv('train_trafficSource.campaign.csv', names = ['index', 'train_trafficSource.campaign']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.campaign
index,Unnamed: 1_level_1
0,(not set)
1,(not set)
2,(not set)
3,(not set)
4,(not set)


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.campaign,value_counts,percent
0,(not set),1604526,93.923272
1,Data Share Promo,32914,1.926669
2,1000557 | GA | US | en | Hybrid | GDN Text+Banner | AS,24410,1.428875
3,1000557 | GA | US | en | Hybrid | GDN Remarketing,15149,0.886769
4,AW - Dynamic Search Ads Whole Site,15146,0.886593
5,AW - Accessories,7972,0.466653
6,Smart Display Campaign,2664,0.155941
7,"""google + redesign/Accessories March 17"" All Users Similar Audiences",1179,0.069014
8,"Page: contains ""/google+redesign/drinkware"" Similar Audiences",611,0.035766
9,"""google + redesign/Accessories March 17"" All Users",562,0.032897


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.campaign.csv', names = ['index', 'test_trafficSource.campaign']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.campaign
index,Unnamed: 1_level_1
0,(not set)
1,(not set)
2,(not set)
3,(not set)
4,(not set)


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.campaign,value_counts
0,(not set),378933
1,Data Share Promo,10831
2,AW - YouTube Brand,4802
3,AW - Bags,1381
4,AW - Apparel,1228
5,AW - Office,1211
6,AW - Google Brand,1066
7,Run of Network Line Item,619
8,Smart Display Campaign,469
9,"""google + redesign/Accessories March 17"" All Users Similar Audiences",244


### 8.9 trafficSource.campaignCode (*drop this column*)
Value of the utm_id campaign tracking parameter, used for manual campaign tracking.


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.campaignCode.csv', names = ['index', 'train_trafficSource.campaignCode']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.campaignCode
index,Unnamed: 1_level_1
100000,
100001,
100002,
100003,
100004,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 100000 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.campaignCode,value_counts,percent
0,,99999,99.999
1,11251kjhkvahf,1,0.001


In [None]:
os.path.exists('test_trafficSource.campaignCode.csv')

False

### 8.10 trafficSource.isTrueDirect (fillna -> False)
True if the source of the session was Direct (meaning the user typed the name of your website URL into the browser or came to your site via a bookmark), This field will also be true if 2 successive but distinct sessions have exactly the same campaign details. Otherwise NULL.

In [None]:
explore_train_df = pd.read_csv('train_trafficSource.isTrueDirect.csv', names = ['index', 'train_trafficSource.isTrueDirect']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.isTrueDirect
index,Unnamed: 1_level_1
0,
1,
2,True
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_trafficSource.isTrueDirect,value_counts
0,,1173819
1,True,534518


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.isTrueDirect.csv', names = ['index', 'test_trafficSource.isTrueDirect']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.isTrueDirect
index,Unnamed: 1_level_1
0,True
1,True
2,True
3,True
4,


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.isTrueDirect,value_counts
0,,253180
1,True,148409


### 8.11 trafficSource.keyword (drop this column)
The keyword of the traffic source, usually set when the trafficSource.medium is "organic" or "cpc". Can be set by the utm_term URL parameter.


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.keyword.csv', names = ['index', 'train_trafficSource.keyword']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.keyword
index,Unnamed: 1_level_1
0,water bottle
1,
2,
3,(not provided)
4,(not provided)


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.keyword,value_counts,percent
0,,1052780,61.626014
1,(not provided),568933,33.303324
2,(User vertical targeting),25918,1.517148
3,(automatic matching),18464,1.080817
4,6qEhsCssdK0z36ri,10870,0.636291
...,...,...,...
4542,google mechanidise,1,0.000059
4543,google mechendise store,1,0.000059
4544,google men shirt,1,0.000059
4545,google men's,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.keyword.csv', names = ['index', 'test_trafficSource.keyword']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.keyword
index,Unnamed: 1_level_1
0,(not provided)
1,(not set)
2,(not provided)
3,(not set)
4,(not provided)


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.keyword,value_counts
0,(not provided),190785
1,(not set),158519
2,,40226
3,Google Merchandise Store,2624
4,youtuber merch,2315
...,...,...
673,google mens,1
674,google men's,1
675,google meechandise,1
676,google mechandise store,1


### 8.12 trafficSource.medium ('not set' -> 'none')
The medium of the traffic source. Could be "organic", "cpc", "referral", or the value of the utm_medium URL parameter.


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.medium.csv', names = ['index', 'train_trafficSource.medium']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.medium
index,Unnamed: 1_level_1
0,organic
1,referral
2,(none)
3,organic
4,organic


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.medium,value_counts,percent
0,organic,591783,34.640882
1,(none),565957,33.129119
2,referral,432963,25.344121
3,cpc,75603,4.425532
4,affiliate,32915,1.926728
5,cpm,8982,0.525774
6,(not set),134,0.007844


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.medium.csv', names = ['index', 'test_trafficSource.medium']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.medium
index,Unnamed: 1_level_1
0,organic
1,(none)
2,organic
3,(none)
4,organic


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.medium,value_counts
0,organic,198378
1,(none),111307
2,referral,61155
3,cpc,13303
4,affiliate,10833
5,cpm,6607
6,(not set),6


### 8.13 trafficSource.referralPath
If trafficSource.medium is "referral", then this is set to the path of the referrer. (The host name of the referrer is in trafficSource.source.)


In [None]:
explore_train_df = pd.read_csv('train_trafficSource.referralPath.csv', names = ['index', 'train_trafficSource.referralPath']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.referralPath
index,Unnamed: 1_level_1
0,
1,/a/google.com/transportation/mtv-services/bikes/bike2workmay2016
2,
3,
4,


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.referralPath,value_counts,percent
0,,1142073,66.852910
1,/,138293,8.095183
2,/yt/about/,79163,4.633922
3,/analytics/web/,33112,1.938259
4,/yt/about/tr/,14600,0.854632
...,...,...,...
3192,/intl/eu/yt/about/policies/,1,0.000059
3193,/intl/eu/yt/about/brand-resources/,1,0.000059
3194,/intl/et/yt/about/brand-resources/,1,0.000059
3195,/intl/es_es/permissions/using-the-logo.html,1,0.000059


In [None]:
vl_count_train_df.groupby('value_counts')['train_trafficSource.referralPath'].apply(list).to_frame()

Unnamed: 0_level_0,train_trafficSource.referralPath
value_counts,Unnamed: 1_level_1
1,"[/pin/132574782761087311, /zimbra/m/zmain, /pluginfile.php/72719/mod_resource/content/1/CourseWork_Web%20Metrics_Briefing.pdf, /questions, /presentation/d/1V7rLQihMCNQeBZOl2qOpd6vWIXjFBUDT-nwATG4kEcU/edit, /presentation/d/1-o__O8tPphC0jesb9AcJ21EJj8H5h585yhpYTtulHkI/present, /pin/A7B6jgAQwHkDlnLy-r4AAAA, /payload, /presentation/d/1V7rLQihMCNQeBZOl2qOpd6vWIXjFBUDT-nwATG4kEcU/present, /presentation/d/1aaKFgSzAoOoJnD7bjw98G61XBiX_9OPRdWWEJUR5gSg/present, /~kristiyan/2082_pure_mix/index.html, /yt/space/rio/unlock.html, /zwzp9zSOvd, /yt/startcreating/ALL_de/index.html, /presentation/d/1ZaQRTK3_vgwq0yCaaIGeQ5ecprXTASxgy0YhA6fn9_4/present, /pin/2251868541515811, /presentation/d/1xnPwOFSGz6L2OMzRsW8RtFarEKTDIdf4s-RdvoNcBVU/present, /pin/227572587401480066, /yt/space/tokyo/learn.html, /popular, /pin/299911656420971465/, /pin/297730225350839312/, /pin/299911656420971467/, /presentation/d/1WlROpIUO7tLDg8fQiSRtM-jQNpH3zB0zYheZ7YDcngE/edit, /pin/276619602092009519, /pin/345932815112616204/, /pagead/gadgets/teracent_product_template_V1/DoubleImg_1a_728x90.html, /pluginfile.php/444966/mod_resource/content/1/Case%20Analysis%204%20-%20Google%20Merchandise%20Store.pdf, /yt/space/toronto/unlock.html, /presentation/d/15qwsJ6SqDfv-uT_LL3NV02Bq5nafZRchYBbo_u7O1R0/edit, /yt/videomasthead/zh-HK/index.html, /presentation/d/1MX8JUFZ9jScwkH-ib69oJLG4iIoDRTRd3g_rHznwOzQ/present, /pin/426082814733567791/, /pin/426082814733567802/, /yt/space/toronto/learn.html, /presentation/d/1JBr-NZ4LvqyV0ZACcp2d5hWA3AdmVND8N_uB8sHUAVY/present, /pin/458452437059157764, /pin/564920346990344470/, /presentation/d/1IzXiRpzcevazUPjgHRYxZRo4-OItPGjBE4b66emIP_w/present, /pin/714242822123061311, /presentation/d/1G0bS8tSEb8R_cqbsz5PrFoM7nyu-lGQuiJ-o-TRBLW8/present, /pin/723038915146464327, /presentation/d/17fftl9gIkiuFNQoLukKU1BtkJndgyTcrzQCcmKOiqbo/present, /pin/426082814733567791, /presentation/d/1NVU5Px90ygSVOuDYwspjUENaeqCxkzRIdEhwWeNgrQs/present, /yt/space/tokyo/location.html, /profile/Rebecca-Sealfon, /presentation/d/1031aZFf2rJXaNts1RUIHp8uhHZzDExUdHu_0D0Spza4/edit, /presentation/d/1R-rol-7i9m7c0FairImVGHgaIft87pwH2Sxn6GGuTCc/present, /presentation/d/1031aZFf2rJXaNts1RUIHp8uhHZzDExUdHu_0D0Spza4/present, /presentation/d/12OuetR02BmdwoOZl2B_ecm7uEhxy-MoObZWUvu3rK2o/present, /pt-BR/golangbr/events/226482468/, /zimbra/h/printmessage, /pin/372180356689276551, /presentation/d/1QPz_RQ_bEulRvvNtHrsWIdvqG4Vkeg4GM70ki_raDg0/present, /yt/about/gl/index.html, /presentation/d/1OCihfMr3WLnr2fTHOuu2sw_hPuyh7whd6ov1Xm-4sqs/edit, /yt/space/toronto/index.html, /pluginfile.php/70656/mod_resource/content/2/TrabajoFinal1_Analytics.pdf, /presentation/d/1jG1soueprUc0Ust1ZfGD5pHRLMIsi-hyejX9XnDr2Xw/present, /yt/videomasthead/tr/index.html, /yt/space/tokyo/creative.html, /presentation/d/1yyX2DMnr5NaQqckh92Gy-YECS9hZm30cUMl-UUscs6E/present, /webmasters/verification/verification, /r/GooglePixel/comments/76whng/google_pixel_2xl_review_megathread, /yt/dev/it/dev-stories.html, /yt/lineups/de/us-beauty.html, /yt/dev/id/dev-stories.html, /spreadsheets/d/1z4gtD4puUkHp-61vjGSIKlxORFccQ1vktEF12NQTBx4/edit, /yt/lineups/de/de-autos.html, /store/music, /success-stories/hostelworld-drives-259-increase-in-app-installs-with-ambitious-speak-the-world-campaign, /yt/lineups/de/channel.html, /yt/lineups/ca-music.html, /tagger, /yt/lineups/br-science.html, /yt/space/rio/access.html, /yt/lineups/ar/us-beauty.html, /yt/dev/ja/api-resources.html, /trends/, /yt/lineups/ar/italy.html, /yt/lineups/ar/germany.html, /u/0/messages, /u/0/search/Virginia%20Poltrack, /yt/lineups/ae-music.html, /yt/lineups/ae-beauty.html, /you%20boy%20tube, /yt/lineups/de/us-comedy.html, /spreadsheets/d/1uKmjEpG7fHQaH_hLtEE80AY57PH73RPJ71XrsaCVgjY/edit, /spreadsheets/d/1foXke_jLIWYidivCFCf1PQAkh6LB5Ka26RqIHDuuR2M/edit, /yt/dev/el/api-resources.html, /yt/lineups/es-419/brazil.html, /spreadsheets/d/1MswhKIvHEk0BJ2DJ1wIVsDgbitDMA4587dreP_1dPQc/edit, /yt/dev/el/demos.html, /yt/lineups/es-419/br-family.html, /yt/lineups/es-419/, /yt/lineups/en/us-music.html, /youtube/forum/AAAAtMBnpzw3xfCBccRfq4, /yt/lineups/en/us-entertainment.html, /yt/dev/el/index.html, ...]"
2,"[/yt/smartoffline/airtelin/, /a/google.com/nestlabs/our-team/product-marketing, /EQsmUQOCA3, /presentation/d/1d4D0NuEXJDD--zKyWzkm1IHjr7k89Coq5yf1BKtZq3M/present, /intl/tr/permissions/using-the-logo.html, /_/scs/mail-static/_/js/k=gmail.main.en.sXDiEpUnPe0.O/m=m_i,t,it/am=nhGvDGD-3_uDcQ3DgK701brz33u-Xyo_e7nH_ycDROlVoP_N_h_A_4H-tI0C/rt=h/d=1/rs=AHGWq9Bm-1qj1hhkb8cG9EpeilyYgWjoxA, /intl/ja/yt/creators/ambassador/, /intl/vi/yt/creators-for-change/get-involved.html, /presentation/d/1wY-9lwYD9WnZS5aqhaCTx6q0LVbRMX9IMv0L_Y7j3AI/present, /document/d/1vu8s73msqyinG4FwaGHnsl3eyQyLMgZQCPSL0JnnrZ8/edit, /5833404830711808, /yt/devices/es-419/index.html, /oQpecIYHBW, /gp/baby-reg/paul-hobbs-tricia-hobbs-november-2016-upland, /presentation/d/1chItZzpBiQ9dElYqjfcT5AelknXyBCahN-HbWKQ4m-g/edit, /yt/devices/fr/, /gp/aw/ls/ref=aw_wl_vv_nxt_1, /intl/vi/yt/creators/benefits/silver/, /intl/fi_ALL/yt/advertise/, /3HFkTPo8YI, /yt/originals/ko/index.html, /2jOketZr1r, /yt/dev/ko/index.html, /intl/th/yt/nextup/, /hangouts/_/webjuice.dk/dan, /mail/mu/mp/656/, /hangouts/_/google.com/nordic-nal, /hangouts/_/hillcompanyinc.com/google-cloud, /r/nexus5x/comments/443w2p/google_prop_device_cradle_5x/, /_/scs/mail-static/_/js/k=gmail.main.en.dUc3eoDK4q8.O/m=m_i,t/am=OosHBMD8v-8PxjWMAjLSByrM-57nm0_lh53A4_8T8VHIqoD_m_0_gM8DvWkLBQ/rt=h/d=1/rs=AHGWq9D8IoFFkinhcl7oxOGr0l3JNdFzUw, /yt/devices/es/index.html, /intl/ko/yt/about/copyright/fair-use/, /yt/devices/pt-BR/index.html, /intl/ja/yt/creators/benefits/bronze/, /yt/family/, /r/startups/comments/33ky3t/stickers/, /intl/hu/yt/about/copyright/fair-use/, /yt/space/berlin/access.html, /drive/my-drive, /a/google.com/bigquerydemo/, /mail/mu/mp/594/, /a/google.com/nestlabs/facilities/security/cybersecurity, /www/2082_pure_mix/index.html, /yt/music/fil/index.html, /_/scs/mail-static/_/js/k=gmail.main.en.yASPeSEQ5YQ.O/m=m_i,t,it/am=OotXDrD_7_3BuIZhQFf6at3573--vdR-6Oce_58MiCKvAv1v9v8g_g_0p20U/rt=h/d=1/rs=AHGWq9DaFJKL46M3S_ik8o8ZOlXeAiF4Gg, /gp/registry/wishlist/ref=nav_youraccount_wl, /dynamic_in_page_V1/Responsive_murray2_GpaSingleIframe_preview.html, /intl/fa/yt/about/experiences/, /groups/epiuj2015/search/, /hangouts/_/calendar/dmMydnBwdm9udXZyYzRpdXJsNWd2NGJnZjRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ.5v5lq1fpfhnjfa21vr2qnul8r7, /intl/ja_ALL/yt/advertise/resources/, /wa6v1nSZFK, /presentation/d/1h6uWpQYy8Fjm9Cvb0f4_8Q37NpPX0DdoQ8lV2xFy5Q0/present, /intl/ja/yt/space/tokyo/access.html, /intl/tr/yt/creators-for-change/creators.html, /intl/en-GB/yt/creators-for-change/ambassadors-and-fellows.html, /yt/advertise/resources/incorporating-video-into-your-marketing-strategy/, /yt/about/copyright/index.html, /intl/fi/yt/about/brand-resources/, /hangouts/_/g-retail.com/rm-weekly, /intl/tr/yt/creators/benefits/opal/, /presentation/d/1o2u7Ms-IW95qCOxGE1p0SPTdHhNXPutRPnkq1OCCcGs/present, /flx/warn/, /why-youtube, /gp/baby-reg/paulandtricia-hobbs-november-2016-upland, /intl/uk/yt/creators-for-change/, /yt/dev/nl/demos.html, /hangouts/_/g-retail.com/kbanta, /yt/advertise/agency/case-studies/agency-brighton/, /a/google.com/nestlabs/home/announcements, /a-mysterious-pair-of-google-headphones-have-shown-up-in-fcc-filings/articleshow/58386374.cms, /intl/tr/yt/advertise/running-a-video-ad/, /a/buffaloseminary.org/2014-15-digital-literacy/module-10, /intl/uk/permissions/using-the-logo.html, /intl/el/yt/about/experiences/, /YSZB3NcKgb, /yt/space/berlin/calendar.html, /mail/mu/mp/479/, /mail/mu/mp/15/, /mail/mu/mp/469/, /p235brw0kjb/, /intl/ca/yt/about/policies/, /mail/mu/mp/132/, /If2T0stIhs, /intl/id/yt/creators/support-and-guidance/, /intl/id/yt/creators/partner-managers/, /mail/mu/mp/116/, /intl/bn/yt/about/copyright/, /mail/ca/u/1/, /l/5AQGpZkHeAQFza2tBs8PALqvRhZJ1xlV68CHBzi2K1DsB5Q/https%3A%2F%2Fshop.googlemerchandisestore.com%2FGoogle%2BRedesign%2FBags%2FBackpacks%2FGoogle%2BAlpine%2BStyle%2BBackpack.axd, /intl/bn/yt/about/brand-resources/, /KtfjgWJtWs, /KwcplBgcuY, /local/business/admin/bb/10248288688305646485, /r/golang, /loZDU6fUOn, /intl/ar_ALL/yt/advertise/talk-to-us/, /F1NiAYlb96, /L1do1UvNby, /intl/ar_ALL/yt/advertise/success-story/revzilla/, ...]"
3,"[/mail/mu/mp/807/, /yt/lineups/en-GB/, /selfie, /intl/pt-BR/yt/creators/events/calendar/, /mail/mu/mp/819/, /ads/richmedia/studio/pv2/46337864/20161102083329730/index.html, /spreadsheets/d/1i-8yHNQrtDU_vWitLZX6ggJo6X4bVeaVnBlOpx5_zs4/edit, /ads/richmedia/studio/pv2/45926887/dirty/index.html, /ads/richmedia/studio/pv2/46337864/dirty/index.html, /intl/pl/yt/creators/support-and-guidance/, /yt/lineups/fr/netherlands.html, /intl/pl/yt/creators-for-change/, /calendar/render, /intl/pl/yt/about/copyright/fair-use/, /mail/mu/mp/800/, /yt/advertise/success-story/rokenbok/, /intl/no/yt/about/experiences/, /yt/advertise/success-story/tuft-and-needle/, /sHr0apUT6e, /mail/mu/mp/704/, /intl/sw/yt/about/press/, /intl/fr-CA/yt/about/brand-resources/, /intl/gl/yt/about/copyright/, /intl/es-419/yt/creators/benefits-and-awards/, /safeframe/1-0-23/html/container.html, /intl/support-and-guidance/yt/creators/, /mail/mu/mp/760/, /yt/lineups/pt-BR/united-states.html, /safeframe/1-0-6/html/container.html, /intl/sr/yt/about/brand-resources/, /intl/ru/yt/creators/benefits/bronze/, /share, /intl/ru/yt/creators-for-change/get-involved.html, /intl/ro/yt/advertise/, /ads/richmedia/studio/pv2/47540503/20170228160422287/index.html, /pluginfile.php/70656/mod_resource/content/4/TrabajoFinal1_Analytics.pdf, /ads/richmedia/studio/pv2/60131022/20170327142959851/index.html, /intl/ro/permissions/using-the-logo.html, /ads/richmedia/studio/pv2/60143059/20170405153153640/index.html, /yt/dev/en/index.html, /spreadsheets/d/19BDJXdfpPmRe_pKfxAE8m3FOTiQotdZ4RM-Kcjnv5TM/edit, /mail/mu/mp/929/, /intl/de/yt/space/berlin/calendar.html, /youtube-works, /ads/richmedia/studio/pv2/60219295/20170601112307638/index.html, /intl/pt-BR/yt/creators-for-change/get-involved.html, /yt/lineups/en-GB/united-states.html, /intl/pt-BR/yt/space/rio/unlock.html, /intl/fr/yt/space/paris/access.html, /yt/dev/es/api-resources.html, /ads/richmedia/studio/pv2/47540409/20170301113940661/index.html, /yt/lineups/en/mexico.html, /intl/ro/yt/creators/benefits/silver/, /r/Android/comments/7bi63c/bank_of_america_is_offering_these_for_setting/.compact, /intl/fr/yt/creators/benefits-and-awards/, /ads/richmedia/studio/pv2/47475561/20170210110646716/index.html, /ads/richmedia/studio/pv2/47509701/20170214145638974/index.html, /mail/mu/mp/859/, /intl/de_ALL/yt/advertise/how-it-works/, /ads/richmedia/studio/pv2/47513302/20170216112251907/index.html, /pin/345932815112616204, /DVbDmqErUv, /yt/dev/de/index.html, /intl/fr/yt/creators/benefits/opal/, /site/northcountycomputer/permalinks, /axDlhy4117, /ads/richmedia/studio/pv2/47540308/20170228152609756/index.html, /yt/advertise/success-story/missouri-star-quilt/, /intl/is/yt/about/policies/, /yt/advertise/resources/measure-impact-of-video-views-on-advertising-success/, /intl/fi_ALL/yt/advertise/how-it-works/, /yt/space/london/learn.html, /intl/ar/yt/creators/partner-managers/, /intl/es/yt/creators-for-change/ambassadors-and-fellows.html, /intl/es/yt/creators/ambassador/, /intl/zh-TW/yt/nextup/, /r/google/comments/3kxsxp/anyone_remember_when_google_gave_these_out_for/, /intl/ar/yt/creators/benefits/opal/, /intl/ar/yt/creators/benefits/graphite/, /intl/FR_fr/permissions/using-the-logo.html, /intl/zh-TW/yt/about/copyright/fair-use/, /NtbQbYrykE, /Y0UobRoJgN, /yt/about/sl/index.html, /Zj76VTPTRw, /_/scs/mail-static/_/js/k=gmail.main.en.Fn1T0YFQTe8.O/m=m_i,t/am=nhGPDLD_7_3BuIZRQFf6QoV57z3fLWk_coHH_ydNgLhWffxv9v8g_g_05i0U/rt=h/d=1/rs=AHGWq9CWzz6INdSwAbt1VRyR7PEfsMq3hQ, /intl/zh-CN/yt/creators/, /i, /intl/zh-CN/yt/advertise/, /_/scs/mail-static/_/js/k=gmail.main.en.K6x2xY7Z8sk.O/m=m_i,t,it/am=nhGPDGCejPuDcQ3DgK70Vbrz338-X9J-9HKP6k-YAFG9CvTf7P9B_B_oTdso/rt=h/d=1/rs=AHGWq9AUFhMpRDf9Ex-ypKSUpnNTPqLJng, /intl/ar/yt/space/index.html, /intl/es/yt/about/copyright/fair-use/, /yt/lineups/us-beauty.html, /mail/mu/mp/433/, /mail/mu/mp/35/, /yt/space/rio/, /intl/da_ALL/yt/advertise/making-a-video-ad/, /yt/space/paris/calendar.html, /yt/about/lv/index.html, /yt/about/lt/index.html, ...]"
4,"[/intl/pl_ALL/yt/advertise/making-a-video-ad/, /intl/it/yt/creators/benefits/opal/, /intl/pl_ALL/yt/advertise/running-a-video-ad/, /intl/pl/yt/creators/learn-and-connect/, /spreadsheets/d/1akPSHrjkumnRX8uNCWMlZ4L_fwUY4ynbqS-GutUYl7o/edit, /yt/dev/th/api-resources.html, /intl/it/yt/creators-for-change/, /intl/it/yt/creators/support-and-guidance/, /intl/pt-BR/yt/about/copyright/fair-use/, /aw/overview, /yt/dev/tr/api-resources.html, /intl/ko/yt/creators-for-change/, /cL3V1uLNaa, /yt/dev/vi/showcase.html, /yt/devices/zh-TW/index.html, /document/d/1rJ2acNCB_lfZlody9QSxRsty64KUabDr8Tv5r_9W5os/edit, /intl/ko/yt/creators/awards/, /intl/ko/yt/creators/benefits/, /intl/ja/yt/space/tokyo/unlock.html, /feedback/2016/3/about/dee/by/dee/SELF, /yt/lineups/ar/, /yt/lineups/ar/brazil.html, /feedback/2017/1/about/cucchiaro/by/cucchiaro/SELF, /yt/lineups/ar/canada.html, /intl/ja/yt/creators/partner-managers/, /intl/ja/yt/creators/nextup/, /yt/lineups/ar/netherlands.html, /intl/ja/yt/creators/events/calendar/, /intl/ml/yt/about/copyright/, /forum/embed/, /intl/mr/yt/about/copyright/, /cr, /yt/devices/es/, /yt/devices/de/, /yt/devices/ar/index.html, /intl/iw/yt/about/brand-resources/, /class_sections/29619/assignments/1604433, /yt/dev/zh-TW/demos.html, /class_sections/29617/assignments/1588723, /attachment/u/0/, /intl/en-GB_ALL/yt/advertise/running-a-video-ad/, /intl/pt-BR/yt/creators/benefits/bronze/, /r/Android/comments/7651gu/the_google_pixel_2_has_a_hidden_but_disabled_dark/, /r/androidcirclejerk/comments/73uvrg/official_blob_emoji_stickers_from_google_only_230/, /mail/mu/mp/18/, /mail/mu/mp/170/, /IUI6ugD07C, /r/findfashion/comments/5j1sci/google_backpack/, /loZ30NuZKs, /link, /yt/space/mumbai/learn.html, /khcOOdGodG, /Ps70eHmIcf, /item, /Redirect/ToExternal/, /intl/zh-TW/yt/creators/benefits/opal/, /yt/space/losangeles/location.html, /intl/zh-TW/yt/creators-for-change/get-involved.html, /UbMMqIvec9, /intl/zh-TW/yt/advertise/, /yt/space/newyork/unlock.html, /mail/mu/mp/492/, /intl/zh-HK/yt/about/brand-resources/, /presentation/d/1vxSdibn1QLUzgriz-ohvKzA1vXUABFU7pse_12BxRbE/present, /pEnXcP0CYj, /pB61DHl95y, /pagead/html/r20171206/r20170110/zrt_lookup.html, /owa/redir.aspx, /zimbra/mail, /pagead/render_post_ads_v1.html, /neo/b/message, /nSOKfTS6aS, /APfyhRAkfL, /mjQh01wrDG, /yt/space/toronto/calendar.html, /matttproud/gochecklist/blob/master/README.md, /maps/d/u/0/viewer, /C1HS0hIP0E, /yt/space/tokyo/unlock.html, /pin/714242822122965250, /mail/mu/mp/785/, /Visit-Google-Headquarters, /r/marketing/comments/6at5fs/vendor_for_online_company_apparelmerch_store/, /yt/lineups/en/united-kingdom.html, /ads-publisher-controls/drx/4/creativereview, /intl/ta/yt/about/copyright/, /safeframe/1-0-13/html/container.html, /intl/sv/yt/creators/learn-and-connect/, /ads/richmedia/studio/pv2/45680226/dirty/index.html, /intl/sv/permissions/using-the-logo.html, /search.php, /yt/lineups/ko/, /intl/ru/yt/creators/partner-managers/, /intl/ru/yt/creators/benefits-and-awards/, /intl/ru/yt/advertise/making-a-video-ad/, /ads/richmedia/studio/pv2/47369778/20170126222357887/index.html, /intl/ro/yt/creators/learn-and-connect/, /yt/lineups/es/france.html, /yt/lineups/es-419/mexico.html, /spreadsheets/d/1MauBgOONLrgZ5XBXLcgBGqEMZyW_0EyEXNQ5WTZBFfU/edit, ...]"
5,"[/a/google.com/nestlabs/facilities/food, /yt/space/berlin/facilities.html, /a/google.com/nestlabs/facilities/rews-team, /a/google.com/nestlabs/facilities/on-site-amenities, /intl/et/yt/about/policies/, /intl/learn-and-connect/yt/creators/, /triage/, /a/google.com/nestlabs/-nestlife/nest-holiday-party, /intl/vi/permissions/using-the-logo.html, /intl/vi/yt/creators/partner-managers/, /intl/vi/yt/creators/nextup/, /a/buffaloseminary.org/2014-15-digital-literacy/new-module-4---google, /yt/music/ca/index.html, /intl/ko/yt/creators/events/calendar/, /a/google.com/nestlabs/home/announcements/nestproductteamishiring, /yt/lineups/saudi-arabia.html, /intl/tr/yt/about/copyright/fair-use/, /ads/richmedia/studio/pv2/46877600/dirty/index.html, /analytics/answer/6164470, /intl/pt-BR/yt/nextup/, /aw/ads, /ads/richmedia/studio/pv2/60273899/dirty/index.html, /aw/ads/edit/display, /intl/pt-BR_ALL/yt/advertise/resources/how-to-make-a-video-ad-that-fits-your-marketing-strategy/, /intl/pt-BR_ALL/yt/advertise/talk-to-us/, /intl/fr/yt/creators/support-and-guidance/, /blog/2018/04/27/validating-google-analytics-hits-network-tab/, /site/northcountycomputer/, /intl/fr/yt/creators/benefits/graphite/, /yt/lineups/es/united-states.html, /yt/dev/de/api-resources.html, /intl/fr/yt/creators-for-change/get-involved.html, /yt/dev/ar/dev-stories.html, /yt/dev/ru/index.html, /intl/ru/yt/space/index.html, /yt/lineups/ja/, /intl/sl/yt/about/policies/, /intl/partner-managers/yt/creators/, /intl/no/yt/about/press/, /intl/gl/yt/about/press/, /intl/es_ALL/permissions/using-the-logo.html, /intl/fil/yt/about/press/, /intl/te/yt/about/policies/, /yt/lineups/sweden.html, /intl/nl/yt/about/experiences/, /getpluginfile/photo_quality_17_labels/index.html, /intl/th/yt/creators/nextup/, /tagmanager/answer/6164470, /intl/es_ALL/yt/advertise/, /XgNQ4oQwcd, /yt/space/paris/, /mail/mu/mp/327/, /intl/it/yt/creators/learn-and-connect/, /intl/ja/yt/creators/benefits/opal/, /mads/gma, /intl/es-419/permissions/using-the-logo.html, /intl/es-419_ALL/yt/advertise/running-a-video-ad/, /IBmpuzFwCe, /yt/dev/tr/demos.html, /yt/dev/zh-TW/index.html, /BNRXVnfS6u, /yt/dev/pt-BR/index.html, /intl/id/yt/creators/benefits/opal/, /yt/space/london/access.html, /mail/mu/mp/399/, /intl/ar_ALL/yt/advertise/resources/, /intl/it_ALL/yt/advertise/how-it-works/, /mail/mu/mp/423/, /mail/mu/mp/426/, /intl/de_ALL/yt/advertise/resources/optimizing-your-video-marketing-campaigns/, /mail/mu/mp/438/, /mail/mu/mp/459/, /intl/es-419/yt/creators/nextup/, /publicconstrt300/, /yt/dev/tr/showcase.html, /intl/de/yt/creators/support-and-guidance/, /mail/mu/mp/720/, /yt/dev/vi/dev-stories.html, /webchat/u/0/frame, /5893803076747264, /SyGQD0CxXj, /opportunity/job/2000000046743, /document/d/1uELQ97D_yoYkVD_C7Jqc2i1R2gQfn9H0KjfVbu3m-kU/edit, /intl/zh-HK/yt/about/experiences/, /yt/space/london/unlock.html, /iNTz4Le7x0, /0x50/items/208dcb46005533a9d889, /yt/devices/vi/index.html, /yt/dev/th/demos.html, /intl/ar/yt/about/copyright/fair-use/, /yt/dev/th/index.html, /intl/ar/yt/creators/awards/, /en/, /evaluation/endor/mturk, /OHt4vFnFOy, /mail/mu/mp/933/, /intl/it/yt/creators/benefits/graphite/, /intl/ar/yt/creators/learn-and-connect/, /neo/m/message, /intl/ar/yt/space/, ...]"
...,...
14600,[/yt/about/tr/]
33112,[/analytics/web/]
79163,[/yt/about/]
138293,[/]


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.referralPath.csv', names = ['index', 'test_trafficSource.referralPath']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.referralPath
index,Unnamed: 1_level_1
0,(not set)
1,(not set)
2,(not set)
3,(not set)
4,(not set)


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.referralPath,value_counts
0,(not set),305203
1,/,37276
2,/analytics/app/,12740
3,/a/google.com/googletopia/discounts-deals-and-free-stuff/alphabet-google-discounts,4227
4,/yt/creators/,1653
...,...,...
1750,/intl/iw/yt/creators-for-change/,1
1751,/page/lesson/sound,1
1752,/intl/it_it/permissions/using-the-logo.html,1
1753,/intl/it_ALL/yt/advertise/running-a-video-ad/,1


### 8.14 trafficSource.source
The source of the traffic source. Could be the name of the search engine, the referring hostname, or a value of the utm_source URL parameter.

In [None]:
explore_train_df = pd.read_csv('train_trafficSource.source.csv', names = ['index', 'train_trafficSource.source']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_trafficSource.source
index,Unnamed: 1_level_1
0,google
1,sites.google.com
2,(direct)
3,google
4,google


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_trafficSource.source,value_counts,percent
0,google,658384,38.539469
1,(direct),565975,33.130173
2,youtube.com,329450,19.284837
3,analytics.google.com,37436,2.191371
4,Partners,32931,1.927664
...,...,...,...
340,m.wikihow.com,1,0.000059
341,fr.yhs4.search.yahoo.com,1,0.000059
342,cz.pinterest.com,1,0.000059
343,dailydot.com,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_trafficSource.source.csv', names = ['index', 'test_trafficSource.source']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_trafficSource.source
index,Unnamed: 1_level_1
0,google
1,(direct)
2,google
3,(direct)
4,google


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_trafficSource.source,value_counts
0,google,208597
1,(direct),111318
2,youtube.com,28093
3,analytics.google.com,12721
4,Partners,10836
...,...,...
187,docs.google.com,1
188,hume.google.com,1
189,results.searchlock.com,1
190,ro.search.yahoo.com,1


## 9. visitId (drop this column)
An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.

In [None]:
explore_train_df = pd.read_csv('train_visitId.csv', names = ['index', 'train_visitId']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_visitId
index,Unnamed: 1_level_1
0,1508198450
1,1508176307
2,1508201613
3,1508169851
4,1508190552


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_visitId,value_counts,percent
0,1513125098,28,0.001639
1,1513124981,28,0.001639
2,1513124949,26,0.001522
3,1513124997,24,0.001405
4,1513125008,24,0.001405
...,...,...,...
1665797,1488726052,1,0.000059
1665798,1488726043,1,0.000059
1665799,1488725980,1,0.000059
1665800,1488725740,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_visitId.csv', names = ['index', 'test_visitId']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_visitId
index,Unnamed: 1_level_1
0,1526099341
1,1526064483
2,1526067157
3,1526107551
4,1526060254


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_visitId,value_counts
0,1526625863,9
1,1526628427,7
2,1525803386,7
3,1539387034,7
4,1533081067,7
...,...,...
393176,1529473398,1
393177,1529473395,1
393178,1529473393,1
393179,1529473295,1


## 10. visitNumber
The session number for this user. If this is the first session, then this is set to 1.

In [None]:
explore_train_df = pd.read_csv('train_visitNumber.csv', names = ['index', 'train_visitNumber']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_visitNumber
index,Unnamed: 1_level_1
0,1
1,6
2,1
3,1
4,1


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_visitNumber,value_counts,percent
0,1,1307430,76.532324
1,2,182542,10.685362
2,3,70962,4.153864
3,4,37886,2.217712
4,5,23314,1.364719
...,...,...,...
452,352,1,0.000059
453,403,1,0.000059
454,402,1,0.000059
455,401,1,0.000059


In [None]:
explore_test_df = pd.read_csv('test_visitNumber.csv', names = ['index', 'test_visitNumber']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_visitNumber
index,Unnamed: 1_level_1
0,2
1,166
2,2
3,4
4,1


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_visitNumber,value_counts
0,1,286065
1,2,52547
2,3,21500
3,4,11309
4,5,6967
...,...,...
381,350,1
382,351,1
383,352,1
384,353,1


## 11. visitStartTime 
The timestamp (expressed as POSIX time).

In [None]:
explore_train_df = pd.read_csv('train_visitStartTime.csv', names = ['index', 'train_visitStartTime']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_visitStartTime
index,Unnamed: 1_level_1
0,1508198450
1,1508176307
2,1508201613
3,1508169851
4,1508190552


In [None]:
explore_test_df = pd.read_csv('test_visitStartTime.csv', names = ['index', 'test_visitStartTime']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_visitStartTime
index,Unnamed: 1_level_1
0,1526099341
1,1526064483
2,1526067157
3,1526107551
4,1526060254


## 12. customDimensions
This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.

In [None]:
explore_train_df = pd.read_csv('train_customDimensions.csv', names = ['index', 'train_customDimensions']).set_index('index')
explore_train_df.head()

Unnamed: 0_level_0,train_customDimensions
index,Unnamed: 1_level_1
0,"[{'index': '4', 'value': 'EMEA'}]"
1,"[{'index': '4', 'value': 'North America'}]"
2,"[{'index': '4', 'value': 'North America'}]"
3,"[{'index': '4', 'value': 'EMEA'}]"
4,"[{'index': '4', 'value': 'Central America'}]"


In [None]:
vl_count_train_df = explore_train_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_customDimensions,value_counts,percent
0,"[{'index': '4', 'value': 'North America'}]",768223,44.969055
1,[],333235,19.506397
2,"[{'index': '4', 'value': 'EMEA'}]",313991,18.379922
3,"[{'index': '4', 'value': 'APAC'}]",222071,12.99925
4,"[{'index': '4', 'value': 'South America'}]",45553,2.666511
5,"[{'index': '4', 'value': 'Central America'}]",25264,1.478865


In [None]:
explore_test_df = pd.read_csv('test_customDimensions.csv', names = ['index', 'test_customDimensions']).set_index('index')
explore_test_df.head()

Unnamed: 0_level_0,test_customDimensions
index,Unnamed: 1_level_1
0,"[{'index': '4', 'value': 'APAC'}]"
1,"[{'index': '4', 'value': 'North America'}]"
2,"[{'index': '4', 'value': 'North America'}]"
3,"[{'index': '4', 'value': 'North America'}]"
4,"[{'index': '4', 'value': 'North America'}]"


In [None]:
vl_count_test_df = explore_test_df.value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_test_df

Unnamed: 0,test_customDimensions,value_counts
0,"[{'index': '4', 'value': 'North America'}]",193825
1,"[{'index': '4', 'value': 'EMEA'}]",69929
2,"[{'index': '4', 'value': 'APAC'}]",63088
3,[],60581
4,"[{'index': '4', 'value': 'South America'}]",9565
5,"[{'index': '4', 'value': 'Central America'}]",4601


In [None]:
# convert string representation of list to a list
explore_train_df['train_customDimensions'] = explore_train_df['train_customDimensions'].apply(lambda x: ast.literal_eval(x))

# fill empty string
explore_train_df['train_customDimensions'] = explore_train_df['train_customDimensions'].apply(lambda x: x[0] if len(x)==1 else "{}")

# convert json string
explore_train_df = pd.json_normalize(explore_train_df['train_customDimensions'])
explore_train_df.columns = ['train_customDimensions' + '_' + col for col in explore_train_df.columns]
explore_train_df.head()

Unnamed: 0,train_customDimensions_index,train_customDimensions_value
0,4,EMEA
1,4,North America
2,4,North America
3,4,EMEA
4,4,Central America


### 12.1 CustomDimensions Index (drop this column)
x_level scope

In [None]:
vl_count_train_df = explore_train_df['train_customDimensions_index'].to_frame().value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df

Unnamed: 0,train_customDimensions_index,value_counts
0,4.0,1375102
1,,333235


In [None]:
# based on https://support.google.com/analytics/answer/2709828?hl=en#zippy=%2Cin-this-article 
des = {
"4": "Product-level scope"
}
des_df = pd.DataFrame(list(des.items()), columns = ['train_customDimensions_index', 'description'])

In [None]:
pd.merge(vl_count_train_df, des_df, on=['train_customDimensions_index'], how='outer').set_index('train_customDimensions_index')

Unnamed: 0_level_0,value_counts,description
train_customDimensions_index,Unnamed: 1_level_1,Unnamed: 2_level_1
4.0,1375102,Product-level scope
,333235,


### 12.2 CustomDimensions Value

In [None]:
vl_count_train_df = explore_train_df['train_customDimensions_value'].to_frame().value_counts(dropna=False).to_frame().rename(columns={0:'value_counts'}).reset_index()
vl_count_train_df['percent'] = vl_count_train_df['value_counts'] / 1708337 * 100
vl_count_train_df

Unnamed: 0,train_customDimensions_value,value_counts,percent
0,North America,768223,44.969055
1,,333235,19.506397
2,EMEA,313991,18.379922
3,APAC,222071,12.99925
4,South America,45553,2.666511
5,Central America,25264,1.478865


In [None]:
des = {
"North America": "North America",
"EMEA": "Europe, the Middle East and Africa",
"APAC": "A-sia, PAC-ific",
"South America": "South America",
"Central America": "Central America"
}
des_df = pd.DataFrame(list(des.items()), columns = ['train_customDimensions_value', 'description'])

In [None]:
pd.merge(vl_count_train_df, des_df, on=['train_customDimensions_value'], how='outer').set_index('train_customDimensions_value')

Unnamed: 0_level_0,value_counts,percent,description
train_customDimensions_value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
North America,768223,44.969055,North America
,333235,19.506397,
EMEA,313991,18.379922,"Europe, the Middle East and Africa"
APAC,222071,12.99925,"A-sia, PAC-ific"
South America,45553,2.666511,South America
Central America,25264,1.478865,Central America


# FEATURE ENGINEERING
Based on this solution: https://www.kaggle.com/competitions/ga-customer-revenue-prediction/discussion/82614 

In [None]:
from os import listdir
lst_file = listdir()
lst_train_file = [f for f in lst_file if 'train' in f]
lst_test_file = [f for f in lst_file if 'train' not in f and 'csv' in f]

train_default_df = pd.DataFrame()
test_default_df = pd.DataFrame()

In [None]:
for f in lst_train_file:
    col = f[:-4]
    if col == 'fullVisitorId':
        train_default_df[col] =  pd.read_csv(f, names = ['index', col],
                             low_memory=False, dtype=str).set_index('index')[col]
    train_default_df[col] = pd.read_csv(f, names = ['index', col],
                             low_memory=False).set_index('index')[col]

df = train_default_df.copy()
df.columns = [c[6:] for c in df.columns]
df.head(2)

Unnamed: 0_level_0,channelGrouping,customDimensions,date,device.browser,device.browserVersion,device.browserSize,device.deviceCategory,device.flashVersion,device.isMobile,device.language,device.mobileDeviceBranding,device.mobileDeviceInfo,device.mobileDeviceModel,device.mobileDeviceMarketingName,device.operatingSystem,device.mobileInputSelector,device.operatingSystemVersion,device.screenColors,device.screenResolution,fullVisitorId,geoNetwork.city,geoNetwork.cityId,geoNetwork.continent,geoNetwork.country,geoNetwork.longitude,geoNetwork.latitude,geoNetwork.metro,geoNetwork.networkDomain,geoNetwork.networkLocation,geoNetwork.region,geoNetwork.subContinent,socialEngagementType,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.sessionQualityDim,totals.timeOnSite,totals.totalTransactionRevenue,totals.transactionRevenue,totals.transactions,totals.visits,trafficSource.adContent,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.slot,trafficSource.campaign,trafficSource.campaignCode,trafficSource.isTrueDirect,trafficSource.keyword,trafficSource.medium,trafficSource.referralPath,trafficSource.source,visitId,visitNumber,visitStartTime
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
0,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",20171016,Firefox,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,3162355547410993243,not available in demo dataset,not available in demo dataset,Europe,Germany,not available in demo dataset,not available in demo dataset,not available in demo dataset,(not set),not available in demo dataset,not available in demo dataset,Western Europe,Not Socially Engaged,1.0,1,1.0,1.0,1.0,,,,,1,,,not available in demo dataset,,,,,(not set),,,water bottle,organic,,google,1508198450,1,1508198450
1,Referral,"[{'index': '4', 'value': 'North America'}]",20171016,Chrome,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Chrome OS,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,8934116514970143966,Cupertino,not available in demo dataset,Americas,United States,not available in demo dataset,not available in demo dataset,San Francisco-Oakland-San Jose CA,(not set),not available in demo dataset,California,Northern America,Not Socially Engaged,,2,,2.0,2.0,28.0,,,,1,,,not available in demo dataset,,,,,(not set),,,,referral,/a/google.com/transportation/mtv-services/bikes/bike2workmay2016,sites.google.com,1508176307,6,1508176307


In [None]:
for f in lst_test_file:
    col = f[:-4]
    if col == 'fullVisitorId':
        test_default_df[col] =  pd.read_csv(f, names = ['index', col],
                             low_memory=False, dtype=str).set_index('index')[col]
    test_default_df[col] = pd.read_csv(f, names = ['index', col],
                             low_memory=False).set_index('index')[col]

test_df = test_default_df.copy()
test_df.columns = [c[5:] for c in test_df.columns]
test_df.head(2)

Unnamed: 0_level_0,channelGrouping,device.deviceCategory,device.browserSize,date,device.browserVersion,device.browser,customDimensions,device.mobileDeviceMarketingName,device.mobileDeviceInfo,device.mobileDeviceBranding,device.screenColors,device.isMobile,device.mobileDeviceModel,device.operatingSystemVersion,device.operatingSystem,device.language,device.mobileInputSelector,device.flashVersion,device.screenResolution,geoNetwork.city,fullVisitorId,geoNetwork.longitude,geoNetwork.latitude,geoNetwork.continent,geoNetwork.metro,geoNetwork.country,geoNetwork.cityId,geoNetwork.networkDomain,geoNetwork.region,geoNetwork.subContinent,socialEngagementType,geoNetwork.networkLocation,totals.bounces,totals.hits,totals.sessionQualityDim,totals.totalTransactionRevenue,totals.pageviews,totals.newVisits,totals.transactionRevenue,totals.transactions,totals.timeOnSite,trafficSource.isTrueDirect,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.campaign,trafficSource.adwordsClickInfo.gclId,trafficSource.keyword,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.adContent,totals.visits,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.slot,visitNumber,trafficSource.source,trafficSource.referralPath,trafficSource.medium,visitId,visitStartTime
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1
0,Organic Search,mobile,not available in demo dataset,20180511,not available in demo dataset,Chrome,"[{'index': '4', 'value': 'APAC'}]",not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,True,not available in demo dataset,not available in demo dataset,Android,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,(not set),7460955084541987166,not available in demo dataset,not available in demo dataset,Asia,(not set),India,not available in demo dataset,unknown.unknown,Delhi,Southern Asia,Not Socially Engaged,not available in demo dataset,,4,1,,3.0,,,,973.0,True,,(not set),,(not provided),not available in demo dataset,(not set),1,,,,2,google,(not set),organic,1526099341,1526099341
1,Direct,desktop,not available in demo dataset,20180511,not available in demo dataset,Chrome,"[{'index': '4', 'value': 'North America'}]",not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,Macintosh,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,San Francisco,460252456180441002,not available in demo dataset,not available in demo dataset,Americas,San Francisco-Oakland-San Jose CA,United States,not available in demo dataset,(not set),California,Northern America,Not Socially Engaged,not available in demo dataset,,4,1,,3.0,,,,49.0,True,,(not set),,(not set),not available in demo dataset,(not set),1,,,,166,(direct),(not set),(none),1526064483,1526064483


In [None]:
drop_lst = ['trafficSource.campaignCode', # >90% NaN
            'trafficSource.adContent', # >90% NaN
            'trafficSource.keyword', # >90% NaN
            'trafficSource.adwordsClickInfo.slot', # >90% NaN
            'trafficSource.campaign', # >90% NaN
            'trafficSource.adwordsClickInfo.page', # >90% NaN
            'trafficSource.adwordsClickInfo.gclId', # >90% NaN
            'trafficSource.adwordsClickInfo.adNetworkType', # >90% NaN
            'totals.transactionRevenue',] # deprecated
            # 'visitStartTime'] # detail information for date

for col in df.columns:
    if len(df[col].value_counts(dropna=False)) == 1:
        drop_lst.append(col)

len(drop_lst)

28

In [None]:
df.drop(columns = drop_lst, inplace=True)
test_df.drop(columns = drop_lst[1:], inplace=True)


df['customDimensions.value'] = df['customDimensions'].apply(lambda x: ast.literal_eval(x.strip('][') 
                                                            if x != '[]' 
                                                            else '{}').get('value')).to_frame()
test_df['customDimensions.value'] = test_df['customDimensions'].apply(lambda x: ast.literal_eval(x.strip('][') 
                                                            if x != '[]' 
                                                            else '{}').get('value')).to_frame()

df.drop(columns=['customDimensions'], inplace=True)
test_df.drop(columns=['customDimensions'], inplace=True)


df["date"] = pd.to_datetime(df["date"], infer_datetime_format=True, format="%Y%m%d")
test_df["date"] = pd.to_datetime(test_df["date"], infer_datetime_format=True, format="%Y%m%d")

In [None]:
def fill_na(df):
    df['totals.bounces'] = df['totals.bounces'].fillna(0.).astype(bool)
    df['trafficSource.referralPath'] = df['trafficSource.referralPath'].fillna('').astype(str)
    df['trafficSource.isTrueDirect'] = df['trafficSource.isTrueDirect'].fillna(False).astype(bool)
    df['trafficSource.adwordsClickInfo.isVideoAd'] = df['trafficSource.adwordsClickInfo.isVideoAd'].fillna(True).astype(bool)
    df['totals.transactions'] = df['totals.transactions'].fillna(0.)
    df['totals.totalTransactionRevenue'] = df['totals.totalTransactionRevenue'].fillna(0.)
    df['totals.timeOnSite'] = df['totals.timeOnSite'].fillna(0.)
    df['totals.sessionQualityDim'] = df['totals.sessionQualityDim'].fillna(0.)
    df['totals.pageviews'] = df['totals.pageviews'].fillna(0.)
    df['totals.newVisits'] = df['totals.newVisits'].fillna(0.).astype(bool)
    df['customDimensions.value'] = df['customDimensions.value'].fillna('')

fill_na(df)
fill_na(test_df)

In [None]:
from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

## solution_ft + solution_k

In [None]:
def getTimeFramewithFeatures(tr, k=1):
    # train timeframe
    tf = tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*(k-1))) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k))]
    # user id in the test timeframe
    tf_fvid = set(tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46)) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    # user id in the test timeframe appeared in train timeframe
    tf_returned = tf[tf['fullVisitorId'].isin(tf_fvid)]
    # test timeframe
    # -------
    tf_tst = tr[tr['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
            & (tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
            & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]

    # for returned user
    tf_target = tf_tst.groupby('fullVisitorId')[['totals.totalTransactionRevenue']]\
                        .sum().apply(np.log1p, axis=1).reset_index()
    tf_target.rename(columns={'totals.totalTransactionRevenue': 'target'}, 
                     inplace=True)
    # for new user
    tf_nonret = pd.DataFrame()
    tf_nonret['fullVisitorId'] = list(set(tf['fullVisitorId']) - tf_fvid)    
    tf_nonret['target'] = 0
    
    tf_target = pd.concat([tf_target, tf_nonret], axis=0).reset_index(drop=True)

    tf_maxdate = max(tf['date'])
    tf_mindate = min(tf['date'])

    tf = tf.groupby('fullVisitorId').agg({
        'channelGrouping': [('channelGrouping_max', 'max')],
        'date': [
            ('firstSes', 'min'), 
            ('lastSes', 'max'),
            ('unique', 'nunique')
        ],
        'visitNumber': [('visitNumber_max', 'max')],
        'device.browser': [('browser_max',  'max')],
        'device.operatingSystem': [('operatingSystem_max',  'max')],
        'device.deviceCategory': [('deviceCategory_max',  'max')],
        'geoNetwork.continent': [('continent_max',  'max')],
        'geoNetwork.subContinent': [('subContinent_max',  'max')],
        'geoNetwork.country': [('country_max',  'max')],
        'geoNetwork.region': [('region_max',  'max')],
        'geoNetwork.metro': [('metro_max',  'max')],
        'geoNetwork.city': [('city_max',  'max')],
        'geoNetwork.networkDomain': [('networkDomain_max', 'max')],
        'trafficSource.source': [('source_max',  'max')],
        'trafficSource.medium': [('medium_max',  'max')],
        'trafficSource.adwordsClickInfo.isVideoAd': [('isVideoAd_mean',  'mean')],
        'device.isMobile': [('isMobile_mean', 'mean')],
        'trafficSource.isTrueDirect': [('isTrueDirect_mean', 'mean')],
        'totals.bounces': [('totals.bounces_sum',  'sum')],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'median'),
            ('hits_std', 'std'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'median'),
            ('pageviews_std', 'std'),
        ],
        'visitStartTime': [('visitStartTime_cnt', 'count')],
        'totals.totalTransactionRevenue': [('totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('transactions_sum', 'sum')],
    })

    tf.columns = tf.columns.droplevel()
    tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
    tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
    tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400

    tf = pd.merge(tf, tf_target, left_on='fullVisitorId', right_on='fullVisitorId')
    return tf

In [None]:
%%time
df1 = getTimeFramewithFeatures(df, k=1)
df2 = getTimeFramewithFeatures(df, k=2)
df3 = getTimeFramewithFeatures(df, k=3)
df4 = getTimeFramewithFeatures(df, k=4)
train = pd.concat([df1, df2, df3, df4], ignore_index=True)
train



CPU times: user 38min 41s, sys: 3min 4s, total: 41min 46s
Wall time: 36min 20s


In [None]:
sr = train.dtypes
lst_cat = sr[sr == 'object'].index.tolist()[1:]

train[lst_cat] = train[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
train.to_csv('output/solution_ft_solution_k/train_pre.csv', index=False)

In [None]:
tf_maxdate = max(test_df['date'])
tf_mindate = min(test_df['date'])

tf = test_df.groupby('fullVisitorId').agg({
    'channelGrouping': [('channelGrouping_max', 'max')],
    'date': [
        ('firstSes', 'min'), 
        ('lastSes', 'max'),
        ('unique', 'nunique')
    ],
    'visitNumber': [('visitNumber_max', 'max')],
    'device.browser': [('browser_max',  'max')],
    'device.operatingSystem': [('operatingSystem_max',  'max')],
    'device.deviceCategory': [('deviceCategory_max',  'max')],
    'geoNetwork.continent': [('continent_max',  'max')],
    'geoNetwork.subContinent': [('subContinent_max',  'max')],
    'geoNetwork.country': [('country_max',  'max')],
    'geoNetwork.region': [('region_max',  'max')],
    'geoNetwork.metro': [('metro_max',  'max')],
    'geoNetwork.city': [('city_max',  'max')],
    'geoNetwork.networkDomain': [('networkDomain_max', 'max')],
    'trafficSource.source': [('source_max',  'max')],
    'trafficSource.medium': [('medium_max',  'max')],
    'trafficSource.adwordsClickInfo.isVideoAd': [('isVideoAd_mean',  'mean')],
    'device.isMobile': [('isMobile_mean', 'mean')],
    'trafficSource.isTrueDirect': [('isTrueDirect_mean', 'mean')],
    'totals.bounces': [('totals.bounces_sum',  'sum')],
    'totals.hits': [
        ('hits_sum', 'sum'),
        ('hits_min', 'min'), 
        ('hits_max', 'max'), 
        ('hits_mean', 'median'),
        ('hits_std', 'std'),
    ],
    'totals.pageviews': [
        ('pageviews_sum', 'sum'),
        ('pageviews_min', 'min'),
        ('pageviews_max', 'max'),
        ('pageviews_mean', 'median'),
        ('pageviews_std', 'std'),
    ],
    'visitStartTime': [('visitStartTime_cnt', 'count')],
    'totals.totalTransactionRevenue': [('totalTransactionRevenue_sum', 'sum')],
    'totals.transactions': [('transactions_sum', 'sum')],
})

tf.columns = tf.columns.droplevel()
tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400



In [None]:
sr = tf.dtypes
lst_cat = sr[sr == 'object'].index.tolist()

tf[lst_cat] = tf[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
tf.to_csv('output/solution_ft_solution_k/X_test.csv')

## solution_ft + self_k

In [None]:
def getTimeFramewithFeatures(tr, k=1):
    # train timeframe
    tf = tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*(k-1))) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k))]
    # user id in the test timeframe
    if k == 4:     
        tf_fvid = set(test_df.loc[(test_df['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                       & (test_df['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    else:
        tf_fvid = set(tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46)) 
                    & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    # user id in the test timeframe appeared in train timeframe
    tf_returned = tf[tf['fullVisitorId'].isin(tf_fvid)]
    # test timeframe
    # -------
    if k == 4:
        tf_tst = test_df[test_df['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
                    & (test_df['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                    & (test_df['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]
    else:
        tf_tst = tr[tr['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
                & (tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]

    # for returned user
    tf_target = tf_tst.groupby('fullVisitorId')[['totals.totalTransactionRevenue']]\
                        .sum().apply(np.log1p, axis=1).reset_index()
    tf_target.rename(columns={'totals.totalTransactionRevenue': 'target'}, 
                     inplace=True)
    # for new user
    tf_nonret = pd.DataFrame()
    tf_nonret['fullVisitorId'] = list(set(tf['fullVisitorId']) - tf_fvid)    
    tf_nonret['target'] = 0
    
    tf_target = pd.concat([tf_target, tf_nonret], axis=0).reset_index(drop=True)

    tf_maxdate = max(tf['date'])
    tf_mindate = min(tf['date'])

    tf = tf.groupby('fullVisitorId').agg({
        'channelGrouping': [('channelGrouping_max', 'max')],
        'date': [
            ('firstSes', 'min'), 
            ('lastSes', 'max'),
            ('unique', 'nunique')
        ],
        'visitNumber': [('visitNumber_max', 'max')],
        'device.browser': [('browser_max',  'max')],
        'device.operatingSystem': [('operatingSystem_max',  'max')],
        'device.deviceCategory': [('deviceCategory_max',  'max')],
        'geoNetwork.continent': [('continent_max',  'max')],
        'geoNetwork.subContinent': [('subContinent_max',  'max')],
        'geoNetwork.country': [('country_max',  'max')],
        'geoNetwork.region': [('region_max',  'max')],
        'geoNetwork.metro': [('metro_max',  'max')],
        'geoNetwork.city': [('city_max',  'max')],
        'geoNetwork.networkDomain': [('networkDomain_max', 'max')],
        'trafficSource.source': [('source_max',  'max')],
        'trafficSource.medium': [('medium_max',  'max')],
        'trafficSource.adwordsClickInfo.isVideoAd': [('isVideoAd_mean',  'mean')],
        'device.isMobile': [('isMobile_mean', 'mean')],
        'trafficSource.isTrueDirect': [('isTrueDirect_mean', 'mean')],
        'totals.bounces': [('totals.bounces_sum',  'sum')],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'median'),
            ('hits_std', 'std'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'median'),
            ('pageviews_std', 'std'),
        ],
        'visitStartTime': [('visitStartTime_cnt', 'count')],
        'totals.totalTransactionRevenue': [('totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('transactions_sum', 'sum')],
    })

    tf.columns = tf.columns.droplevel()
    tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
    tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
    tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400

    tf = pd.merge(tf, tf_target, left_on='fullVisitorId', right_on='fullVisitorId')
    return tf

In [None]:
%%time
df1 = getTimeFramewithFeatures(df, k=1)
df2 = getTimeFramewithFeatures(df, k=2)
df3 = getTimeFramewithFeatures(df, k=3)
df4 = getTimeFramewithFeatures(df, k=4)
train = pd.concat([df1, df2, df3, df4], ignore_index=True)
train



CPU times: user 38min 31s, sys: 3min 1s, total: 41min 32s
Wall time: 36min 9s


In [None]:
sr = train.dtypes
lst_cat = sr[sr == 'object'].index.tolist()[1:]

train[lst_cat] = train[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
train.to_csv('output/solution_ft_self_k/train_pre.csv', index=False)

In [None]:
tf.to_csv('output/solution_ft_self_k/X_test.csv')

## self_ft + solution_k

In [None]:
def getTimeFramewithFeatures(tr, k=1):
    # train timeframe
    tf = tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*(k-1))) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k))]
    # user id in the test timeframe
    
    tf_fvid = set(tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46)) 
                & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    # user id in the test timeframe appeared in train timeframe
    tf_returned = tf[tf['fullVisitorId'].isin(tf_fvid)]
    # test timeframe
    # -------
    
    tf_tst = tr[tr['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
            & (tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
            & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]

    # for returned user
    tf_target = tf_tst.groupby('fullVisitorId')[['totals.totalTransactionRevenue']]\
                        .sum().apply(np.log1p, axis=1).reset_index()
    tf_target.rename(columns={'totals.totalTransactionRevenue': 'target'}, 
                     inplace=True)
    # for new user
    tf_nonret = pd.DataFrame()
    tf_nonret['fullVisitorId'] = list(set(tf['fullVisitorId']) - tf_fvid)    
    tf_nonret['target'] = 0
    
    tf_target = pd.concat([tf_target, tf_nonret], axis=0).reset_index(drop=True)

    tf_maxdate = max(tf['date'])
    tf_mindate = min(tf['date'])

    tf = tf.groupby('fullVisitorId').agg({
        'channelGrouping': [('channelGrouping_mode', get_most_common)],
        'visitNumber': [('visitNumber_max', 'max')],
        'device.browser': [('device.browser_mode',  get_most_common)],
        'geoNetwork.city': [('geoNetwork.city_mode',  get_most_common)],
        'totals.bounces': [('totals.bounces_mode',  get_most_common)],
        'trafficSource.source': [('trafficSource.source_mode',  get_most_common)],
        'trafficSource.referralPath': [('trafficSource.referralPath_mode',  get_most_common)],
        'trafficSource.medium': [('trafficSource.medium_mode',  get_most_common)],
        'trafficSource.isTrueDirect': [('trafficSource.isTrueDirect_mode',  get_most_common)],
        'trafficSource.adwordsClickInfo.isVideoAd': [('trafficSource.adwordsClickInfo.isVideoAd_mode',  get_most_common)],
        'device.operatingSystem': [('device.operatingSystem_mode',  get_most_common)],
        'device.isMobile': [('device.isMobile_mode',  get_most_common)],
        'device.deviceCategory': [('device.deviceCategory_mode',  get_most_common)],
        'geoNetwork.metro': [('geoNetwork.metro_mode',  get_most_common)],
        'geoNetwork.networkDomain': [('geoNetwork.networkDomain_mode',  get_most_common)],
        'geoNetwork.region': [('geoNetwork.region_mode',  get_most_common)],
        'geoNetwork.subContinent': [('geoNetwork.subContinent_mode',  get_most_common)],
        'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('totals.transactions_sum', 'sum')],
        'totals.timeOnSite': [
            ('timeOnSite_sum', 'sum'),
            ('timeOnSite_min', 'min'),
            ('timeOnSite_max', 'max'),
        ],
        'totals.sessionQualityDim': [
            ('sessionQualityDim_max', 'max'),
            ('sessionQualityDim_mean', 'mean'),
            ('sessionQualityDim_min', 'min'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'mean'),
        ],
        'totals.newVisits': [('totals.newVisits',  get_most_common)],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'mean'),
        ],
        'geoNetwork.country': [('geoNetwork.country_mode',  get_most_common)],
        'geoNetwork.continent': [('geoNetwork.continent_mode',  get_most_common)],
        'customDimensions.value': [('customDimensions.value_mode',  get_most_common)]
    })

    tf.columns = tf.columns.droplevel()

    tf = pd.merge(tf, tf_target, left_on='fullVisitorId', right_on='fullVisitorId')
    return tf

In [None]:
%%time
df1 = getTimeFramewithFeatures(df, k=1)
df2 = getTimeFramewithFeatures(df, k=2)
df3 = getTimeFramewithFeatures(df, k=3)
df4 = getTimeFramewithFeatures(df, k=4)
train = pd.concat([df1, df2, df3, df4], ignore_index=True)
train

CPU times: user 6min 3s, sys: 67.4 ms, total: 6min 3s
Wall time: 6min 11s


In [None]:
sr = train.dtypes
lst_cat = sr[sr == 'object'].index.tolist()[1:]

train[lst_cat] = train[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
train.to_csv('output/self_ft_solution_k/train_pre.csv', index=False)

In [None]:
tf_maxdate = max(test_df['date'])
tf_mindate = min(test_df['date'])

tf = test_df.groupby('fullVisitorId').agg({
    'channelGrouping': [('channelGrouping_mode', get_most_common)],
    'visitNumber': [('visitNumber_max', 'max')],
    'device.browser': [('device.browser_mode',  get_most_common)],
    'geoNetwork.city': [('geoNetwork.city_mode',  get_most_common)],
    'totals.bounces': [('totals.bounces_mode',  get_most_common)],
    'trafficSource.source': [('trafficSource.source_mode',  get_most_common)],
    'trafficSource.referralPath': [('trafficSource.referralPath_mode',  get_most_common)],
    'trafficSource.medium': [('trafficSource.medium_mode',  get_most_common)],
    'trafficSource.isTrueDirect': [('trafficSource.isTrueDirect_mode',  get_most_common)],
    'trafficSource.adwordsClickInfo.isVideoAd': [('trafficSource.adwordsClickInfo.isVideoAd_mode',  get_most_common)],
    'device.operatingSystem': [('device.operatingSystem_mode',  get_most_common)],
    'device.isMobile': [('device.isMobile_mode',  get_most_common)],
    'device.deviceCategory': [('device.deviceCategory_mode',  get_most_common)],
    'geoNetwork.metro': [('geoNetwork.metro_mode',  get_most_common)],
    'geoNetwork.networkDomain': [('geoNetwork.networkDomain_mode',  get_most_common)],
    'geoNetwork.region': [('geoNetwork.region_mode',  get_most_common)],
    'geoNetwork.subContinent': [('geoNetwork.subContinent_mode',  get_most_common)],
    'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
    'totals.transactions': [('totals.transactions_sum', 'sum')],
    'totals.timeOnSite': [
        ('timeOnSite_sum', 'sum'),
        ('timeOnSite_min', 'min'),
        ('timeOnSite_max', 'max'),
    ],
    'totals.sessionQualityDim': [
        ('sessionQualityDim_max', 'max'),
        ('sessionQualityDim_mean', 'mean'),
        ('sessionQualityDim_min', 'min'),
    ],
    'totals.pageviews': [
        ('pageviews_sum', 'sum'),
        ('pageviews_min', 'min'),
        ('pageviews_max', 'max'),
        ('pageviews_mean', 'mean'),
    ],
    'totals.newVisits': [('totals.newVisits',  get_most_common)],
    'totals.hits': [
        ('hits_sum', 'sum'),
        ('hits_min', 'min'), 
        ('hits_max', 'max'), 
        ('hits_mean', 'mean'),
    ],
    'geoNetwork.country': [('geoNetwork.country_mode',  get_most_common)],
    'geoNetwork.continent': [('geoNetwork.continent_mode',  get_most_common)],
    'customDimensions.value': [('customDimensions.value_mode',  get_most_common)]
})

tf.columns = tf.columns.droplevel()

In [None]:
sr = tf.dtypes
lst_cat = sr[sr == 'object'].index.tolist()

tf[lst_cat] = tf[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
tf.to_csv('output/self_ft_solution_k/X_test.csv')

## self_ft + self_k

In [None]:
def getTimeFramewithFeatures(tr, k=1):
   # train timeframe
    tf = tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*(k-1))) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k))]
    # user id in the test timeframe
    if k == 4:     
        tf_fvid = set(test_df.loc[(test_df['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                       & (test_df['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    else:
        tf_fvid = set(tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46)) 
                    & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    # user id in the test timeframe appeared in train timeframe
    tf_returned = tf[tf['fullVisitorId'].isin(tf_fvid)]
    # test timeframe
    # -------
    if k == 4:
        tf_tst = test_df[test_df['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
                    & (test_df['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                    & (test_df['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]
    else:
        tf_tst = tr[tr['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
                & (tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]

    # for returned user
    tf_target = tf_tst.groupby('fullVisitorId')[['totals.totalTransactionRevenue']]\
                        .sum().apply(np.log1p, axis=1).reset_index()
    tf_target.rename(columns={'totals.totalTransactionRevenue': 'target'}, 
                     inplace=True)
    # for new user
    tf_nonret = pd.DataFrame()
    tf_nonret['fullVisitorId'] = list(set(tf['fullVisitorId']) - tf_fvid)    
    tf_nonret['target'] = 0
    
    tf_target = pd.concat([tf_target, tf_nonret], axis=0).reset_index(drop=True)

    tf_maxdate = max(tf['date'])
    tf_mindate = min(tf['date'])

    tf = tf.groupby('fullVisitorId').agg({
        'channelGrouping': [('channelGrouping_mode', get_most_common)],
        'visitNumber': [('visitNumber_max', 'max')],
        'device.browser': [('device.browser_mode',  get_most_common)],
        'geoNetwork.city': [('geoNetwork.city_mode',  get_most_common)],
        'totals.bounces': [('totals.bounces_mode',  get_most_common)],
        'trafficSource.source': [('trafficSource.source_mode',  get_most_common)],
        'trafficSource.referralPath': [('trafficSource.referralPath_mode',  get_most_common)],
        'trafficSource.medium': [('trafficSource.medium_mode',  get_most_common)],
        'trafficSource.isTrueDirect': [('trafficSource.isTrueDirect_mode',  get_most_common)],
        'trafficSource.adwordsClickInfo.isVideoAd': [('trafficSource.adwordsClickInfo.isVideoAd_mode',  get_most_common)],
        'device.operatingSystem': [('device.operatingSystem_mode',  get_most_common)],
        'device.isMobile': [('device.isMobile_mode',  get_most_common)],
        'device.deviceCategory': [('device.deviceCategory_mode',  get_most_common)],
        'geoNetwork.metro': [('geoNetwork.metro_mode',  get_most_common)],
        'geoNetwork.networkDomain': [('geoNetwork.networkDomain_mode',  get_most_common)],
        'geoNetwork.region': [('geoNetwork.region_mode',  get_most_common)],
        'geoNetwork.subContinent': [('geoNetwork.subContinent_mode',  get_most_common)],
        'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('totals.transactions_sum', 'sum')],
        'totals.timeOnSite': [
            ('timeOnSite_sum', 'sum'),
            ('timeOnSite_min', 'min'),
            ('timeOnSite_max', 'max'),
        ],
        'totals.sessionQualityDim': [
            ('sessionQualityDim_max', 'max'),
            ('sessionQualityDim_mean', 'mean'),
            ('sessionQualityDim_min', 'min'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'mean'),
        ],
        'totals.newVisits': [('totals.newVisits',  get_most_common)],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'mean'),
        ],
        'geoNetwork.country': [('geoNetwork.country_mode',  get_most_common)],
        'geoNetwork.continent': [('geoNetwork.continent_mode',  get_most_common)],
        'customDimensions.value': [('customDimensions.value_mode',  get_most_common)]
    })

    tf.columns = tf.columns.droplevel()

    tf = pd.merge(tf, tf_target, left_on='fullVisitorId', right_on='fullVisitorId')
    return tf

In [None]:
%%time
df1 = getTimeFramewithFeatures(df, k=1)
df2 = getTimeFramewithFeatures(df, k=2)
df3 = getTimeFramewithFeatures(df, k=3)
df4 = getTimeFramewithFeatures(df, k=4)
train = pd.concat([df1, df2, df3, df4], ignore_index=True)
train

CPU times: user 5min 59s, sys: 0 ns, total: 5min 59s
Wall time: 6min


In [None]:
sr = train.dtypes
lst_cat = sr[sr == 'object'].index.tolist()[1:]

train[lst_cat] = train[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
train.to_csv('output/solution_ft_self_k/train_pre.csv', index=False)

In [None]:
tf.to_csv('output/solution_ft_self_k/X_test.csv')

## self_ft(add date ft) + solution_k

In [None]:
def getTimeFramewithFeatures(tr, k=1):
    # train timeframe
    tf = tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*(k-1))) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k))]
    # user id in the test timeframe
    
    tf_fvid = set(tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46)) 
                & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    # user id in the test timeframe appeared in train timeframe
    tf_returned = tf[tf['fullVisitorId'].isin(tf_fvid)]
    # test timeframe
    # -------
    
    tf_tst = tr[tr['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
            & (tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
            & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]

    # for returned user
    tf_target = tf_tst.groupby('fullVisitorId')[['totals.totalTransactionRevenue']]\
                        .sum().apply(np.log1p, axis=1).reset_index()
    tf_target.rename(columns={'totals.totalTransactionRevenue': 'target'}, 
                     inplace=True)
    # for new user
    tf_nonret = pd.DataFrame()
    tf_nonret['fullVisitorId'] = list(set(tf['fullVisitorId']) - tf_fvid)    
    tf_nonret['target'] = 0
    
    tf_target = pd.concat([tf_target, tf_nonret], axis=0).reset_index(drop=True)

    tf_maxdate = max(tf['date'])
    tf_mindate = min(tf['date'])

    tf = tf.groupby('fullVisitorId').agg({
        'channelGrouping': [('channelGrouping_mode', get_most_common)],
        'date': [
            ('firstSes', 'min'), 
            ('lastSes', 'max'),
            ('unique', 'nunique')
        ],
        'visitNumber': [('visitNumber_max', 'max')],
        'device.browser': [('device.browser_mode',  get_most_common)],
        'geoNetwork.city': [('geoNetwork.city_mode',  get_most_common)],
        'totals.bounces': [('totals.bounces_mode',  get_most_common)],
        'trafficSource.source': [('trafficSource.source_mode',  get_most_common)],
        'trafficSource.referralPath': [('trafficSource.referralPath_mode',  get_most_common)],
        'trafficSource.medium': [('trafficSource.medium_mode',  get_most_common)],
        'trafficSource.isTrueDirect': [('trafficSource.isTrueDirect_mode',  get_most_common)],
        'trafficSource.adwordsClickInfo.isVideoAd': [('trafficSource.adwordsClickInfo.isVideoAd_mode',  get_most_common)],
        'device.operatingSystem': [('device.operatingSystem_mode',  get_most_common)],
        'device.isMobile': [('device.isMobile_mode',  get_most_common)],
        'device.deviceCategory': [('device.deviceCategory_mode',  get_most_common)],
        'geoNetwork.metro': [('geoNetwork.metro_mode',  get_most_common)],
        'geoNetwork.networkDomain': [('geoNetwork.networkDomain_mode',  get_most_common)],
        'geoNetwork.region': [('geoNetwork.region_mode',  get_most_common)],
        'geoNetwork.subContinent': [('geoNetwork.subContinent_mode',  get_most_common)],
        'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('totals.transactions_sum', 'sum')],
        'totals.timeOnSite': [
            ('timeOnSite_sum', 'sum'),
            ('timeOnSite_min', 'min'),
            ('timeOnSite_max', 'max'),
        ],
        'totals.sessionQualityDim': [
            ('sessionQualityDim_max', 'max'),
            ('sessionQualityDim_mean', 'mean'),
            ('sessionQualityDim_min', 'min'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'mean'),
        ],
        'totals.newVisits': [('totals.newVisits',  get_most_common)],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'mean'),
        ],
        'geoNetwork.country': [('geoNetwork.country_mode',  get_most_common)],
        'geoNetwork.continent': [('geoNetwork.continent_mode',  get_most_common)],
        'customDimensions.value': [('customDimensions.value_mode',  get_most_common)]
    })

    tf.columns = tf.columns.droplevel()
    tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
    tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
    tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400

    tf = pd.merge(tf, tf_target, left_on='fullVisitorId', right_on='fullVisitorId')
    return tf

In [None]:
%%time
df1 = getTimeFramewithFeatures(df, k=1)
df2 = getTimeFramewithFeatures(df, k=2)
df3 = getTimeFramewithFeatures(df, k=3)
df4 = getTimeFramewithFeatures(df, k=4)
train = pd.concat([df1, df2, df3, df4], ignore_index=True)
train



CPU times: user 5min 56s, sys: 0 ns, total: 5min 56s
Wall time: 5min 57s


In [None]:
sr = train.dtypes
lst_cat = sr[sr == 'object'].index.tolist()[1:]

train[lst_cat] = train[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
train.to_csv('output/self_ft_solution_k/train_pre_date.csv', index=False)

In [None]:
tf_maxdate = max(test_df['date'])
tf_mindate = min(test_df['date'])

tf = test_df.groupby('fullVisitorId').agg({
    'channelGrouping': [('channelGrouping_mode', get_most_common)],
    'date': [
        ('firstSes', 'min'), 
        ('lastSes', 'max'),
        ('unique', 'nunique')
    ],
    'visitNumber': [('visitNumber_max', 'max')],
    'device.browser': [('device.browser_mode',  get_most_common)],
    'geoNetwork.city': [('geoNetwork.city_mode',  get_most_common)],
    'totals.bounces': [('totals.bounces_mode',  get_most_common)],
    'trafficSource.source': [('trafficSource.source_mode',  get_most_common)],
    'trafficSource.referralPath': [('trafficSource.referralPath_mode',  get_most_common)],
    'trafficSource.medium': [('trafficSource.medium_mode',  get_most_common)],
    'trafficSource.isTrueDirect': [('trafficSource.isTrueDirect_mode',  get_most_common)],
    'trafficSource.adwordsClickInfo.isVideoAd': [('trafficSource.adwordsClickInfo.isVideoAd_mode',  get_most_common)],
    'device.operatingSystem': [('device.operatingSystem_mode',  get_most_common)],
    'device.isMobile': [('device.isMobile_mode',  get_most_common)],
    'device.deviceCategory': [('device.deviceCategory_mode',  get_most_common)],
    'geoNetwork.metro': [('geoNetwork.metro_mode',  get_most_common)],
    'geoNetwork.networkDomain': [('geoNetwork.networkDomain_mode',  get_most_common)],
    'geoNetwork.region': [('geoNetwork.region_mode',  get_most_common)],
    'geoNetwork.subContinent': [('geoNetwork.subContinent_mode',  get_most_common)],
    'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
    'totals.transactions': [('totals.transactions_sum', 'sum')],
    'totals.timeOnSite': [
        ('timeOnSite_sum', 'sum'),
        ('timeOnSite_min', 'min'),
        ('timeOnSite_max', 'max'),
    ],
    'totals.sessionQualityDim': [
        ('sessionQualityDim_max', 'max'),
        ('sessionQualityDim_mean', 'mean'),
        ('sessionQualityDim_min', 'min'),
    ],
    'totals.pageviews': [
        ('pageviews_sum', 'sum'),
        ('pageviews_min', 'min'),
        ('pageviews_max', 'max'),
        ('pageviews_mean', 'mean'),
    ],
    'totals.newVisits': [('totals.newVisits',  get_most_common)],
    'totals.hits': [
        ('hits_sum', 'sum'),
        ('hits_min', 'min'), 
        ('hits_max', 'max'), 
        ('hits_mean', 'mean'),
    ],
    'geoNetwork.country': [('geoNetwork.country_mode',  get_most_common)],
    'geoNetwork.continent': [('geoNetwork.continent_mode',  get_most_common)],
    'customDimensions.value': [('customDimensions.value_mode',  get_most_common)]
})

tf.columns = tf.columns.droplevel()
tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400



In [None]:
sr = tf.dtypes
lst_cat = sr[sr == 'object'].index.tolist()

tf[lst_cat] = tf[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
tf.to_csv('output/self_ft_solution_k/X_test_date.csv')

## self_ft(add date ft) + self_k

In [None]:
def getTimeFramewithFeatures(tr, k=1):
    # train timeframe
    tf = tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*(k-1))) 
              & (tr['date'] < min(tr['date']) + timedelta(days=168*k))]
    # user id in the test timeframe
    if k == 4:     
        tf_fvid = set(test_df.loc[(test_df['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                       & (test_df['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    else:
        tf_fvid = set(tr.loc[(tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46)) 
                & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]['fullVisitorId'])
    # user id in the test timeframe appeared in train timeframe
    tf_returned = tf[tf['fullVisitorId'].isin(tf_fvid)]
    # test timeframe
    # -------
    if k == 4:
        tf_tst = test_df[test_df['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
                    & (test_df['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
                    & (test_df['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]
    else:
        tf_tst = tr[tr['fullVisitorId'].isin(set(tf_returned['fullVisitorId']))
            & (tr['date'] >= min(tr['date']) + timedelta(days=168*k + 46))
            & (tr['date'] < min(tr['date']) + timedelta(days=168*k + 46 + 62))]

    # for returned user
    tf_target = tf_tst.groupby('fullVisitorId')[['totals.totalTransactionRevenue']]\
                        .sum().apply(np.log1p, axis=1).reset_index()
    tf_target.rename(columns={'totals.totalTransactionRevenue': 'target'}, 
                     inplace=True)
    # for new user
    tf_nonret = pd.DataFrame()
    tf_nonret['fullVisitorId'] = list(set(tf['fullVisitorId']) - tf_fvid)    
    tf_nonret['target'] = 0
    
    tf_target = pd.concat([tf_target, tf_nonret], axis=0).reset_index(drop=True)

    tf_maxdate = max(tf['date'])
    tf_mindate = min(tf['date'])

    tf = tf.groupby('fullVisitorId').agg({
        'channelGrouping': [('channelGrouping_mode', get_most_common)],
        'date': [
            ('firstSes', 'min'), 
            ('lastSes', 'max'),
            ('unique', 'nunique')
        ],
        'visitNumber': [('visitNumber_max', 'max')],
        'device.browser': [('device.browser_mode',  get_most_common)],
        'geoNetwork.city': [('geoNetwork.city_mode',  get_most_common)],
        'totals.bounces': [('totals.bounces_mode',  get_most_common)],
        'trafficSource.source': [('trafficSource.source_mode',  get_most_common)],
        'trafficSource.referralPath': [('trafficSource.referralPath_mode',  get_most_common)],
        'trafficSource.medium': [('trafficSource.medium_mode',  get_most_common)],
        'trafficSource.isTrueDirect': [('trafficSource.isTrueDirect_mode',  get_most_common)],
        'trafficSource.adwordsClickInfo.isVideoAd': [('trafficSource.adwordsClickInfo.isVideoAd_mode',  get_most_common)],
        'device.operatingSystem': [('device.operatingSystem_mode',  get_most_common)],
        'device.isMobile': [('device.isMobile_mode',  get_most_common)],
        'device.deviceCategory': [('device.deviceCategory_mode',  get_most_common)],
        'geoNetwork.metro': [('geoNetwork.metro_mode',  get_most_common)],
        'geoNetwork.networkDomain': [('geoNetwork.networkDomain_mode',  get_most_common)],
        'geoNetwork.region': [('geoNetwork.region_mode',  get_most_common)],
        'geoNetwork.subContinent': [('geoNetwork.subContinent_mode',  get_most_common)],
        'totals.totalTransactionRevenue': [('totals.totalTransactionRevenue_sum', 'sum')],
        'totals.transactions': [('totals.transactions_sum', 'sum')],
        'totals.timeOnSite': [
            ('timeOnSite_sum', 'sum'),
            ('timeOnSite_min', 'min'),
            ('timeOnSite_max', 'max'),
        ],
        'totals.sessionQualityDim': [
            ('sessionQualityDim_max', 'max'),
            ('sessionQualityDim_mean', 'mean'),
            ('sessionQualityDim_min', 'min'),
        ],
        'totals.pageviews': [
            ('pageviews_sum', 'sum'),
            ('pageviews_min', 'min'),
            ('pageviews_max', 'max'),
            ('pageviews_mean', 'mean'),
        ],
        'totals.newVisits': [('totals.newVisits',  get_most_common)],
        'totals.hits': [
            ('hits_sum', 'sum'),
            ('hits_min', 'min'), 
            ('hits_max', 'max'), 
            ('hits_mean', 'mean'),
        ],
        'geoNetwork.country': [('geoNetwork.country_mode',  get_most_common)],
        'geoNetwork.continent': [('geoNetwork.continent_mode',  get_most_common)],
        'customDimensions.value': [('customDimensions.value_mode',  get_most_common)]
    })

    tf.columns = tf.columns.droplevel()
    tf['interval'] = (tf['lastSes'] - tf['firstSes']).astype(int)/10**9/86400
    tf['firstSes'] = (tf['firstSes'] - tf_mindate).astype(int)/10**9/86400
    tf['lastSes'] = (tf_maxdate - tf['lastSes']).astype(int)/10**9/86400

    tf = pd.merge(tf, tf_target, left_on='fullVisitorId', right_on='fullVisitorId')
    return tf

In [None]:
%%time
df1 = getTimeFramewithFeatures(df, k=1)
df2 = getTimeFramewithFeatures(df, k=2)
df3 = getTimeFramewithFeatures(df, k=3)
df4 = getTimeFramewithFeatures(df, k=4)
train = pd.concat([df1, df2, df3, df4], ignore_index=True)
train



CPU times: user 5min 58s, sys: 33.3 ms, total: 5min 58s
Wall time: 5min 58s


In [None]:
sr = train.dtypes
lst_cat = sr[sr == 'object'].index.tolist()[1:]

train[lst_cat] = train[lst_cat].apply(lambda x: pd.factorize(x, sort=True)[0])
train.to_csv('output/self_ft_self_k/train_pre_date.csv', index=False)

In [None]:
tf.to_csv('output/self_ft_self_k/X_test_date.csv')