## 1. Meta Data Summary

Source: https://www.kaggle.com/c/outbrain-click-prediction/forums/t/24332/data-summary-all-tables

In [82]:
import pandas as pd
summary = pd.read_csv('data_summary.csv')
summary.head()

Unnamed: 0,table_name,group_by,group_by_value,target_column,column_number,data_type,row_count,distinct_values,missing_values,blank_values,...,fraction_blank,mean,variance,min,max,first_quartile,median,third_quartile,most_frequent_values,mfv_frequencies
0,clicks_train,,,display_id,1,text,87141731,16874593,0,0.0,...,0.0,,,1.0,8.0,,,,"{1830897,1834278,1836704,1836878,1834701,18328...","{85403,85363,85352,85297,85268,85267,85252,851..."
1,clicks_train,,,ad_id,2,text,87141731,478950,0,0.0,...,0.0,,,1.0,6.0,,,,"{173005,123742,180923,151028,173006,138353,347...","{269177,246080,231818,228730,219236,195557,186..."
2,clicks_train,,,clicked,3,text,87141731,2,0,0.0,...,0.0,,,1.0,1.0,,,,"{0,1}","{70267138,16874593}"
3,documents_categories,,,document_id,1,text,5481475,2828649,0,0.0,...,0.0,,,1.0,7.0,,,,"{1405282,1530506,2063154,1439841,1629124,17452...","{5396,5396,5377,5370,5370,5369,5368,5362,5356,..."
4,documents_categories,,,category_id,2,text,5481475,97,0,0.0,...,0.0,,,4.0,4.0,,,,"{1403,1702,1902,1513,1808,1100,1907,2004,1408,...","{572107,408499,292878,276203,241966,212249,181..."


### !!! the 'document_id' in the events.csv and the variable 'document_id' in the promoted_content.csv are not referring to the same thing

In [14]:
mask = summary['target_column']=='document_id'

summary.loc[mask,['table_name','row_count','distinct_values','most_frequent_values']]

Unnamed: 0,table_name,row_count,distinct_values,most_frequent_values
3,documents_categories,5481475,2828649,"{1405282,1530506,2063154,1439841,1629124,17452..."
6,documents_meta,2999334,2999334,"{1405282,2063154,1530506,1629124,1636523,16040..."
10,documents_topics,11325960,2495423,"{1033527,827899,279563,1438841,1496084,1545884..."
15,events,23120126,894060,"{1179111,394689,1827718,38922,7054,1788295,210..."
20,page_views,2034275448,2885425,"{1179111,394689,2191,7054,38922,1154100,357569..."
26,page_views_sample,9999999,59849,"{1811567,234,42744,1858440,1780813,60164,17904..."
32,promoted_content,559583,185709,"{1804537,1617964,1109881,1150712,1109919,11332..."
37,documents_entities,5537552,1791420,"{2285116,2388474,289047,2468402,1384898,277672..."


The 'document_id' in events.csv refers to the article that a user browsed (eg: news on cnn.com).

The 'document_id' in promoted_content.csv refers to the product page ('ad landing page' in web analysts' vocabulary) that a click leads to.

## 2. Evaluation (Mean Avergae Precision @ 12)

#### Remember the MAP@12 problem we discussed on our last meeting? It has been answered in the post 'The correct order of NOT clicked ADs' on Kaggle forum.

Source: https://www.kaggle.com/c/outbrain-click-prediction/forums/t/24467/the-correct-order-of-not-clicked-ads

Q: I am wondering how Outbrain (or anybody) can know the correct order of the not clicked ADs since the user clicked just a single AD from any given display_id? Probabilistic exercises are all well and good, but how the exact order can be decided apart from that?

A: Average precision is independent of orders of no-click. It is all about the order/rank of the click ad : )
   avg precision of every display_id is basically (1/1-based-position of the clicked ad). MAP is the mean of ap across all display_ids.

In [None]:
def apk(actual, predicted, k=12):
    if len(predicted)>k:
        predicted = predicted[:k]
    score = 0.0
    num_hits = 0.0
    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)
    if not actual:
        return 0.0
    return score / min(len(actual), k)

def mapk(actual, predicted, k=12):
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

## 3. Geo Info

Source: https://www.kaggle.com/andreyg/outbrain-click-prediction/explore-user-base-by-geo

In [84]:
import pycountry as pc

#Generate dict to translate Alpha2 Code to Country Name
alpha2_code = dict()
for country in list(pc.countries):
    alpha2_code[country.alpha2] = country.name

#### 3.1 get country

In [None]:
df = pd.read_csv("../input/page_views.csv", usecols=['uuid', 'geo_location'],dtype={'uuid': np.str, 'geo_location': np.str})
df.dropna(inplace=True)
#Drop EU codes
df = df.loc[~df.geo_location.isin(['EU', '--']), :]
df = df.drop_duplicates('uuid', keep='first')

df['geo_location'] = df['geo_location'].str[:2]

#### 3.2 get state

In [None]:
usa = df.loc[df.geo_location.str[:2] == 'US', :]

usa.columns = ['uuid', 'State']

usa.State = usa.State.str[3:5]

## 4. Time

#### method 1 (btw this is from the kernel 'unveiling-page-views-csv-with-pyspark'. We can look more into it after we start to study Spark on our Data Management course next week!)

Source: https://github.com/gabrielspmoreira/static_resources/blob/gh-pages/Kaggle-Outbrain-PageViews_EventsAnalytics.ipynb

In [None]:
from datetime import datetime
def convert_odd_timestamp(timestamp_ms_relative):
    TIMESTAMP_DELTA=1465876799998
    for i in timestamp_ms_relative:
        yield datetime.fromtimestamp((int(i)+TIMESTAMP_DELTA)//1000)

for i in convert_odd_timestamp(df['timestamp'].values):
    print i

#### method 2

In [None]:
df["hour"] = (df.timestamp // (3600 * 1000)) % 24
df["day"] = df.timestamp // (3600 * 24 * 1000)

## 5. Cross Validation

Source: https://www.kaggle.com/c/outbrain-click-prediction/forums/t/24255/cv-vs-lb

Problem: normal CV is not a representative sample from test data (Look the train/test distributions @joconnor kernel)

Solution:
1. CV using random split (estimation for 'present' data)
2. CV using time based split (estimation for 'future' data)
3. Final CV = weighted mean from 1 and 2

Train: 80% of train data from day 0 to 10 
    
Validation: 20% of train data from day 0 to 10 + day 11 + day 12

## 6. Leakage

see https://www.kaggle.com/its7171/outbrain-click-prediction/leakage-solution/discussion for explanation

These kernels go through other possible ways to get the leak:

https://www.kaggle.com/its7171/outbrain-click-prediction/is-landing-access-for-ad-clicks-in-page-views-csv
    
https://www.kaggle.com/its7171/outbrain-click-prediction/leakage-solution/code
    
https://www.kaggle.com/jiweiliu/outbrain-click-prediction/extract-leak-in-30-mins-with-small-memory

## 7. Model/algorithm we should probably pay attention to

Source: https://www.kaggle.com/c/outbrain-click-prediction/forums/t/24595/a-war-of-ffms
        
@rcarson in the post 'A war of FFMs': 

"For people who has only small machine like me, the key is the willingness to go low level and customized classifiers. As you all know, FFM is the nuclear weapon for CTR. Currently I'm developing my own FFM in c++. I can get 0.683 LB with 5 GB memory, 2 hours with a single FFM (excluding feature engineering time). My ffm is largely based on Libffm which is a wonderful library. I managed to improve its out of core performance and add some tricks to improve both apk and speed."

There is already some Python codes for implementing FFM on this competition

Check https://www.kaggle.com/qqgeogor/outbrain-click-prediction/keras-based-fm

Also, someone has written a Python wrapper for Libffm!!!

https://github.com/jfloff/pywFM