# Click-Through Rate Prediction

> In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.
>
> [Link](https://www.kaggle.com/c/avazu-ctr-prediction)

## Libraries

 * **Numpy:** Useful for algebra and other mathematical utilities
 * **Pandas:** Library that enables working with dataframes
 * **Dask:** Provides functionality that mimics numpy arrays and pandas dataframes, while performing out-of-core computations
 * **Matplotlib:** Useful for fast and non-interactive visualizations
 * **Plotly:** Visualization library, with a lot of interactive functionality
 * **Sci-kit Learn:** Library with machine learning algorithms, useful, e.g., for exploratory and predictive data analysis

Start by clearing variables from previous runs 

In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [27]:
import numpy as np
import pandas as pd
import dask.dataframe as dd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import matplotlib
import matplotlib.pyplot as plt
import sklearn

Set Matplotly and Plotly to be used inline throughou the notebook:

In [25]:
# Matplotlib
%matplotlib inline

# Pyplot
init_notebook_mode(connected=True)

Beautify Matplotlib:

In [57]:
matplotlib.style.use('ggplot')

Import the dataset using Pandas dataframes, with proper configuration:

In [58]:
date_parser = lambda x: pd.datetime.strptime(x, '%y%m%d%H')

Datatypes for the data features. These were obtained through observation of the data contents, and in the case of integers, of the integers feature ranges.

In [59]:
data_types = {
    'id': np.str,
    'click': np.bool_,
    'hour': np.str,
    'C1': np.uint16,
    'banner_pos': np.uint16,
    'site_id': np.object,
    'site_domain': np.object,
    'site_category': np.object,
    'app_id': np.object,
    'app_domain': np.object,
    'app_category': np.object,
    'device_id': np.object,
    'device_model': np.object,
    'device_type': np.uint16,
    'device_conn_type': np.uint16,
    'C14': np.uint16,
    'C15': np.uint16,
    'C16': np.uint16,
    'C17': np.uint16,
    'C18': np.uint16,
    'C19': np.uint16,
    'C20': np.uint16,
    'C21': np.uint16    
}

In [7]:
%%time
train_df = pd.read_csv('./data/train/train.csv',
                       dtype=data_types,
                       parse_dates=['hour'],
                       date_parser=date_parser)


Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.



Wall time: 12min 53s


Extract some basic information about the data

In [54]:
train_df.iloc[:, 10:20].head()

Unnamed: 0,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17
0,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722
1,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722
2,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722
3,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722
4,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161


In [8]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40428967 entries, 0 to 40428966
Data columns (total 24 columns):
id                  object
click               int64
hour                datetime64[ns]
C1                  int64
banner_pos          int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type         int64
device_conn_type    int64
C14                 int64
C15                 int64
C16                 int64
C17                 int64
C18                 int64
C19                 int64
C20                 int64
C21                 int64
dtypes: datetime64[ns](1), int64(13), object(10)
memory usage: -827572848.0+ bytes


In [None]:
train_df.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1000009418151094273,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,10000169349117863715,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,10000371904215119486,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,10000640724480838376,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,10000679056417042096,0,2014-10-21,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


In [56]:
train_df.describe()

Unnamed: 0,click,C1,banner_pos,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
count,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0
mean,0.1698056,1004.968,0.2880146,1.015305,0.331315,18841.81,318.8831,60.10201,2112.601,1.432499,227.1444,53216.85,83.38229
std,0.375462,1.094586,0.506382,0.5274336,0.8547935,4959.457,21.2725,47.29538,609.4124,1.326227,351.0221,49956.82,70.28996
min,0.0,1001.0,0.0,0.0,0.0,375.0,120.0,20.0,112.0,0.0,33.0,-1.0,1.0
25%,0.0,1005.0,0.0,1.0,0.0,16920.0,320.0,50.0,1863.0,0.0,35.0,-1.0,23.0
50%,0.0,1005.0,0.0,1.0,0.0,20346.0,320.0,50.0,2323.0,2.0,39.0,100048.0,61.0
75%,0.0,1005.0,1.0,1.0,0.0,21894.0,320.0,50.0,2526.0,3.0,171.0,100093.0,101.0
max,1.0,1012.0,7.0,5.0,5.0,24052.0,1024.0,1024.0,2758.0,3.0,1959.0,100248.0,255.0


In [None]:
iplot([go.Histogram(x=train_df['click'])])