# DSI Capstone - Part 2 

### Summary
For my capstone project, I will be entering the Avazu Click-Through Rate Prediction competition on Kaggle (https://www.kaggle.com/c/avazu-ctr-prediction).  This competition ended three years ago but is still open for late submissions.

### Goals
My primary goal is to achieve a score which would have been sufficient to reach the top 1/3 of the leaderboard during the contest. This is a score of 0.3930207, which would have achieved a ranking of 534 of 1604 entrants (the precision is important for ranking in this contest, where many scores are separated by very small margins).  The top score in this contest was 0.3791384. Submissions are evaluated by Logarithmic Loss (lower scores are better). 

A secondary goal is to understand Neural Networks in depth since this is one of the model techniques I plan to use. 

A potential third (stretch) goal would be to test whether a model pipeline developed for this contest could be quickly repurposed for a different click prediction contest (eg https://www.kaggle.com/c/criteo-display-ad-challenge).  

### Data
This is a classification problem where the goal is to predict whether a user will click a given text ad. There are ~40MM rows in the training data set, representing 10 days of data from the Avazu site, and ~4MM rows in the test set, representing one day of data.  

Preliminary data analysis follows in this notebook. In general the data appears to be well-formatted and consistent, with no null values present in either the test or the train set.


### Methodology
I plan to use a Neural Network approach to begin with, based on my interest in learning more about NN topology.  I am planning to use a Count Binarization technique to represent categorical values such as Device ID and Device IP with conditional probabilities. I am considering separate pipelines for ads which the model has seen before and ads which are new. All models will be cross-validated in training.

The winning approaches for the Criteo competition (another click prediction competition) included such interesting techniques including Field Aware Factorization Machines and Vowpal Wabbit's implementation of Logistic Regression. One of these might be an interesting alternative to test once the Neural Network work reaches a good point.

With 6GB of training data, I anticipate that model training and feature engieering may take time, depending on the approaches involved. I plan to create smaller test sets to use for initial model evaluation and fitting but do anticipate that final model training will be time-intensive. It is possible that I may look for a cloud instance with more cores/RAM than my local system, if necessary.  The local system is a dual-core Macbook Air with 8GB of RAM.

In [1]:
# Exploratory Data Analysis for DSI Capstone Project

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix, autocorrelation_plot
import seaborn as sns

In [None]:
sample = pd.read_csv('./assets/sampleSubmission')
train = pd.read_csv('./assets/train')
test = pd.read_csv('./assets/test')

# note that we have ~10x more data in our training data set than in our testing data set
print(f"Sample submission shape: {sample.shape} - Train shape: {train.shape} - Test shape: {test.shape}")

In [None]:
# Examining the sample submission format

sample.head(1)

In [None]:
# Examining the training data format

train.head(1)

In [None]:
# Examining the test data format - it looks identical to train except for the "click" column, as expected

test.head(1)

In [None]:
# Verifying that only "click" is present in train but not in test

[item for item in list(train.columns.values) if item not in list(test.columns.values)]

In [None]:
# And that there are no columns in test which are not in train

[item for item in list(test.columns.values) if item not in list(train.columns.values)]

In [None]:
# Checking nulls in training data - looks good

train.isnull().sum().sum()

In [None]:
# And in test data - also looking good

test.isnull().sum().sum()

In [None]:
# Reviewing data types 

train.info()

In [None]:
test.info()

In [None]:
# Our class balance is 83% - 17% -- imbalanced classes although not as extreme as in other cases like disease detection
# TODO: click through rates are typically well below 16% - worth reviewing the data notes to understand this better

train.click.value_counts(normalize=True)

In [None]:
# Getting a sense for the different columns in our dataset

In [None]:
# How many unique values does each column contain?

unique_values = pd.DataFrame(index=train.columns)
for col_name in list(train.columns):
    unique_values.at[col_name, "unique_vals"] = len(train[col_name].unique())
unique_values

In [None]:
# Let's look at the numeric columns we have
numeric_cols = train.select_dtypes(exclude='object').drop(columns=['id'])


In [None]:
numeric_cols.head()
#sns.pairplot(numeric_cols)

In [None]:
# Let's examine the object columns in more detail since they will need to be transformed

# these all look like anonymized categorical values... pretty straightforward
train.select_dtypes(include='object').head(10)

In [None]:
# looking into Hour... looks like it's a sequential list of dates with hour identifiers from 00 to 23 
hour_counts = train.groupby(['hour']).size().reset_index(name='counts')
hour_counts.head(24)

In [None]:
# specifically this looks like observations drawn over a 10 day period
# hour-of-the-day, day-of-the-week, and day-of-the-month may be relevant

# "Day One" events by hour
plt.xscale('linear')
plt.plot(hour_counts.hour[0:24], hour_counts.counts[0:24])

In [None]:
# "Day Two" events by hour
plt.plot(hour_counts.hour[24:48], hour_counts.counts[24:48])

In [None]:
# Initial hypothesis : BalancedBaggingClassifier may perform well. Experiment: "bin counting" with probabilities to replace categoricals such as IP / device ID