![title](http://oi65.tinypic.com/2nvf248.jpg)

In [1]:
# Numerical Python and Pandas for data manipulation
import numpy as np
import pandas as pd

In [2]:
# Necessary libraries from Sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [3]:
# Visualization
from matplotlib import pyplot
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

In [4]:
# Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
import xgboost
from xgboost import plot_importance

In [5]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Import data set from Kaggle Competition
Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

The dataset contains approximately 200,000,000 clicks over 4 days!

In [6]:
# Thanks to Bojan Tunguz on Kaggle for suggesting to skip the first 150 million rows.
train = pd.read_csv("train.csv", skiprows=150000000, nrows=50000000)

In [7]:
train.columns =
['ip', 'app', 'device', 'os', 'channel', 'click_time', 'attributed_time', 'is_attributed']

In [8]:
train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,111186,12,1,22,178,2017-11-09 05:19:45,,0
1,143839,12,1,17,178,2017-11-09 05:19:45,,0
2,87609,18,1,19,107,2017-11-09 05:19:45,,0
3,123924,2,1,25,469,2017-11-09 05:19:45,,0
4,105861,29,1,19,343,2017-11-09 05:19:45,,0


# Exploring Fraudulent Cases
The basic goal is to see a few rows where fraud (is_attributed = 1) happened. 

Something important stands out: transactions happen 24/7. It's possible, that we need to feature engineer the timestamp with an advanced technique called Sin / Cosine. The reason for this is that the traditional approach of breaking up date data into multiple features might not be sufficient as e.g. 5 minutes before midnight and 5 minutes after midnight are pretty close. By using the traditional Panda date data handling, we would miss this.

In [9]:
# Show the top rows where column 'is_attributed' equals 1
train[train['is_attributed'] == 1].head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
412,322273,29,1,14,213,2017-11-09 05:19:45,2017-11-09 05:21:39,1
627,324343,45,1,61,419,2017-11-09 05:19:45,2017-11-09 05:19:53,1
1212,356183,19,0,24,213,2017-11-09 05:19:46,2017-11-09 05:24:41,1
1329,47595,9,1,15,334,2017-11-09 05:19:46,2017-11-09 07:12:10,1
1344,77041,10,1,13,317,2017-11-09 05:19:46,2017-11-09 09:20:20,1


In [10]:
train.shape

(34903890, 8)

# Class Distribution

In [11]:
# How is the target distributed? 
target_counts = train.groupby('is_attributed').size()
target_counts

is_attributed
0    34815719
1       88171
dtype: int64

# Breaking Up Date Data into Multiple Features
The goal is to break up the timestamp into seperate columns of year, month, day, hour and minute.

In [67]:
df = pd.DataFrame(data=train, columns=['click_time'])

In [68]:
df.click_time = pd.to_datetime(df.click_time)

In [69]:
df['new_formatted_date'] = df.click_time.dt.strftime('%d/%m/%y %H:%M')

In [70]:
df.new_formatted_date.head(3)

0    09/11/17 05:19
1    09/11/17 05:19
2    09/11/17 05:19
Name: new_formatted_date, dtype: object

In [71]:
# pandas.Series.dt
df['month'] = df.click_time.dt.month
df['day'] = df.click_time.dt.day
df['year'] = df.click_time.dt.year
df['hour'] = df.click_time.dt.hour
df['minute'] = df.click_time.dt.minute
df.head(3)

Unnamed: 0,click_time,new_formatted_date,month,day,year,hour,minute
0,2017-11-09 05:19:45,09/11/17 05:19,11,9,2017,5,19
1,2017-11-09 05:19:45,09/11/17 05:19,11,9,2017,5,19
2,2017-11-09 05:19:45,09/11/17 05:19,11,9,2017,5,19


In [72]:
print('Unique values of month:', df.month.unique())
print('Unique values of day:', df.day.unique())
print('Unique values of year:', df.year.unique())
print('Unique values of hour:', df.hour.unique())
print('Unique values of minute:', df.minute.unique())

Unique values of month: [11]
Unique values of day: [9]
Unique values of year: [2017]
Unique values of hour: [ 5  6  7  8  9 10 11 12 13 14 15 16]
Unique values of minute: [19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59  0  1  2  3  4  5  6
  7  8  9 10 11 12 13 14 15 16 17 18]


We only have a single month and year. Therefore, we only care about day/hour/minute as a feature.

# The Magic
Now the magic happens. We map each cyclical variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- component of that point using sin and cos trigonometric functions. You remember your unit circle, right? Here's what it looks like for the "hours" variable. Zero (midnight) is on the right, and the hours increase counterclockwise around the circle. In this way, 23:59 is very close to 00:00, as it should be.

Source: David Kaleko (http://blog.davidkaleko.com/feature-engineering-cyclical-features.html)

![title](http://i65.tinypic.com/2akh56x.jpg)

In [73]:
# Day
df['day_sin'] = np.sin(df.day*(2.*np.pi/30))
df['day_cos'] = np.cos(df.day*(2.*np.pi/30))

In [74]:
# Hour
df['hour_sin'] = np.sin(df.day*(2.*np.pi/24))
df['hour_cos'] = np.cos(df.day*(2.*np.pi/24))

In [75]:
# Minute
df['minute_sin'] = np.sin(df.day*(2.*np.pi/60))
df['minute_cos'] = np.cos(df.day*(2.*np.pi/60))

# Concatenate Data Frames

In [76]:
# Concatenate
concatenated = pd.concat([train, df], axis=1)

# Define X and y

In [77]:
# Define y = is_attributed
y = concatenated['is_attributed']

In [78]:
# Sanity check
y.head(3)

0    0
1    0
2    0
Name: is_attributed, dtype: int64

In [79]:
# Define X
X = concatenated[['ip', 'app', 'device', 'os', 'channel', 'day_sin','day_cos', 'hour_sin', 'hour_cos',
                  'minute_sin', 'minute_cos']]

In [80]:
# Sanity check
X.head(3)

Unnamed: 0,ip,app,device,os,channel,day_sin,day_cos,hour_sin,hour_cos,minute_sin,minute_cos
0,111186,12,1,22,178,0.951057,-0.309017,0.707107,-0.707107,0.809017,0.587785
1,143839,12,1,17,178,0.951057,-0.309017,0.707107,-0.707107,0.809017,0.587785
2,87609,18,1,19,107,0.951057,-0.309017,0.707107,-0.707107,0.809017,0.587785


# Split Data into Training and Test Sets


In [81]:
validation_size = 0.10 # 10% is enough for this large dataset
seed = 99
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size = validation_size, 
                                                     random_state=seed, shuffle=True )

In [82]:
num_folds = 5
kfold = KFold(n_splits=5, random_state=seed)

# Machine Learning Algorithm

In [93]:
# fit the model to the training data
model = XGBClassifier(base_score=0.5,
                      booster='gbtree',
                      colsample_bylevel=1,
                      colsample_bytree=1, 
                      gamma=0, 
                      learning_rate=0.1, 
                      max_delta_step=0, 
                      max_depth=4, 
                      min_child_weight=100, 
                      missing=None, 
                      n_estimators=40,
                      n_jobs=1, 
                      nthread=None, 
                      objective='binary:logistic', 
                      random_state=99,
                      reg_alpha=0, 
                      reg_lambda=1, 
                      scale_pos_weight=150, 
                      seed=None,
                      silent=True, 
                      subsample=1)

In [94]:
model.fit(X_train, y_train)

[14:51:40] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=4, min_child_weight=100, missing=None, n_estimators=40,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=99, reg_alpha=0, reg_lambda=1, scale_pos_weight=150,
       seed=None, silent=True, subsample=1)

In [95]:
num_folds = 5
seed = 99

kfold = KFold(n_splits=num_folds, random_state=seed)
results = cross_val_score(model, X_test, y_test, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 98.137% (0.019%)


In [96]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [97]:
from sklearn.metrics import classification_report
predicted = model.predict(X_test)
report = classification_report(y_test, predicted)
print(report)

              precision    recall  f1-score   support

           0       1.00      0.98      0.99   3481675
           1       0.10      0.83      0.18      8714

   micro avg       0.98      0.98      0.98   3490389
   macro avg       0.55      0.91      0.59   3490389
weighted avg       1.00      0.98      0.99   3490389



In [98]:
scoring = 'roc_auc'
results = cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

[15:11:34] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[15:24:21] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[15:37:12] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[15:49:50] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[16:02:25] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
AUC: 0.957 (0.007)
