# This notebook does the following:
1. Load Criteo Terabyte Click Logs Day 15 as Pandas DF
2. Process and format data
3. Train a Scikit-learn random forest model
4. Perform prediction & calculate accuracy

Requirements:
pandas, numpy, sklearn.model_selection, sklearn.metrics, matplotlib, matplotlib.pyplot

Download Criteo Click Logs dataset Day 15 in Terminal:
wget http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_15.gz

In [1]:
# optional installation if the following libraries have not been installed in the cluster:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install sklearn



In [2]:
file = '/data/day_15' #after download the dataset, decompressed the file first. day_15 is text format.

In [3]:
#readline() is reading the first 1 line.
with open(file) as f:
    print(f.readline()) 

0	2	9		1		0	0	3	1	0		1036		4db5cd76	310b1fd7	bfbe69f6	bc892e1f	1315f676	6fcd6dcb	e7222fbe	b2a2bd17	25dd8f9a	2d40282b	4f91b406	a81c2672	a77a4a56	be4ee537	57469cbd	4cdc3efa	1f7fc70b	b8170bba	9512c20b	31a9f3b3	228aee9b	b74c6548	59f9dd38	165fbf32	0b3c06d0	2ccea557



In [4]:
%%time

import pandas as pd
import numpy as np
header = ['col'+str(i) for i in range (1,41)] #note that according to criteo, the first column in the dataset is Click Through (CT). Consist of 40 columns 

first_row_taken = 50_000_000 # use this in pd.read_csv() if your compute resource is limited.
# total number of rows in day15 is 20B
# take 20M, 30M 

"""
Read data & display the following metrics:
1. Total number of rows per day
2. df loading time in the cluster 
3. Train a random forest model
""" 
df = pd.read_csv(file, nrows=first_row_taken, delimiter='\t', names=header)

# take numerical columns
df_sliced = df.iloc[:, 0:14]

# split data into training and Y
Y = df_sliced.pop('col1') # first column is binary (click or not)

# change df_sliced data types & fillna
df_sliced = df_sliced.astype(np.float32).fillna(0)

from sklearn.ensemble import RandomForestClassifier

# Random Forest building parameters
# n_streams = 8 # optimization
max_depth = 10
n_bins = 16
n_trees = 10

rf_model = RandomForestClassifier(max_depth=max_depth, n_estimators=n_trees)
rf_model.fit(df_sliced, Y)

# testing data, last 1M rows in day15
test_file = '/data/day_15_test'
with open(test_file) as g:
    print(g.readline()) 
    
# dataFrame processing for test data
test_df = pd.read_csv(test_file, delimiter='\t', names=header) 
test_df_sliced = test_df.iloc[:, 0:14]
test_Y = test_df_sliced.pop('col1')
test_df_sliced = test_df_sliced.astype(np.float32).fillna(0)

# prediction & calculating error
pred_df = rf_model.predict(test_df_sliced)

from sklearn import metrics
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(test_Y, pred_df))

0	1	3	0	7	5	0	0	3575	6	0	4	11976	1	f438eac0	e7c8f4b4	5b913d0f	f2463ffb	729e35ab	6fcd6dcb	27f43f86	312aa74b	25dd8f9a	96bd225a	3861b8d7	f1b49bb9	a77a4a56	672e9cf8	96fd88a3	ae30c32c	1f7fc70b	b6bc86c5	108a0699	5865ea16	d55ec182	f11ef8d0	483383ee	d7b3dff0	321935cd	2ba8d787

Accuracy: 0.96592
CPU times: user 43min 50s, sys: 26.8 s, total: 44min 17s
Wall time: 47min 21s


Data Source: https://labs.criteo.com/2013/12/download-terabyte-click-logs/

Project Inspiration:https://towardsdatascience.com/mobile-ads-click-through-rate-ctr-prediction-44fdac40c6ff

Mapping object to set of Integer wiht Hash Function, before using it in XGBoost: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087

Regularization, Variance, OverFit Concept: https://www.youtube.com/watch?v=Q81RR3yKn30

XGBoost_Playlist by StatQuest: https://www.youtube.com/watch?v=OtD8wVaFm6E&list=PLblh5JKOoLULU0irPgs1SnKO6wqVjKUsQ

Visulazing XGBClassifier with val_metric Error & LogLoss: https://setscholars.net/wp-content/uploads/2019/02/visualise-XgBoost-model-with-learning-curves-in-Python.html