# Avazu dataset processing

In [1]:
import pandas as pd
import numpy as np
import time

from sklearn.model_selection import StratifiedShuffleSplit

In [2]:
# print the time of reading original training dataset
start = time.time()
dt = pd.read_csv('./train.csv')
end = time.time()
print(end-start)

246.88031196594238


In [3]:
# output the original dataset size
dt.shape

(40428967, 24)

Since train.csv is sorted by time, only the first 100w records are read, and there may be a risk of sampling deviation.

Users should be divided into even sub-groups, called strata, and an appropriate number of instances should be taken from each strata to ensure that the training set is representative of the overall.

Here, we divide the label into samples, and sample 0.005 samples for users with click = 0 and click = 1, to reduce the data set to facilitate subsequent operations.

In [7]:
# stratified data sampling and convert into a Avazu sub-dataset of ".txt" format

split = StratifiedShuffleSplit(n_splits=1, train_size=0.005, random_state=42)

for train_index, test_index in split.split(dt, dt['click']):
    strat_train_set = dt.loc[train_index]
    strat_train_set.to_csv("avazu_20w.txt", index=False, header = True)

In [5]:
# overview the sub-dataset

samp = pd.read_csv('avazu_20w.csv')
samp

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.199007e+19,0,14102704,1005,0,2ee82a0f,ad063404,f66779e6,ecad2386,7801e8d9,...,1,0,16859,320,50,1887,3,39,-1,23
1,4.523558e+18,0,14102611,1005,0,9a977531,a434fa42,f028772b,ecad2386,7801e8d9,...,1,0,20251,320,50,2323,0,687,100081,48
2,1.663450e+19,0,14102804,1005,0,85f751fd,c4e18dd6,50e219e0,e2fcccd2,5c5a694b,...,1,0,22680,320,50,2528,0,39,100079,221
3,1.705210e+18,0,14102316,1005,1,b7e9786d,b12b9f85,f028772b,ecad2386,7801e8d9,...,1,0,21980,320,50,2532,0,679,100075,48
4,9.639365e+18,0,14102517,1005,0,5b08c53b,7687a86e,3e814130,ecad2386,7801e8d9,...,1,0,17653,300,250,1994,2,39,100084,33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202139,1.505666e+19,0,14102509,1005,1,856e6d3f,58a89a43,f028772b,ecad2386,7801e8d9,...,1,0,16208,320,50,1800,3,167,100077,23
202140,1.370729e+19,0,14102102,1002,0,84c7ba46,c4e18dd6,50e219e0,ecad2386,7801e8d9,...,0,0,21300,320,50,2446,3,171,100228,156
202141,1.145843e+19,1,14102804,1002,0,2bf49fc4,c4e18dd6,50e219e0,ecad2386,7801e8d9,...,0,0,6563,320,50,572,2,39,-1,32
202142,4.609230e+18,0,14102719,1005,0,85f751fd,c4e18dd6,50e219e0,92f5800b,ae637522,...,1,3,21191,320,50,2424,1,161,100189,71


In [6]:
# output length of the dataset (stratify-sampled sub-dataset)
len(samp)

202144