This script will help you understand how can you use keras library to build neural nets for binary classification. <br />
Majorly, following are the steps:

1. Scale train and test data.
2. Encode the target variables (one-hot)
3. Setup the model architechture
4. Train the model
5. Predict


In [1]:
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
print ('The train data has {} rows and {} columns'.format(train.shape[0],train.shape[1]))
print ('The test data has {} rows and {} columns'.format(test.shape[0],test.shape[1]))

The train data has 12137810 rows and 10 columns
The test data has 3706907 rows and 9 columns


In [4]:
train.head()

Unnamed: 0,ID,datetime,siteid,offerid,category,merchant,countrycode,browserid,devid,click
0,IDsrk7SoW,2017-01-14 09:42:09,4709696.0,887235,17714,20301556,e,Firefox,,0
1,IDmMSxHur,2017-01-18 17:50:53,5189467.0,178235,21407,9434818,b,Mozilla Firefox,Desktop,0
2,IDVLNN0Ut,2017-01-11 12:46:49,98480.0,518539,25085,2050923,a,Edge,,0
3,ID32T6wwQ,2017-01-17 10:18:43,8896401.0,390352,40339,72089744,c,Firefox,Mobile,0
4,IDqUShzMg,2017-01-14 16:02:33,5635120.0,472937,12052,39507200,d,Mozilla Firefox,Desktop,0


In [5]:
# imputing missing values
train['siteid'].fillna(-999, inplace=True)
test['siteid'].fillna(-999, inplace=True)

train['browserid'].fillna("None",inplace=True)
test['browserid'].fillna("None", inplace=True)

train['devid'].fillna("None",inplace=True)
test['devid'].fillna("None",inplace=True)

In [7]:
# create timebased features

train['datetime'] = pd.to_datetime(train['datetime'])
test['datetime'] = pd.to_datetime(test['datetime'])

train['tweekday'] = train['datetime'].dt.weekday
test['tweekday'] = test['datetime'].dt.weekday

train['thour'] = train['datetime'].dt.hour
test['thour'] = test['datetime'].dt.hour

train['tminute'] = train['datetime'].dt.minute
test['tminute'] = test['datetime'].dt.minute

In [141]:
# create aggregate features
site_offer_count = train.groupby(['siteid','offerid']).size().reset_index()
site_offer_count.columns = ['siteid','offerid','site_offer_count']

site_offer_count_test = test.groupby(['siteid','offerid']).size().reset_index()
site_offer_count_test.columns = ['siteid','offerid','site_offer_count']

site_cat_count = train.groupby(['siteid','category']).size().reset_index()
site_cat_count.columns = ['siteid','category','site_cat_count']

site_cat_count_test = test.groupby(['siteid','category']).size().reset_index()
site_cat_count_test.columns = ['siteid','category','site_cat_count']

site_mcht_count = train.groupby(['siteid','merchant']).size().reset_index()
site_mcht_count.columns = ['siteid','merchant','site_mcht_count']

site_mcht_count_test = test.groupby(['siteid','merchant']).size().reset_index()
site_mcht_count_test.columns = ['siteid','merchant','site_mcht_count']


In [158]:
# joining all files
agg_df = [site_offer_count,site_cat_count,site_mcht_count]
agg_df_test = [site_offer_count_test,site_cat_count_test,site_mcht_count_test]

for x in agg_df:
    train = train.merge(x)
    
for x in agg_df_test:
    test = test.merge(x)


In [28]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
for c in list(train.select_dtypes(include=['object']).columns):
    if c != 'ID':
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(list(train[c].values))
        test[c] = lbl.transform(list(test[c].values))        

In [163]:
# sample 10% data - to avoid memory troubles
# if you have access to large machines, you can use more data for training

train = train.sample(1e6)
print (train.shape)

  locs = rs.choice(axis_length, size=n, replace=replace, p=weights)


(1000000, 16)


In [164]:
# select columns to choose
cols_to_use = [x for x in train.columns if x not in list(['ID','datetime','click'])]

In [165]:
# standarise data before training
scaler = StandardScaler().fit(train[cols_to_use])

strain = scaler.transform(train[cols_to_use])
stest = scaler.transform(test[cols_to_use])

In [167]:
# train validation split
X_train, X_valid, Y_train, Y_valid = train_test_split(strain, train.click, test_size = 0.5, random_state=2017)

In [168]:
print (X_train.shape)
print (X_valid.shape)
print (Y_train.shape)
print (Y_valid.shape)

(500000, 13)
(500000, 13)
(500000,)
(500000,)


In [169]:
# model architechture
def keras_model(train):
    
    input_dim = train.shape[1]
    classes = 2
    
    model = Sequential()
    model.add(Dense(100, activation = 'relu', input_shape = (input_dim,))) #layer 1
    model.add(Dense(30, activation = 'relu')) #layer 2
    model.add(Dense(classes, activation = 'sigmoid')) #output
    model.compile(optimizer = 'adam', loss='binary_crossentropy',metrics = ['accuracy'])
    return model

callback = EarlyStopping(monitor='val_acc',patience=3)

### Now, let's understand the architechture of this neural network:

1. We have 13 input features. 
2. We connect these 13 features with 100 neurons in the first hidden layer (call layer 1).
3. Visualise in mind this way: The lines connecting input to neurons are assigned a weight (randomly assigned).
4. The neurons in layer 1 receive a weighted sum (bias + woxo + w1x1...) of inputs while passing through `relu` activation function.
5. Relu works this way: If the value of weighted sum is less than zero, it sets it to 0, if the value of weighted sum of positive, it considers the value as is.
6. The output from layer 1 is input to layer 2 which has 30 neurons. Again, the input passes through `relu` activation function.   
7. Finally, the output of layer 2 is fed into the final layer which has 2 neurons. The output passes through `sigmoid` function. `Sigmoid` functions makes sure that probabilities stays within 0 and 1 and we get the output predictions.

In [171]:
# one hot target columns
Y_train = to_categorical(Y_train)
Y_valid = to_categorical(Y_valid)

In [173]:
# train model
model = keras_model(X_train)
model.fit(X_train, Y_train, 1000, 50, callbacks=[callback],validation_data=(X_valid, Y_valid),shuffle=True)

Train on 500000 samples, validate on 500000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50


<keras.callbacks.History at 0x7f12846a5940>

In [177]:
# check validation accuracy
vpreds = model.predict_proba(X_valid)[:,1]
roc_auc_score(y_true = Y_valid[:,1], y_score=vpreds)



0.96653631540205431

In [178]:
# predict on test data
test_preds = model.predict_proba(stest)[:,1]



In [180]:
# create submission file
submit = pd.DataFrame({'ID':test.ID, 'click':test_preds})
submit.to_csv('keras_starter.csv', index=False)