# Keras Implementation Tutorial

In [1]:
import keras

Using TensorFlow backend.


In [2]:
from keras.models import Sequential
from keras.layers import Dense

In [113]:
import numpy as np
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

## 0. The Data

In this tutorial, we are going to study what attributes infuence Pima Indians to have diabetes. The dataset is from [UIC Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes). All samples here are females at least 21 years old of Pima Indian heritage. 

This is a binary classification problem, and Class = 1 means patients having diabetes. The number of instances (samples) is 768 and the number of attributes is 8, which are **(1) Number of times pregnant** 
**(2) Plasma glucose concentration a 2 hours in an oral glucose tolerance test**,
**(3) Diastolic blood pressure (mm Hg)**,
**(4) Triceps skin fold thickness (mm)**,
**(5) 2-Hour serum insulin (mu U/ml)**,
**(6) Body mass index (weight in kg/(height in m)^2)**,
**(7) Diabetes pedigree function**,
**(8) Age (years)**.

In [47]:
import pandas as pd
dataset = pd.read_csv("pima-indians-diabetes.csv")
dataset.columns = ['num_pregnant','glucose_concentration','blood_pressure','skin_fold_thickness','serum_insulin','body_mass_index','diabetes_pedigree','age','class']
dataset.head()

Unnamed: 0,num_pregnant,glucose_concentration,blood_pressure,skin_fold_thickness,serum_insulin,body_mass_index,diabetes_pedigree,age,class
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


In [33]:
dataset.shape

(767, 9)

### Split dataset in training and test datasets

In [51]:
X = dataset.iloc[:,:8]
Y = dataset.iloc[:,8]

In [70]:
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn import metrics, cross_validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
type(X_train), type(Y_train)

(pandas.core.frame.DataFrame, pandas.core.series.Series)

In [99]:
X_train.shape, Y_train.shape

((536, 8), (536,))

In [100]:
X_test.shape, Y_test.shape

((231, 8), (231,))

Note that different from Sklearn, the input data format in Keras need to be array. Thus here we have to convert the pandas frame to numpy array: 

In [77]:
X_train_arr = X_train.as_matrix()
Y_train_arr = Y_train.as_matrix()

In [72]:
X_test_arr = X_test.as_matrix()
Y_test_arr = Y_test.as_matrix()

In [79]:
type(X_train_arr), type(Y_train_arr)

(numpy.ndarray, numpy.ndarray)

## 1. Define the Neural Network

### (a) model I: 3-layer network, neuron numbers are 12-8-1, relu activation function

First we consider one-hidden layer, so the neural network architecure is **input-hidden-output**:

In [73]:
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))

The input layer has 12 neurons, hidden one has 8 neurons and the output always has 1 neuron. Since this is binary classification problem, we always implement **sigmoid** activation function in the output layerto indicate probability to have diabetes.

For input layer and hidden layers, we can still choose **sigmoid** or even use **[relu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)** as the activation function. **relu** activation function is popular in deep learning, in particular in convolutional neural networks. Later we will revisit the same problem but using **sigmoid** activation and compare the performances.


We use **logarithmic loss**, which for a **binary** classification problem is defined in Keras as **binary_crossentropy**, as the metric for stochastic graident descent. In Keras, **Adam** is an efficient gradient descent algorithm. Learn more about the Adam optimization algorithm in the paper [“Adam: A Method for Stochastic Optimization“](https://arxiv.org/abs/1412.6980). The metric used here is "accuracy". 

In [74]:
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#### Training model

We first set up number of epoches as **nb_epoch=150**, and the size of each mini-batch as **batch_size=10**. Keep in mind that different from sklean, the input data format in Keras should be array.

In [82]:
model.fit(X_train_arr, Y_train_arr, nb_epoch=150, batch_size=10, verbose=0)

<keras.callbacks.History at 0x115e07a58>

#### Evaluate the model

In [83]:
scores = model.evaluate(X_train_arr, Y_train_arr)
print('  Loss: ', scores[0], ' , acc:', scores[1])



In [84]:
scores = model.evaluate(X_test_arr, Y_test_arr)
print('  Loss: ', scores[0], ' , acc:', scores[1])

  Loss:  0.577939843074  , acc: 0.757575757834


The **acc** is the accuracy defined the ratio we have right prediction. This is the same as doing:

In [85]:
predictions = model.predict(X_test_arr)
rounded = [round(x[0]) for x in predictions]
a = numpy.dot(rounded-Y_test_arr, rounded-Y_test_arr)
print(1.0-a/len(rounded))

0.757575757576


We can further compute the probability of each Pima Indian patient having diabetes as

In [86]:
print (model.predict_proba(X_test_arr)[:10])

 [ 0.2329852 ]
 [ 0.55269057]
 [ 0.14207026]
 [ 0.15011092]
 [ 0.10766293]
 [ 0.85442162]
 [ 0.02469895]
 [ 0.21705227]
 [ 0.08255263]]


#### Double mini-batch size

Now we can double the mini-batch size and train the model again:

In [87]:
model.fit(X_train_arr, Y_train_arr, nb_epoch=150, batch_size=20, verbose=0)
scores = model.evaluate(X_test_arr, Y_test_arr)
print('  Loss: ', scores[0], ' , acc:', scores[1])

 32/231 [===>..........................] - ETA: 0s  Loss:  0.675078886928  , acc: 0.696969697228


#### Implement SGD

Next we try SGD as the gradient descent method. We can see the performance is worse than that using **adam** algorithm:

In [128]:
from keras.optimizers import SGD
#model.compile(loss='binary_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9, nesterov=True))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

In [129]:
model.fit(X_train_arr, Y_train_arr, nb_epoch=150, batch_size=10, verbose=0)
scores = model.evaluate(X_test_arr, Y_test_arr);
print('  Loss: ', scores[0], ' , acc:', scores[1])

 32/231 [===>..........................] - ETA: 0s  Loss:  0.647219739693  , acc: 0.649350650125


### (b) model II: 3-layer network, neuron numbers are 40-30-1, relu activation function

In [91]:
model2 = Sequential()
model2.add(Dense(40, input_dim=8, init='uniform', activation='relu'))
model2.add(Dense(30, init='uniform', activation='relu'))
model2.add(Dense(1, init='uniform', activation='sigmoid'))

In [92]:
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [95]:
model2.fit(X_train_arr, Y_train_arr, nb_epoch=150, batch_size=10, verbose=0)
scores = model2.evaluate(X_train_arr, Y_train_arr);
print('  Loss: ', scores[0], ' , acc:', scores[1])



In [96]:
scores = model2.evaluate(X_test_arr, Y_test_arr);
print('  Loss: ', scores[0], ' , acc:', scores[1])



We can see this shows using more nodes will train more accurate models, but has lower accuracy on test dataset. Therefore using too many neurons will result in overfitting!

### (c) model III: 3-layer network, neuron numbers are 12-8-1, sigmoid activation function

Now let's do deep learning by considering **sigmoid** as activation function. The accuracy is worse than using **relu**.

In [97]:
model3 = Sequential()
model3.add(Dense(12, input_dim=8, init='uniform', activation='sigmoid'))
model3.add(Dense(8, init='uniform', activation='sigmoid'))
model3.add(Dense(1, init='uniform', activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model3.fit(X_train_arr, Y_train_arr, nb_epoch=150, batch_size=10, verbose=0)
scores = model3.evaluate(X_test_arr, Y_test_arr);
print('  Loss: ', scores[0], ' , acc:', scores[1])

 32/231 [===>..........................] - ETA: 0s  Loss:  0.555261078851  , acc: 0.72294372346


### (d) model IV: 4-layer network, neuron numbers are 12-8-8-1, relu activation function

Here we use two hidden layers and numbers of nodes on the both layers are 8.

In [98]:
model4 = Sequential()
model4.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
model4.add(Dense(8, init='uniform', activation='relu'))
model4.add(Dense(8, init='uniform', activation='relu'))
model4.add(Dense(1, init='uniform', activation='sigmoid'))
model4.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model4.fit(X_train_arr, Y_train_arr, nb_epoch=150, batch_size=10, verbose=0)
scores = model4.evaluate(X_test_arr, Y_test_arr);
print('  Loss: ', scores[0], ' , acc:', scores[1])

 32/231 [===>..........................] - ETA: 0s  Loss:  0.574795645037  , acc: 0.748917749176


Compared with the previous results using three layers, the four-layer network does not improve model accuracy. This indicates that in our current case using more hidden could be not necessary (this is because our dataset is small).

## 2. Prediction

In [210]:
predictions = model.predict(X_test)
predictions.shape
print (predictions[:10])

[[ 0.02870761]
 [ 0.2803773 ]
 [ 0.33776325]
 [ 0.64599019]
 [ 0.36038035]
 [ 0.14201277]
 [ 0.08908633]
 [ 0.12896512]
 [ 0.85859615]
 [ 0.89892173]]


In [211]:
rounded = [round(x[0]) for x in predictions]
print (rounded[:10])

[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0]


In [202]:
predictions2 = model.predict_proba(X_test)



## 3. Compare to Logistic-Regression, SVM and Random-Forest models

Next we try to compare the neural network classification with other machine learning models, such as logistic-regression, support-vector-machine and random-forest models. We train the models with 10-fold cross-validation set,  perform grid search to find the best model and the hyperparameters, and then compute the model accuracy using the test dataset. We will see that in the current small dataset case, the logistic regression, SVM and random forest also show similar accuracy as the neural network did.

### (a) Logistic regression

In [114]:
best_logreg_model = None
max_score = -1
best_reg = -1
for regularization_param in [0.01, 0.1, 1, 2, 10, 100, 500, 1000, 5000]:
    logreg = linear_model.LogisticRegression('l2', C=regularization_param)
    cv_score = cross_val_score(logreg, X_train, Y_train, cv=10)
    print (regularization_param, np.mean(cv_score))
    if np.mean(cv_score) > max_score:
        max_score = np.mean(cv_score)
        best_logreg_model = logreg
        best_reg = regularization_param
        
best_logreg_model.fit(X_train, Y_train)
print ('best reg =', best_reg)
print ('accuracy = ', best_logreg_model.score(X_test, Y_test))

0.01 0.686617749825
0.1 0.720230607966
1 0.781656184486
2 0.787281621244
10 0.783647798742
100 0.783682739343
500 0.783682739343
1000 0.783682739343
5000 0.783682739343
best reg = 2
accuracy =  0.761904761905


### (b) Support-vector-machine

In [121]:
from sklearn import svm
best_svm_model = None
best_Cs = -1
max_score = -1
for Cs in [0.1, 0.2, 0.5, 0.7, 1.0, 2.0, 5.0]:
    svc = svm.SVC(kernel='linear', C=Cs, probability=True)
    ### usually not suggest to do cv in SVM since it costs time a lot, even an iteration
    cv_score = cross_val_score(svc, X_train, Y_train, cv=5) 
    print (Cs, np.mean(cv_score))
    if np.mean(cv_score) > max_score:
        max_score = np.mean(cv_score)
        best_svm_model = svc
        best_Cs = Cs

best_svm_model.fit(X_train, Y_train)
print ('best C =', best_Cs)
print ('accuracy = ', best_svm_model.score(X_test, Y_test))

0.1 0.777933541018
0.2 0.776064382139
0.5 0.77978539287
0.7 0.776047075112
1.0 0.777916233991
2.0 0.776047075112
5.0 0.774177916234
best C = 0.5
accuracy:  0.753246753247


In [126]:
print ('MSE = ', best_svm_model.score(X_test, Y_test))

MSE =  0.753246753247


### (c) Random forest

In [122]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import grid_search
rf = RandomForestClassifier()
parameters = {'n_estimators': [4,6,8],'max_depth':[5,10,15],'min_samples_leaf':[10,20]}
model_cv_grid = grid_search.GridSearchCV(rf,parameters,scoring='roc_auc',verbose=2,n_jobs=-1)
model_cv_grid.fit(X_train,Y_train)
best_rf_model = model_cv_grid.best_estimator_

Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] max_depth=5, min_samples_leaf=10, n_estimators=4 ................
[CV] max_depth=5, min_samples_leaf=10, n_estimators=4 ................
[CV] max_depth=5, min_samples_leaf=10, n_estimators=4 ................
[CV] max_depth=5, min_samples_leaf=10, n_estimators=6 ................
[CV] ....... max_depth=5, min_samples_leaf=10, n_estimators=4 -   0.0s
[CV] ....... max_depth=5, min_samples_leaf=10, n_estimators=4 -   0.0s
[CV] ....... max_depth=5, min_samples_leaf=10, n_estimators=4 -   0.0s
[CV] max_depth=5, min_samples_leaf=10, n_estimators=6 ................
[CV] max_depth=5, min_samples_leaf=10, n_estimators=6 ................
[CV] ....... max_depth=5, min_samples_leaf=10, n_estimators=6 -   0.0s
[CV] max_depth=5, min_samples_leaf=10, n_estimators=8 ................
[CV] max_depth=5, min_samples_leaf=10, n_estimators=8 ................
[CV] ....... max_depth=5, min_samples_leaf=10, n_estimators=6 -   0.1s
[CV] ....... max

[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:    0.9s finished


In [130]:
best_rf_model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=20,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=8, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [127]:
print ('accuracy = ', best_rf_model.score(X_test, Y_test))

accuracy =  0.757575757576


In [124]:
importance = best_rf_model.feature_importances_
attribute = X.columns

v = sorted(range(len(importance)), key=lambda k: importance[k], reverse=True)
sorted_importance = [importance[i] for i in v]
sorted_attribute = [attribute[i] for i in v]

df_importance = pd.DataFrame({'variable': sorted_attribute, 'importance' : sorted_importance})
df_importance.sort_index().head(5)

Unnamed: 0,importance,variable
0,0.339148,glucose_concentration
1,0.217821,body_mass_index
2,0.126481,age
3,0.10789,diabetes_pedigree
4,0.072164,num_pregnant


## Reference

* 1. [Develop Your First Neural Network in Python With Keras Step-By-Step](http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/)