# Pima Indians Diabetes Database¶

## Using a Keras based neural network to predict diabetes


- Atul Acharya

This notebook shows how to use a simple Keras based neural network for predicting diabetes. A few things implemented:

- a 3-layer NN 
- model checkpointing / saving
- plotting history

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import ModelCheckpoint

seed = 42
np.random.seed(seed)

In [None]:
# load Pima dataset
pdata = pd.read_csv('../input/diabetes.csv')
pdata.head()

Let's see what the dataset describes

In [None]:
pdata.describe()

Looks like there are some 0-entries in the dataset. This may or may not be important.

In [None]:
# let's remove the 0-entries for these fields

zero_fields = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

def check_zero_entries(data, fields):
    """ List number of 0-entries in each of the given fields"""
    for field in fields:
        print('field %s: num 0-entries: %d' % (field, len(data.loc[ data[field] == 0, field ])))

check_zero_entries(pdata, zero_fields)

[Thanks to **_ManasviKundalia_** for the interpretation.]

As one can see, there are several "0" entries, especially for SkinThickness and Insulin. Atleast some of them (e.g. **Insulin**) matter for diabetes predicition. 

What to do? 

Let's split into Train/Test datasets, and then add back the 0-entries by imputing them from the average.
We don't want to impute for the entire dataset at once, since this would affect the performance on the Test set.


In [None]:
# First - split into Train/Test
from sklearn.model_selection import train_test_split

features = list(pdata.columns.values)
features.remove('Outcome')
print(features)
X = pdata[features]
y = pdata['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

print(X_train.shape)
print(X_test.shape)

In [None]:
# lets fix the 0-entry for a field in the dataset with its mean value
def impute_zero_field(data, field):
    nonzero_vals = data.loc[data[field] != 0, field]
    avg = np.sum(nonzero_vals) / len(nonzero_vals)
    k = len(data.loc[ data[field] == 0, field])   # num of 0-entries
    data.loc[ data[field] == 0, field ] = avg
    print('Field: %s; fixed %d entries with value: %.3f' % (field, k, avg))

In [None]:
# Fix it for Train dataset
for field in zero_fields:
    impute_zero_field(X_train, field)

In [None]:
# double check for the Train dataset
check_zero_entries(X_train, zero_fields)

In [None]:
# Fix for Test dataset
for field in zero_fields:
    impute_zero_field(X_test, field)

In [None]:
# double check for the Test dataset
check_zero_entries(X_test, zero_fields)

In [None]:
# Ensure that fieldnames aren't included
X_train = X_train.values
y_train = y_train.values
X_test  = X_test.values
y_test  = y_test.values

### Neural Network model

We define a 3-layer NN model in Keras

- First layer: 12 nodes, with RELU activation
- 2nd layer:   8 nodes,  with RELU activation
- 3rd layer:   output,   with sigmoid activation

In [None]:
NB_EPOCHS = 1000  # num of epochs to test for
BATCH_SIZE = 16

## Create our model
model = Sequential()

# 1st layer: input_dim=8, 12 nodes, RELU
model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
# 2nd layer: 8 nodes, RELU
model.add(Dense(8, init='uniform', activation='relu'))
# output layer: dim=1, activation sigmoid
model.add(Dense(1, init='uniform', activation='sigmoid' ))

# Compile the model
model.compile(loss='binary_crossentropy',   # since we are predicting 0/1
             optimizer='adam',
             metrics=['accuracy'])

# checkpoint: store the best model
ckpt_model = 'pima-weights.best.hdf5'
checkpoint = ModelCheckpoint(ckpt_model, 
                            monitor='val_acc',
                            verbose=1,
                            save_best_only=True,
                            mode='max')
callbacks_list = [checkpoint]

print('Starting training...')
# train the model, store the results for plotting
history = model.fit(X_train,
                    y_train,
                    validation_data=(X_test, y_test),
                    nb_epoch=NB_EPOCHS,
                    batch_size=BATCH_SIZE,
                    callbacks=callbacks_list,
                    verbose=0)

In [None]:
# Model accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()

In [None]:
# Model Losss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()

In [None]:
# print final accuracy
scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

So we see several things:

    - We get accuracy is  about **78%**, which is decent, not great
    - after about 300 epochs, the model does not really improve.
    - After about 500 epochs, the training loss starts to increase, which indicates overfitting

A few things could be done to improve the results:

- Different model architecture (num of nodes, etc)
- Dropout
- Adaptive learning rate

Anything other suggestions for improvement? Thanks for taking a look. 