## Diabetes Diagnosis
We are going to see how a neural network could be used to diagnose diabetes. When you go through this notebook, pay attention to how little human intervention is needed, the neural network does all the work for us!

In [None]:
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
import numpy as np
from sklearn.model_selection import train_test_split

### Exercise 1: Load the data

Before we load the data, we must upload it to colab. On the left you should see four symbols, the bottom one refers to files.

Read the line below and work out where the correct place to upload the data is. Hint: to create a new folder, right click under `sample data`.


In [None]:
dataframe = pd.read_csv("Data/diabetes.csv")

Now the data is loaded, lets have a look at it:

In [None]:
dataframe.head()

### Exercise 2

You may notice that there are a lot of '0' values in this data. With a touch of common sense, we can work out which are legitimate zeros and which are missing values. It's perfectly normal to have had 0 pregnancies, but I hope to never meet anyone with a skin thickness of 0!

Use the missing value imputation methods you have seen previously to fill in missing values in this data set.

In [None]:
# Missing Value Imputation

We now need to remove supervision labels from the features that we want to learn from. In this case the outcome column contains the supervision labels, which tells us if a person has diabetes or not.

In [None]:
df_label = dataframe['Outcome']
df_features = dataframe.drop('Outcome', 1)
print(df_label.head())
df_features.head()

In [None]:
data = np.array(df_features)
label = np.array(df_label)
print(data.shape, label.shape)

#### Split the data into train and test portions
We need to split the data into training and testing, we will use the function from sklearn to split our samples into 80% train samples and 20% test samples.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df_features, df_label, test_size=0.2, random_state=42)
x_train.shape
x_test.shape

#### Build the model
We will now build the neural network.

To do this, we call `model = Sequential()` (remember we imported Sequential in the first cell of this notebook). It is called sequential because neural networks are a sequence: layer 1 then layer 2 then layer 3 etc. So the model is called `model` and it is sequential.

In [None]:
model = Sequential()

#### Add layers
Next we need to add layers. In keras, a fully connected layer (like you saw in the lecture) is called a Dense layer. It works as follows:
    `model.add(Dense(number of neurons, input dimension (optional), activation function))`
    
The *input dimension* only needs to included for the first layer. For the following layers, keras will automatically include the input dimension as the number of neurons from the previous layer.

The *number of neurons* is the number of features we consider at each layer. 
Note the last layer has only one neuron. This is because our labels are one dimensional. When we have $n>2$ output classes, we'll need $n$ outputs.

The *activation* is a non-linear function that is applied at each layer. We will discuss this in the next lecture. In the final layer, this function is the loss function. Again, we will discuss loss functions in detail on Thursday.

In [None]:
model.add(Dense(50, input_dim=8, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Lastly, to compile the model, we use `model.compile`. This requires three things, the loss, the optimiser, and the metric to optimise on:
*Loss*: When we do classification, we perform cross-entropy. In this case, we only have two classes, so we need `binary_crossentropy`
*Optimiser*: The optimiser is what we use to update the weights in the network. Traditionally we used Stochastic Gradient Descent (SGD), but a few years ago 'Adam' was proposed and usually outperforms sgd.

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#### Training (and testing) the model

To train the model, we use model.fit() which takes on several variables
*Training data* this consists of the training features and training labels, in our case `x_train` and `y_train`
*Epochs* How many times we will pass through all of the data
*Batch size* How many samples we will consider at any time
*Validation_data* Test the performance on unseen data. We will use the test data for this. You might also want to split 10% of your training data off to use for validation, so that your test data remains completely unseen until after training.

In [None]:
model.fit(x_train,y_train, epochs=1000, batch_size=10, validation_data=(x_test, y_test))

Let's test a couple of sample in the test set to see what is given

In [None]:
sample1 = np.array([x_test[0]])
sample2 = np.array([x_test[1]])
sample1

In [None]:
result = model.predict_classes(sample1)

if result==0:
    print("NO Diabetes")
else:
    print("Diabetes")

### Exercise 3
Try stuff!

Add more layers, change the number of neurons in each layer (there doesn't have to be the same amount in each layer), change the optimiser to `'sgd'`, see what the highest accuracy you can get is. This is also stuff you can try in the iterative development mission of your project.

#### Overfitting
Compare the training accuracy and the testing accuracy (it is called val_accuracy above). The training accuracy is much higher. This is perfectly normal, but also a good indication that the model is memorising the data. Recall from the lecture that this is called overfitting.

Overfitting can be tackled by a technique called dropout, where a proportion, $p$, of the nodes within each layer of the neural network are randomly eliminated. 
We will try $p=0.3$, which eliminates 30% of the nodes.
This stops layers of hidden neurons being overly reliant on a small number of nodes, which can often happen when the data set is small and can be easily estimated.

To add a dropout layer, we simply `model.add(Dropout(0.3))`. In this instance we can define a new model called `model2`

In [None]:
model2 = Sequential()
model2.add(Dense(50, input_dim=8, activation='relu'))
model2.add(Dropout(0.3))
model2.add(Dense(50, activation='relu'))
model2.add(Dropout(0.3))
model2.add(Dense(50, activation='relu'))
model2.add(Dropout(0.3))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.fit(x_train,y_train, epochs=1000, batch_size=70, validation_data=(x_test, y_test))

### Exercise 4

We've improved the test accuracy but the training accuracy has massively decreased. This might be because too many neurons have been dropped out and we are not longer learning effectively. Try changing the dropout parameter to see what gives the best results.