## Diabetes Diagnosis
We are going to see how a neural network could be used to diagnose diabetes. When you go through this notebook, pay attention to how little human intervention is needed, the neural network does all the work for us!

In [None]:
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
import numpy as np
from sklearn.model_selection import train_test_split

### Exercise 1: Load the data

Before we load the data, we must upload it to colab. On the left you should see four symbols, the bottom one refers to files.

Read the line below and work out where the correct place to upload the data is. Hint: to create a new folder, right click under `sample data`.


In [None]:
dataframe = pd.read_csv("Data/diabetes.csv")

Now the data is loaded, lets have a look at it:

In [None]:
dataframe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Exercise 2

You may notice that there are a lot of '0' values in this data. With a touch of common sense, we can work out which are legitimate zeros and which are missing values. It's perfectly normal to have had 0 pregnancies, but I hope to never meet anyone with a skin thickness of 0!

Use the missing value imputation methods you have seen previously to fill in missing values in this data set.

### Solution
First lets take a look at columns individually

In [None]:
dataframe['Insulin']

0        0
1        0
2        0
3       94
4      168
      ... 
763    180
764      0
765    112
766      0
767      0
Name: Insulin, Length: 768, dtype: int64

Play around with stuff to see what happens. We are going to need the `.where()` function or the `replace` function. `where` is a less efficient way of doing things so I will use `replace`, but I first show how to do a simple replacement with `where`.

In [None]:
dataframe['Insulin'].where(dataframe['Insulin']==0) # Looks promising

0      0.0
1      0.0
2      0.0
3      NaN
4      NaN
      ... 
763    NaN
764    0.0
765    NaN
766    0.0
767    0.0
Name: Insulin, Length: 768, dtype: float64

In [None]:
dataframe['Insulin'].where(dataframe['Insulin']==0, -1) #This seems to replace the wrong values with -1

0      0
1      0
2      0
3     -1
4     -1
      ..
763   -1
764    0
765   -1
766    0
767    0
Name: Insulin, Length: 768, dtype: int64

In [None]:
dataframe['Insulin'].where(dataframe['Insulin']!=0, -1) #This seems to be what we want

0       -1
1       -1
2       -1
3       94
4      168
      ... 
763    180
764     -1
765    112
766     -1
767     -1
Name: Insulin, Length: 768, dtype: int64

In [None]:
dataframe['Insulin'] = dataframe['Insulin'].where(dataframe['Insulin']!=0, -1) #This works but it is very messy!
dataframe

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,-1,33.6,0.627,50,1
1,1,85,66,29,-1,26.6,0.351,31,0
2,8,183,64,0,-1,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,-1,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,-1,30.1,0.349,47,1


In [None]:
dataframe = pd.read_csv("Data/diabetes.csv") #We altered the original dataframe above so lets reload it. Lets try to do the same thing with .replace()

In [None]:
dataframe_replace = dataframe.replace(0,-1) #This is much easier, but remember we need to select certain columns
dataframe_replace # Use dataframe_replace so we do not overwrite the original dataframe

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,-1,33.6,0.627,50,1
1,1,85,66,29,-1,26.6,0.351,31,-1
2,8,183,64,-1,-1,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,-1
4,-1,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,-1
764,2,122,70,27,-1,36.8,0.340,27,-1
765,5,121,72,23,112,26.2,0.245,30,-1
766,1,126,60,-1,-1,30.1,0.349,47,1


In [None]:
columns = ['Glucose', 'BloodPressure','SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction'] #Select columns where zero shouldn't appear
dataframe_replace = dataframe.copy() #copy dataframe and update the columns we need individually in the for-loop
                                     # we need .copy() here otherwise we will overwrite the original dataframe

# Simple Missing Value Imputation
dataframe_replace[columns] = dataframe[columns].replace(0, -1)
dataframe_replace #this seems to have worked

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,-1,33.6,0.627,50,1
1,1,85,66,29,-1,26.6,0.351,31,0
2,8,183,64,-1,-1,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,-1,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,-1,-1,30.1,0.349,47,1


In [None]:
# Now lets try more complicated types of missing value imputation
dataframe['Insulin'].where(dataframe['Insulin']!=0).mean() #Use the .where() function to get the mean of all non-zero values in a column


155.5482233502538

In [None]:
#That seems to work so lets try and replace all columns with this value
dataframe_mean = dataframe.copy() 
for column in columns:
  column_mean = dataframe[column].where(dataframe[column]!=0).mean() #calculate the mean of the column
  dataframe_mean[column] = dataframe[column].replace(0, column_mean) #replace 0s with the column mean

dataframe_mean

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.00000,155.548223,33.6,0.627,50,1
1,1,85.0,66.0,29.00000,155.548223,26.6,0.351,31,0
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32,1
3,1,89.0,66.0,23.00000,94.000000,28.1,0.167,21,0
4,0,137.0,40.0,35.00000,168.000000,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.00000,180.000000,32.9,0.171,63,0
764,2,122.0,70.0,27.00000,155.548223,36.8,0.340,27,0
765,5,121.0,72.0,23.00000,112.000000,26.2,0.245,30,0
766,1,126.0,60.0,29.15342,155.548223,30.1,0.349,47,1


In [None]:
#Now lets do the same with a random number between minimum and maximum
dataframe_rand = dataframe.copy()

for column in columns:
  column_max = dataframe[column].max()
  column_min = dataframe[column].where(dataframe[column]!=0).min() #we need the 'where' here otherwise we will get 0 as the minimum
  rand = np.random.uniform(column_min, column_max) #random number from the uniform distribution between min and max (you can try from the normal distribution as well)
  dataframe_rand[column] = dataframe[column].replace(0, rand)

dataframe_rand #This assigns one random value per column, how can you assign a different value for every new zero instance?

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.000000,17.717941,33.6,0.627,50,1
1,1,85.0,66.0,29.000000,17.717941,26.6,0.351,31,0
2,8,183.0,64.0,47.469012,17.717941,23.3,0.672,32,1
3,1,89.0,66.0,23.000000,94.000000,28.1,0.167,21,0
4,0,137.0,40.0,35.000000,168.000000,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.000000,180.000000,32.9,0.171,63,0
764,2,122.0,70.0,27.000000,17.717941,36.8,0.340,27,0
765,5,121.0,72.0,23.000000,112.000000,26.2,0.245,30,0
766,1,126.0,60.0,47.469012,17.717941,30.1,0.349,47,1


#End of Solution
You can try loading the different data into the models below and see what seems to work best.

We now need to remove supervision labels from the features that we want to learn from. In this case the outcome column contains the supervision labels, which tells us if a person has diabetes or not.

In [None]:
df_label = dataframe['Outcome']
df_features = dataframe.drop('Outcome', 1)
print(df_label.head())
df_features.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [None]:
data = np.array(df_features)
label = np.array(df_label)
print(data.shape, label.shape)

(768, 8) (768,)


#### Split the data into train and test portions
We need to split the data into training and testing, we will use the function from sklearn to split our samples into 80% train samples and 20% test samples.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=42)
x_train.shape
x_test.shape

(154, 8)

#### Build the model
We will now build the neural network.

To do this, we call `model = Sequential()` (remember we imported Sequential in the first cell of this notebook). It is called sequential because neural networks are a sequence: layer 1 then layer 2 then layer 3 etc. So the model is called `model` and it is sequential.

In [None]:
model = Sequential()

#### Add layers
Next we need to add layers. In keras, a fully connected layer (like you saw in the lecture) is called a Dense layer. It works as follows:
    `model.add(Dense(number of neurons, input dimension (optional), activation function))`
    
The *input dimension* only needs to included for the first layer. For the following layers, keras will automatically include the input dimension as the number of neurons from the previous layer.

The *number of neurons* is the number of features we consider at each layer. 
Note the last layer has only one neuron. This is because our labels are one dimensional. When we have $n>2$ output classes, we'll need $n$ outputs.

The *activation* is a non-linear function that is applied at each layer. We will discuss this in the next lecture. In the final layer, this function is the loss function. Again, we will discuss loss functions in detail on Thursday.

In [None]:
model.add(Dense(50, input_dim=8, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Lastly, to compile the model, we use `model.compile`. This requires three things, the loss, the optimiser, and the metric to optimise on:
*Loss*: When we do classification, we perform cross-entropy. In this case, we only have two classes, so we need `binary_crossentropy`
*Optimiser*: The optimiser is what we use to update the weights in the network. Traditionally we used Stochastic Gradient Descent (SGD), but a few years ago 'Adam' was proposed and usually outperforms sgd.

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#### Training (and testing) the model

To train the model, we use model.fit() which takes on several variables
*Training data* this consists of the training features and training labels, in our case `x_train` and `y_train`
*Epochs* How many times we will pass through all of the data
*Batch size* How many samples we will consider at any time
*Validation_data* Test the performance on unseen data. We will use the test data for this. You might also want to split 10% of your training data off to use for validation, so that your test data remains completely unseen until after training.

In [None]:
model.fit(x_train,y_train, epochs=1000, batch_size=10, validation_data=(x_test, y_test))

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

Let's test a couple of sample in the test set to see what is given

In [None]:
sample1 = np.array([x_test[0]])
sample2 = np.array([x_test[1]])
sample1

In [None]:
result = model.predict_classes(sample1)

if result==0:
    print("NO Diabetes")
else:
    print("Diabetes")

### Exercise 3
Try stuff!

Add more layers, change the number of neurons in each layer (there doesn't have to be the same amount in each layer), change the optimiser to `'sgd'`, see what the highest accuracy you can get is. This is also stuff you can try in the iterative development mission of your project.

#### Overfitting
Compare the training accuracy and the testing accuracy (it is called val_accuracy above). The training accuracy is much higher. This is perfectly normal, but also a good indication that the model is memorising the data. Recall from the lecture that this is called overfitting.

Overfitting can be tackled by a technique called dropout, where a proportion, $p$, of the nodes within each layer of the neural network are randomly eliminated. 
We will try $p=0.3$, which eliminates 30% of the nodes.
This stops layers of hidden neurons being overly reliant on a small number of nodes, which can often happen when the data set is small and can be easily estimated.

To add a dropout layer, we simply `model.add(Dropout(0.3))`. In this instance we can define a new model called `model2`

In [None]:
model2 = Sequential()
model2.add(Dense(50, input_dim=8, activation='relu'))
model2.add(Dropout(0.3))
model2.add(Dense(50, activation='relu'))
model2.add(Dropout(0.3))
model2.add(Dense(50, activation='relu'))
model2.add(Dropout(0.3))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.fit(x_train,y_train, epochs=1000, batch_size=70, validation_data=(x_test, y_test))

### Exercise 4

We've improved the test accuracy but the training accuracy has massively decreased. This might be because too many neurons have been dropped out and we are not longer learning effectively. Try changing the dropout parameter to see what gives the best results.