# Neural Network classification model to detect diabetes

Diabetes Dataset columns:
1. Pregnancies - Number of times a woman got pregnant
2. Glucose - Plasma glucose concentration
3. BloodPressure - Diastolic blood pressure (mm Hg).
4. SkinThickness - Skinfold thickness (mm).
5. Insulin - Hour serum insulin (mu U/ml).
6. BMI – Basal metabolic rate (weight in kg/height in m).
7. DiabetesPedigreeFunction - Diabetes pedigree function
8. Age - Age in years.
9. Outcomes - “1” represents the presence of diabetes while “0” represents the absence of it. (This is the target variable)

**AIM-** Our aim is to make a classification model using *Keras* (popular deep-learning library) with good accuracy that can predict if someone has diabetes or not.

### Loading the Required Libraries and Modules

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score

### Reading the Data and Performing Basic Data Checks

In [41]:
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [42]:
df.shape

(768, 9)

In [43]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


### Creating Arrays for the Features and the Response Variable

In [87]:
target_column = ['Outcome'] 
predictors = list(df.columns[0:8])
df[predictors] = df[predictors]/df[predictors].max()
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.259091,0.617284,1
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.145041,0.382716,0
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.277686,0.395062,1
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.069008,0.259259,0
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.945455,0.407407,1


In [82]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,0.22618,0.19821,0.0,0.058824,0.176471,0.352941,1.0
Glucose,768.0,0.60751,0.160666,0.0,0.497487,0.58794,0.704774,1.0
BloodPressure,768.0,0.566438,0.158654,0.0,0.508197,0.590164,0.655738,1.0
SkinThickness,768.0,0.207439,0.161134,0.0,0.0,0.232323,0.323232,1.0
Insulin,768.0,0.094326,0.136222,0.0,0.0,0.036052,0.150414,1.0
BMI,768.0,0.47679,0.117499,0.0,0.406855,0.4769,0.545455,1.0
DiabetesPedigreeFunction,768.0,0.19499,0.136913,0.032231,0.100723,0.153926,0.258781,1.0
Age,768.0,0.410381,0.145188,0.259259,0.296296,0.358025,0.506173,1.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


### Creating the Training and Test Datasets

In [88]:
X = df[predictors].values
y = df[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)

(537, 8)
(231, 8)


In [89]:
X

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.25909091,
        0.61728395],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.14504132,
        0.38271605],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.27768595,
        0.39506173],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.390462  , 0.10123967,
        0.37037037],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.14421488,
        0.58024691],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.13016529,
        0.28395062]])

### Building, Predicting, and Evaluating the Neural Network Model

In [55]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)

predict_train = mlp.predict(X_train)
predict_test = mlp.predict(X_test)

  y = column_or_1d(y, warn=True)


In [77]:
train_conf_matrix = confusion_matrix(y_train,predict_train)
print(train_conf_matrix)
print('*'*10)
test_conf_matrix = confusion_matrix(y_test,predict_test)
print(test_conf_matrix)

[[321  37]
 [ 77 102]]
**********
[[129  13]
 [ 42  47]]


In [67]:
def confusion_metrics(conf_matrix):
# save confusion matrix and slice into four pieces
    TP = conf_matrix[1][1]
    TN = conf_matrix[0][0]
    FP = conf_matrix[0][1]
    FN = conf_matrix[1][0]
    print('True Positives:', TP)
    print('True Negatives:', TN)
    print('False Positives:', FP)
    print('False Negatives:', FN)
    
    
    # calculate accuracy
    conf_accuracy = (float (TP+TN) / float(TP + TN + FP + FN))
    
    # calculate mis-classification
    conf_misclassification = 1- conf_accuracy
    
    # calculate the sensitivity
    conf_sensitivity = (TP / float(TP + FN))
    # calculate the specificity
    conf_specificity = (TN / float(TN + FP))
    
    # calculate precision
    conf_precision = (TN / float(TN + FP))
    # calculate f_1 score
    conf_f1 = 2 * ((conf_precision * conf_sensitivity) / (conf_precision + conf_sensitivity))
    print('-'*50)
    print(f'Accuracy: {round(conf_accuracy,2)}') 
    print(f'Mis-Classification: {round(conf_misclassification,2)}') 
    print(f'Sensitivity or TPR: {round(conf_sensitivity,2)}')
    print(f'Specificity or TNR: {round(conf_specificity,2)}') 
    print(f'Precision: {round(conf_precision,2)}')
    print(f'f_1 Score: {round(conf_f1,2)}')

Performance of the model on training and tetsing data

In [80]:
print("For training data:")
confusion_metrics(train_conf_matrix)
print('+'*50)
print("For testing data:")
confusion_metrics(test_conf_matrix)

For training data:
True Positives: 102
True Negatives: 321
False Positives: 37
False Negatives: 77
--------------------------------------------------
Accuracy: 0.79
Mis-Classification: 0.21
Sensitivity or TPR: 0.57
Specificity or TNR: 0.9
Precision: 0.9
f_1 Score: 0.7
++++++++++++++++++++++++++++++++++++++++++++++++++
For testing data:
True Positives: 47
True Negatives: 129
False Positives: 13
False Negatives: 42
--------------------------------------------------
Accuracy: 0.76
Mis-Classification: 0.24
Sensitivity or TPR: 0.53
Specificity or TNR: 0.91
Precision: 0.91
f_1 Score: 0.67


### Conclusion

In this, neural network classification model is trained using diabetes dataset.The model achieves accuracy around 79 and 76 percentage on training and test data respectively. The model can be further improved by doing cross-validation, feature engineering, or changing the arguments in the neural network estimator.

In [97]:
#mlp.predict([[1,89,66,23,94,28.1,0.167,31]])    # less BMI and age

array([0], dtype=int64)

In [98]:
#mlp.predict([[1,89,66,23,94,38.1,0.167,51]])   # increasing age and BMI leads to high Diabetes risk

array([1], dtype=int64)