# Diabetes Prediction

## Import

In [1]:
%matplotlib inline
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict


from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve
from scipy.stats import randint as sp_randint
pd.set_option('display.max_rows', 100)

## Data Loading

The Scikit-Learn library uses NumPy arrays in its implementation, we will use pandas to load *.csv files then convert it to NumPy.

This dataset provided by National Institute of Diabetes and Digestive and Kidney Diseases; The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

refer to https://www.kaggle.com/uciml/pima-indians-diabetes-database for more details ...

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) 

Note: In particular, all patients here are females at least 21 years old of Pima Indian heritage.

#### Task --> for patients given attributes discribed above, is she diabetic.


In [2]:
# load the file
data = pd.read_csv("./data/diabetes.csv")

# printing the length of data - (rows x columns)
print("The number of rows and Columns are: ", data.shape)

# name of columns in the data
print(data.columns)

# lets see the most frequent target class
len(data[data['Outcome']==0]), len(data[data['Outcome']==1]), 
# therefore the most frequenct class is 0



The number of rows and Columns are:  (768, 9)
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')


(500, 268)

In [3]:
# converting the data into numpy matrix
dataset = data.values

# separating the data into X and Y - (features, target)
X = dataset[:, 0:8]

Y = dataset[:, -1].reshape((768,1)) # it is better no to reshape it - Otherwise you have to flatten it later - I just reshape it to spice the things
X.shape, Y.shape

((768, 8), (768, 1))

## Data Normalization
As you know, the majority of gradient based methods are highly sensitive to data scaling. Therefore, before running an algorithm, we are going to perform either normalization or standardization.

refer to http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing for more infromation

In [4]:
# normalizing the data attributes
normalized_x = preprocessing.normalize(X)

# Center to the mean and component wise scale to a unit variance
standarized_x = preprocessing.scale(X)

# lets print the max and min values of above normalized and standarized data
print(np.min(normalized_x), np.max(normalized_x))
print(np.min(standarized_x), np.max(standarized_x))


0.0 0.9736822155037493
-4.060473872668307 6.65283937836846


## Data split 
Split the data into training and testing, 80% for training and 20% for testing 

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(standarized_x, Y, train_size=0.80, random_state=42)

## Logistic Regression

use Logistic Regression for binary classificaiton in this case.

In [6]:
# lets define the model
model = LogisticRegression(solver ='liblinear')

# train the model
model.fit(X_train, Y_train.ravel())

# lets make predictions using the trained model on unseen data
predicted = model.predict(X_test)

# lets measure the performance
acc = accuracy_score(Y_test, predicted)

# report the results
print("The accuracy is ", acc)


The accuracy is  0.7532467532467533


What if we build a dump classifier that always prediect the highest accuracy ?

accuracy that could be achieved by always by predicting the most frequent class

In [7]:
# calculate the percentage of ones
# because y_test only contains ones and zeros, we can simply calculate the mean = percentage of ones
class_one_percentage = Y_test.mean()
print ("percentage of zeros = ", 1 - class_one_percentage, "percentage of ones = ", class_one_percentage)



percentage of zeros =  0.6428571428571428 percentage of ones =  0.35714285714285715


This means that a dumb model that always predicts 0 would be right 64.2% of the time

This shows how classification accuracy is not that good as it's close to a dumb model


It's a good way to know the minimum we should achieve with our models

#### Notes:
- Classification accuracy is the easiest classification metric to understand
- But, it does not tell you the underlying distribution of response values
- We examine by calculating the dump classifer accuracy
- And, it does not tell you what "types" of errors your classifier is making


In [8]:
# lets calculate the confusion matrix values
# IMPORTANT: first argument will always be the True values and second argument is predicted values
# lets print the total number of test samples
print("Number of test samples are: ", len(Y_test))

# getting the confusion matrix
confusion = metrics.confusion_matrix(Y_test, predicted)

# lets display it
print(confusion)


Number of test samples are:  154
[[79 20]
 [18 37]]


True Positives (TP): we correctly predicted that they do have diabetes = 79

True Negatives (TN): we correctly predicted that they don't have diabetes = 37

False Positives (FP): we incorrectly predicted that they do have diabetes = 18

Falsely predict positive

False Negatives (FN): we incorrectly predicted that they don't have diabetes = 20

Falsely predict negative

Based on these numbers you can calculates recall, accuracy, 

In [9]:
TN = confusion[1, 1]
TP = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

print ("All of these metrics based on 0.5 threshold ....")
# use float to perform true division, not integer division
print("Accuracy = ", (TP + TN) / float(TP + TN + FP + FN))

sensitivity = TP / float(FN + TP)
print("sensitivity = ",sensitivity)

specificity = TN / (TN + FP)

print("specificity = ",specificity)

false_positive_rate = FP / float(TN + FP)

print("false_positive_rate = ",false_positive_rate)

precision = TP / float(TP + FP)

print("precision = ", precision)

print("recall = ", sensitivity)




All of these metrics based on 0.5 threshold ....
Accuracy =  0.7532467532467533
sensitivity =  0.8144329896907216
specificity =  0.6491228070175439
false_positive_rate =  0.3508771929824561
precision =  0.797979797979798
recall =  0.8144329896907216


## Logistic Regression with Polynomials

use Logistic Regression for binary classificaiton after mapping the feature to different Polynomials degrees. plus use regularizer to reduce overfitting.

In [10]:
# lets define the model, set regularizer parameter C to 0.0001
model = LogisticRegression(C = 0.0001, solver='liblinear')

# deinfe Polynomial transformation
poly = PolynomialFeatures(include_bias=False)

# lets map the features
poly_X_Train = poly.fit_transform(X_train)
poly_X_test = poly.fit_transform(X_test)

# printing their shapes
print("The shape of Poy X Train/Test is: ", poly_X_Train.shape, poly_X_test.shape)

# We need to normalize the data at this point again
# fit the model with the mapped features
model.fit(poly_X_Train, Y_train.ravel())

# lets predict the model
predicted = model.predict(poly_X_test)

# Lets compute the accuracy
acc = accuracy_score(Y_test.ravel(), predicted)

# lets report the results
print("THe accuracy is: ", acc)



The shape of Poy X Train/Test is:  (614, 44) (154, 44)
THe accuracy is:  0.6558441558441559


In the previous cell we used all the holdout method. it is much better to use cross validation. for this small dataset I
suggest using k-fold with K=5

In [11]:
# lets define the model. set regularizer parameter C to 0.0001
model = LogisticRegression(C=0.0001, solver='liblinear', penalty='l2')

# define the polynomial transformation
poly = PolynomialFeatures(degree=6, include_bias=False)

# map the features
poly_X = poly.fit_transform(X)
standarized_poly_x = preprocessing.scale(poly_X)

# lets use the cross validation for prediction
y_predict = cross_val_predict(model, standarized_poly_x, Y.ravel(), cv=5, verbose=1)

print("The metrics Classification Report is Given as \n", metrics.classification_report(Y.ravel(), y_predict))
confusion =  metrics.confusion_matrix(Y.ravel(), y_predict)
print("The confusion matrix scores are: \n",)

# lets extract the TP, TN, FP, FN from above confusion matrix
TP = confusion[0, 0]
TN = confusion[1, 1]
FN = confusion[1, 0]
FP = confusion[0, 1]

# lets compute the performance metrics
print ("All of these metrics based on 0.5 threshold ....")
# use float to perform true division, not integer division
print("Accuracy = ", (TP + TN) / float(TP + TN + FP + FN))

sensitivity = TP / float(FN + TP)
print("sensitivity = ",sensitivity)

specificity = TN / (TN + FP)

print("specificity = ",specificity)

false_positive_rate = FP / float(TN + FP)

print("false_positive_rate = ",false_positive_rate)

precision = TP / float(TP + FP)

print("precision = ", precision)

print("recall = ", sensitivity)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


The metrics Classification Report is Given as 
               precision    recall  f1-score   support

         0.0       0.80      0.81      0.80       500
         1.0       0.64      0.62      0.63       268

    accuracy                           0.74       768
   macro avg       0.72      0.71      0.72       768
weighted avg       0.74      0.74      0.74       768

The confusion matrix scores are: 

All of these metrics based on 0.5 threshold ....
Accuracy =  0.7434895833333334
sensitivity =  0.7988165680473372
specificity =  0.6360153256704981
false_positive_rate =  0.36398467432950193
precision =  0.81
recall =  0.7988165680473372


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.3s finished


Apply grid search to find the best parameters

In [12]:
C = 10.**np.arange(-7, 2)
penalty = ['l1', 'l2']
hyperparameters = dict(C=C, penalty=penalty)
print(' Hyperparameters are:\n', hyperparameters)

# lets build the logistic regression model
logistic = LogisticRegression(solver='liblinear')
clf = GridSearchCV(logistic, hyperparameters, cv=5, verbose=1)
best_model = clf.fit(standarized_poly_x, Y.ravel())

# lets get the best parameters
print("Best Penalty is: ", best_model.best_estimator_.get_params()['penalty'])
print("Best C is: ", best_model.best_estimator_.get_params()['C'])


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


 Hyperparameters are:
 {'C': array([1.e-07, 1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00,
       1.e+01]), 'penalty': ['l1', 'l2']}
Fitting 5 folds for each of 18 candidates, totalling 90 fits




Best Penalty is:  l1
Best C is:  0.1


[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:  1.3min finished


In [None]:
# lets use the best model from above
y_predicted = cross_val_predict(best_model, standarized_poly_x, Y.ravel(), cv=5)

# lets print the model performance metrics
print("Metrics Classification Report:", metrics.classification_report(Y.ravel(), y_predicted))
print("Confusion Metrics is: ", metrics.confusion_matrix(Y.ravel(), y_predicted))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [None]:
# lets plot the remaining performance metrics
TP = confusion[0, 0]
TN = confusion[1, 1]
FP = confusion[0, 1]
FN = confusion[1, 0]

print ("All of these metrics based on 0.5 threshold ....")
# use float to perform true division, not integer division
print("Accuracy = ", (TP + TN) / float(TP + TN + FP + FN))

sensitivity = TP / float(FN + TP)
print("sensitivity = ",sensitivity)

specificity = TN / (TN + FP)

print("specificity = ",specificity)

false_positive_rate = FP / float(TN + FP)

print("false_positive_rate = ",false_positive_rate)

precision = TP / float(TP + FP)

print("precision = ", precision)

print("recall = ", sensitivity)