## Case Diabetes
The diabetes dataset consists of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.  
  
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: a function which scores likelihood of diabetes based on family history  
- Age: Age (years)
- Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0  
  
Solve following questions: 
1. What is the number of rows and columns in the dataset?
1. How many persons actually have diabetes and how many haven't?
1. What is the minimum value for each predictor?
1. Blood pressure: By observing the data we can see that there are 0 values for blood pressure. And it is evident that a living person cannot have diastolic blood pressure of zero. How many records have value == 0?
1. Plasma glucose levels : Even after fasting glucose level would not be as low as zero. How many records have value == 0?
1. BMI : Should not be 0 or close to zero unless the person is really underweight which could be life threatening. How many records have value == 0?
1. Remove the rows which have zero as value for 'BloodPressure', 'BMI' and 'Glucose'
1. Give the new dimensions of the dataset
1. Use an ensemble method to try to predict the value for Outcome
1. Calculate the accuracy and the precision

In [2]:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/diabetes.csv'
diabetes = pd.read_csv(url)
print(diabetes.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [3]:
# 1 What is the number of rows and columns in the dataset?
diabetes.shape

(768, 9)

In [4]:
# 2 How many persons actually have diabetes and how many haven't?
diabetes.groupby('Outcome')['Outcome'].count()

Outcome
0    500
1    268
Name: Outcome, dtype: int64

In [5]:
# 3 What is the minimum value for each predictor?
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
# 4 Blood pressure: By observing the data we can see that there are 0 values for blood pressure. 
# And it is evident that a living person cannot have diastolic blood pressure of zero. How many records have value == 0?
diabetes[diabetes.BloodPressure == 0]['BloodPressure'].count()

35

In [7]:
# 5 Plasma glucose levels : Even after fasting glucose level would not be as low as zero. Therefore zero is an invalid reading. 
# How many records have Glucose level == 0?
diabetes[diabetes.Glucose == 0]['Glucose'].count()

5

In [8]:
# 6 BMI : Should not be 0 or close to zero unless the person is really underweight which could be life threatening.
# How many records have BMI == 0?
diabetes[diabetes.BMI == 0]['BMI'].count()

11

In [9]:
# 7 Remove the rows which have zero for 'BloodPressure', 'Glucose' or BMI'
diabetes = diabetes[(diabetes.BloodPressure != 0) & (diabetes.BMI != 0) & (diabetes.Glucose != 0)]

In [10]:
# 8 Give the new dimensions of diabetes
diabetes.shape

(724, 9)

In [11]:
# 9 Use an ensemble method to try to predict the value for Outcome
from sklearn.model_selection import train_test_split
X = diabetes.drop('Outcome',axis=1)
y = diabetes['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

lr = LogisticRegression(solver='newton-cg')
rf100 = RandomForestClassifier(n_estimators=250) 
rf150 = RandomForestClassifier(n_estimators=300) 
rf200 = RandomForestClassifier(n_estimators=350) 
rf250 = RandomForestClassifier(n_estimators=400) 
gnb =  GaussianNB()

model = VotingClassifier(estimators=[('lr', lr), ('rf100', rf100),('rf150', rf150), ('rf200', rf200), 
                                     ('rf250', rf250), ('gnb', gnb)], voting='soft')
model.fit(X_train, y_train)

y_test2 = model.predict(X_test)

In [12]:
# 10 Calculate the accuracy and the precision. For calculating the precision you can either 
# program the formula yourself (as we have done before) or you can use the function precision_score
# from sklearn.metrics
from sklearn.metrics import accuracy_score,precision_score
print('Accuracy = %4.1f %%' % (accuracy_score(y_test, y_test2)*100))
print('Precision = %4.1f %%' % (precision_score(y_test, y_test2)*100))

Accuracy = 78.9 %
Precision = 71.7 %


In [15]:
# 11 The model can be used to determine wether a person might develop diabetes or not. 
# Write a function that takes values for all features to predict this. 
def risk_patient(model,Pregnancies,Glucose, BloodPressure, SkinTickness, Insulin, BMI, DiabetesPedigreeFunction, Age): 
    patient=pd.DataFrame(columns=['Pregnancies','Glucose', 'BloodPressure', 'SkinTickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])

    new_patient = {'Pregnancies':Pregnancies,'Glucose':Glucose, 'BloodPressure':BloodPressure, 'SkinTickness':SkinTickness, 'Insulin':Insulin, 'BMI':BMI, 'DiabetesPedigreeFunction':DiabetesPedigreeFunction, 'Age':Age}
    
    patient = patient.append(new_patient,ignore_index=True)   
    return model.predict(patient)

print(risk_patient(model,0,50,60,50,500,33,0.5,50))
    
print(risk_patient(model,4,171,100,50,500,40,2,70))  

[0]
[1]
