#### Context :    
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Columns details are as below : 
- Pregnancies : Number of times pregnant
- Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure : Diastolic blood pressure (mm Hg)
- SkinThickness : Triceps skin fold thickness (mm)
- Insulin : 2-Hour serum insulin (mu U/ml)
- BMI : Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction : Diabetes pedigree function
- Age : Age (years)
- Outcome : Class variable (0 or 1) 268 of 768 are 1, the others are 0

So, here the y-variable is Outcome

In [1]:
#Importing necessary Libraries : 
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn import model_selection
from sklearn import metrics
from sklearn.decomposition import PCA
from scipy.stats import zscore
import matplotlib.pyplot as plt 

In [3]:
colnames = ['preg', 'glu', 'bp', 'sft', 'ins', 'bmi', 'dpf', 'age', 'outcome']
prima_df = pd.read_csv("prima-indians-diabetes.data",names=colnames)

In [4]:
X=prima_df[['preg', 'glu', 'bp', 'sft', 'ins', 'bmi', 'dpf', 'age']]
Y=prima_df['outcome']

In [6]:
#Normalizing x-variables : 
sc=StandardScaler()
X=sc.fit_transform(X)

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
base_knn=KNeighborsClassifier(n_neighbors=7,weights='distance')
base_nb=GaussianNB()
base_LR=LogisticRegression(random_state=2)
base_rf=RandomForestClassifier(n_estimators=101,random_state=2)
gb_model=GradientBoostingClassifier(n_estimators=50,random_state=2)

In [8]:
bag_knn=BaggingClassifier(base_estimator=base_knn,n_estimators=17,random_state=2)

In [9]:
bag_LR=BaggingClassifier(base_estimator=base_LR,n_estimators=15,random_state=2)
boost_LR = AdaBoostClassifier(base_estimator=base_LR,n_estimators=50,random_state=2)

In [10]:
bag_nb=BaggingClassifier(base_estimator=base_nb,n_estimators=15,random_state=2)
boost_nb = AdaBoostClassifier(base_estimator=base_nb,n_estimators=51,random_state=2)

In [11]:
boost_rf=AdaBoostClassifier(base_estimator=base_rf,n_estimators=50,random_state=2)

In [12]:
bag_dt=BaggingClassifier(n_estimators=15,random_state=2)
boost_dt = AdaBoostClassifier(n_estimators=50,random_state=2)

In [13]:
stacked = VotingClassifier(estimators = [('Boosted_LR',boost_LR),('RF', base_rf), ('Boosted_DT', boost_dt)],voting='soft')

In [14]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Here we are performing Ensemble Techniques. Below models are being built :
- Boosted Linear Regression - n_estimators=50
- Random Forest - n_estimators=50
- Boosted Decsion Tree - n_estimators=50
- Gradient Boosting - n_estimators=50
- Stacked Model - This is not a single model but 3 models are stacked together as one : Boosted Linear Regression, Random Forest and Boosted Decsion Tree (with stacking we can use multiple models in one model and check the performance).

Now, all the above models are provided data in 5-folds - Using k-fold cross validation


In [15]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=5,shuffle=True,random_state=2)
for model, name in zip([boost_LR,base_rf,boost_dt,gb_model,stacked], ['BoostLR','RF','BoostedDT','GradientBoost','stacked']):
    k=0
    recall=np.zeros((2,5))
    prec=np.zeros((2,5))
    fscore=np.zeros((2,5))
    for train,test in kf.split(X,Y):
        Xtrain,Xtest=X[train,:],X[test,:]
        Ytrain,Ytest=Y[train],Y[test]
        model.fit(Xtrain,Ytrain)
        Y_predict=model.predict(Xtest)
        cm=metrics.confusion_matrix(Ytest,Y_predict)
        for i in np.arange(0,2):
            recall[i,k]=cm[i,i]/cm[i,:].sum()
        for i in np.arange(0,2):
            prec[i,k]=cm[i,i]/cm[:,i].sum()
        k=k+1
    for row in np.arange(0,2):
        for col in np.arange(0,5):
            fscore[row,col]=2*(recall[row,col]*prec[row,col])/(recall[row,col]+prec[row,col])
    print("f1_weighted for Healthy: %0.02f (+/- %0.5f) [%s]" % (np.mean(fscore[0,:]), np.var(fscore[0,:],ddof=1), name ))   
    print("f1_weighted for Diabetic: %0.02f (+/- %0.5f) [%s]" % (np.mean(fscore[1,:]), np.var(fscore[1,:],ddof=1), name ))   
    

f1_weighted for Healthy: 0.83 (+/- 0.00048) [BoostLR]
f1_weighted for Diabetic: 0.62 (+/- 0.00222) [BoostLR]
f1_weighted for Healthy: 0.81 (+/- 0.00029) [RF]
f1_weighted for Diabetic: 0.62 (+/- 0.00171) [RF]
f1_weighted for Healthy: 0.81 (+/- 0.00028) [BoostedDT]
f1_weighted for Diabetic: 0.60 (+/- 0.00683) [BoostedDT]
f1_weighted for Healthy: 0.81 (+/- 0.00046) [GradientBoost]
f1_weighted for Diabetic: 0.61 (+/- 0.00254) [GradientBoost]
f1_weighted for Healthy: 0.81 (+/- 0.00020) [stacked]
f1_weighted for Diabetic: 0.62 (+/- 0.00166) [stacked]


In [16]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_curve, auc
kf=KFold(n_splits=5,shuffle=True,random_state=2)
for model, name in zip([boost_LR,base_rf,boost_dt,gb_model,stacked], ['BoostLR','RF','BoostedDT','GradientBoost','stacked']):
    roc_auc=[]
    for train,test in kf.split(X,Y):
        Xtrain,Xtest=X[train,:],X[test,:]
        Ytrain,Ytest=Y[train],Y[test]
        model.fit(Xtrain,Ytrain)
        Y_predict=model.predict(Xtest)
        cm=metrics.confusion_matrix(Ytest,Y_predict)
        fpr,tpr, _ = roc_curve(Ytest,Y_predict)
        roc_auc.append(auc(fpr, tpr))
    print("AUC scores: %0.02f (+/- %0.5f) [%s]" % (np.mean(roc_auc), np.var(roc_auc,ddof=1), name ))   
    

AUC scores: 0.72 (+/- 0.00086) [BoostLR]
AUC scores: 0.71 (+/- 0.00047) [RF]
AUC scores: 0.70 (+/- 0.00229) [BoostedDT]
AUC scores: 0.71 (+/- 0.00100) [GradientBoost]
AUC scores: 0.71 (+/- 0.00042) [stacked]


#### Clearly Boosted Linear Regresson Forest model is outperforming other models, F1 score and AUC scores are displayed above.