Problem Statement

The objective of the dataset is to diagnostically predict whether or not a
patient has diabetes, based on certain diagnostic measurements included in
the dataset. All patients are females of at least 21 years of age and of Pima
Indian heritage.

The dataset consists of several medical predictor variables and one target
variable, Outcome. Predictor variables include the number of pregnancies the
patients have had, their BMI, insulin level, age, and so on.

In [1]:
# Importing required libraries
import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier

In [2]:
# Loading the dataset, Extracting the values from the columns in the form of an array.
# Set the random seed value as seven & number of trees as 30
dataframe = pd.read_csv('pima-indians-diabetes.csv')
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees =30

In [6]:
# We will build classifiers using AdaBoost & XGBoost
# AdaBoost uses decision tree classifier as the default classifier, Pass the model within the cross validation score function to evaluate the results 
# Contruct the model by splitting the tarin test indices into 10 consecutive folds
# Again evaluate the models such that each fold gets used once as a validation while the remaining nine folds form the training set
kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees)
results = model_selection.cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

0.7552802460697198


AdaBoost gives an accuracy of 75%

In [8]:
# Simlarly we apply XGBoost algorithm, Importing respective modules, namely SVM & XGBClassifier
# Initialize XGBClassifier under the name clf
from sklearn import svm
from xgboost import XGBClassifier
clf = XGBClassifier()

seed = 7
num_trees = 30

kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
model = XGBClassifier(n_estimators=num_trees)
results = model_selection.cross_val_score(model,X,Y,cv=kfold)
print(results.mean())

0.7382433356117566


Again the accuracy is around 75%