In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('diabetes.csv')

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [12]:
X=df.drop('Outcome',axis=1)
y=df['Outcome']

## Create training and test dataset

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.2,stratify=y)



Here the stratify is set to y, it is to make sure that the proportion of both the classes remain the same in both the train and test data. Say if you have 60% of class 1 and 40% of class 0 in train data then you would have the same distribution in the test.

In [15]:
X_train.shape

(614, 8)

In [16]:
X_test.shape

(154, 8)

In [17]:
y_train.shape

(614,)

In [18]:
y_test.shape

(154,)

## Adaboost Classifier

In [21]:
from sklearn.ensemble import AdaBoostClassifier

In [22]:
ada=AdaBoostClassifier(random_state=42)

In [23]:
##train the model

ada.fit(X_train,y_train)

In [25]:
## test and predict

y_pred=ada.predict(X_test)

In [26]:
### Calculate score on training data
ada.score(X_train,y_train)

0.8420195439739414

In [29]:
### Calculate score on test data
ada.score(X_test,y_test)

0.7597402597402597

In [30]:
from sklearn.metrics import classification_report

In [31]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.82      0.81      0.81       100
           1       0.65      0.67      0.66        54

    accuracy                           0.76       154
   macro avg       0.74      0.74      0.74       154
weighted avg       0.76      0.76      0.76       154



## Hyperparameter Tuning

base_estimator: This parameter is used to signify the type of base learners we can implement or the type of weak learner we want to use. It can Decision tree, Logistic Regressor, SVC anything. It cannot be Knn as the weight cannot be assigned in this model. By default, the base estimator is DecisionTreeClassifier(max_depth=1).

n_estimators: The number of base estimators or weak learners we want to use in our dataset. By default, the n_estimator is 50.

learning_rate: This parameter is provided to shrink the contribution of each classifier. By default, it is provided a value of 1.


algorithm: It can be either SAMME or SAMME.R. The performance of the SAMME and SAMME.R algorithms are compared. SAMME.R uses the probability estimates to update the additive model, while SAMME uses the classifications only. As the example illustrates, the SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations. As we have seen, SAMME.R breaks after the error goes above 1/2. However this is not the case for SAMME although error can be bigger than 1/2 (or equal to 1/2), the weight of the estimator is still positive; hence, the misclassified training samples get more weights, and the test error keeps decreasing even after 600 iterations.

In [32]:
from sklearn.model_selection import GridSearchCV

In [38]:
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(random_state=42)

In [47]:
parameters = {
    'n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 20],
    'learning_rate': [(0.97 + x / 100) for x in range(0, 8)],
    'algorithm': ['SAMME', 'SAMME.R']
}
clf = GridSearchCV(ada, parameters, cv=5, verbose=1, n_jobs=-1)
clf.fit(X_train, y_train)


Fitting 5 folds for each of 192 candidates, totalling 960 fits


In [48]:
clf.best_estimator_

In [49]:
ada=AdaBoostClassifier(algorithm='SAMME', n_estimators=12, random_state=42)

In [50]:
ada.fit(X_train,y_train)

In [51]:
## test and predict

y_pred=ada.predict(X_test)

In [52]:
### Calculate score on training data
ada.score(X_train,y_train)

0.7964169381107492

In [53]:
### Calculate score on test data
ada.score(X_test,y_test)

0.7727272727272727

In [54]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.87      0.83       100
           1       0.71      0.59      0.65        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.74       154
weighted avg       0.77      0.77      0.77       154

