In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Boosting

Boosting refers to an ensemble method in which many predictors are trained and each predictor learns from the errors of its predecessor. More formally, in boosting many weak learners are combined to form a strong learner. A weak learner is a model doing slightly better than random guessing. For example, a decision tree with a maximum-depth of one, known as a decision-stump, is a weak learner. In boosting, an ensemble of predictors are trained sequentially and each predictor tries to correct the errors made by its predecessor.

### Adaboost

AdaBoost stands for Adaptive Boosting. In AdaBoost, each predictor pays more attention to the instances wrongly predicted by its predecessor by constantly changing the weights of training instances. Furthermore, each predictor is assigned a coefficient alpha that weighs its contribution in the ensemble's final prediction. Alpha depends on the predictor's training error.

#### AdaBoost: Training

In the AdaBoost Training, there are N predictors in total. 

First, predictor 1 is trained on the initial dataset (X,y), and the training error for predictor 1 is determined. This error can then be used to determine alpha 1 which is predictor 1's coefficient. Alpha1 is then used to determine the weights W(2) of the training instances for predictor 2. 

Here, the incorrectly predicted instances acquire higher weights. When the weighted instances are used to train predictor 2, this predictor is forced to pay more attention to the incorrectly predicted instances. This process is repeated sequentially, until the N predictors forming the ensemble are trained.

#### Learning Rate

An important parameter used in training is the learning rate, eta. Learning Rate is a number between 0 and 1; it is used to shrink the coefficient alpha of a trained predictor. It's important to note that there's a trade-off between Learning Rate and the number of estimators. A smaller value of Learning Rate should be compensated by a greater number of estimators.

#### AdaBoost Classification in sklearn (Breast Cancer dataset)

Alright, let's fit an AdaBoostClassifier to the breast cancer dataset and evaluate its ROC-AUC score. Note that the dataset is already loaded. After importing AdaBoostClassifier, DecisionTreeClassifier, roc_auc_score, and train_test_split, split the data into 80%-train and 20%-test as shown here.

In [2]:
df = pd.read_csv("wbc.csv")
#df.head()

In [3]:
label = []

for i in df["diagnosis"]:
    if i == "M":
        label.append(1)
    else:
        label.append(0)
df["labels"] = label

In [4]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32,labels
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,,1
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,,1
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,,1
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,,1
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,,1


In [5]:
X = df.drop(['id', 'diagnosis','Unnamed: 32', "labels"], axis = 1)

In [6]:
y = df["labels"]
y.unique()

array([1, 0], dtype=int64)

In [7]:
# Import models and utility functions
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [8]:
# Set seed for reproducibility
SEED = 1
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=SEED)

In [9]:
# Instantiate a classification-tree 'dt'
dt = DecisionTreeClassifier(random_state=SEED, max_depth = 2)

In [10]:
# Instantiate an AdaBoost classifier 'adab_clf'
adb_clf = AdaBoostClassifier(base_estimator=dt,random_state=1)

In [11]:
adb_clf.get_params()

{'algorithm': 'SAMME.R',
 'base_estimator__ccp_alpha': 0.0,
 'base_estimator__class_weight': None,
 'base_estimator__criterion': 'gini',
 'base_estimator__max_depth': 2,
 'base_estimator__max_features': None,
 'base_estimator__max_leaf_nodes': None,
 'base_estimator__min_impurity_decrease': 0.0,
 'base_estimator__min_impurity_split': None,
 'base_estimator__min_samples_leaf': 1,
 'base_estimator__min_samples_split': 2,
 'base_estimator__min_weight_fraction_leaf': 0.0,
 'base_estimator__random_state': 1,
 'base_estimator__splitter': 'best',
 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1),
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': 1}

In [12]:
# Fit 'adb_clf' to the training set
adb_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                         random_state=1),
                   random_state=1)

In [13]:
pred = adb_clf.predict(X_test)

In [14]:
# Predict the test set probabilities of positive class
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]

In [15]:
# Evaluate test-set roc_auc_score
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
# Print adb_clf_roc_auc_score
print('ROC AUC score: {:.3f}'.format(adb_clf_roc_auc_score))

ROC AUC score: 0.988


In [16]:
# Print test set accuracy of tuned adb_clf
print("Accuracy Score: {:.3f}".format(accuracy_score(y_test, pred)*100))

Accuracy Score: 93.860


In [17]:
# error_rate = []

# for n in range(1,200):
#     model = AdaBoostClassifier(n_estimators=n)
#     model.fit(X_train, y_train)
#     prediction = model.predict(X_test)
    
#     err = 1 - accuracy_score(y_test, prediction)
#     error_rate.append(err)

In [18]:
# plt.figure(figsize=(10,5))
# plt.plot(range(1,200), error_rate)

In [19]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [20]:
# Define the grid of hyperparameters 'params_dt'
params_ad = {'n_estimators': list(range(80,200,10)) , "learning_rate": [0.04,0.05,0.06,0.07]}

In [21]:
# Instantiate a 10-fold CV grid search object 'grid_dt'
grid_ad = GridSearchCV(estimator=adb_clf, param_grid=params_ad, scoring='accuracy', cv=10, n_jobs=-1)

In [22]:
# Fit 'grid_dt' to the training data
grid_ad.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                                                random_state=1),
                                          random_state=1),
             n_jobs=-1,
             param_grid={'learning_rate': [0.04, 0.05, 0.06, 0.07],
                         'n_estimators': [80, 90, 100, 110, 120, 130, 140, 150,
                                          160, 170, 180, 190]},
             scoring='accuracy')

In [23]:
# Extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_ad.best_params_
print('Best hyerparameters:\n', best_hyperparams)

# Extract best CV score from 'grid_dt'
best_CV_score = grid_ad.best_score_
print('Best CV accuracy'.format(best_CV_score))

# Extract best model from 'grid_dt'
best_model = grid_ad.best_estimator_

# Evaluate test set accuracy
test_acc = best_model.score(X_test,y_test)

# Print test set accuracy of tuned DT
print("Accuracy Score of Untuned AdaBoost:{:.3f}".format(accuracy_score(y_test, pred)*100))

# Print test set accuracy of tuned DT
print("Accuracy Score of Tuned AdaBoost:{:.3f}".format(test_acc*100))

Best hyerparameters:
 {'learning_rate': 0.04, 'n_estimators': 170}
Best CV accuracy
Accuracy Score of Untuned AdaBoost:93.860
Accuracy Score of Tuned AdaBoost:97.368


In [24]:
from sklearn.metrics import confusion_matrix, f1_score, classification_report

In [25]:
pred = best_model.predict(X_test)
# Print the confusion matrix
cm = confusion_matrix(y_test, pred)
print('Confusion matrix:\n', cm)

# Print the F1 score
score = f1_score(y_test, pred)
print('F1-Score: {:.3f}'.format(score))

Confusion matrix:
 [[72  0]
 [ 3 39]]
F1-Score: 0.963


In [26]:
target_names=['B','M']
print(classification_report(y_test,pred,target_names=target_names))

              precision    recall  f1-score   support

           B       0.96      1.00      0.98        72
           M       1.00      0.93      0.96        42

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



In [27]:
# Predict the test set probabilities of positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Evaluate test-set roc_auc_score
best_model_roc_auc_score = roc_auc_score(y_test, y_pred_proba)

# Print adb_clf_roc_auc_score
print('ROC AUC score: {:.3f}'.format(best_model_roc_auc_score))

ROC AUC score: 0.987
