# **Boosting Machine Learning Models**

## Boosting

In this module we will cover a powerful ensemble method called Boosting. Boosted ensemble methods use weak learners as base models that are simple and tend to suffer from high bias. The weak learners underfit the data.

Boosting is a sequential learning technique where each of the base models builds off of the previous model. Each subsequent model aims to improve the performance of the final ensemble model by attempting to fix the errors in the previous stage.

There are two important decisions that need to be made to perform boosted ensembling:
- Sequential Fitting Method
- Aggregation Method

Two boosting algorithms that will be covered in detail in this module are **Adaptive Boosting** and **Gradient Boosting**.

While boosting can be applied to any base machine learning algorithm, we will demonstrate with an extremely popular choice as a base estimator, the decision tree. Recall that Decision Trees are a commonly used and powerful machine learning algorithm because they are easy to interpret. Additionally, the training data requires very little manipulation (no need standardization, removal of collinearity, etc.).

The major limitation to decision trees is that they tend to suffer from high variance and are therefore prone to overfitting. They are good at making a series of decisions which cause them to memorize the training data, so they do not generalize well to unseen data. In the following exercises we will explore how to work past these limitations while using decision trees for boosting.

<img src="bag_vs_boost.png" width="40%" height="40%">

## Adaptive Boosting

Adaptive Boosting (or AdaBoost) is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees.

For AdaBoost, the **Sequential Fitting Method** is accomplished by updating the weight attached to each of the training dataset observations as we proceed from one base model to the next. The **Aggregation Method** is a weighted sum of those base models where the model weight is dependent on the error of that particular estimator.

The training of an AdaBoost model is the process of determining the training dataset observation weights at each step as well as the final weight for each base model for aggregation.

<img src="adaboost.png" width="40%" height="40%">

Let’s take this opportunity to implement AdaBoost on a real dataset and solve a classification problem.

We will be using a dataset from UCI’s Machine Learning Repository to evaluate drug usage based on a set of demographic characteristics.

In [13]:
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier

In [3]:
# import the data
drugs = pd.read_csv("drug_consumption.csv")
# filter out unnecessary columns
drugs.drop("ID", inplace=True, axis=1)
drugs.drop(drugs.iloc[:, 12:27], inplace=True, axis=1)
drugs.drop(drugs.iloc[:, -3:], inplace=True, axis=1)
drugs.head()

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,AScore,Cscore,Impulsive,SS,Mushrooms
0,25-34,M,Doctorate degree,UK,White,-0.67825,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575,CL0
1,35-44,M,Professional certificate/ diploma,UK,White,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,-1.37983,0.40148,CL1
2,18-24,F,Masters degree,UK,White,-0.14882,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084,CL0
3,35-44,F,Doctorate degree,UK,White,0.73545,-1.6334,-0.45174,-0.30172,1.30612,-0.21712,-0.21575,CL2
4,65+,F,Left school at 18 years,Canada,White,-0.67825,-0.30033,-1.55521,2.03972,1.63088,-1.37983,-1.54858,CL0


In [4]:
# map the categorical variables into a numeric representation
age_map = {"18-24": 0, "25-34": 1, "35-44": 2, "45-54": 3, "55-64": 4, "65+": 5}
educ_map = {"Left school before 16 years": 0, "Left school at 16 years": 1, "Left school at 17 years": 2, "Left school at 18 years": 3,
            "Some college or university, no certificate or degree": 4, "Professional certificate/ diploma": 5, "University degree": 6, 
            "Masters degree": 7, "Doctorate degree": 8}
user_map = {'CL0': 0, 'CL1': 0, 'CL2': 0, 'CL3': 1, 'CL4': 1, 'CL5': 1, 'CL6': 1}

dicts = [age_map, educ_map, user_map]
for col, enc in zip(drugs[['Age', 'Education', 'Mushrooms']], dicts):
    drugs[col] = drugs[col].map(lambda x: enc.get(x, x))

enc = OrdinalEncoder()
encoded_cols = enc.fit_transform(drugs[['Gender', 'Country', 'Ethnicity']])
drugs[['Gender', 'Country', 'Ethnicity']] = encoded_cols
drugs.head()

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,AScore,Cscore,Impulsive,SS,Mushrooms
0,1,1.0,8,5.0,6.0,-0.67825,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575,0
1,2,1.0,5,5.0,6.0,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,-1.37983,0.40148,0
2,0,0.0,7,5.0,6.0,-0.14882,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084,0
3,2,0.0,8,5.0,6.0,0.73545,-1.6334,-0.45174,-0.30172,1.30612,-0.21712,-0.21575,0
4,5,0.0,3,1.0,6.0,-0.67825,-0.30033,-1.55521,2.03972,1.63088,-1.37983,-1.54858,0


In [8]:
# split the data into 70% train and 30% validation
y = drugs.Mushrooms
X = drugs.drop('Mushrooms', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, shuffle=True, stratify=y, random_state=0)

# create a base classifier in the form of a decision stump (a decision tree with two leaf nodes)
decision_stump = DecisionTreeClassifier(max_depth=1)
print(decision_stump.get_params())

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 1, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}


In [9]:
# create an adaboost classification model with the decision stump as the base classifier and 5 trees
ada_clf = AdaBoostClassifier(base_estimator=decision_stump, n_estimators=5)
print(ada_clf.get_params())

{'algorithm': 'SAMME.R', 'base_estimator__ccp_alpha': 0.0, 'base_estimator__class_weight': None, 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': 1, 'base_estimator__max_features': None, 'base_estimator__max_leaf_nodes': None, 'base_estimator__min_impurity_decrease': 0.0, 'base_estimator__min_samples_leaf': 1, 'base_estimator__min_samples_split': 2, 'base_estimator__min_weight_fraction_leaf': 0.0, 'base_estimator__random_state': None, 'base_estimator__splitter': 'best', 'base_estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 1.0, 'n_estimators': 5, 'random_state': None}


In [10]:
# fit the model to the training data
ada_clf.fit(X_train, y_train)
# make predictions on the test data
y_pred = ada_clf.predict(X_test)

# assess model performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 score: {f1}')

Accuracy: 0.8091872791519434
Precision: 0.5948275862068966
Recall: 0.5307692307692308
F1 score: 0.5609756097560975


In [11]:
# view the confusion matrix
test_conf_matrix = pd.DataFrame(
    confusion_matrix(y_test, y_pred, labels=[1, 0]), 
    index=['actual yes', 'actual no'], 
    columns=['predicted yes', 'predicted no']
)
print(f'Confusion Matrix:\n{test_conf_matrix.to_string()}')

Confusion Matrix:
            predicted yes  predicted no
actual yes             69            61
actual no              47           389


## Gradient Boosting

Gradient Boosting is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees, known as Gradient Boosted Trees.

For Gradient Boost, the **Sequential Fitting Method** is accomplished by fitting a base model to the negative gradient of the error in the previous stage. The **Aggregation Method** is a weighted sum of those base models where the model weight is constant.

The training of a Gradient Boosted model is the process of determining the base model error at each step and using those to determine how to best formulate the subsequent base model.

<img src="gbm.png" width="40%" height="40%">

While Gradient Boosting can be applied to any base machine learning model, decision trees are commonly used in practice. In this example we will be focusing on a Gradient Boosted Trees model.

Our first step is to fit an estimator, the 1st Base Model. Recall that the base estimators for boosting algorithms tend to be simple and high bias. In contrast to AdaBoost which leveraged the simplest form of decision trees, the decision stump with only 1 level, gradient boosted trees can and actually do tend to include a few more decision branches. Often gradient boosted trees will have up to 32 leaf nodes, which corresponds to a tree depth of 5 levels. In this example, we are limiting the depth of the base estimators to 2, corresponding to 4 leaf nodes.

Once the 1st Base Model is trained, the residual errors (`h_1`), of the model given the training training data are determined. The residual error is the difference between the actual and predicted values for each of the training data instances.

$$h_1=y_{actual} - y_{1(predicted)}$$

The errors will be greater for the training data instances where the model did not do as good of a job with its prediction and will be lower on training data instances where the model fit the data well.

In the next stage of the sequential learning process, we fit the 2nd Base Model. Here is where the interesting part comes in. Instead of fitting the model to the target values `y_actual` as we are typically used to doing in machine learning, we actually fit the model on the errors of the previous stage, in this case `h_1`. The 2nd Base Model is literally learning from the mistakes of the 1st Base Model through those residuals that were calculated.

The results of the 2nd Base Model are multiplied by a constant learning rate, `alpha`, and added to the results of the 1st Base Model to give the set of updated predictions. The results of the second base model, which was tasked with fitting the errors of the first base model are multiplied by a constant learning rate, alpha and added to the results of the first base model to give us a set of updated predictions, `y_2(predicted)`.

The residual errors of the 2nd stage are calculated using the updated predictions to get,

$$h_2=y_{actual} - y_{2(predicted)}$$

The subsequent stages repeat the same steps. At stage `N`, the base model is fit on the errors calculated at the previous stage `h_(N-1)`. The new model that is fit is multiplied by the constant learning rate `alpha` and added to the predictions of the previous stage.

Once we have reached the predefined number of estimators for our Gradient Boosting model or the residual errors are not changing between iterations, the model will stop training and we end up with the resultant ensemble model.

In [18]:
# create a gradient boosting classification model with 15 trees
gb_clf = GradientBoostingClassifier(n_estimators=15)
print(gb_clf.get_params())

{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'log_loss', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 15, 'n_iter_no_change': None, 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}


In [19]:
# fit the model to the training data
gb_clf.fit(X_train, y_train)
# make predictions on the test data
y_pred = gb_clf.predict(X_test)

# assess model performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 score: {f1}')

Accuracy: 0.8021201413427562
Precision: 0.5957446808510638
Recall: 0.4307692307692308
F1 score: 0.5


In [20]:
# view the confusion matrix
test_conf_matrix = pd.DataFrame(
    confusion_matrix(y_test, y_pred, labels=[1, 0]), 
    index=['actual yes', 'actual no'], 
    columns=['predicted yes', 'predicted no']
)
print(f'Confusion Matrix:\n{test_conf_matrix.to_string()}')

Confusion Matrix:
            predicted yes  predicted no
actual yes             56            74
actual no              38           398
