# Bagging and Boosting

# 1. Introduction

### 1.1 Overview of Ensemble Learning:

- Ensemble learning is a powerful technique in machine learning where multiple models are combined to create a more robust and accurate predictive model.
- The idea is to leverage the collective intelligence of multiple models to make better predictions than any individual model could achieve alone.
- Ensemble learning draws inspiration from the wisdom of crowds, where diverse opinions lead to better decisions than relying on a single perspective.
- By aggregating the predictions from multiple models, ensemble learning aims to improve generalization and reduce the risk of overfitting.

### 1.2 Importance of Ensemble Learning in Data Science:

- Data science often deals with complex and noisy datasets where building a single strong model may be challenging.
- Ensemble learning helps address the limitations of individual models by combining their strengths and compensating for their weaknesses.
- Ensemble methods can significantly boost the accuracy and performance of predictive models, making them invaluable in various real-world applications.
- Ensemble learning is widely used in competitions like Kaggle, where top-performing models often utilize sophisticated ensemble techniques.
- In practical scenarios, ensemble models are more reliable and trustworthy due to their ability to capture diverse patterns and avoid model bias.

### 1.3 What to Expect in This Session:

- In this session, we will delve into two essential ensemble learning techniques: bagging and boosting.
- We will understand the underlying principles of each technique, explore their strengths and weaknesses, and discover when to use them.
- We will also demonstrate coding examples using Python and scikit-learn to solidify your understanding and enable you to implement these techniques in your projects.
- By the end of this session, you will have a clear understanding of how bagging and boosting contribute to the success of predictive models and the importance of ensemble learning in data science.

# 2. Bagging (Bootstrap Aggregating)

### 2.1 Definition and Purpose of Bagging:

- Bagging, short for Bootstrap Aggregating, is an ensemble learning technique that aims to improve the accuracy and robustness of predictive models.
- The primary purpose of bagging is to reduce variance and combat overfitting, which are common challenges in machine learning.
- By combining predictions from multiple base learners, bagging reduces the risk of relying too heavily on any one model's idiosyncrasies.

### 2.2 How Bagging Works:

1. Bootstrapping and Creating Subsets:
- Bagging starts by creating multiple subsets of the original training data through bootstrapping.
- Bootstrapping involves randomly sampling data with replacement, resulting in diverse subsets with the same size as the original dataset.


2. Training Multiple Base Learners:
- Each subset is used to train an independent base learner (model), such as decision trees, SVMs, or neural networks.
- These base learners are trained independently, and their predictions are combined later to form the ensemble.


3. Aggregating Predictions:
- Once all base learners are trained, the final prediction is made by aggregating their individual predictions.
- For regression tasks, predictions are averaged, while for classification tasks, majority voting is used to decide the final prediction.


### 2.3 Advantages of Bagging:

1. Reducing Variance and Overfitting:
- Bagging's main advantage is its ability to reduce variance and combat overfitting.
- By combining predictions from diverse models, it results in a more balanced and stable ensemble model.


2. Robustness to Noisy Data and Outliers:
- Bagging is robust to noisy data and outliers, as it diminishes the impact of individual data points through aggregation.
- Outliers are less likely to affect the overall prediction due to the averaging or voting process.


### 2.4 Common Algorithms Using Bagging:

1. Random Forest:
- Random Forest is one of the most popular bagging algorithms that uses decision trees as base learners.
- It builds a large number of decision trees, and each tree is trained on a bootstrapped subset of the data.
- The final prediction is made by aggregating the predictions of all individual trees.

2. Bagged Decision Trees:
- Bagged Decision Trees, or simply Bagging with decision trees, is a straightforward bagging approach.
- It applies the same concept as Random Forest but with a smaller number of trees (often not enough to be considered a forest).

3. Bagged SVMs (Support Vector Machines):
- Bagging can also be applied to other base learners like Support Vector Machines (SVMs).
- Bagged SVMs utilize subsets of data for training multiple SVM models, and their predictions are aggregated to make the final prediction.

### 2.5 Coding Example with Random Forest (Python and scikit-learn):

#### 2.5.1 Importing Libraries

In [28]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
import numpy as np

These are the necessary libraries for the following tasks:

- **pandas:** For data handling (e.g., creating DataFrames).
- **fetch_openml:** Fetches datasets from the OpenML repository.
- **DecisionTreeClassifier:** The decision tree classifier we'll use.
- **RandomForestClassifier:** The Random Forest ensemble method.
- **train_test_split:** Splits the data into training and testing sets.
- **accuracy_score:** Computes the accuracy of predictions.
- **resample:** Helps in bootstrapping samples.
- **numpy:** For numerical operations.

#### 2.5.2 Loading Data

In [29]:
# Load the covertype dataset from OpenML
covertype = fetch_openml('covertype', version=4)
X = covertype.data
y = covertype.target.astype('int') - 1  # Shift labels to range 0-6

This code fetches the **covertype** dataset from OpenML, and then extracts the data and target labels. The labels are shifted from 1-7 to 0-6 for ease of use.

#### 2.5.3 Data Subset (for speed)

In [30]:
# Select a subset of data for demonstration purposes (and for faster execution)
X = X[:20000]
y = y[:20000]

Given the large size of the **covertype** dataset, we take a subset of 20,000 samples for faster demonstration.

#### 2.5.4 Data Splitting

In [31]:
# Split data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

This splits the data into training and testing sets. 30% of the data is reserved for testing.

#### 2.5.5 Results Storage Initialization

In [32]:
# Initialize list to store results of each model
results = []

An empty list to store accuracy results for each model.

#### 2.5.6 Simple Decision Tree Training and Prediction

In [33]:
# Train and test a simple decision tree classifier
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
results.append(['Simple Decision Tree', accuracy_score(y_test, tree_pred)])

This section trains a basic decision tree on the training data, makes predictions on the test data, and stores the accuracy in the results list.

#### 2.5.7 Random Forest Training and Prediction

In [34]:
# Train and test a Random Forest with 5 trees
rf_small = RandomForestClassifier(n_estimators=5, random_state=42)
rf_small.fit(X_train, y_train)
rf_small_pred = rf_small.predict(X_test)
results.append(['Random Forest (5 trees)', accuracy_score(y_test, rf_small_pred)])

Trains a Random Forest model with 5 trees (small for demonstration), makes predictions, and adds the accuracy to the results list.

#### 2.5.8 Manual Bagging Implementation

In [35]:
# Manual implementation of Bagging
n_learners = 5  # Number of learners (trees) for bagging
predictions = []  # List to store predictions of each learner

# Bootstrapping and training for each learner
for i in range(n_learners):
    boot_X, boot_y = resample(X_train, y_train, replace=True, n_samples=len(X_train))
    tree = DecisionTreeClassifier()
    tree.fit(boot_X, boot_y)
    predictions.append(tree.predict(X_test))

This manually implements bagging by bootstrapping the training data **n_learners** times, training a decision tree on each bootstrap sample, and storing its predictions.

#### 2.5.9 Mode Calculation for Bagging Predictions

In [36]:
# Function to compute mode for an array
def compute_mode(array):
    return np.bincount(array).argmax()

# Convert predictions list of arrays to a 2D numpy array
predictions_array = np.array(predictions)

# Calculate mode of predictions for bagging
mode_predictions = []
for column in predictions_array.T:  # Now we transpose the numpy array
    mode_predictions.append(compute_mode(column))
mode_predictions = np.array(mode_predictions)
results.append(['Manual Bagging (Mode)', accuracy_score(y_test, mode_predictions)])

For each sample in the test set, this computes the mode (most common prediction) across the **n_learners** decision trees and stores the results.

#### 2.5.10 Median Calculation for Bagging Predictions

In [37]:
# Calculate median of predictions for bagging
median_predictions = np.median(predictions, axis=0).astype(int)
results.append(['Manual Bagging with Median', accuracy_score(y_test, median_predictions)])

This section demonstrates another way to aggregate predictions by using the median. It then stores the results.

#### 2.5.11 Convert Results to DataFrame and Display

In [38]:
# Convert the results list to a DataFrame and display it
df_results = pd.DataFrame(results, columns=['Model', 'Accuracy'])

# Sort the DataFrame based on the 'Accuracy' column in descending order
df_results = df_results.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)

# Display the sorted DataFrame
df_results.head()

Unnamed: 0,Model,Accuracy
0,Manual Bagging (Mode),0.841
1,Manual Bagging with Median,0.835667
2,Random Forest (5 trees),0.827667
3,Simple Decision Tree,0.815167


This converts the **results** list into a pandas DataFrame and displays it for easy comparison of model accuracies.

### 2.6 Exercise

1. Basic Understanding:

- Question: What is the main principle behind the Bagging technique? How does it help in reducing overfitting?
- Answer: Bagging, or Bootstrap Aggregating, involves training multiple models independently on different bootstrapped subsets of the data and then aggregating their predictions. By averaging out the errors, bagging helps in reducing the variance and thus overfitting.

2. Practical Implementation:

- Exercise: Implement bagging from scratch using Decision Trees as the base learner on the Iris dataset. Compare its performance with a single Decision Tree.
- Hint: Remember to bootstrap samples for each tree and average the predictions for regression or take a mode for classification.
3. Deep Dive:

- Question: How does out-of-bag (OOB) error estimation work in Bagging?
- Answer: Each bootstrapped subset on average contains about 63.2% of the unique samples of the dataset. The remaining 36.8% that are not sampled are called out-of-bag samples. Each base learner (e.g., tree) in bagging can be validated using its OOB samples, and this gives an unbiased error estimate. The OOB error is the average error for each training sample calculated using predictions from the trees that did not have this sample in their bootstrap sample.
4. Advanced Topic:

- Exercise: Study Random Forests. How do they extend the basic idea of bagging? Implement a Random Forest using scikit-learn on the Wine dataset and tweak its hyperparameters.
- Hint: Look into parameters like max_features, max_depth, and min_samples_split.


# 3. Boosting

### 3.1 Definition and Purpose of Boosting:

- Boosting is an ensemble learning technique that focuses on creating a strong learner by combining multiple weak learners iteratively.
- The main purpose of boosting is to improve the accuracy and performance of predictive models by giving more emphasis to misclassified instances during training.
- Boosting learns from its mistakes in an adaptive manner, continually refining its predictions to achieve high accuracy on complex datasets.

### 3.2 How Boosting Works:

1. Sequential Learning and Weighted Misclassifications:
- Boosting works in a sequential manner, where each base learner is trained based on the performance of the previous ones.
- During training, it assigns higher weights to misclassified instances, making them more influential in subsequent iterations.


2. Iterative Training of Base Learners:
- In each iteration, a new base learner is trained to correct the mistakes made by the ensemble so far.
- The base learners are typically weak models (e.g., shallow decision trees) to avoid overfitting and maintain interpretability.


3. Emphasizing Difficult Instances:
- Boosting focuses on challenging instances that are frequently misclassified by previous base learners.
- By repeatedly emphasizing these difficult instances, boosting ensures that the ensemble model pays more attention to them and gradually improves its performance.

### 3.3 Advantages of Boosting:

1. Adaptive Learning and High Accuracy:
- Boosting's adaptive learning approach allows it to learn from misclassifications and significantly improve predictive accuracy.
- It is particularly effective in handling complex relationships in data, making it suitable for various real-world applications.


2. Model Versatility and Feature Importance:
- Boosting can be applied with various base learners, such as decision trees, SVMs, and neural networks.
- Additionally, many boosting algorithms provide feature importance scores, enabling us to identify the most influential features in the model's decision-making process.

### 3.4 Common Algorithms Using Boosting:

1. AdaBoost (Adaptive Boosting):
- AdaBoost is one of the earliest and most popular boosting algorithms.
- It assigns higher weights to misclassified instances and combines the predictions of weak learners to create a strong ensemble model.
- AdaBoost is suitable for both classification and regression tasks.


2. Gradient Boosting Machines (GBM):
- GBM builds base learners sequentially, focusing on the gradients of the loss function to optimize the model's performance.
- It uses a process called gradient descent to minimize the errors in each iteration.
- GBM is widely used for regression and classification tasks and is known for its high accuracy and flexibility.


3. XGBoost and LightGBM:
- XGBoost and LightGBM are optimized implementations of gradient boosting that are efficient and scalable.
- They utilize advanced techniques like parallel processing and tree-pruning to achieve better performance.
- These algorithms are popular in data science competitions and real-world applications due to their speed and accuracy.


### 3.5 Coding Example with AdaBoost (Python and scikit-learn):

#### 3.5.1 Import necessary libraries

In [40]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
import numpy as np

This section imports all the necessary libraries and modules we'll use throughout the demonstration.

#### 3.5.2 Load and preprocess data

In [41]:
# Load the covertype dataset from OpenML
covertype = fetch_openml('covertype', version=4)
X = covertype.data
y = covertype.target.astype('int') - 1  # Shift labels to range 0-6

# Select a subset of data for demonstration purposes (and for faster execution)
X = X[:20000]
y = y[:20000]

#### 3.5.3 Split data into training and test sets

In [42]:
# Split data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

This line splits our data into training and test sets. 70% of the data is used for training and 30% for testing.

#### 3.5.4 Initialize a list to store results

In [43]:
# Initialize list to store results of each model
results = []

We'll store the accuracy results of each model in this list and then convert it to a DataFrame at the end.

#### 3.5.5 Train and test a simple decision tree classifier

In [44]:
# Train and test a simple decision tree classifier
tree = DecisionTreeClassifier(max_depth=1)  # Setting depth to 1, as AdaBoost typically uses 'stumps'
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
results.append(['Simple Decision Tree', accuracy_score(y_test, tree_pred)])

We train a simple decision tree "stump" (tree of depth 1) on the training data and then predict and evaluate its accuracy on the test set.

#### 3.5.6 Train and test a Random Forest with 5 trees

In [45]:
rf_small = RandomForestClassifier(n_estimators=5, random_state=42)
rf_small.fit(X_train, y_train)
rf_small_pred = rf_small.predict(X_test)
results.append(['Random Forest (5 trees)', accuracy_score(y_test, rf_small_pred)])

Next, we train a random forest ensemble with 5 trees and then predict and evaluate its accuracy on the test set.

#### 3.5.7 Train and test AdaBoost with 5 weak learners

In [46]:
# Train and test AdaBoost with 5 weak learners (stumps)
adaboost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), 
                              n_estimators=5, random_state=42)
adaboost.fit(X_train, y_train)
adaboost_pred = adaboost.predict(X_test)
results.append(['AdaBoost (5 learners)', accuracy_score(y_test, adaboost_pred)])

Here, we train an AdaBoost classifier using decision tree stumps as weak learners. We use 5 such stumps. After training, we predict on the test set and evaluate the accuracy.

#### 3.5.8 Convert the results list to a sorted DataFrame and display it

In [47]:
# Convert the results list to a DataFrame
df_results = pd.DataFrame(results, columns=['Model', 'Accuracy'])

# Sort the DataFrame based on the 'Accuracy' column in descending order
df_results = df_results.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)

# Display the sorted DataFrame
df_results.head()

Unnamed: 0,Model,Accuracy
0,Random Forest (5 trees),0.827667
1,AdaBoost (5 learners),0.488667
2,Simple Decision Tree,0.3985


### 3.6 Exercise

1. Basic Understanding:

- Question: Describe how boosting works. How is it different from bagging?
- Answer: Boosting trains learners sequentially where each subsequent learner tries to correct the mistakes of the previous one. Unlike bagging, which aims to reduce variance, boosting aims to reduce bias and variance.
2. Practical Implementation:

- Exercise: Implement AdaBoost using scikit-learn on the digits dataset. Compare its performance with a single Decision Tree.
- Hint: Use AdaBoostClassifier with DecisionTreeClassifier as the base estimator.
3. Deep Dive:

- Question: What is the role of learning rate in boosting algorithms like AdaBoost and Gradient Boosting?
- Answer: The learning rate, often termed as alpha, determines the contribution of each tree to the final prediction. A smaller learning rate would require more boosting rounds to achieve the same reduction in error, but it often results in a better generalized model.
4. Advanced Topic:

- Exercise: Explore XGBoost, a popular gradient boosting algorithm. Install the library and implement it on the Boston Housing dataset. Compare its performance with basic Gradient Boosting from scikit-learn.
- Hint: Remember to convert the dataset to DMatrix for XGBoost. Look into parameters like eta, max_depth, and subsample.

# 4. Strengths and Weaknesses

### 4.1 Comparison of Bagging and Boosting:

- Bagging and Boosting are both ensemble learning techniques that combine multiple models to improve predictive performance.
- Bagging aims to reduce variance and overfitting by aggregating predictions from diverse models.
- Boosting focuses on adaptive learning and aims to correct mistakes made by previous base learners in an iterative manner.

### 4.2 Strengths of Bagging:

1. Reduced Variance and Overfitting:
- Bagging reduces the risk of overfitting by combining predictions from multiple models with diverse training subsets.

2. Robustness to Noisy Data:
- Bagging is robust to noisy data and outliers since it aggregates predictions, minimizing the impact of individual data points.

3. Efficient Parallelization:
- Training base learners in bagging can be easily parallelized, making it efficient and scalable.

### 4.3 Weaknesses of Bagging:

1. Lack of Interpretability:
- The ensemble model in bagging may lack interpretability, especially if the base learners are complex models like Random Forest.


2. Bias Preservation:
- Bagging may not address bias in the base learners; if the base models are biased, the ensemble may inherit the bias.


3. Limited Performance Improvement:
- Bagging may not improve performance significantly if the base learners are already strong and diverse.


### 4.4 Strengths of Boosting:
1. Adaptive Learning and High Accuracy:
- Boosting adaptively learns from mistakes and focuses on challenging instances, leading to high accuracy.
2. Model Versatility and Feature Importance:
- Boosting can be used with various base learners and often provides feature importance scores for better understanding.

### 4.5 Weaknesses of Boosting:
1. Sensitivity to Noisy Data and Outliers:
- Boosting may be sensitive to noisy data and outliers, affecting model performance.
2. Slower Training and Limited Parallelization:
- Training base learners sequentially makes boosting slower than bagging, and it may be less efficient to parallelize.
3. Potential Overfitting:
- Boosting may be prone to overfitting, especially if the number of iterations is too high.

### 4.6 Choosing Between Bagging and Boosting: Considerations:
- The choice between bagging and boosting depends on the dataset, the complexity of the problem, and the available computational resources.
- Bagging is suitable when the base models are diverse and the goal is to reduce variance and overfitting.
- Boosting is preferred when high accuracy is crucial, and the dataset is not significantly affected by noisy data and outliers.

# 5. Strengths and Weaknesses

1. High Variance Model:
- You've trained a deep decision tree on your dataset and noticed that it performs extremely well on the training data but poorly on the validation data.

- Question: Which ensemble technique might help remedy this, bagging or boosting?
- Answer: Bagging. The described scenario suggests overfitting, which is a result of high variance. Bagging is more appropriate for reducing variance.
2. High Bias Model:
- Your team trained a shallow decision tree (i.e., a decision stump) on a complex dataset. The model performs poorly on both training and validation sets.

- Question: Which ensemble technique might help improve this model's performance, bagging or boosting?
- Answer: Boosting. The model is underfitting the data, which is a sign of high bias. Boosting aims to reduce bias by sequentially correcting the errors of the previous models.
3. Large Dataset:
- You have a very large dataset and are concerned about the training time. You're considering an ensemble technique to improve your model's performance.

- Question: Which technique, bagging or boosting, would typically be faster in training?
- Answer: Bagging. Boosting trains models sequentially, where each model tries to correct the errors of the previous one, which can be time-consuming. Bagging trains its models in parallel, making it often faster, especially with large datasets.
4. Noisy Data:
- Your dataset contains a significant amount of noise, and outliers are causing models to underperform.

- Question: Which ensemble method, bagging or boosting, might be more robust to such noise and why?
- Answer: Bagging. Boosting might overemphasize the outliers by giving them higher weights, leading to overfitting. Bagging, by averaging out predictions, is generally more robust to noisy data.
5. Model Diversity:
- You're working on an ensemble model, and you have access to various diverse base models. You believe the errors in these models are largely uncorrelated.

- Question: Which ensemble technique might benefit more from this diversity, bagging or boosting?
- Answer: Bagging. Bagging benefits greatly from the independence of errors among base models. If each model makes different errors, bagging can average them out, leading to a strong combined prediction.
6. Information on Error Types:
- After evaluating a model, you've noticed that it's making many types of errors, but the frequency of each type is low.

- Question: If you had to choose an ensemble technique to correct diverse error types, would you pick bagging or boosting?
- Answer: Boosting. Boosting is designed to sequentially correct the errors of previous models, making it suitable for addressing diverse types of errors.
7. Final Model Interpretability:
- You're working on a healthcare project where the interpretability of the model is crucial. Doctors want to understand how the model makes decisions.

- Question: Which ensemble technique, bagging or boosting, might be more challenging in terms of interpretability, and why?
- Answer: Both techniques can be challenging for interpretability as they combine multiple models. However, boosting, especially with many iterations, can be more challenging because it focuses on correcting errors sequentially, leading to a more complex combined model.