# Bagging ensembles


1. Bagging involves creating multiple samples of the training dataset.
2. Decision trees are fitted on each sample.
3. Variations in the training datasets lead to differences in the fitted decision trees.
4. Predictions from individual trees are combined using simple statistics like voting or averaging.
5. Bootstrap sampling is employed where examples (rows) are randomly drawn from the dataset with replacement.
6. Bootstrap sampling ensures that each sample may contain duplicates of some rows.
This process is key to the effectiveness of bagging, providing diversity in the ensemble members.

Bagging
is available in scikit-learn via the BaggingClassifier and BaggingRegressor classes, which
use a decision tree as the base-model by default and you can specify the number of trees to
create via the n estimators argument.

In [None]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the ensemble model
model = BaggingClassifier(n_estimators=50)

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.947 (0.081)


changing base learner

In [None]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier  # Importing Random Forest
from sklearn.ensemble import BaggingClassifier

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the base learner (Random Forest)
base_learner = RandomForestClassifier()  # Using Random Forest as the base learner

# configure the ensemble model with the specified base learner
model = BaggingClassifier(base_estimator=base_learner, n_estimators=60)

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Mean Accuracy: 0.963 (0.060)


# Random Forest Ensemble

1. Random Forest is an extension of the bagging ensemble method.
2. Similar to bagging, Random Forest fits a decision tree on various bootstrap samples of the training dataset.
3. In addition to sampling the data, Random Forest also randomly samples the features (columns) of each dataset.
4. When constructing each decision tree, split points are chosen in the data.
5. Instead of considering all features for choosing a split point, Random Forest restricts the features to a random subset.
6. For instance, if there were 10 features, Random Forest might limit the features to a subset of 3 for each split point evaluation.
7. This feature sampling enhances the diversity among the trees in the ensemble and leads to improved generalization performance.

The random forest ensemble is available in scikit-learn via the RandomForestClassifier
and RandomForestRegressor classes. You can specify the number of trees to create via the
n estimators argument and the number of randomly selected features to consider at each split
point via the max features argument, which is set to the square root of the number of features
in your dataset by default

In [None]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the ensemble model with a specified number of trees and max_features
model = RandomForestClassifier(n_estimators=100, max_features="sqrt")  # Using the square root of the number of features as max_features

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Mean Accuracy: 0.957 (0.072)


# AdaBoost Ensemble

1. Boosting involves adding models sequentially to the ensemble where new models aim to correct errors made by prior models.
2. The more ensemble members added, the fewer errors expected, up to a limit supported by data before overfitting.
3. AdaBoost works by fitting decision trees on versions of the training dataset weighted so that the tree focuses more on examples that prior members got wrong.
4. AdaBoost uses simple trees known as decision stumps, which make a single decision on one input variable before predicting.

AdaBoost is available in
scikit-learn via the AdaBoostClassifier and AdaBoostRegressor classes, which use a decision
tree (decision stump) as the base-model by default and you can specify the number of trees
to create via the n estimators argument

In [None]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the ensemble model
model = AdaBoostClassifier(n_estimators=50)

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Mean Accuracy: 0.947 (0.088)


change
the base learner that is used (note, it must support weighted training data).


In [None]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the base learner (decision tree)
base_learner = DecisionTreeClassifier(max_depth=1)

# configure the ensemble model with the specified base learner and SAMME algorithm
model = AdaBoostClassifier(base_estimator=base_learner, n_estimators=50, algorithm='SAMME')

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Mean Accuracy: 0.950 (0.081)


# Gradient Boosting Ensemble


1. Gradient boosting is a boosting ensemble algorithm that extends AdaBoost.
2. It reframes boosting as an additive model under a statistical framework.
3. Allows for the use of arbitrary loss functions, enhancing flexibility.
4. Utilizes loss penalties (shrinkage) to mitigate overfitting.
5. Introduces the concept of bagging to ensemble members, including sampling of training dataset rows and columns.
6. Stochastic gradient boosting is a variant that involves random sampling of rows and columns during training.
7. Particularly effective for structured or tabular data.
Can be slow to fit due to sequential addition of models.
8. More efficient implementations have been developed, such as:
Extreme Gradient Boosting (XGBoost)
Light Gradient Boosting Machines (LightGBM)

## Introduction to Gradient Boosting:
- Gradient Boosting is an ensemble learning method used for both regression and classification tasks.
- It builds a strong predictive model by combining multiple weak models (typically decision trees) sequentially.

## Key Components:
- **Weak Learners:** Typically decision trees with limited depth, called "base learners" or "weak learners."
- **Loss Function:** A loss function is optimized to minimize errors between actual and predicted values.
- **Gradient Descent:** Errors (residuals) from the previous step are used to fit the next base learner.

## Sequential Learning Process:
- Starts with an initial weak learner that predicts the target variable.
- Subsequent learners are trained to correct the errors made by the previous ones.
- Each new learner focuses on minimizing the loss function with respect to the residuals of the previous model.

## Boosting Mechanism:
- Each weak learner is trained on a modified version of the data where the instances are reweighted to focus on the previously mispredicted instances.
- The final prediction is the weighted sum of all the weak learners, where weights are determined during training.

## Benefits:
- **Improved Accuracy:** Gradient boosting often yields higher accuracy compared to individual weak learners.
- **Handles Complex Data:** Capable of capturing complex relationships in data.
- **Robustness:** Less prone to overfitting compared to other ensemble methods.
- **Flexibility:** Supports various loss functions and can be customized through hyperparameters.

## Popular Implementations:
- **XGBoost:** An optimized and scalable implementation of gradient boosting.
- **LightGBM:** A gradient boosting framework that uses tree-based learning algorithms.
- **CatBoost:** A gradient boosting library that handles categorical features efficiently.
- **Scikit-learn GradientBoostingRegressor/Classifier:** Part of the scikit-learn library, offering a simple interface for gradient boosting.


Gradient boosting is available in scikit-learn via the GradientBoostingClassifier and
GradientBoostingRegressor classes, which use a decision tree as the base-model by default.
You can specify the number of trees to create via the n estimators argument and the learning
rate that controls the contribution from each tree via the learning rate argument that defaults
to 0.1.

In [1]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the ensemble model
model = GradientBoostingClassifier(n_estimators=50, learning_rate=0.2)

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Mean Accuracy: 0.930 (0.100)


# Voting Ensemble

1. Voting ensembles utilize simple statistics to combine predictions from multiple models.
2. It involves fitting multiple different model types on the same training dataset.
3. For regression problems, predictions are typically averaged.
4. In classification, hard voting is used, selecting the class label with the most votes.
5. Hard voting is effective when the base models are diverse and have different decision boundaries.
6. Soft voting is an alternative, where predicted probabilities are summed and the label with the largest summed probability is selected.
7. Soft voting is preferred when base models support predicting class probabilities.
8. Soft voting can lead to better performance as it incorporates the confidence level of each model's prediction.

Voting ensembles are available in scikit-learn via the VotingClassifier and VotingRegressor
classes.

A list of base-models can be provided as an argument to the model and each model in
the list must be a tuple with a name and the model, e.g. (‘lr’, LogisticRegression()).

The
type of voting used for classification can be specified via the voting argument and set to either
‘soft’ or ‘hard’

In [3]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the models to use in the ensemble
models = [('lr', LogisticRegression()), ('nb', GaussianNB())]

# configure the ensemble model
model = VotingClassifier(models, voting='hard')

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.957 (0.062)


# Stacking Ensemble

1. Stacking combines predictions from multiple base models, similar to voting.
2. Unlike voting, stacking employs another machine learning model, called a meta-model, to learn how to best combine the predictions of the base models.
3. The meta-model is often a linear model like linear regression for regression tasks or logistic regression for classification tasks, but it can be any machine learning model.
4. The meta-model is trained on the predictions made by the base models on out-of-sample data.
5. Out-of-sample data refers to data that is not included in the training set of a machine learning model. It's used to evaluate how well the model generalizes to new, unseen observations.
6. Out-of-sample data is generated using k-fold cross-validation for each base model, and all out-of-fold predictions are stored.
7. Out-of-fold predictions refer to the predictions made by a machine learning model on data points that were not used during the training process for that particular fold in k-fold cross-validation.
8. Base models are then trained on the entire training dataset.
9. The meta-model learns from the out-of-fold predictions, determining which base models to trust, the degree of trust, and under which circumstances to trust them.
10. Internally, stacking uses k-fold cross-validation to train the meta-model, but the evaluation of stacking models can be done using various methods such as train-test split or k-fold cross-validation.
11. The evaluation of the model is separate from the internal resampling-for-training process.







 Stacking ensembles are available in scikit-learn via the StackingClassifier and
StackingRegressor classes. A list of base-models can be provided as an argument to the
model and each model in the list must be a tuple with a name and the model, e.g. (‘lr’,
LogisticRegression()). The meta-learner can be specified via the final estimator argument
and the resampling strategy can be specified via the cv argument and can be simply set to an
integer indicating the number of cross-validation folds

In [4]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# create the synthetic classification dataset
X, y = make_classification(random_state=1)

# configure the models to use in the ensemble
models = [('knn', KNeighborsClassifier()), ('tree', DecisionTreeClassifier())]

# configure the ensemble model
model = StackingClassifier(models, final_estimator=LogisticRegression(), cv=3)

# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.927 (0.085)
