# Bagging and Pasting

Use the same training algorithm for every predictor, but to train them on different random subsets of the training set

##### Bagging:
When sampling is performed with replacement
##### Pasting:
When sampling is performed without replacement

In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

This sampling and training process is represented in Figure 7-4.

<img src='img_4.png'>

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. 

*statistical mode*: The aggregation function. (i.e., the most frequent prediction, just like a
hard voting classifier for classification, or the average for regression)

Each individual predictor has a higher bias than if it were trained on the original training set, but
aggregation reduces both bias and variance.

Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the
original training set.

predictors can all be trained in parallel, via different CPU cores or even different servers. This is one of the reasons why bagging and pasting are such popular methods: they scale very well.

## Bagging and Pasting in Scikit-Learn


In [3]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [4]:
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data[:, 2:], data.target)

In [5]:
bag_clf = BaggingClassifier(base_estimator= DecisionTreeClassifier(), n_estimators = 500, 
                           max_samples=100, bootstrap=True, n_jobs = -1)

In [7]:
bag_clf.fit(X_train, y_train)


In [8]:
y_pred = bag_clf.predict(X_test)

The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with Decision Trees classifiers.

<img src='img_5.png'>

As you can see, the ensemble’s predictions will likely generalize much better than the single Decision Tree’s predictions: the ensemble has a comparable bias but a smaller variance (it makes roughly the same number of errors on the training set, but the decision boundary is less irregular).

Overall, bagging often results in better models, which explains why it is generally preferred. However, if you have spare time and CPU power you can use cross-validation to evaluate both bagging and pasting and select the one that works best.

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble’s variance is reduced.

## Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all.

m: the size of training set

By default a BaggingClassifier samples m training instances with replacement (bootstrap=True).
This means that only about 63% of the training instances are sampled on average for each predictor:
as m growth, the ratio become:


1 - $e^{-1}$ $\approx$ 63%

The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances

*out-of-bag* intances: are the training instances that are not sampled 


Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set or cross-validation.



In [11]:
bag_clf_oob = BaggingClassifier(DecisionTreeClassifier(), oob_score=True, 
                                n_estimators = 500, max_samples=100, 
                                bootstrap=True, n_jobs = -1)

In [13]:
bag_clf_oob.fit(X_train, y_train)
bag_clf_oob.oob_score_

0.9642857142857143

In [14]:
from sklearn.metrics import accuracy_score
accuracy_score(bag_clf_oob.predict(X_test), y_test)

0.9473684210526315

In [15]:
bag_clf_oob.oob_decision_function_

array([[0.        , 1.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 0.14027149, 0.85972851],
       [1.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 0.98173516, 0.01826484],
       [1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [0.        , 0.16402116, 0.83597884],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.        ],
       [0.

The oob decision function for each training instance is also available through the
oob_decision_function_ variable. In this case (since the base estimator has a pre
dict_proba() method) the decision function returns the class probabilities for each
training instance

In [17]:
bag_clf_oob.oob_decision_function_[2]

array([0.        , 0.14027149, 0.85972851])

For example, the oob evaluation estimates that the first training instance has a 85.97% probability of belonging to the $1^{st}$ class (and 14.02% of belonging to the $2^{nd}$ class)

## Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well
his is con‐
trolled by two hyperparameters: 
1. max_features and 
2. bootstrap_features

They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features.


This is particularly useful when you are dealing with high-dimensional inputs (such as images).

*Random Patches Method*: Sampling both training instances and features.

*Random Subspaces Method*: Keeping all training instances( bootstrap=False and max_samples=1.0) but sampling features (bootstrap_features=True and/or max_features smaller than 1.0)

Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.