## Ensemble Learning
- If you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. 
- A group of predictors is called an ***ensemble***; thus, this technique is called ***ensemble learning***, and an ensemble algorithm is called an ***ensemble method***.
- For example:
    - You can train a group of decision tree classifiers, each on a different random subset of the training set.
    - You can then obtain the predictions of all the individual trees, and the class that gets the most votes is the ensemble's prediction.
    - Such an ensemble of decision trees is called a ***random forest***, and despite its simplicity, this is one of the most powerful machine learning algorithms available today.
- You will often use ensemble methods near the end of a project, once you have already built a few good predictors, to combine them into an even better predictor.

### Voting Classifiers
- Suppose you have trained a few classifiers, each one achieving about 80% accuracy.
- *You may have a logistic regression classifier, an SVM classifier, a random forest classifier, a k-nearest neighbors classifier, and perhaps a few more*.
- A very simple way to create an even better classifier is to aggregate the predictions of each classifier: the class that gets the most votes is the ensemble's prediction.
- This majority-vote classifier is called a *hard-voting* classifier.
- Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. 
- In fact, even if each classifier is a *weak learner* (meaning it only does slightly better than random guessing), the ensemble can still be a *strong learner* (achieving high accuracy), provided there are a sufficient number of weak learners in the ensemble and they are sufficiently diverse.

***Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble's accuracy.***

- Scikit-Learn's `VotingClassifier` class:
    - Just give it a list of name/predictor pairs, and use it like a normal classifier.

In [24]:
# Using VotingClassifier on the make_moons dataset
# We will load and split the moons dataset into a training set and a test set, then we'll create and train a voting classifier
# composed of three diverse classifiers


from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42, probability=True))
    ]
)

voting_clf.fit(X_train, y_train)

- When you fit a `VotingClassifier`, it clones every estimator and fits the clones.
- The original estimators are available via the `estimators` attribute, while the fitted clones are available via the `estimators_` attribute.

In [25]:
# Let's look at each fitted classifier's accuracy on the test set
for name, clf in voting_clf.named_estimators_.items():
    print(f"{name} = {clf.score(X_test, y_test)}")

lr = 0.864
rf = 0.896
svc = 0.896


In [26]:
# For each classifier, show the prediction of that classifier on the first instance of the test set
for clf in voting_clf.estimators_:
    y_pred = clf.predict(X_test[:1])
    print(y_pred)

[1]
[1]
[0]


In [27]:
# So if we use the predict method of the voting classifier, it should give us the most frequent prediction among the three classifiers
voting_clf.predict(X_test[:1])

array([1])

In [28]:
# Now if we look at the voting classifier's score on the test set, it should be higher than each individual classifier
voting_clf.score(X_test, y_test)

0.912

- If all classifiers are able to estimate class probabilities (i.e. if they all have a `predict_proba()` method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.
- This is called *soft voting*. 
- It often achieves higher performance than hard voting because it gives more weight to highly confident votes.

In [29]:
voting_clf.voting = "soft"

In [30]:
# Taking the predicted probabilities of each class from each classifier for the first test instance
import numpy as np

class_0_probs = []
class_1_probs = []
for clf in voting_clf.estimators_:
    class_0_probs.append(clf.predict_proba(X_test[:1])[0][0])
    class_1_probs.append(clf.predict_proba(X_test[:1])[0][1])

class_0_probs = np.array(class_0_probs)
class_1_probs = np.array(class_1_probs)

In [31]:
# Obtaining the mean of the prediction confidence from each class
print(class_0_probs)
print(class_0_probs.mean())
print(class_1_probs)
print(class_1_probs.mean())

[0.49900001 0.48       0.56979485]
0.5162649531151421
[0.50099999 0.52       0.43020515]
0.48373504688485797


In [32]:
# So now with soft voting, we should expect the prediction on the first test instance to be class 0
voting_clf.predict(X_test[:1])

array([0])

In [33]:
# But our overall score on the test set is
voting_clf.score(X_test, y_test)

0.92

### Bagging and Pasting
- Another approach to get very different training algorithms is to use the same training algorithm for every predictor but train them on different random subsets of the training set. 
- When sampling is performed *with* replacement, this method is called ***bagging*** (short for *bootstrap aggregating*). 
- When sampling is performed *without* replacement, this method is called ***pasting***.
- In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.
- Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. 
- The aggregation is typically the *statistical mode* for classification, or the average for regression.
- Predictors can all be trained in parallel, via different CPU cores or even different servers.
- Similary, predictions can be made in parallel.
- This is one of the reasons bagging and pasting are such popular methods: ***they scale very well***.

In [34]:
# The following code trains an ensemble of 500 decision tree classifiers
# Each is trained on 100 training instances randomly sampled from the training set with repalcement
# So this is a bagging example (set bootstrap=False to use pasting)

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500, # The number of classifiers to use
    max_samples=100, # How many instances used to train each classifier
    n_jobs=-1, # use all available cores for training and predictions
    random_state=42
)

In [35]:
bag_clf.fit(X_train, y_train)

***A `BaggingClassifier` automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities***
- Overall, bagging often results in better models, which explains why it's generally preferred.

### Out-of-Bag Evaluation
- With bagging some training instances may be sampled several times for any given predictor, while others may not be sampled at all
- By default, a `BaggingClassifier` samples $m$ training instances with replacement `(bootstrap=True)`, where $m$ is the size of the training set.
- With this process, it can be shown mathematically that only about 63% of the training instances are sampled on average for each predictor.
- The remaining 37% of the training instances that are not sampled are called *out-of-bag* (OOB) instances. *Note: They are not the same 37% for all predictors*.
- A bagging ensemble can be evaluated using OOB instances, without the need for a separate validation set: indeed, if there are enough estimators, then each instance in the training set will likely be an OOB instance of several estimators, so these estimators can be used to make a fair ensemble prediction for that instance. 
- Once you have a prediction for each instance, you can compute the ensemble's prediction accuracy (or any other metric).

In [36]:
# Using the OOB instances to get a score in sklearn
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    oob_score=True,
    n_jobs=-1,
    random_state=42
)

bag_clf.fit(X_train, y_train)

In [37]:
bag_clf.oob_score_

0.896

In [38]:
# According to this OOB evaluation, this BaggingClassifier is likely to achieve about 89.6% accuracy on the test set
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

### Random Patches and Random Subspaces
- The `BaggingClassifier` class supports sampling the features as well.
- Sampling is controlled by two hyperparameters: `max_features` and `bootstrap_features`. 
- They work the same way as `max_samples` and `bootstrap`, but for feature sampling instead of instance sampling. 
- Thus, each predictor will be trained on a random subset of the input features.
- This technique is particularly useful when you are dealing with high-dimensional inputs (such as images), as it can considerably speed up training.
- Sampling both training instances and features is called the ***random patches*** method.
- Keeping all training instances (by setting `bootstrap=False` and `max_samples=1.0`) but sampling features (by setting `bootstrap_features` to `True` and/or `max_features` to a value smaller than `1.0`) is called the *random subspaces* method.

### Random Forests
- A random forest is an ensemble of decision trees, generally trained via the bagging method, typically with `max_samples` set to the size of the training set.
- Instead of building a `BaggingClassifier` and passing it a `DecisionTreeClassifier`, you can use the `RandomForestClassifier` class, which is more convenient and optimized for decision trees.

In [39]:
# Training a random forest classifier with 500 trees, each limited to a maximum number of 16 leaf nodes
# using all available CPU cores
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(
    n_estimators=500,
    max_leaf_nodes=16,
    n_jobs=-1,
    random_state=42
)

rnd_clf.fit(X_train, y_train)

In [40]:
y_pred_rf = rnd_clf.predict(X_test) 
accuracy_score(y_pred_rf, y_test)

0.912

- The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features.
- By default, it samples $\sqrt{n}$ features (where $n$ is the total number of features).

### Extra Trees
- When you are growing a tree in a random forest, at each node only a random subset of the features is considered for splitting. 
- It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds. 
- A forest of such extremely random trees is called an *extremely randomized trees* (or *extra-trees* for short) ensemble.
- Once again, this technique trades more bias for a lower variance.
- It also makes extra-trees classifiers much faster to train than regular random forests, because finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.

***It is hard to tell beforehand whether a `RandomForestClassifier` will perform better or worse than an `ExtraTreesClassifier`. Generally, the only way to know is to try both and compare them using cross-validation.*** 

### Feature Importance
- Random forests make it easy to measure the relative importance of each feature.
- Scikit Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average, across all trees in the forest.
- Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1.
- You can access the result using the `feature_importances_` variable.

In [41]:
# Training a RandomForestClassifier on the iris dataset
# Outputting each feature's importance

from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score, 2), name)

0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


- So we can see that petal width seems to be the most important feature
- Random forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

### Boosting
- *Boosting* refers to any ensemble method that can combine several weak learners into a strong learner. 
- The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.
- There are many boosting methods available, but by far the most popular are *AdaBoost* (adaptive boosting) and *Gradient Boosting*.

### AdaBoost
- One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfit. 
- This results in new predictors focusing more and more on the hard cases. 
- This is the technique used by AdaBoost.
- For example:
    - When training an AdaBoost classifier, the algorithm first trains a base classifier (such as a decision tree) and uses it to make predictions on the training set.
    - The algorithm then increases the relative weight of misclassified training instances.
    - Then it trains a second classifier, using the updated weights, and again makes predictions on the training set, updates the instance weights, and so on.
    
***There is one important drawback to this sequential learning technique: training cannot be parallelized since each predictor can only be trained after the previous predictor has been trained and evaluated. As a result, it does not scale as well as bagging or pasting.***

#### The AdaBoost Algorithm
- Each instance weight $w^{(i)}$ is initially set to $\frac{1}{m}$.
- A first predictor is trained, and its weighted error rate $r_1$ is computed on the training set.
- *Weighted error rate of the $j^{th}$ predictor*
$$
r_j = \sum^m_{\underset{\hat{y}^{(i)}_j \neq y^{(i)}}{i=1}} w^{(i)} \text{ where } \hat{y}^{(i)}_j \text{ is the } j^{th} \text{ predictor's prediction for the } i^{th} \text{ instance }
$$
*for each instance of the training set where the predictor incorrectly predicted the value, we sum up the total weights of these instances*

- The predictor's weight $\alpha_j$ is then computed using:
$$
\alpha_j = \eta \text{ log } \frac{1 - r_j}{r_j}
$$
- where:
    - $\eta$ is the learning rate hyperparameter (defaults to 1)
- The more accurate a predictor is, the higher its weight will be.
- If its just guessing randomly, then its weight will be close to zero. 
- However, if it most often wrong (i.e. less accurate than random guessing), then its weight will be negative.
- Next, the AdaBoost algorithm updates the instance weights:
- *Weight update rule, **which boosts the weights of the misclassified instances**:*
$$
\text{for } i = 1, 2, ... , m \\
w^{(i)} \leftarrow
\begin{cases}
w^{(i)} \text{ if } \hat{y}^{(i)} = y^{(i)} \\
w^{(i)} \text{ exp } (\alpha_j) \text{ if } \hat{y}^{(i)} \neq y^{(i)}
\end{cases}
$$

- Then all the instance weights are normalized (i.e., divided by $\sum^m_{i=1}w^{(i)}$)
- Finally, a new predictor is trained using the updated weights, and the whole process is repeated: the new predictor's weight is computed, the instance weights are updated, then another predictor is trained, and so on. 
- The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.
- To make predictions, AdaBoost simply computes the predictions of all the predictors and weighs them using the predictor weights $\alpha_j$.
- The predicted class is the one that receives the majority of weighted votes.
- *AdaBoost predictions*
$$
\hat{y}(\textbf{x}) = \underset{k}{\text{argmax}} \sum^N_{\underset{\hat{y}_j(\textbf{x}) = k}{j=1}} \alpha_j
$$

In [42]:
# Trains an AdaBoost classifier based on 30 decision stumps
# A decision stump is a decision tree with max_depth=1
# In other words, a tree composed of a single decision node plus two leaf nodes
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30,
    learning_rate=0.5, random_state=42
)
ada_clf.fit(X_train, y_train)



***If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly regularizing the base estimator***

### Gradient Boosting
- Another very popular boosting algorithm is *gradient boosting*.
- Just like AdaBoost, gradient boosting works by sequentially adding predictors an ensemble, each one correcting its predecessor. 
- However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the *residual errors* made by the previous predictor.

In [43]:
# Lets go through a simple regression example, using decision trees as the base predictor
# This is called gradient tree boosting or gradient boosted regression trees
# First, lets generate a noisy quadratic dataset
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100) # y = 3x^2 + Gaussian noise

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

In [44]:
# Next, we'll train a second DecisionTreeRegressor on the residual errors made by the first predictor:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)

In [45]:
# And then we'll train a third regressor on the residual errors made by the second predictor
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)

- So essentially what happens:
    - Train the initial model on the training set
    - Get the predictions of the initial model 
    - The training set for the next model will be the correct labels/values minus the predicted values
        - For example, suppose the correct labels are `[100, 200, 300]` and the model predicts `[90, 198, 300]`
        - Then the training set for the next model will be `[(100 - 90), (200 - 198), (300 - 300)] = [10, 2, 0]`
    - So on and so forth.

In [46]:
# Now we have an ensemble containing three trees
# It can make predictions simply by adding up the predictions of all the trees
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.49484029, 0.04021166, 0.75026781])

In [47]:
# We can use GradientBoostingRegressor to train GBRT ensembles more easily
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(
    max_depth=2, n_estimators=3,
    learning_rate=1.0, random_state=42
)
gbrt.fit(X, y)

- The `learning_rate` hyperparameter scales the contribution of each tree.
- If you set it to a low value, such as 0.05, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better.
    - This is a regularization technique called *shrinkage*. 
- To find the optimal number of trees, you could perform cross-validation using `GridSearchCV` or `RandomizedSearchCV`, as usual, but there's a simpler way:
    - If you set the `n_iter_no_change` hyperparameter to an integer value, say 10, then the `GradientBoostingRegressor` will automatically stop adding more trees during training if it sees that the last 10 trees did not help.
        - This is simply early stopping, but with a little bit of patience: it tolerates having no progress for a few iterations before it stops.

In [48]:
# Training the ensemble using early stopping
gbrt_best = GradientBoostingRegressor(
    max_depth=2, learning_rate=0.05, n_estimators=500,
    n_iter_no_change=10, random_state=42
)
gbrt_best.fit(X, y)

- If you set `n_iter_no_change` too low, training may stop too early and the model will underfit. 
- But if you set it too high, it will overfit instead.

In [49]:
gbrt_best.n_estimators_

92

- When `n_iter_no_change` is set, the `fit()` method automatically splits the training set into a smaller training set and a validation set.
- This allows it to evaluate the model's performance each time it adds a new tree.
- The size of the validation set is controlled by the `validation_fraction` hyperparameter, which is 10% by default.
- The `tol` hyperparameter determines the maximum performance improvement that still counts as negligible. It defaults to 0.0001.

- The `GradientBoostingRegressor` class also supports a `subsample` hyperparameter, which specifies the fraction of training instances to be used for training each tree.
- For example:
    - If `subsample=0.25`, then each tree is trained on 25% of the training instances, selected randomly. 
- This technique trades a higher bias for a lower variance.
- It also speeds up training considerably. This is called *stochastic gradient boosting*.

### Histogram-Based Gradient Boosting
- Scikit-Learn also provides another GBRT implementation, optimized for large datasets: *histogram-based gradient boosting (HGB)*. 
- It works by binning the input features, replacing them with integers.
- The number of bins is controlled by the `max_bins` hyperparameter, which defaults to 255 and cannot be set any higher than this.
- Working with integers makes it possible to use faster and more memory-efficient data structures.
- As a result, this implementation has a computational complexity of $O(b \times m)$ instead of $O(n \times m \times log(m))$ where $b$ is the number of bins, $m$ is the number of training instances , and $n$ is the number of features.