## Ensemble Learning 
Suppose you ask a complex question to thousands of random people, then aggregate their answers. In
many cases you will find that this aggregated answer is better than an expert’s answer. This is called the
wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as
classifiers or regressors), you will often get better predictions than with the best individual predictor. A
group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an
Ensemble Learning algorithm is called an Ensemble method.

For example, you can train a group of **Decision Tree classifiers**, each on a **different random subset** of the
training set. To make predictions, you just obtain the predictions of all **individual trees**, then predict the
class that gets the most votes. Such an ensemble of **Decision Trees** is called a **Random Forest**, and despite its simplicity, this is one of the most **powerful Machine Learning algorithms** available today.

### I- Voting Classifiers
Suppose you have trained a few classifiers, each one achieving about 80% accuracy. You may have a
Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a K-Nearest Neighbors
classifier, and perhaps a few more.

A very simple way to create an even better classifier is to **aggregate the predictions** of each classifier and
predict the class that gets the **most votes**. This majority-vote classifier is called a **hard voting classifier**

![VotingClassifiers Img](votingClassifier.PNG)

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in
the ensemble. In fact, even if each classifier is a weak learner (meaning it does only slightly better than
random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there
are a sufficient number of weak learners and they are sufficiently diverse

suppose you build an ensemble containing 1,000 classifiers that are individually correct only
51% of the time (barely better than random guessing). If you predict the majority voted class, you can
hope for up to 75% accuracy! However, this is only true if all classifiers are perfectly independent,
making uncorrelated errors, which is clearly not the case since they are trained on the same data. They are
likely to make the same types of errors, so there will be many majority votes for the wrong class, reducing
the ensemble’s accuracy.


The following code creates and trains a voting classifier in Scikit-Learn, composed of three diverse
classifiers

In [36]:
from sklearn.datasets import make_moons
X=make_moons(10000)
from sklearn.model_selection import train_test_split
#From tuple to numpy array of moon data
data=X[0]
labels=X[1]
#data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)
#X_train=X_trtest[]
# what does the random state param a part from the random sampling methode notice? 
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.20, random_state=42)

In [37]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='soft'
)
voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
'''    
LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896
'''


LogisticRegression 0.895
RandomForestClassifier 1.0
SVC 1.0
VotingClassifier 0.999


'    \nLogisticRegression 0.864\nRandomForestClassifier 0.872\nSVC 0.888\nVotingClassifier 0.896\n'

If all classifiers are able to estimate **class probabilities** (i.e., they have a predict_proba() method),
then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the
individual classifiers. This is called **soft voting**.It often achieves higher performance than hard voting
because it gives more weight to highly confident votes.

All you need to do is replace **voting="hard"** with voting="soft" and ensure that all classifiers can estimate class probabilities. This is not the case of the **SVC** class by default, so you need to set its probability hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add
a predict_proba() method). If you modify the preceding code to use soft voting, you will find that the
voting classifier achieves over 91% accuracy!

### Bagging and Pasting

Another approach is to use the same training algorithm for every predictor, but to train them on different
**random subsets** of the training set. When **sampling** is performed **with replacement**, this method is called
**bagging** (short for **bootstrap aggregating**). When sampling is performed without replacement, it is called **pasting**

In other words, both **bagging** and **pasting** allow training instances to be **sampled several times** across
multiple **predictors**, but only **bagging** allows training instances to be **sampled several times for the same
predictor**.


predictors can all be trained in parallel, via different **CPU cores** or even different **servers**. Similarly, predictions can be made in parallel. This is one of the reasons why bagging and pasting are such popular methods: they scale very well



#### Bagging and Pasting in Scikit-Learn
Scikit-Learn offers a simple API for both bagging and pasting with the **BaggingClassifier** class (or BaggingRegressor for regression). The following code trains an ensemble of **500 Decision Tree classifiers**, each trained on 100 training instances **randomly sampled** from the training set **with replacement** (this is an example of bagging, but if you want to use pasting instead, just set **bootstrap=False**). The **n_jobs** parameter tells Scikit-Learn the number of **CPU cores** to use for training and predictions (–1 tells Scikit-Learn to use all available cores)

In [38]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging
ends up with a slightly higher bias than pasting, but this also means that predictors end up being less
correlated so the ensemble’s variance is reduced. Overall, bagging often results in better models, which
explains why it is generally preferred. However, if you have spare time and CPU power you can use
cross-validation to evaluate both bagging and pasting and select the one that works best


#### Out-of-Bag Evaluation
With bagging, some instances may be sampled several times for any given predictor, while others may not
be sampled at all. By default a BaggingClassifier samples m training instances with replacement
(bootstrap=True), where m is the size of the training set. This means that only about 63% of the training
instances are sampled on average for each predictor.6 The remaining 37% of the training instances that are
not sampled are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors.
Since a predictor never sees the oob instances during training, it can be evaluated on these instances,
without the need for a separate validation set or cross-validation. You can evaluate the ensemble itself by
averaging out the oob evaluations of each predictor.
In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier to request an
automatic oob evaluation after training. The following code demonstrates this. The resulting evaluation
score is available through the oob_score_ variable:

In [43]:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,
bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_
# it is possible to have he oob decision function for each instance : bag_clf.oob_decision_function_

0.99962499999999999

In [44]:
# with cross validation we are not far from the result above
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)


0.99850000000000005

### Random Patches and Random Subspaces
The **BaggingClassifier** class supports sampling the features as well. This is controlled by two hyperparameters: **max_features** and **bootstrap_features**. They work the same way as **max_samples** and **bootstrap**, but for **feature sampling** instead of **instance sampling**. Thus, each predictor will be trained on a **random subset** of the input features.
This is particularly useful when you are dealing with **high-dimensional inputs** (such as images). Sampling both training instances and features is called the **Random Patches method**. Keeping all training instances (i.e., bootstrap=False and max_samples=1.0) but sampling features (i.e., bootstrap_features=True and/or max_features smaller than 1.0) is called the Random Subspaces method. Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

### Random Forests
As we have discussed, a **Random Forest** is an ensemble of **Decision Trees**, generally trained via the **bagging method** (or sometimes pasting), typically with **max_samples** set to the size of the training set. Instead of building a **BaggingClassifier** and passing it a **DecisionTreeClassifier**, you can instead use the **RandomForestClassifier** class, which is more convenient and optimized for **Decision Trees** (similarly, there is a **RandomForestRegressor** class for **regression** tasks). 
The following code trains a **Random Forest classifier** with 500 trees (each limited to maximum **16 nodes**), using all available CPU cores:

In [45]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test) 

With a few exceptions, a **RandomForestClassifier** has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a **BaggingClassifier** to control the ensemble itself. The **Random Forest** algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. 

The following BaggingClassifier is roughly equivalent to the previous **RandomForestClassifier**:

In [46]:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1
)

#### Extra-Trees
When you are growing a tree in a Random Forest, at each node only a random subset of the features is
considered for splitting (as discussed earlier). It is possible to make trees even more random by also
using random thresholds for each feature rather than searching for the best possible thresholds (like
regular Decision Trees do).
A forest of such extremely random trees is simply called an Extremely Randomized Trees ensemble12 (or
Extra-Trees for short). Once again, this trades more bias for a lower variance. It also makes Extra-Trees
much faster to train than regular Random Forests since finding the best possible threshold for each feature
at every node is one of the most time-consuming tasks of growing a tree.
You can create an Extra-Trees classifier using Scikit-Learn’s ExtraTreesClassifier class. Its API is
identical to the RandomForestClassifier class. Similarly, the ExtraTreesRegressor class has the
same API as the RandomForestRegressor class.


#### TIP
It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier.
Generally, the only way to know is to try both and compare them using cross-validation (and tuning the hyperparameters using
grid search).

#### Feature Importance
Lastly, if you look at a single Decision Tree, important features are likely to appear closer to the root of the tree, while unimportant features will often appear closer to the leaves (or not at all). It is therefore possible to get an estimate of a feature’s importance by computing the average depth at which it appears across all trees in the forest. Scikit-Learn computes this automatically for every feature after training. You can access the result using the feature_importances_ variable. For example, the following code trains a RandomForestClassifier on the iris dataset  and outputs each feature’s importance. It seems that the most important features are the petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison (11% and 2%, respectively):

In [49]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score*100)

sepal length (cm) 7.97603502116
sepal width (cm) 2.23496633708
petal length (cm) 44.3933630567
petal width (cm) 45.3956355851


Similarly, if you train a Random Forest classifier on the MNIST dataset (introduced in Chapter 3) and
plot each pixel’s importance

Random Forests are very handy to get a quick understanding of what features actually matter, in particular
if you need to perform feature selection.


### Boosting

Boosting (originally called hypothesis boosting) refers to any **Ensemble method** that can combine several **weak learners** into a **strong learner**. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are **AdaBoost** (short for Adaptive Boosting) and **Gradient Boosting**. 

#### AdaBoost
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training
instances that the predecessor underfitted. This results in new predictors focusing more and more on the
hard cases. This is the technique used by AdaBoost.
For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained
and used to make predictions on the training set. The relative weight of misclassified training instances is
then increased. A second classifier is trained using the updated weights and again it makes predictions on
the training set, weights are updated, and so on

The first classifier
gets many instances wrong, so their weights get boosted. The second classifier therefore does a better job
on these instances, and so on. The plot on the right represents the same sequence of predictors except that
the learning rate is halved (i.e., the misclassified instance weights are boosted half as much at every
iteration). As you can see, this sequential learning technique has some similarities with Gradient Descent,
except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost
adds predictors to the ensemble, gradually making it better.

The following code trains an AdaBoost classifier based on 200 Decision Stumps using Scikit-Learn’s
AdaBoostClassifier class (as you might expect, there is also an AdaBoostRegressor class). A
Decision Stump is a Decision Tree with max_depth=1 — in other words, a tree composed of a single
decision node plus two leaf nodes. This is the default base estimator for the AdaBoostClassifier class:

In [50]:
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5
)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=0.5, n_estimators=200, random_state=None)

If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly
regularizing the base estimator.

### Gradient Boosting
Another very popular Boosting algorithm is Gradient Boosting.17 Just like AdaBoost, Gradient Boosting
works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However,
instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the
new predictor to the residual errors made by the previous predictor.
Let’s go through a simple regression example using Decision Trees as the base predictors (of course
Gradient Boosting also works great with regression tasks). This is called Gradient Tree Boosting, or
Gradient Boosted Regression Trees (GBRT). First, let’s fit a DecisionTreeRegressor to the training set
(for example, a noisy quadratic training set):

In [52]:
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train)
#Now train a second DecisionTreeRegressor on the residual errors made by the first predictor:
y2 = y_train - tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2)
#Then we train a third regressor on the residual errors made by the second predictor:
y3 = y2 - tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)
#Now we have an ensemble containing three trees. It can make predictions on a new instance simply by
#adding up the predictions of all the trees:
#y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRegressor class.
Much like the RandomForestRegressor class, it has hyperparameters to control the growth of Decision
Trees (e.g., max_depth, min_samples_leaf, and so on), as well as hyperparameters to control the
ensemble training, such as the number of trees (n_estimators). The following code creates the same
ensemble as the previous one:

In [53]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=1.0, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=3, presort='auto',
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such
as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually
generalize better. This is a regularization technique called shrinkage. Figure 7-10 shows two GBRT
ensembles trained with a low learning rate: the one on the left does not have enough trees to fit the
training set, while the one on the right has too many trees and overfits the training set

In order to find the optimal number of trees, you can use early stopping (see Chapter 4). A simple way to
implement this is to use the staged_predict() method: it returns an iterator over the predictions made
by the ensemble at each stage of training (with one tree, two trees, etc.). The following code trains a
GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the
optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees

In [55]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(data, labels)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)
errors = [mean_squared_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=119,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

It is also possible to implement early stopping by actually stopping training early (instead of training a
large number of trees first and then looking back to find the optimal number). You can do so by setting
warm_start=True, which makes Scikit-Learn keep existing trees when the fit() method is called,
allowing incremental training. The following code stops training when the validation error does not
improve for five iterations in a row

In [56]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)
min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break # early stopping

The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies
the fraction of training instances to be used for training each tree. For example, if subsample=0.25, then
each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by
now, this trades a higher bias for a lower variance. It also speeds up training considerably. This technique
is called Stochastic Gradient Boosting.

It is possible to use Gradient Boosting with other cost functions. This is controlled by the loss hyperparameter (see Scikit-
Learn’s documentation for more details

### Stacking
The last Ensemble method we will discuss in this chapter is called stacking (short for stacked
generalization).18 It is based on a simple idea: instead of using trivial functions (such as hard voting) to
aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this
aggregation? Figure 7-12 shows such an ensemble performing a regression task on a new instance. Each
of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor
(called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction
(3.0).