# Ensemble Learning and Random Forests

A group of pre
dictors is called an ensemble; thus, this technique is called Ensemble Learning, and an
 Ensemble Learning algorithm is called an Ensemble method.

### Important point in this cell
As an example of an Ensemble method, you can train a group of Decision Tree classi
fiers, each on a different random subset of the training set. To make predictions, you
 obtain the predictions of all the individual trees, then predict the class that gets the
 most vo. Such an ensemble of Decision Trees is
 called a Random Forest, and despite its simplicity, this is one of the most powerful
 Machine Learning algorithms available today.tes

## Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy.
 You may have a Logistic Regression classifier, an SVM classifier, a Random Forest
 classifier, a K-Nearest Neighbors classifier, and perhaps a few mo. A very simple way to create an even better classifier is to aggregate the predictions of
 each classifier and predict the class that gets the most votes. This majority-vote class
fier is called a hard voting classifierre

this voting classifier often achieves a higher accuracy than the
 best classifier in the ensemble. In fact, even if each classifier is a weak learner (mean
ing it does only slightly better than random guessing), the ensemble can still be a
 strong learner (achieving high accuracy), provided there are a sufficient number of
 weak learners and they are sufficiently diverse.

How is this possible? The following analogy can help shed some light on this mystery.
 Suppose you have a slightly biased coin that has a 51% chance of coming up heads
 and 49% chance of coming up tails. If you toss it 1,000 times, you will generally get
 more or less 510 heads and 490 tails, and hence a majority of heads. If you do the
 math, you will find that the probability of obtaining a majority of heads after 1,000
 tosses is close to 75%. The more you toss the coin, the higher the probability (e.g.,
 with 10,000 tosses, the probability climbs over 97%). This is due to the law of large
 numbers: as you keep tossing the coin, the ratio of heads gets closer and closer to the
 probability of heads (51%). 

### Training Ensemble Classifiers

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [2]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
X, y = make_moons(n_samples=100, noise=0.15)

In [4]:
len(X)

100

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [6]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

In [7]:
voting_clf = VotingClassifier(
 estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
 voting='hard')

In [8]:
voting_clf.fit(X_train, y_train)

In [9]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.85
RandomForestClassifier 0.95
SVC 1.0
VotingClassifier 0.95


### Important point in this cell
 If all classifiers are able to estimate class probabilities (i.e., they all have a pre
 dict_proba() method), then you can tell Scikit-Learn to predict the class with the
 highest class probability, averaged over all the individual classifiers. This is called soft
 voting. It often achieves higher performance than hard voting because it gives more
 weight to highly confident votes. All you need to do is replace voting="hard" with
 voting="soft" and ensure that all classifiers can estimate class probabilities. This is
 not the case for the SVC class by default, so you need to set its probability hyper
parameter to True (this will make the SVC class use cross-validation to estimate class
 probabilities, slowing down training, and it will add a predict_proba() method). If
 you modify the preceding code to use soft voting, you will find that the voting classi
fier achieves over 91.2% accuracy!

##  Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms,
 as just discussed. Another approach is to use the same training algorithm for every
 predictor and train them on different random subsets of the training setWhen sampling is performed with replacement, this method is called bagging1 (short for boot
strap aggregating2). When sampling is performed without replacement, it is called
 pasting.3. 

Once all predictors are trained, the ensemble can make a prediction for a new
 instance by simply aggregating the predictions of all predictors. The aggregation
 function is typically the statistical mode (i.e., the most frequent prediction, just like a
 hard voting classifier) for classification, or the average for regression. Each individual
 predictor has a higher bias than if it were trained on the original training set, but
 aggregation reduces both bias and variance.4 Generally, the net result is that the
 ensemble has a similar bias but a lower variance than a single predictor trained on the
 original training set.

### Important point 
predictors can all be trained in parallel, via different
 CPU cores or even different servers. Similarly, predictions can be made in parallel.
 This is one of the reasons bagging and pasting are such popular methods: This is one of the reasons bagging and pasting are such popular methods: they scale
 very well.

## Bagging and Pasting in Scikit-Learn

Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClas
 sifier class (or BaggingRegressor for regression). The following code trains an
 ensemble of 500 Decision Tree classifiers:5 each is trained on 100 training instances
 randomly sampled from the training set with replacement (this is an example of bag
ging, but if you want to use pasting instead, just set bootstrap=False). The n_jobs
 parameter tells Scikit-Learn the number of CPU cores to use for training and predic
tions (–1 tells Scikit-Learn to use all available cores):

In [10]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [11]:
bag_clf = BaggingClassifier(
 DecisionTreeClassifier(), n_estimators=500,
 max_samples=80, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
 

In [12]:
y_pred = bag_clf.predict(X_test)

### Important point
 The BaggingClassifier automatically performs soft voting
 instead of hard voting if the base classifier can estimate class proba
bilities (i.e., if it has a predict_proba() method), which is the case
 with Decision Tree classifiers.

In [13]:
accuracy_score(y_test, y_pred)

0.95

## Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor,
 while others may not be sampled at all. By default a BaggingClassifier samples m
 training instances with replacement (bootstrap=True), where m is the size of the
 training set. This means that only about 63% of the training instances are sampled on
 average for each predictor.6 The remaining 37% of the training instances that are not
 sampled are called out-of-bag (oob) instances. Note that they are not the same 37%
 for all predictors.

### Important
Since a predictor never sees the oob instances during training, it can be evaluated on
 these instances, without the need for a separate validation set. You can evaluate the
 ensemble itself by averaging out the oob evaluations of each predictor

In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier to
 request an automatic oob evaluation after training.

In [14]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,bootstrap=True, n_jobs=-1, oob_score=True)

In [15]:
bag_clf.fit(X_train, y_train)

In [16]:
bag_clf.oob_score_

0.9125

In [17]:
y_pred = bag_clf.predict(X_test)

In [18]:
accuracy_score(y_test, y_pred)

0.95

The oob decision function for each training instance is also available through the
 oob_decision_function_ variable. In this case (since the base estimator has a pre
 dict_proba() method), the decision function returns the class probabilities for each
 training instance. For example, the oob evaluation estimates that the first training
 instance ha53a038.25% probability of belonging to the positive class (46d961.75% of
 belonging to the negative class):

In [19]:
bag_clf.oob_decision_function_

array([[0.        , 1.        ],
       [1.        , 0.        ],
       [0.98404255, 0.01595745],
       [0.95321637, 0.04678363],
       [0.5       , 0.5       ],
       [0.0326087 , 0.9673913 ],
       [0.72251309, 0.27748691],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.79096045, 0.20903955],
       [0.46774194, 0.53225806],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.65625   , 0.34375   ],
       [0.6974359 , 0.3025641 ],
       [1.        , 0.        ],
       [0.98780488, 0.01219512],
       [0.96756757, 0.03243243],
       [0.        , 1.        ],
       [0.0797546 , 0.9202454 ],
       [0.1875    , 0.8125    ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.78125   , 0.21875   ],
       [0.02197802, 0.97802198],
       [1.        , 0.        ],
       [0.75806452, 0.24193548],
       [0.03658537, 0.96341463],
       [0.        , 1.        ],
       [0.

## Random Patches and Random Subspaces

### Important point
 
# The BaggingClassifier class supports sampling the features as wel
. Sampling is
 controlled by two hyperparameters: max_features and bootstmrap_features. They
 work the same way as max_samples and bootstrap, but for feature sampling instead
 of instance sampling. Thus, each predictor will be trained on a random subset of the
 input features.

### Important, definition of Features

Features means which labels do you want to select from your dataset, for ex:- to predict price of a car you would need features such as mileage, age of car, engine type

 This technique is particularly useful when you are dealing with high-dimensional
 inputs (such as images). Sampling both training instances and features is called the
 Random Patches methoKeeping all training instances (by setting bootstrap=False and max_samples=1.0) but sampling features (by setting bootstrap_features to
 True and/or max_features to a value smaller than 1.0) is called the Random Subspa
ces method.7 

## Random Forests

 Random Forest is an ensemble of Decision Trees, generally
 trained via the bagging method (or sometimes pasting), typically with max_samples
 set to the size of the training set. Instead of building a BaggingClassifier and pass
ing it a DecisionTreeClassifier, you can instead use the RandomForestClassifier
 class, which is more convenient and optimized for Decision Trees10 (similarly, there is
 a RandomForestRegressor class for regression tasks). The following code uses all
 available CPU cores to train a Random Forest classifier with 500 trees (each limited
 to maximum 16 nodes)

In [20]:
from sklearn.ensemble import RandomForestClassifier

In [21]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

In [22]:
y_pred_rf = rnd_clf.predict(X_test)

### Important
RandomForestClassifier has all the hyperparameters of a
 DecisionTreeClassifier (to control how trees are grown), plus all the hyperpara
meters of a BaggingClassifier to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees;
 instead of searching for the very best feature when splitting a nod), it
 searches for the best feature among a random subset of features. The algorithm
 results in greater tree diversity, which (again) trades a higher bias for a lower var
iance, generally yielding an overall better model. 

In [23]:
# This bagging classifier with Decision Trees is roughly equal to the previous random tree classifier
bag_clf = BaggingClassifier(
 DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
 n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)

In [24]:
y_pred_rf = bag_clf.predict(X_test)

## Extra-Trees (IMPORTANT)

When you are growing a tree in a Random Forest, at each node only a random subset
 of the features is considered for splitting (as discussed earlier). It is possible to make
 trees even more random by also using random thresholds for each feature rather than
 searching for the best possible thresholds (like regular Decision Trees d

 A forest of such extremely random trees is called an Extremely Randomized Trees
 ensemble12 (or Extra-Trees for short). Once again, this technique trades more bias for
 a lower variance. It also makes Extra-Trees much faster to train than regular Random
 Forests, because finding the best possible threshold for each feature at every node is
 one of the most time-consuming tasks of growing a t

 You can create an Extra-Trees classifier using Scikit-Learn’s ExtraTreesClassifier
 class. Its API is identical to the RandomForestClassifier class. Similarly, the Extra
 TreesRegressor class has the same API as the RandomForestRegressor class.ree.o).

## Feature Importance(IMPORTANT)

Another great quality of Random Forests is that they make it easy to measure the
 relative importance of each feature. Scikit-Learn measures a feature’s importance by
 looking at how much the tree nodes that use that feature reduce impurity on average
 (across all trees in the forest). More precisely, it is a weighted average, where each
 node’s weight is equal to the number of training samples that are associated wit.

 Scikit-Learn computes this score automatically for each feature after training, then it
 scales the results so that the sum of all importances is equal to 1. You can access the
 result using the feature_importances_ variable.h it

Following code
 trains a RandomForestClassifier on the iris dataseand
 outputs each feature’s importance. It seems that the most important features are the
 pet widthah 644%) anlengthth (42%), while sepal length and width are rather unm
portant in compariso9(11% and 2%, respectively):t

In [25]:
from sklearn.datasets import load_iris

In [26]:
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

In [27]:
iris["data"]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [32]:
iris['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [28]:
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(f'feature name {name} and score:- {score}')

feature name sepal length (cm) and score:- 0.09768304982752864
feature name sepal width (cm) and score:- 0.02330916466481691
feature name petal length (cm) and score:- 0.4292289745633719
feature name petal width (cm) and score:- 0.44977881094428274


# IMPORTANT - Random Forests are very handy to get a quick understanding of what features  actually matter, in particular if you need to perform feature selection.

## BOOSTING

### Boosting (originally called hypothesis boosting) refers to any Ensemble method that
 can combine several weak learners into a strong learner. The general idea of most
 boosting methods is to train predictors sequentially, each trying to correct its prede
cessor. 

### AdaBoost (Adaptive Boosting)

when training an AdaBoost classifier, the algorithm first trains a base
 classifier (such as a Decision Tree) and uses it to make predictions on the training set.
 The algorithm then increases the relative weight of misclassified training instances.
 Then it trains a second classifier, using the updated weights, and again makes predic
tions on the training set, updates the instance weights, and so on

#### Difference between Gradient Descent and AdaBoost
Instead of tweaking a single predictor’s  parameters to minimize a cost functio like in Gradient Descent, AdaBoost adds predictors to the ensemble, gradually making it better.n

Once all predictors are trained, the ensemble makes predictions very much like bag
ging or pasting, except that predictors have different weights depending on their
 overall accuracy on the weighted training set.

#### NOTE
There is one important drawback to this sequential learning techni
que: it cannot be parallelized (or only partially), since each predic
tor can only be trained after the previous predictor has been
 trained and evaluated. As a result, it does not scale as well as bag
ging or pasting.

## AdaBoost Working

### STEP 1 
Each instance weight w(i) is initially
 set to 1/m. A first predictor is trained, and its weighted error rate r1
 is computed on
 the training se

 ### STEP 2
 The predictor’s weight αj  is then compute. The more accurate the predictor is, the
 higher its weight will be. If it is just guessing randomly, then its weight will be close to
 zero. However, if it is most often wrong (i.e., less accurate than random guessing),
 then its weight will be negati

 ### STEP 3
Next, the AdaBoost algorithm updates the instance weights, which boosts the weights of the misclassified instances. Then all the instance weights are normalized.

### STEP 4
Finally, a new predictor is trained using the updated weights, and the whole process is
 repeated (the new predictor’s weight is computed, the instance weights are updated,
 then another predictor is trained, and so on). The algorithm stops when the desired
 number of predictors is reached, or when a perfect predictor is found.ve.dt; 

### AdaBoost Predictions
To make predictions, AdaBoost simply computes the predictions of all the predictors
 and weighs them using the predictor weights α
 . The predicted class is the one tht
 receives the major(argmax)ity of weighted votes

## AdaBoost in Scikit-Learn

Scikit-Learn uses a multiclass version of AdaBoost called SAMME16 (which stands for
 Stagewise Additive Modeling using a Multiclass Exponential loss function). When there
 are just two classes, SAMME is equivalent to AdaBoost. If the predictors can estimate
 class probabilities (i.e., if they have a predict_proba() method), Scikit-Learn can use
 a variant of SAMME called SAMME.R (the R stands for “Real”), which relies on class
 probabilities rather than predictions and generally performs better.

The following code trains an AdaBoost classifier based on 200 Decision Stumps using
 Scikit-Learn’s AdaBoostClassifier class (as you might expect, there is also an Ada
 BoostRegressor class). A Decision Stump is a Decision Tree with max_depth=1—in
 other words, a tree composed of a single decision node plus two leaf nodes. This is
 the default base estimator for the AdaBoostClassifier class:

In [29]:
from sklearn.ensemble import AdaBoostClassifier

In [33]:
ada_clf = AdaBoostClassifier(
 DecisionTreeClassifier(max_depth=1), n_estimators=200,
 algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

#### TIP
If your AdaBoost ensemble is overfitting the training set, you can
 try reducing the number of estimators or more strongly regulariz
ing the base estimator.

## Gradient Boosting

Just like AdaBoost,
 Gradient Boosting works by sequentially adding predictors to an ensemble, each one
 correcting its predecessor. However, instead of tweaking the instance weights at every
 iteration like AdaBoost does, this method tries to fit the new predictor to the residual
 errors made by the previous predictor.

Gradient Boosting also works great with regression tasks). This is
 called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT)

 let’s
 fit a DecisionTreeRegressor to the training set (for example, a noisy quadratic train
ing set):

In [34]:
from sklearn.tree import DecisionTreeRegressor

#### Training First predictor

In [36]:
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

#### Next, we’ll train a second DecisionTreeRegressor on the residual errors made by the first predictor:

In [37]:
y2 = y - tree_reg1.predict(X)
y2

array([-0.13793103,  0.        ,  0.        , -0.13793103,  0.86206897,
       -0.13793103, -0.13793103, -0.13793103,  0.        ,  0.86206897,
       -0.13793103,  0.        , -0.13793103,  0.        , -0.13793103,
        0.        ,  0.        ,  0.        ,  0.        ,  0.86206897,
        0.        ,  0.        ,  0.86206897,  0.        ,  0.        ,
       -0.13793103,  0.        ,  0.        , -0.13793103, -0.13793103,
       -0.13793103,  0.        , -0.13793103,  0.        , -0.13793103,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.86206897,  0.        ,
        0.        , -0.13793103,  0.        , -0.13793103,  0.        ,
       -0.13793103, -0.13793103,  0.        , -0.13793103,  0.        ,
       -0.13793103, -0.13793103, -0.13793103,  0.        ,  0.        ,
        0.        , -0.13793103, -0.13793103, -0.13793103, -0.13793103,
       -0.13793103, -0.13793103,  0.        , -0.13793103, -0.13

In [38]:
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

#### Then we train a third regressor on the residual errors made by the second predictor:

In [39]:
y3 = y2 - tree_reg2.predict(X)

In [40]:
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X,y3)

#### Now we have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees:

In [65]:
X_new = X[3]

In [66]:
X_new.shape

(2,)

In [67]:
X_new = X_new.reshape(1,-1)

In [68]:
X_new

array([[0.32433632, 0.88126557]])

In [69]:
y_pred = tree_reg1.predict(X_new)
y_pred

array([0.13793103])

In [71]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([-0.00889163])

A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRe
 gressor class. Much like the RandomForestRegressor class, it has hyperparameters to
 control the growth of Decision Trees (e.g., max_depth, min_samples_leaf), as well as
 hyperparameters to control the ensemble training, such as the number of trees
 (n_estimators). The following code creates the same ensemble as the previous one:

In [72]:
from sklearn.ensemble import GradientBoostingRegressor

In [73]:
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

#### Note
The learning_rate hyperparameter scales the contribution of each tree. If you set it
 to a low value, such as 0.1, you will need more trees in the ensemble to fit the train
ing set, but the predictions will usually generalize better. This is a regularization tech
nique called shrinkage. 

### Important
In order to find the optimal number of trees, you can use early stopping. A simple way to implement this is to use the staged_predict() method: it
 returns an iterator over the predictions made by the ensemble at each stage of train
ing (with one tree, two trees, etc.). The following code trains a GBRT ensemble with
 120 trees, then measures the validation error at each stage of training to find the opti
mal number of trees, and finally trains another GBRT ensemble using the optimal
 number of trees:

In [74]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [75]:
X_train, X_val, y_train, y_val = train_test_split(X, y)

In [82]:
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

In [84]:
errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
errors

[0.22027882142857144,
 0.1906110747063723,
 0.17050472201588726,
 0.14998140529485515,
 0.13349677897780338,
 0.1223331340658766,
 0.11100009808867506,
 0.09977505593679409,
 0.09228595064675296,
 0.08736117787727295,
 0.08226849004867803,
 0.07878153552203054,
 0.07620362689551341,
 0.07516572123709107,
 0.07346425465108297,
 0.07315548659825448,
 0.07228428095143254,
 0.06924840064777732,
 0.06812498093055389,
 0.06809270691093264,
 0.06796655815073338,
 0.06626978330081403,
 0.06643784079808546,
 0.06635095502594787,
 0.06382278227185517,
 0.06383797459649916,
 0.06376223521669509,
 0.061698160548311974,
 0.06180788048636669,
 0.06187710954642034,
 0.06186131545217432,
 0.06005814877202423,
 0.06044052615211406,
 0.06055746207496067,
 0.05935218993105232,
 0.05945400053289835,
 0.05966473864171993,
 0.05917685964638922,
 0.05927294530007571,
 0.057721875333302475,
 0.057853345822086524,
 0.05557162612665618,
 0.05515026131206759,
 0.05474274473976542,
 0.053478950259952134,
 0.05313

In [85]:
np.argmin(errors) + 1

119

In [86]:
bst_n_estimators = np.argmin(errors) + 1

In [87]:
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

It is also possible to implement early stopping by actually stopping training early
 (instead of training a large number of trees first and then looking back to find the
 optimal number). You can do so by setting warm_start=True, which makes Scikit
Learn keep existing trees when the fit() method is called, allowing incremental
 training. The following code stops training when the validation error does not
 improve for five iterations in a row:

In [88]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

In [90]:
min_val_error = float("inf")
error_going_up_or_constant = 0

In [91]:
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping
        

### IMPORTANT
The GradientBoostingRegressor class also supports a subsample hyperparameter,
 which specifies the fraction of training instances to be used for training each tree. For
 example, if subsample=0.25, then each tree is trained on 25% of the training instan
ces, selected randomly. As you can probably guess by now, this technique trades a
 higher bias for a lower variance. It also speeds up training considerably. This is ca Stochastic Gradient Boosting.
 
ting.

## XGBoost

 stands for Extreme Gradient Boosting.
 This package was initially developed by Tianqi Chen as part of the Distributed (Deep)
 Machine Learning Community (DMLC), and it aims to be extremely fast, scalable,
 and portable. In fact, XGBoost is often an important component of the winning
 entries in ML competitions. XGBoost’s API is quite similar to Scikit-Learn’s

In [93]:
import xgboost

In [97]:
xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

XGBoost also offers several nice features, such as automatically taking care of early
 stopping:

Early Stopping
When a model is trained with early stopping, there is an inconsistent behavior between native Python interface and sklearn/R interfaces. By default on R and sklearn interfaces, the best_iteration is automatically used so prediction comes from the best model. But with the native Python interface xgboost.Booster.predict() and xgboost.Booster.inplace_predict() uses the full model. Users can use best_iteration attribute with iteration_range parameter to achieve the same behavior. Also the save_best parameter from xgboost.callback.EarlyStopping might be useful.

In [99]:
# early_stopping_rounds=2 should be present in fit, however it has been removed in the latest version as mentioned in above block
xgb_reg.fit(X_train, y_train,
 eval_set=[(X_val, y_val)])
y_pred = xgb_reg.predict(X_val)

[0]	validation_0-rmse:0.39428
[1]	validation_0-rmse:0.33011
[2]	validation_0-rmse:0.29614
[3]	validation_0-rmse:0.28055
[4]	validation_0-rmse:0.27481
[5]	validation_0-rmse:0.27375
[6]	validation_0-rmse:0.27460
[7]	validation_0-rmse:0.27224
[8]	validation_0-rmse:0.27090
[9]	validation_0-rmse:0.27103
[10]	validation_0-rmse:0.27091
[11]	validation_0-rmse:0.27076
[12]	validation_0-rmse:0.26957
[13]	validation_0-rmse:0.26862
[14]	validation_0-rmse:0.26837
[15]	validation_0-rmse:0.26815
[16]	validation_0-rmse:0.26765
[17]	validation_0-rmse:0.26740
[18]	validation_0-rmse:0.26725
[19]	validation_0-rmse:0.26692
[20]	validation_0-rmse:0.26672
[21]	validation_0-rmse:0.26658
[22]	validation_0-rmse:0.26641
[23]	validation_0-rmse:0.26623
[24]	validation_0-rmse:0.26608
[25]	validation_0-rmse:0.26606
[26]	validation_0-rmse:0.26602
[27]	validation_0-rmse:0.26601
[28]	validation_0-rmse:0.26593
[29]	validation_0-rmse:0.26591
[30]	validation_0-rmse:0.26589
[31]	validation_0-rmse:0.26588
[32]	validation_0-

## Stacking

short for stacked generalizatio)

#### IDEA
It is based on a simple idea: instead of using trivial functions
 (such as hard voting) to aggregate the predictions of all predictors in an ensemble,
 why don’t we train a model to perform this aggregatio

 Suppose you have 3 predictors, each predicts a different value (3.1, 2.7, and 2.9), and then the final predictor
 (called a blender, or a meta learner) takes these predictions as inputs and makes the
 final prediction (3.0)n?

### Working

### STEP 1
First, the training set is split into two subsets. The first subset is used to train  the predictors in the first laye.

### STEP 2
Next, the first layer’s predictors are used to make predictions on the second (held 
out) se. This ensures that the predictions are “clean,” since the pre
dictors never saw these instances during training 

### STEP 3
For each instance in the hold-out set, there are three predicted values. We can create a new training set using these pre
dicted values as input features (which makes this new training set 3D), and keeping
 the target values. The blender is trained on this new training set, so it learns to pre
dict the target value, given the first layer’s predictions..t0) 

### Conclusion

It is actually possible to train several different blenders this way (e.g., one using Lin
ear Regression, another using Random Forest Regression), to get a whole layer of
 blenders. The trick is to split the training set into three subsets: the first one is used to
 train the first layer, the second one is used to create the training set used to train the
 second layer (using predictions made by the predictors of the first layer), and the
 third one is used to create the training set to train the third layer (using predictions
 made by the predictors of the second layer). Once this is done, we can make a predic
tion for a new instance by going through each layer sequentially, as shown in