**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART I** 

**The Fundamentals of Machine Learning** 

---

**CHAPTER 7 - Ensemble Learning and Random Forests** 

---

Chapter 7 explores Ensemble Learning, a powerful technique based on the wisdom of the crowd principle. By aggregating predictions from multiple predictors, ensemble methods often achieve better results than individual models. This chapter covers voting classifiers, bagging, boosting, stacking, and Random Forests. 

---

**Voting Classifiers**   
A simple way to create a better classifier is to aggregate predictions from multiple classifiers and predict the class that gets the most votes. This majority-vote classifier is called a **hard voting classifier**. 

**Figure 7-1. Training diverse classifiers**   
![Figure7-1.jpg](./07.Chapter-07/Figure7-1.jpg) 

**Figure 7-2. Hard voting classifier predictions**   
![Figure7-2.jpg](./07.Chapter-07/Figure7-2.jpg) 

The voting classifier often achieves higher accuracy than the best classifier in the ensemble. Even if each classifier is a **weak learner** (only slightly better than random guessing), the ensemble can be a **strong learner** (achieving high accuracy), provided there are sufficient weak learners that are sufficiently diverse.

**The Law of Large Numbers**   
Consider a biased coin with 51% chance of heads and 49% chance of tails. After 1,000 tosses, you'll get approximately 510 heads and 490 tails, with about 75% probability of obtaining a majority of heads. With 10,000 tosses, this probability climbs over 97%. This is due to the **law of large numbers**: as tosses increase, the ratio of heads approaches the true probability (51%). 

**Figure 7-3. The law of large numbers**   
![Figure7-3.jpg](./07.Chapter-07/Figure7-3.jpg) 

Similarly, an ensemble with 1,000 classifiers individually correct 51% of the time can achieve up to 75% accuracy. However, this assumes perfect independence with uncorrelated errors. Ensemble methods work best when predictors are as independent as possible. One way to get diverse classifiers is to train them using very different algorithms.

**Code Example: Voting Classifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
voting_clf.fit(X_train, y_train)

Evaluating each classifier's accuracy on the test set:

In [None]:
>>> from sklearn.metrics import accuracy_score
>>> for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
...     clf.fit(X_train, y_train)
...     y_pred = clf.predict(X_test)
...     print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
...
LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.888
VotingClassifier 0.904

The voting classifier slightly outperforms all individual classifiers. 

**Soft Voting**   
If all classifiers can estimate class probabilities (have a predict_proba() method), you can use **soft voting**: predict the class with the highest class probability averaged over all classifiers. This often achieves higher performance than hard voting because it gives more weight to highly confident votes. Replace voting="hard" with voting="soft" and ensure all classifiers can estimate probabilities. For SVC, set probability=True (this uses cross-validation to estimate probabilities, slowing training). With soft voting, the classifier achieves over 91.2% accuracy. 

---

**Bagging and Pasting**   
Another approach to get diverse classifiers is using the same training algorithm for every predictor, training them on different random subsets of the training set. When sampling is performed **with replacement**, this method is called **bagging** (bootstrap aggregating). When sampling is performed **without replacement**, it's called **pasting**. 

**Figure 7-4. Bagging and pasting involves training several predictors on different random samples of the training set**   
![Figure7-4.jpg](./07.Chapter-07/Figure7-4.jpg) 

Once all predictors are trained, the ensemble aggregates predictions using the statistical mode (most frequent prediction) for classification or average for regression. Each individual predictor has higher bias than if trained on the original set, but aggregation reduces both bias and variance. Generally, the ensemble has similar bias but lower variance than a single predictor. 

Predictors can all be trained in parallel via different CPU cores or servers, making bagging and pasting very scalable.

**Bagging and Pasting in Scikit-Learn**

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

The code trains an ensemble of 500 Decision Tree classifiers, each on 100 training instances randomly sampled with replacement. The n_jobs parameter specifies CPU cores to use (-1 uses all available cores). For pasting instead of bagging, set bootstrap=False. 

**Note**: BaggingClassifier automatically performs soft voting if the base classifier can estimate class probabilities (has predict_proba() method). 

**Figure 7-5. A single Decision Tree (left) versus a bagging ensemble of 500 trees (right)**   
![Figure7-5.jpg](./07.Chapter-07/Figure7-5.jpg) 

Bootstrapping introduces more diversity in subsets, so bagging has slightly higher bias than pasting, but the extra diversity reduces correlation between predictors, reducing variance. Overall, bagging often results in better models and is generally preferred.

**Out-of-Bag Evaluation** 
With bagging, some instances may be sampled several times for a predictor, while others may not be sampled at all. By default, BaggingClassifier samples m training instances with replacement, meaning only about 63% are sampled on average for each predictor. The remaining 37% are called **out-of-bag (oob)** instances. 

Since a predictor never sees oob instances during training, it can be evaluated on them without needing a separate validation set. You can evaluate the ensemble by averaging oob evaluations of each predictor.

In [None]:
>>> bag_clf = BaggingClassifier(
...     DecisionTreeClassifier(), n_estimators=500,
...     bootstrap=True, n_jobs=-1, oob_score=True)
...
>>> bag_clf.fit(X_train, y_train)
>>> bag_clf.oob_score_
0.90133333333333332

This indicates the classifier will likely achieve about 90.1% accuracy on the test set. Verification:

In [None]:
>>> from sklearn.metrics import accuracy_score
>>> y_pred = bag_clf.predict(X_test)
>>> accuracy_score(y_test, y_pred)
0.91200000000000003

The oob decision function for each training instance is available through oob_decision_function_. It returns class probabilities for each training instance:

In [None]:
>>> bag_clf.oob_decision_function_
array([[0.31746032, 0.68253968],
       [0.34117647, 0.65882353],
       [1.        , 0.        ],
       ...
       [1.        , 0.        ],
       [0.03108808, 0.96891192],
       [0.57291667, 0.42708333]])

**Random Patches and Random Subspaces**   
BaggingClassifier supports sampling features as well. Sampling is controlled by max_features and bootstrap_features, which work like max_samples and bootstrap but for feature sampling. Each predictor trains on a random subset of input features. 

This is particularly useful for high-dimensional inputs like images. Sampling both training instances and features is called the **Random Patches method**. Keeping all training instances (bootstrap=False and max_samples=1.0) but sampling features (bootstrap_features=True and/or max_features < 1.0) is called the **Random Subspaces method**. 

Sampling features results in even more predictor diversity, trading more bias for lower variance. 

---

**Random Forests**   
A **Random Forest** is an ensemble of Decision Trees, generally trained via bagging (or sometimes pasting), typically with max_samples set to the training set size. Instead of using BaggingClassifier with DecisionTreeClassifier, use the more convenient and optimized RandomForestClassifier class (or RandomForestRegressor for regression).

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

RandomForestClassifier has all hyperparameters of DecisionTreeClassifier (to control tree growth) plus all hyperparameters of BaggingClassifier (to control the ensemble). 

The Random Forest algorithm introduces extra randomness when growing trees: instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in greater tree diversity, trading higher bias for lower variance, generally yielding a better overall model. 

The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier:

In [None]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

**Extra-Trees**   
When growing a tree in a Random Forest, at each node only a random subset of features is considered for splitting. It's possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds. 

A forest of such extremely random trees is called an **Extremely Randomized Trees ensemble** (Extra-Trees). This technique trades more bias for lower variance. It also makes Extra-Trees much faster to train than Random Forests because finding the best possible threshold is one of the most time-consuming tasks. 

Create an Extra-Trees classifier using ExtraTreesClassifier (or ExtraTreesRegressor). Its API is identical to RandomForestClassifier. 

**Note**: It's hard to tell in advance whether RandomForestClassifier or ExtraTreesClassifier will perform better. Generally, try both and compare using cross-validation with grid search.

**Feature Importance**   
Random Forests make it easy to measure the relative importance of each feature. Scikit-Learn measures feature importance by looking at how much tree nodes using that feature reduce impurity on average (across all trees in the forest). More precisely, it's a weighted average where each node's weight equals the number of training samples associated with it. 

Scikit-Learn computes this score automatically after training, scaling results so the sum of all importances equals 1. Access the result using feature_importances_.

In [None]:
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
>>> rnd_clf.fit(iris["data"], iris["target"])
>>> for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
...     print(name, score)
...
sepal length (cm) 0.112492250999
sepal width (cm) 0.0231192882825
petal length (cm) 0.441030464364
petal width (cm) 0.423357996355

The most important features are petal length (44%) and width (42%), while sepal length and width are relatively unimportant (11% and 2%). 

**Figure 7-6. MNIST pixel importance (according to a Random Forest classifier)**   
![Figure7-6.jpg](./07.Chapter-07/Figure7-6.jpg) 

Random Forests are handy for quickly understanding what features matter, particularly for feature selection. 

---

**Boosting**   
**Boosting** (originally hypothesis boosting) refers to any Ensemble method that combines several weak learners into a strong learner. The general idea is to train predictors sequentially, each trying to correct its predecessor. The most popular boosting methods are **AdaBoost** (Adaptive Boosting) and **Gradient Boosting**.

**AdaBoost**   
One way for a new predictor to correct its predecessor is to pay more attention to training instances that the predecessor underfitted. This results in new predictors focusing more on hard cases. This is the technique used by AdaBoost. 

When training an AdaBoost classifier, the algorithm first trains a base classifier (like a Decision Tree) and makes predictions on the training set. The algorithm then increases the relative weight of misclassified training instances. It trains a second classifier using updated weights, makes predictions, updates weights, and so on. 

**Figure 7-7. AdaBoost sequential training with instance weight updates**   
![Figure7-7.jpg](./07.Chapter-07/Figure7-7.jpg) 

**Figure 7-8. Decision boundaries of consecutive predictors**   
![Figure7-8.jpg](./07.Chapter-07/Figure7-8.jpg) 

The figure shows decision boundaries of five consecutive predictors on the moons dataset (each predictor is a highly regularized SVM classifier with RBF kernel). The first classifier gets many instances wrong, so their weights get boosted. The second classifier does better on these instances, and so on. The right plot shows the same sequence with halved learning rate (misclassified instance weights boosted half as much at every iteration). 

Once all predictors are trained, the ensemble makes predictions like bagging or pasting, except predictors have different weights depending on their overall accuracy on the weighted training set. 

**Important drawback**: This sequential learning technique cannot be parallelized (or only partially) since each predictor can only be trained after the previous predictor has been trained and evaluated. It doesn't scale as well as bagging or pasting.

**AdaBoost Algorithm Details**   
Each instance weight w<sup>(i)</sup> is initially set to 1/m. A first predictor is trained, and its weighted error rate r1 is computed on the training set. 

**Equation 7-1. Weighted error rate of the jth predictor**   
![Eq7-1.jpg](./07.Chapter-07/Eq7-1.jpg) 

Where ŷ<sub>j</sub><sup>(i)</sup> is the jth predictor's prediction for the ith instance. 

The predictor's weight αj is then computed using Equation 7-2, where η is the learning rate hyperparameter (defaults to 1). The more accurate the predictor, the higher its weight. If it's just guessing randomly, its weight is close to zero. If it's most often wrong (less accurate than random guessing), its weight is negative. 

**Equation 7-2. Predictor weight**   
![Eq7-2.jpg](./07.Chapter-07/Eq7-2.jpg) 

Next, AdaBoost updates instance weights using Equation 7-3, which boosts weights of misclassified instances. 

**Equation 7-3. Weight update rule**   
![Eq7-3.jpg](./07.Chapter-07/Eq7-3.jpg) 

Then all instance weights are normalized (divided by Σw(i)). 

Finally, a new predictor is trained using updated weights, and the process repeats. The algorithm stops when the desired number of predictors is reached or when a perfect predictor is found. 

To make predictions, AdaBoost computes predictions of all predictors and weighs them using predictor weights αj. The predicted class receives the majority of weighted votes. 

**Equation 7-4. AdaBoost predictions**   
![Eq7-4.jpg](./07.Chapter-07/Eq7-4.jpg) 

Where N is the number of predictors.

**Scikit-Learn Implementation**

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

The code trains an AdaBoost classifier based on 200 Decision Stumps. A **Decision Stump** is a Decision Tree with max_depth=1—a tree with a single decision node plus two leaf nodes. This is the default base estimator for AdaBoostClassifier. 

Scikit-Learn uses a multiclass version of AdaBoost called **SAMME** (Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are just two classes, SAMME is equivalent to AdaBoost. If predictors can estimate class probabilities (have predict_proba() method), Scikit-Learn uses **SAMME.R** (R stands for "Real"), which relies on class probabilities rather than predictions and generally performs better. 

**Note**: If your AdaBoost ensemble is overfitting, try reducing the number of estimators or more strongly regularizing the base estimator.

**Gradient Boosting**   
Another popular boosting algorithm is **Gradient Boosting**. Like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each correcting its predecessor. However, instead of tweaking instance weights at every iteration like AdaBoost, this method tries to fit the new predictor to the residual errors made by the previous predictor. 

**Simple Regression Example**   
Using Decision Trees as base predictors (Gradient Boosting works great with regression tasks), this is called **Gradient Tree Boosting** or **Gradient Boosted Regression Trees (GBRT)**.

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

Next, train a second DecisionTreeRegressor on the residual errors made by the first predictor:

In [None]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

Then train a third regressor on the residual errors made by the second predictor:

In [None]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

Now the ensemble containing three trees can make predictions by adding up predictions of all trees:

In [None]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

**Figure 7-9. Gradient Boosting: the first predictor (top left) is trained normally, then each consecutive predictor (middle left and lower left) is trained on the previous predictor's residuals; the right column shows the resulting ensemble's predictions**   
![Figure7-9.jpg](./07.Chapter-07/Figure7-9.jpg) 

**Using GradientBoostingRegressor**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

The code creates the same ensemble as the previous one. The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value like 0.1, you'll need more trees in the ensemble to fit the training set, but predictions will usually generalize better. This is a regularization technique called **shrinkage**. 

**Figure 7-10. GBRT ensembles with not enough predictors (left) and too many (right)**   
![Figure7-10.jpg](./07.Chapter-07/Figure7-10.jpg) 

**Early Stopping**   
To find the optimal number of trees, use early stopping. A simple way is using the staged_predict() method: it returns an iterator over predictions made by the ensemble at each training stage (with one tree, two trees, etc.).

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

**Figure 7-11. Tuning the number of trees using early stopping**   
![Figure7-11.jpg](./07.Chapter-07/Figure7-11.jpg) 

It's also possible to implement early stopping by actually stopping training early (instead of training many trees first and looking back). Set warm_start=True to make Scikit-Learn keep existing trees when fit() is called, allowing incremental training.

In [None]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

The code stops training when validation error doesn't improve for five iterations in a row. 

**Stochastic Gradient Boosting**   
GradientBoostingRegressor also supports a subsample hyperparameter, specifying the fraction of training instances used for training each tree. For example, if subsample=0.25, each tree is trained on 25% of training instances, selected randomly. This technique trades higher bias for lower variance and speeds up training considerably. This is called **Stochastic Gradient Boosting**. 

**Note**: It's possible to use Gradient Boosting with other cost functions controlled by the loss hyperparameter.

**XGBoost**   
An optimized implementation of Gradient Boosting is available in the popular Python library **XGBoost** (Extreme Gradient Boosting). Initially developed by Tianqi Chen as part of the Distributed (Deep) Machine Learning Community (DMLC), it aims to be extremely fast, scalable, and portable. XGBoost is often an important component of winning entries in ML competitions.

In [None]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

XGBoost's API is quite similar to Scikit-Learn's. XGBoost offers several nice features, such as automatically taking care of early stopping:

In [None]:
xgb_reg.fit(X_train, y_train,
            eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)

---

**Stacking**   
The last Ensemble method discussed is **stacking** (short for stacked generalization). It's based on a simple idea: instead of using trivial functions (like hard voting) to aggregate predictions of all predictors in an ensemble, why not train a model to perform this aggregation? 

**Figure 7-12. Aggregating predictions using a blending predictor**   
![Figure7-12.jpg](./07.Chapter-07/Figure7-12.jpg) 

The figure shows an ensemble performing a regression task on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and the final predictor (called a **blender** or **meta learner**) takes these predictions as inputs and makes the final prediction (3.0).

**Training the Blender**   
To train the blender, a common approach is using a **hold-out set**. First, the training set is split into two subsets. The first subset is used to train the predictors in the first layer. 

**Figure 7-13. Training the first layer**   
![Figure7-13.jpg](./07.Chapter-07/Figure7-13.jpg) 

Next, the first layer's predictors make predictions on the second (held-out) set. This ensures predictions are "clean," since predictors never saw these instances during training. For each instance in the hold-out set, there are three predicted values. Create a new training set using these predicted values as input features (making this new training set 3D), keeping the target values. The blender is trained on this new training set, learning to predict the target value given the first layer's predictions. 

**Figure 7-14. Training the blender**   
![Figure7-14.jpg](./07.Chapter-07/Figure7-14.jpg) 

**Multilayer Stacking**   
It's possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression) to get a whole layer of blenders. The trick is to split the training set into three subsets: the first trains the first layer, the second creates the training set for the second layer (using first layer predictions), and the third creates the training set for the third layer (using second layer predictions). Once done, make a prediction for a new instance by going through each layer sequentially. 

**Figure 7-15. Predictions in a multilayer stacking ensemble**   
![Figure7-15.jpg](./07.Chapter-07/Figure7-15.jpg) 

Unfortunately, Scikit-Learn doesn't support stacking directly, but it's not too hard to implement yourself. Alternatively, use an open source implementation such as DESlib.