# Ensemble Learning and Random Forests

** Use a group of model to make a classification or Regression**

## Voting Classifiers

* A simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifer.

* Ensemble methods work best when the predictors are as independent from one another as possible.

* One way to get diverse classifiers is to train them using very different algorithms.



In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators=[('lr', log_clf),('rf', rnd_clf),('svc', svm_clf)],
                             voting='hard')

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rf', RandomF...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         n_jobs=1, voting='hard', weights=None)

In [5]:
#Look at each classifier's accuracy on the test set:

from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
    

LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.888
VotingClassifier 0.904


* If all classifiers are able to estimate class probabilites, we can tell sklearn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting.

* Soft voting often achieves higher performance than hard voiting.

## Bagging and Pasting

* Another way to get a diverse set of classifiers is to use the same training algorithm for every predictor, but to train on different random subsets of the training set. 

* When sampling is performed with replacement, This method is called bagging(short for bootstrap aggregating)

* When sampling is performed without replacement, it is called pasting.

* Generally, the net result is that the ensemble has a similar bias but a lower variance than single predictor trained on the original training set.



In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                           max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)



In [8]:
print(accuracy_score(y_test, y_pred))

0.904


### Out-of-Bag Evaluation

* With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. Only about 63% of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called out-of-bag(oob) instances. 

* Since a predictor never sees the oob instances during training, it can be evaluated on these instances.

In [11]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                           bootstrap=True, n_jobs=-1, oob_score=True)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.89600000000000002

In [12]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.90400000000000003

In [13]:
bag_clf.oob_decision_function_

array([[ 0.32571429,  0.67428571],
       [ 0.31638418,  0.68361582],
       [ 1.        ,  0.        ],
       [ 0.        ,  1.        ],
       [ 0.        ,  1.        ],
       [ 0.08333333,  0.91666667],
       [ 0.31976744,  0.68023256],
       [ 0.01492537,  0.98507463],
       [ 0.98421053,  0.01578947],
       [ 0.98387097,  0.01612903],
       [ 0.75129534,  0.24870466],
       [ 0.        ,  1.        ],
       [ 0.79166667,  0.20833333],
       [ 0.87951807,  0.12048193],
       [ 0.98019802,  0.01980198],
       [ 0.05820106,  0.94179894],
       [ 0.        ,  1.        ],
       [ 0.96842105,  0.03157895],
       [ 0.95652174,  0.04347826],
       [ 0.99378882,  0.00621118],
       [ 0.02777778,  0.97222222],
       [ 0.35802469,  0.64197531],
       [ 0.90865385,  0.09134615],
       [ 1.        ,  0.        ],
       [ 0.95808383,  0.04191617],
       [ 0.        ,  1.        ],
       [ 0.99470899,  0.00529101],
       [ 1.        ,  0.        ],
       [ 0.        ,

## Random Patches and Random Subspaces

* Sampling both training instances and features is called the Random Patches method.
* Keeping all training instances but sampling features is called the Random Subspaces method.
* Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance

## Random Forests

* Random Forests is an ensemble of Decision Trees.


In [15]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

In [16]:
accuracy_score(y_pred_rf, y_test)

0.92000000000000004

## Extra-Trees

* When we are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting.

* A forest of sunch extremely random trees is simply called an Extremely Randomized Trees ensemble (or Extra-Trees for short). this trades more bias for a lower variance.

## Feature Importance

* In a single Decision Tree, important features are likely to appear closer to the root of the tree, while unimportant features will often appear closer to the leaves. 

* It is therefore possible to get an estimate of a feature's importance by computing the average depth at which it appears across all trees in the forest.

* we can access the result using the feature_importances_ variable.



In [17]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [18]:
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.0956045546963
sepal width (cm) 0.0221267719485
petal length (cm) 0.418076917646
petal width (cm) 0.464191755709


## Boosting ( hypothesis boosting)

* The general idea of most boosting methods is to train predictors sequentially, each tring to correct its predecessor. 

### AdaBoost

* One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases.

* There is one important drawback to this sequential learning technique: it cannot be parallelized, since each predictor can only be trained after the previous predictor has been trained and evaluated.



In [19]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200,algorithm="SAMME.R",
                            learning_rate=0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=0.5, n_estimators=200, random_state=None)

In [20]:
ada_clf_pred = ada_clf.predict(X_test)
accuracy_score(ada_clf_pred, y_test)

0.89600000000000002

## Gradient Boosting

* Gradient Boosting works by sequentially adding predictors to an ensemble. This method tries to fit the new predictor to the residual errors made by the previous predicor.



In [22]:
import numpy as np
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:,0]**2 + 0.05 * np.random.randn(100)

In [23]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=42,
           splitter='best')

In [24]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=42,
           splitter='best')

In [25]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=42,
           splitter='best')

In [26]:
X_new = np.array([[0.8]])

In [27]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [28]:
y_pred

array([ 0.75026781])

In [29]:
# The easy way

from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=1.0, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=3, presort='auto',
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

In [30]:
gbrt_pred = gbrt.predict(X_new)

In [31]:
gbrt_pred

array([ 0.75026781])

## Stacking(short for stacked generalization)

* This ensemble method is based on a simple idea: instead of using trivial functions(such as hard voting) to aggregate the predicitons of all predictors in an ensemble, why don't we train a model to perform this aggregation?