In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

##Import dataset

In [2]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#standard scaling
X_train = (X_train - np.mean(X_train))/np.std(X_train)
X_test = (X_test - np.mean(X_test))/np.std(X_test)

The dataset is composed by the following feature

#Ensamble Learning
combine few good predictors to get one more accurate one predctor (wisdom of the crowd)

###Voting classifier
aggregate the predictions of each classifier and predict the class that gets the most votes. This **majority-vote** classifier is called hard classifier

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [4]:
#set three classifiers
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

#aggregate the three classifiers 
voting_clf = VotingClassifier(
    estimators = [("lr",log_clf), ("rf",rnd_clf),("svc",svm_clf)],
    voting = "hard"
)

Lets look at each classifier accuracy on the test set

In [5]:
from sklearn.metrics import accuracy_score

In [6]:
for clf in (log_clf ,rnd_clf ,svm_clf, voting_clf):
  clf.fit(X_train,y_train)
  y_preds = clf.predict(X_test)
  print(f"{clf.__class__.__name__} {accuracy_score(y_preds,y_test)}")

LogisticRegression 0.84
RandomForestClassifier 0.92
SVC 0.92
VotingClassifier 0.928


The voting classifiers looks like the best amog the previous!
If every classifier in the ensable has a method to ouptut the probabilities of each class, you can do the same with the entire ensamble using the "soft" method. Usually this obtains better results.

#Bagging and Pasting

These approaches use he same training algorithm for every predictor, but train them on different subsets of the training set. When sampling is performed **with** replacement, the method is called bagging, pasting otherwise

###Bagging of decision trees
bagging of decision trees, is able to generalize so much better than a single decision tree

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [8]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators = 500,
    max_samples = 1.0, bootstrap = True, n_jobs = -1)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_pred, y_test)

0.904

####**Out of bag evaluation**
Some instances may be sampled several times during bootstrapping, while others may not be sampled at all, these are called out-of-bag instances.

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set

In [9]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators = 500,
    max_samples = 1.0, bootstrap = True, n_jobs = -1, oob_score = True)

bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=500,
                  n_jobs=-1, oob_score=True)

In [10]:
bag_clf.oob_score_

0.8986666666666666

In [11]:
accuracy_score(bag_clf.predict(X_test), y_test)

0.904

The oob score its pretty accurate!

###Random Forest
we can simply recrate the previous bagging using a RandomForestClassifier class


In [12]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators= 500, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred = rnd_clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.896

It is also possible to use random forests that uses random thresholds for each feature. These are called Extreme Random Forests and are provided by the scikit library as ExtraTrees.
ExtraTrees trades more bias for lower variance.

- The **bias** error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).


In [13]:
from sklearn.ensemble import ExtraTreesClassifier

rnd_clf = ExtraTreesClassifier(n_estimators= 500, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred = rnd_clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.904

Random forests can also be used for **Feature Selection**. In fact a feature is important if it is able to create subsets with as much purity as possible.
We can access the feature importance via **feature_importances_** variable.

In [15]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

for name,score in zip(iris["feature_names"], rnd_clf.feature_importances_):
  print(name,score)

sepal length (cm) 0.10640985970854759
sepal width (cm) 0.025253139106723152
petal length (cm) 0.4333483985051396
petal width (cm) 0.43498860267958966


#Boosting
This is another ensemble solution. The most famous boost methods are AdaBoosting and Gradient Boost. 

##AdaBoost
Each new predictor (model) in the esnemble should focus on correct the instances that its predecessor underfitted, weighting the missclassified instances.
The boosting cannot be parallelized, because each predictor should wait for the previous one.
In scikit learn the "SAMME" algorithm is used for multiclass labels AdaBoost. While "SAMME.R" relies on probabilities instead of predictions, usually performs better.

In [16]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth = 1), n_estimators = 250,
    algorithm = "SAMME.R", learning_rate = 0.5
)

ada_clf.fit(X_train,y_train)
y_preds = ada_clf.predict(X_test)
accuracy_score(y_preds, y_test)


0.912

##Gradient Boosting
Similar to AdaBoosting but instead of working on the weights, each predictor tries to fit the residuals errors of the previous predictor.
Let's see how implement it from scratch.
The implementation can be optimized adopting an early stopping.

In [19]:
from sklearn.ensemble import GradientBoostingClassifier

In [20]:
gbrt = GradientBoostingClassifier(max_depth = 2, n_estimators = 3, learning_rate = 1.0)

gbrt.fit(X_train, y_train)
y_preds = gbrt.predict(X_test)
accuracy_score(y_preds, y_test)


0.904

An optimized implementation of the gradient boosting is provided by the **XGBoost** library.

In [21]:
import xgboost

In [23]:
xgb_reg = xgboost.XGBClassifier()

xgb_reg.fit(X_train, y_train)
y_preds = xgb_reg.predict(X_test)
accuracy_score(y_preds,y_test)

0.92

#Stacking
This is the last ensemble method. Instead of aggregating the predictors with trivial methods like majority voting, we train a model to perform the aggregation. Each tree predicts a certain value, and the final predictor called blender or meta-learner takes these predictions and output the final value.