# Ensemble Learning and Random Forests

A group of predictors is called an ensemble; thus, this technique is called **Ensemble Learning** , and an Ensemble Learning Algorithm is called a an **Ensemble Method**

## Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy. You may have a Logistic Regression classifier, an SVM classifier, a Random Forest classifer, a K-Nearest Neighbors classifier, and perhaps a few more..

In [1]:
import numpy as np
heads_proba= .51 # Define the probability of heads 
coin_tosses = (np.random.rand(1,000,10)<heads_proba).astype(np.int32) # The np.random.rand() function returns a 10,000 rows with 10 random numbers as columns

In [2]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# Importing the train_test_split is required to create training samples.

X, y = make_moons(n_samples=500, noise=.30,random_state=42)


X_train, X_test,y_train,y_test = train_test_split(X ,y, random_state=42)

# Splitting of Training and Testing Set is Seperated with a Random Seed.

0.35

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators = [('lr',log_clf), ('rf',rnd_clf),('svc',svm_clf)],
    voting = 'hard'
)

voting_clf.fit(X_train,y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

Let us look at each classifiers accuracy on the test set:

In [4]:
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.896


In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators = [('lr',log_clf), ('rf',rnd_clf),('svc',svm_clf)],
    voting = 'soft'
)

voting_clf.fit(X_train,y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()),
                             ('svc', SVC(probability=True))],
                 voting='soft')

# Bagging and Pasting in Scikit-Learn

One way to get a diverse set of classifiers is very different training algorithms, as just discussed. Another approach is to use the same training algorithm for every, but to train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap). When the sampling is done without replacement it is called Pasting.

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf=BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, max_samples=100,bootstrap=True,n_jobs=-1
)

bag_clf.fit(X_train,y_train)
y_pred=bag_clf.predict(X_test)


In [7]:
y_pred[0]

0

## Out of Bag Evaluation


In [8]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),n_estimators=500,
    bootstrap=True,n_jobs=-1,oob_score=True
)

bag_clf.fit(X_train,y_train)
bag_clf.oob_score_

0.9013333333333333

According to this metric this Bagging Classifier is likely to achieve about 93.1 accuracy on the test set.

In [9]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test,y_pred)

0.912

# Random Forests

In [10]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X_train,y_train)

y_pred_rf=rnd_clf.predict(X_test)

In [11]:
y_pred

array([1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])

## A Bagging Classifer Equivalent to the Previous Random Forest

In [12]:
## Bagging Classifier With A Decision Classifier Only

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random",max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1
)