# Ensemble Learning

Ensemble learning is an algorithm in which we take predictions of a group of predictors, then predict the class that gets most votes.
A group of predictos is called an ensemble.


## Voting Classifier

In majority voting we create an ensemble of different , train them on trainig sets , then the most predicted class will be taken.

This majority voting classifier is known as hard voting classifier.


In [30]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

make_moons

<function sklearn.datasets._samples_generator.make_moons(n_samples=100, *, shuffle=True, noise=None, random_state=None)>

In [39]:
X, y = make_moons(n_samples=1000,shuffle=True, noise=0.10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [32]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#we will ensemble 3 classifiers (logistic regression, random forest, svc)
log_reg = LogisticRegression()
forest_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_reg), ('rd', forest_clf), ('svc', svm_clf)],
    voting='hard'
)
voting_clf.fit(X_train, y_train)

In [33]:
#look at each classifiers accuracy
from sklearn.metrics import accuracy_score

for clf in (log_reg, svm_clf, forest_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8787878787878788
SVC 0.9696969696969697
RandomForestClassifier 0.9393939393939394
VotingClassifier 0.9696969696969697


If all the classifiers are able to predict class probabilities(i.e; they have a `predict_proba()` method) then you can tell scikit-learn to predict class with the highest class probability, averaged over all individual classifiers. This is called soft voting.
It ofter achieves higher performance than hard voting because it gives more weight to highly confident votes. (replace voting="hard" with voting="soft")

In [34]:
voting_clf_soft = VotingClassifier(
    estimators=[('lr', log_reg), ('rd', forest_clf)], #svc is not used because it do not have attribute predict proba
    voting='soft'
)
voting_clf.fit(X_train, y_train)

In [36]:
for clf in (log_reg,svm_clf, forest_clf, voting_clf, voting_clf_soft):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8787878787878788
SVC 0.9696969696969697
RandomForestClassifier 0.9696969696969697
VotingClassifier 0.9696969696969697
VotingClassifier 0.9696969696969697


## Bagging and Pasting

Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for
bootstrap aggregating). When sampling is performed without replacement, it is called pasting.

The following code trains an ensemble of 500 Decision Tree classifiers each trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting instead, just set bootstrap=False).

The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (–1 tells Scikit-Learn to use all available cores):

In [48]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
bag_clf = BaggingClassifier(
    tree_clf, n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True
)
bag_clf.fit(X_train, y_train)

In [49]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_pred , y_test)

0.9818181818181818

In [50]:
bag_clf.oob_score_ #gives to out_of_bag evaluation of the classifier

0.9805970149253731

In [51]:
bag_clf.oob_decision_function_ #gives the out_of_bag (oob) evaluation for each clf

array([[0.04100228, 0.95899772],
       [0.06966292, 0.93033708],
       [0.07305936, 0.92694064],
       ...,
       [0.97742664, 0.02257336],
       [0.71917808, 0.28082192],
       [0.04418605, 0.95581395]])

## Random Forests

In [55]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

In [57]:
accuracy_score(y_test, y_pred_rf)

0.996969696969697