<a href="https://colab.research.google.com/github/Richish/hands_on_ml/blob/master/7_1_ensemble_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hard voting classifier

Suppose you have trained a few classifiers, each one achieving about x% accuracy
A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.

even if each classifier is a weak learner (meaning
it does only slightly better than random guessing), the ensemble can still be a
strong learner (achieving high accuracy), provided there are a sufficient number of
weak learners and they are sufficiently diverse.



## Analogy from game theory:
Suppose you have a slightly biased coin that has a 51% chance of coming up heads,
and 49% chance of coming up tails. If you toss it 1,000 times, you will generally get
more or less 510 heads and 490 tails, and hence a majority of heads. If you do the
math, you will find that the probability of obtaining a majority of heads after 1,000
tosses is close to 75%. The more you toss the coin, the higher the probability (e.g.,
with 10,000 tosses, the probability climbs over 97%). This is due to the law of large
numbers: as you keep tossing the coin, the ratio of heads gets closer and closer to the
probability of heads (51%).

## Example of voting classifier on moons dataset

we will use ensemble of svc, random forest and logistic regression

In [1]:
# data prep
from sklearn.datasets import make_moons

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

X, y = make_moons(n_samples=10_000, noise=0.4, random_state=42)
X.shape, y.shape

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test

(array([[-0.56413534,  0.29283681],
        [-1.16033479,  0.96512577],
        [-0.06598769, -0.15191052],
        ...,
        [ 0.38876425, -0.78662881],
        [ 2.50492832,  0.21133631],
        [ 0.35428745,  0.74582457]]), array([[ 0.69945888, -0.8734481 ],
        [ 1.7764418 ,  0.13222334],
        [-1.14450821,  0.24446319],
        ...,
        [ 0.66336269,  0.79833307],
        [-0.6493245 ,  1.19920859],
        [-0.09883144,  0.40961263]]), array([0, 0, 1, ..., 1, 1, 0]), array([1, 1, 0, ..., 0, 0, 0]))

In [2]:
# training each classifier with no hyperparameter tuning
lr_clf = LogisticRegression()
rf_clf = RandomForestClassifier()
svc_clf = SVC()

lr_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
svc_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
# seeing the individual performance for each classifier
for clf in (lr_clf, rf_clf, svc_clf):
    y_pred = clf.predict(X_test)
    acc_score = accuracy_score(y_true=y_test, y_pred=y_pred)
    print("{}: {}".format(clf.__class__.__name__, acc_score))



LogisticRegression: 0.8415
RandomForestClassifier: 0.8545
SVC: 0.874


In [3]:
# checking performance of a voting classifier based on exact same models

voting_clf = VotingClassifier(estimators=[('lr', lr_clf), ('rf', rf_clf), ('svc', svc_clf)], voting='hard')
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
acc_score = accuracy_score(y_true=y_test, y_pred=y_pred)
print("voting clf: {}".format(acc_score)) 
# in most of the cases will be higher that all of the constituents thogh did not happen in this particular example.
# in this particular case though looks like svc is smiply too good for this data pattern


voting clf: 0.87


## Soft voting classifieer.
Instead of counting votes from each classifier as in "hard voting classifier", here we ask each classifier what the predict_proba for that class and classify based on which class that has highest probability(as calculated from predict_prob given by individual classifiers). In essence, more weight is given to high confidence votes.

Often, soft voting classifiers perform better than hard voting ones.

In [6]:
# checking performance of a soft voting classifier
svc_clf_with_prob = SVC(probability=True) # turning on the predict_prob method of SVC which is off by default
soft_voting_clf = VotingClassifier(estimators=[('lr', lr_clf), ('rf', rf_clf), ('svc', svc_clf_with_prob)], voting='soft')
soft_voting_clf.fit(X_train, y_train)
y_pred_soft = soft_voting_clf.predict(X_test)
acc_score_soft = accuracy_score(y_true=y_test, y_pred=y_pred_soft)
print("voting clf: {}".format(acc_score_soft)) 

voting clf: 0.8715


# Bagging and pasting 
oh.. these scale so well.

These are same as hard voting/soft voting classsifier. But the difference is that instead of having different algorithms for each predictor, here each predictor is based on same algorithm(svc or random forest or some other). That is- the type of classifier for all the predictors is same. What is different is that, for each of the predictors a differnt sample of training data is picked up. That is the training data fed to predictor1 is different from training data fed to predictor2 for training.
All these predictors are trained in parallel and after they have been trained we make them do predictions and then do final ensemble of results using the same method as before(hard or soft voting).

Bagging is short for bootstrap aggregating.

## Difference between bagging and pasting:
Bagging - sampling performed with replacement.
Pasting- sampling does not involve replacement.
Both bagging and pasting allow training instances to be sampled several
times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

## How to combine the results after individual predictions:
Once all predictors are trained, the ensemble can make a prediction for a new
instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.4 Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

## Why is bagging/pasting so scalable:
predictors can all be trained in parallel, via different
CPU cores or even different servers. Similarly, predictions can be made in parallel.
This is one of the reasons why bagging and pasting are such popular methods

## Bagiing using sklearn
The following code trains an
ensemble of 500 Decision Tree classifiers,5 each trained on 100 training instances randomly
sampled from the training set with replacement (this is an example of bagging,
but if you want to use pasting instead, just set bootstrap=False). The n_jobs parameter
tells Scikit-Learn the number of CPU cores to use for training and predictions
(–1 tells Scikit-Learn to use all available cores)

The BaggingClassifier automatically performs soft voting
instead of hard voting if the base classifier can estimate class probabilities
(i.e., if it has a predict_proba() method), which is the case
with Decision Trees classifiers.

In [10]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

acc_score = accuracy_score(y_true=y_test, y_pred=y_pred)
acc_score


0.8715

## Which one to prefer- Bagging or pasting:
Bootstrapping introduces a bit more diversity in the subsets that each predictor is
trained on, so bagging ends up with a slightly higher bias than pasting, but this also
means that predictors end up being less correlated so the ensemble’s variance is
reduced. Overall, bagging often results in better models, which explains why it is generally
preferred.

## Out of bag evaluation

In case of bagging classifer since samples are picked at random each time with replacement, almost always there are some samples that are never part of the training samples picked for a predictor in any of the attempts(Infact if sampling size is same as training size-which is default for Bagging classifier- by probility theory, 63% of samples are ever picked for training). 
So there always are some samples that have not beem seen by the predictor during training. We can treat these samples as validation set and evaluate our model on these samples.(Before we predict/evaluate on test set).
This process is called out of bag evaluation.
(Evaluation on samples that were left out of training bag)

In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier to
request an automatic oob evaluation after training.
The resulting evaluation score is available through the oob_score_ variable

In [11]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_ 
# generally this oob_score is quite a good indicator of what the accuracy on test set will be. Since this is a score on samples that have never been seen by predictor.

0.836375

In [12]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.85

The oob decision function for each training instance is also available through the
oob_decision_function_ variable. In this case (since the base estimator has a pre
dict_proba() method) the decision function returns the class probabilities for each
training instance.

In [13]:
bag_clf.oob_decision_function_

array([[0.93604651, 0.06395349],
       [0.99393939, 0.00606061],
       [0.32820513, 0.67179487],
       ...,
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.74054054, 0.25945946]])

## Random Patches and Random Subspaces
All about feature sampling. Specially useful if number of fetures are large, ex: image classification.

In sklearn's BaggingClassifier, Feature sampling is controlled by max_features and bootstrap_features (these work
the same way as max_samples and bootstrap but for features instead of samples)

### Random Patches: 
Sampling both training instances and features is called the Random
Patches method.

### Random Subspaces
Keeping all training instances (i.e., bootstrap=False and max_samples=1.0) but sampling features (i.e., bootstrap_features=True and/or max_features smaller than 1.0) is called the Random Subspaces method.

### Advantage:
Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.