<center><h2>Ensemble Learning, The Wisdom of the Crowds</h2></center>

<center><img src="https://www.ibmbigdatahub.com/sites/default/files/styles/open_graph_image/public/blog_iotcrowd_thumbnail.jpg?itok=XQEqPMWi" width="90%"/></center>

By The End Of This Session You Should Be Able To:
----

- Define ensembling in your own words.
- Explain why ensembling is a useful ML technique.
- Define and give examples of 3 most common ensembling methods:
    1. Stacking
    2. Bagging
    3. Boosting

Ensemble Methods, aka Ensembling
------

Combine multiple ML models to obtain better predictive performance than any of single models could do alone.

Techniques for combining several weak learners to produce a single strong learner.

How the weak  become strong together
-------

<center><img src="http://jasonleaster.github.io/images/img_for_2015_12_13/stump.png" width="55%"/></center>

Suppose we have a decision stump classifier with 70% accuracy, but Bayes Error is 0%.   
Is this good enough?

Student Activity: Think, Pair, & Share
------

Suppose we have 3 completely independent decision stump classifiers each with an accuracy of 70%.

If we take a majority vote, what is the accuracy?

$$\binom{3}{2}(.7)^2(.3)^1 + \binom{3}{3}(.7)^3(.3)^0$$

In [None]:
reset -fs

In [13]:
import operator as op
from functools import reduce

def n_choose_k(n, k):
    k = min(k, n-k)
    numerator = reduce(op.mul, range(n, n-k, -1), 1)
    denominator = reduce(op.mul, range(1, k+1), 1)
    return numerator//denominator

In [14]:
from scipy.special import binom

In [15]:
majority_accuracy = ((n_choose_k(3, 2)*(.7**2)*(.3**1)) + 
                     (n_choose_k(3, 3)*(.7**3)*(.3**0)))
print(f"{majority_accuracy:.1%}")

78.4%


In [None]:
majority_accuracy = ((binom(3, 2)*(.7**2)*(.3**1)) + 
                     (binom(3, 3)*(.7**3)*(.3**0)))
print(f"{majority_accuracy:.1%}")

Suppose we now have 5 completely independent decision stump classifiers each with an accuracy of 70%.

What is the majority vote accuracy?

$$\binom{5}{3}(.7)^3(.3)^2 + \binom{5}{4}(.7)^4(.3)^1 + \binom{5}{5}(.7)^5(.3)^0$$

In [16]:
majority_accuracy = ((n_choose_k(5, 3)*(.7**3)*(.3**2)) + 
                     (n_choose_k(5, 4)*(.7**4)*(.3**1)) + 
                     (n_choose_k(5, 5)*(.7**5)*(.3**0)))
print(f"{majority_accuracy:.1%}")

83.7%


If we had 101 such classifiers, we would have 99.9% majority vote accuracy!

Bagging, aka Bootstrap Aggregating
----

<center><img src="https://www.oreilly.com/library/view/python-machine-learning/9781783555130/graphics/3547_07_06.jpg" width="50%"/></center>

Fit multiple models in parallel and independently.

Each model gets a vote on the final prediction.

Bootstrap Statistical Procedure
-----

<center><img src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/10/bootstrap-sample.png" width="75%"/></center>

Create many random subsamples with replacement. Compute the statistic of each subsample.

Bootstrap Sampling Steps
-----

1. Start with your dataset of size $n$
1. Sample from your dataset with replacement to create 1 bootstrap sample of size $n$ which means many of the observations will be repeated
1. Repeat $B$ times
1. Each bootstrap sample can then be used as a separate dataset for model fitting

Bootstraping
-----

<center><img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/image20-768x289.png" width="75%"/></center>

Bagging
----

<center><img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/Screenshot-from-2018-05-08-13-11-49-768x580.png" width="55%"/></center>

You can bag with any collection of algorithms, giving them each a vote in the final prediction.

Although usually applied to tree-based algorithms, it can be used with any type of algorithms

For regression problems (predicting a continuous value), we average the values given by all the models.

For classification problems (predicting a categorical value), we choose the label with the most votes.

In [None]:
reset -fs

In [None]:
import numpy as np

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [17]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

In [18]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

lr_clf = LogisticRegression()
dt_clf = DecisionTreeClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators = [('lr', lr_clf), 
                                            ('dt', dt_clf), 
                                            ('svc',svm_clf)],
                              voting = 'hard')
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test) # 🍾

1.0

Source: https://towardsdatascience.com/ensemble-learning-in-machine-learning-getting-started-4ed85eb38e00

Types of Voting
------

__Hard voting__: A model is selected from an ensemble to make the final prediction by a simple majority vote for accuracy.

__Soft Voting__: Can only be done when all your classifiers can calculate probabilities for the outcomes. 

Soft voting averages out the probabilities calculated by individual algorithms.

Why do Bagging?
------

- Increases evaluation metric performance.
- Less likely to overfitting since permutations of datasets are fit.
- Improves stability of estimates. If ML people ever made error bars, they would be smaller.

Stacking (not just for Transformers) 
-----

An ensemble learning technique that uses predictions from previous models to build a new model. 

Two variations:

1. Pipeline
1. Metalearner

Stacking with a Metalearner
-----

<center><img src="https://hsto.org/getpro/habr/post_images/c28/f6a/e29/c28f6ae298041c65eba7a97d3fbcce8e.png" width="75%"/></center>

Metalearner learns the optimal combination of the base learners

Another common example of Stacking
-----

1. First, clustering (or topic modeling). 
2. Then fit a separate classifier for each cluster. Different clusters may have different feature importance.

Challenge Question
-------

If a data point is incorrectly predicted by the several of the models for a systematical reason, will combining the predictions provide better results? 

No - We need a method to improve the errors across models.

Boosting
----

<center><img src="https://littleml.files.wordpress.com/2017/03/boosted-trees-process.png" width="100%"/></center>

A sequential process, where each subsequent model attempts to correct the errors of the previous models. 

Boosting algorithms
-----

- XGBoost (Currently, the most popular)
- AdaBoost
- Gradient Boosting Machine (GBM)


<center><h2> ML 2 will cover boosting</h2></center>

When should you choose Stacking?
------

It is generally a good idea to pipe the outputs of one model into the inputs of another model.

However, creating a meta-learner that is able to choose among heterogeneous models is often too complex for the gain in performance.

When should you choose Bagging?
------

If you have time and enough data, bagging is a good choice because it only improves model performance.

When should you choose Boosting?
------

Boosting is good idea if highest level model performance is required. 

However, it is more complex than Bagging (harder to implement and harder to debug)

Check for understanding
-----

Will Bagging increase or decrease Bias?

Never increase Bias. Most of the time decrease Bias.

The final prediction error is often lower than any individual model.

Check for understanding
-----

Will Bagging increase or decrease Variance?

Never increase Variance. Most of the time decrease Variance.

By combining predictions, they will not overfit to specific attributes of certain training sets. 

Check for understanding
-----

How does Bagging partition the subsetted data?  
How does Boosting partition the subsetted data?

Bagging does random partitions.

Boosting samples data with errors at a higher preference. 

 Summary
------

- Ensembles are collections of model that will perform better than any individual model.
- Stacking is chaining the output of one model as the input of another model.
- Boosting is where subsequent models learn to correct the errors of earlier models.
- Bagging is repeatedly sampling the training data and fitting a model to each resampling.