# Ensemble Methods
Using multiple machine learning models. Can be of same type or completely different models all together! 
<img width=50% src='img/captain_planet.jpg'/>

> "With our powers combined..."

## Some Types of Ensembles
We can break up the different types of ensembles into a few main types.

### Stacking
<img src='img/stacking.jpg' width=50%/>

#### Different Models, Same Data
- A form of averaging multiple models
- Typically uses the same training data for every model
- The innovation comes from the use of different kinds of models

#### Meta-Classifier/Meta-Regressor
- First, we ask several different models to make predictions about the target
- Rather than taking a simple average or vote to determine the outcome, feed these results into a final model that makes the prediction based on the other models’ predictions
- If it seems like we are approaching a neural network...you are correct!

### Bagging 
![](img/bag_of_marbles.jpg)
> Train weak learners, combine together into one via voting


- Many models naturally overfit
- Randomization → New models
- New models overfit in different ways
- Aggregation → Smooth over different ways of overfitting to reduce variance

#### Aggregation
- **B**ootstrap **AGG**regating
- Algorithm to repeat many times:
    + Create a sample from your data
    + Train a model (e.g. a decision tree) on that sample
- Final model comes by averaging over those many models

#### 3 Options:
1. Train each model on random sample
2. Choose a random set of features at each decision point
3. Choose a path at random!

### Boosting
<img src="img/try_fail_success.jpg" width=50%/>

> New model attempts to predict where the previous model made mistakes

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.

1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd1-e1526989432375.png)

5. Errors are calculated using the actual values and predicted values.
6. The observations which are incorrectly predicted, are given higher weights.(Here, the three misclassified blue-plus points will be given higher weights)
7. Another model is created and predictions are made on the dataset.(This model tries to correct the errors from the previous model)
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd2-e1526989487878.png)

8. Similarly, multiple models are created, each correcting the errors of the previous model.
9. The final model (strong learner) is the weighted mean of all the models (weak learners).
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/11/boosting10-300x205.png)

Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they work well for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd4-e1526551014644.png)

# Aggregating: Averaging, Bagging, and Random Forests

We can imagine training different models (maybe with different conditions) and then having them "vote" on what they think is best.

> Let's prepare some data to do some examples

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
plt.style.use('fivethirtyeight')

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,\
ExtraTreesClassifier, VotingClassifier
from sklearn.metrics import r2_score, accuracy_score, confusion_matrix, classification_report

from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import StackingRegressor

In [None]:
df = pd.read_csv('data/salaries.csv')
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum().sum()

In [None]:
# Convert Target to 0 or 1
df['Target'] = df['Target'] == '>50K'

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['Relationship','Sex','Target'],axis=1), df['Target'], random_state=42)

In [None]:
ohe = OneHotEncoder(drop='first', sparse=False)
ohe.fit(X_train.select_dtypes('object'))

In [None]:
X_tr_ohe = pd.DataFrame(ohe.transform(X_train.select_dtypes('object')),
                                  columns=ohe.get_feature_names(),
                                    index=X_train.index)

In [None]:
X_tr_ohe.head()

In [None]:
X_te_ohe = pd.DataFrame(ohe.transform(X_test.select_dtypes('object')),
                                  columns=ohe.get_feature_names(),
                                    index=X_test.index)

## Averaging
> Each model uses the same data to train and then we "vote" to make a prediction

### Model 1 - Logistic Regression

In [None]:
lr = LogisticRegression()

lr.fit(X_tr_ohe, y_train)

In [None]:
#cross validate to see performance 
scores = cross_val_score(estimator=lr, X=X_tr_ohe,
                        y=y_train, cv=5)
scores

In [None]:
lr.score(X_te_ohe, y_test)

### Model 2 - KNN

In [None]:
knn = KNeighborsClassifier(3)

knn.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=knn, X=X_tr_ohe,
                y=y_train, cv=5)


In [None]:
scores

In [None]:
knn.score(X_te_ohe, y_test)

### Model 3 - Decision Tree

In [None]:
dt = DecisionTreeClassifier(random_state=42)

dt.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=dt, X=X_tr_ohe,
               y=y_train, cv=5)
scores

In [None]:
dt.score(X_te_ohe, y_test)

### Averaging the Models
#### Building a `VotingClassifier`

> Of course there's a Scikit-Learn class for that

In [None]:
avg = VotingClassifier(estimators=[
    ('lr', lr),
    ('knn', knn),
    ('dt', dt)])
avg.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=avg, X=X_tr_ohe,
               y=y_train, cv=5)
scores

In [None]:
avg.score(X_te_ohe, y_test)

#### Weighted Averaging with the `VotingClassifier`
> Even if the vote is 50-50, you'd probably side with the "smart" ones more

This meta-estimator is not as good as one of our base estimators, so in this case the averaging did not work very well. Realizing that the decision tree is performing better than the logistic regression and the k-nearest-neighbors model, however, we might decide to build a meta-estimator by calculating a **weighted average** of the base estimators' predictions. And we can weight, or bias, this estimator in favor of the best-performing base estimator. Suppose we weight the tree 40%, the knn model 20%, and the logistic regression 40%:

In [None]:
w_avg = VotingClassifier(estimators=[
    ('lr', lr),
    ('knn', knn),
    ('dt', dt)],
    weights=[0.4, 0.2, 0.4])
w_avg.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=w_avg, X=X_tr_ohe,
                        y=y_train, cv=5)
scores

In [None]:
w_avg.score(X_te_ohe, y_test)

## Bagging
A single decision tree will often overfit your training data. Let's see if we have evidence of that in the current case:

- Take a sample of your X_train and fit a decision tree to it.
- Replace the first batch of data and repeat.
- When you've got as many trees as you like, make use of all your individual trees' predictions to come up with some holistic prediction. 
    - (Most obviously, we could take the average of our predictions, but there are other methods we might try.)

### Why is called bagging? 
* Because we're resampling our data with replacement, we're *bootstrapping*.
* Because we're making use of our many samples' predictions, we're *aggregating*.
* Because we're bootstrapping and aggregating all in the same algorithm, we're *bagging*.

### Back to the Example Data

In [None]:
# Instatiate a BaggingRegessor
# Note the base esimator is by default a decision tree
bag = BaggingClassifier(n_estimators=100,
                       verbose=1,
                       random_state=1)

In [None]:
bag.fit(X_tr_ohe, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=bag, X=X_tr_ohe,
               y=y_train, cv=5)
scores

In [None]:
# Score on test

bag.score(X_te_ohe, y_test)

### Fitting a Random Forest

### An Aside Story - Bananas 🍌

![Many individual yellow bananas](img/bananas.jpg)

Banana trees can be susceptible to [Panama's disease](https://en.wikipedia.org/wiki/Panama_disease)

They're all clones!

Similarly, all the Decision Trees will be the same if given the same data! (A clone!!!)

### The Goods & The Bads

**The Goods**

- Super friend! - Captain Planet
- High performance 
    + low variance
- Transparent - Look at each individual tree to see it's decisions 
    + inherited from Decision Trees

**The Bads**

- We got so many trees to plant...
- Computationally expensive
- Memory
    + all trees stored in memory
    + think back to k-Nearest Neighbors

### Breed a Variety of Trees
Let's add an extra layer of randomization: Instead of using all the features of my model to optimize a branch at each node, I'll just choose a subset of my features.

That's the essence of a random forest model. Note that there are now two levels of random sampling happening: To build a new tree, I'll be taking only some of my data points; and at any branching point in a tree, I'll be using only some of my features to determine the split.

### Steps:
1. Save a portion of data for validation (**out-of-bag**), the rest for training (**bag**)
2. The data for training (**bag**) is then split up by randomly selecting predictors
3. Grow/train your tree with the training data using just those features
4. Use our validation set (**out-of-bag**), take out the columns used in our tree from the previous step, and predict using the tree & this *out-of-bag* data
5. Compare on how well the tree did *out-of-bag error*
6. Repeat to make new trees and use the result to "vote" for the final decision

### Back to the Example Data

> Here's the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) on `RandomForestClassifier`

In [None]:
# Instantiate a RandomForestClassifier

rfr = RandomForestClassifier(max_features='sqrt',
                            max_samples=0.5,
                            random_state=1)

In [None]:
# Fit it

rfr.fit(X_tr_ohe, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=rfr, X=X_tr_ohe,
               y=y_train, cv=5)
scores

In [None]:
# Score on test

score = rfr.score(X_te_ohe, y_test)
score

### Cool Features of Random Forests
There are some extra investigations we can do with random forests since they're built of decision trees.

> **NOTE**
>
> Not all of these are _specific_ to random forests and can be applied to other (ensemble) models

#### Investigate Your Forest 🌲🌲👀🌲🌲
We can check out our trained estimators after training the ensemble. This isn't necessarily unique to random forests, but since the base model is always a decision tree we can really investigate how the model is working!

In [None]:
model_estimators = rfr.estimators_ 
print(len(model_estimators))
model_estimators

In [None]:
print(f'Overall model\'s score was {score:.3f}')
print('='*70)

for model in model_estimators[:5]:
    display(model)
    model_score = model.score(X_te_ohe, y_test)
    print(f'\tModel gave score of {model_score:.3f}')

#### Feature Importance

We can use [`.feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_) property of the trained model to get an idea of what features mattered the most

In [None]:
rfr.feature_importances_

In [None]:
feat_import = {name: score 
                   for name, score 
                       in zip(X_te_ohe.columns, rfr.feature_importances_)
}
feat_import

### Extremely Randomized Trees (Extra Trees)

Sometimes we might want even one more bit of randomization. Instead of always choosing the *optimal* branching path, we might just choose a branching path at random. If we're doing that, then we've got extremely randomized trees.

There are now **three** levels of randomization: sampling of data, sampling of features, and random selection of branching paths.

In [None]:
# Instantiate an ExtraTreesRegressor

etr = ExtraTreesClassifier(max_features='sqrt',
                         max_samples=0.5,
                         bootstrap=True,
                         random_state=1)

In [None]:
# Fit it

etr.fit(X_tr_ohe, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=etr, X=X_tr_ohe,
               y=y_train, cv=5)
scores

In [None]:
# Score on test

etr.score(X_te_ohe, y_test)

# Stacking

Remember weighted averaging? Stacking is about using DS models to estimate those weights for us. This means we'll have one layer of base estimators and another layer that is "**trained to optimally combine the model predictions to form a new set of predictions**". See [this short blog post](https://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-models-win-data-science-competitions/) for more.

## Initial Data Prep

In [None]:
import xlrd
import os
wb = xlrd.open_workbook('data/Sales Report.xls',
                        logfile=open(os.devnull, 'w'))

sales = pd.read_excel(wb)
sales = sales.dropna()

In [None]:
sales.dtypes

In [None]:
sales['Category'].value_counts()

In [None]:
sales['Sub-Category'].value_counts()

In [None]:
X_num = sales[['Discount', 'Profit']].columns
X_cat = sales[['Category', 'Sub-Category']].columns

In [None]:
X = sales[['Discount', 'Profit',
          'Category', 'Sub-Category']]
y = sales['Sales']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Setting Up a Pipeline

In [None]:
numTrans = Pipeline(steps=[
    ('scaler', StandardScaler())
])
catTrans = Pipeline(steps=[
    ('ohe', OneHotEncoder(drop='first',
                          sparse=False))
])

In [None]:
pp = ColumnTransformer(transformers=[
    ('num', numTrans, X_num),
    ('cat', catTrans, X_cat)
])

In [None]:
pp.fit(X_train)

In [None]:
X_tr_pp = pp.transform(X_train)

## Setting Up a Stack

In [None]:
estimators = [
    ('lr', LinearRegression()),
    ('knn', KNeighborsRegressor()),
    ('dt', DecisionTreeRegressor())
]

sr = StackingRegressor(estimators)

In [None]:
sr.fit(X_tr_pp, y_train)

In [None]:
X_test_pp = pp.transform(X_test)

In [None]:
sr.score(X_test_pp, y_test)

## Comparison with Base Estimators

In [None]:
lr = LinearRegression().fit(X_tr_pp, y_train)
lr.score(X_test_pp, y_test)

In [None]:
knn = KNeighborsRegressor().fit(X_tr_pp, y_train)
knn.score(X_test_pp, y_test)

In [None]:
rt = DecisionTreeRegressor().fit(X_tr_pp, y_train)
rt.score(X_test_pp, y_test)

### Pros and Cons of Random forests 
**Pros:**
* Strong performance because this is an ensemble algorithm, the model is naturally resistant to noise and variance in the data, and generally tends to perform quite well.

* Interpretability: each tree in the random forest is a Glass-Box Model (meaning that the model is interpretable, allowing us to see how it arrived at a certain decision), the overall random forest is, as well!

**Cons:**
* Computational complexity: On large datasets, the runtime can be quite slow compared to other algorithms.

* Memory usage: Random forests tend to have a larger memory footprint that other models. It's not uncommon to see random forests that were trained on large datasets have memory footprints in the tens, or even hundreds of MB.