In [5]:
from sklearn.datasets import make_regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import load_boston
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

## Ensemble model with Bagging

The guiding principle behind ensemble models is to leverage a combination of weak learners to create a strong learner. Bagging does this by creating subsets of the training data through resampling, and training an ML model of choice (in our example, we will use Decision Trees) on the substes of training data. This produces numerous models, each slightly different than others. By averaging the prediction of these inidividual learners for a given obervation, we should get more robust results that accounts for variance in the test data than we would get from an individual learner.

### Pseudocode for Bagging

In [None]:
"""
Create our strong learner by bagging weak learners. Note that the code below 
will not run, it is only an outline of the general implementation.
"""
# Assume `data` is defined
trees = []
number_of_trees = 100
for i in range(number_of_trees):
  subset_data = resample(data)
  tree = DecisionTreeModel().fit(subset_data)
  trees.append(tree)

"""
Predict for target variable by running all weak learners on an observation
and averaging the result (or taking the mode if target variable is categorical).
"""
# Assume `x_test` is defined where x_test is the observation we will to predict for
results = []
for tree in trees:
  tree.predict(x_test)
pred = results.mean()

### Bagging (Regressor) with SK Learn

For this example, we will load the boston home prices dataset provided by sklearn. The target variable will be the median home prices.

Learn more about this dataset [here](https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-dataset).

In [2]:
data = load_boston()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
display(X.head())
display(y)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Length: 506, dtype: float64

We will now evaluate the `mean_absolute_error` for an ensemble model with the following number of learners: 1, 10, 25, 50

In [3]:
scores = {1: [], 10: [], 25: [], 50: []}
for num_estimators in scores:
  # define the model
  model = BaggingRegressor(n_estimators=num_estimators)
  # evaluate the model
  cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
  n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
  scores[num_estimators] = n_scores
# report performance
scores_df = pd.DataFrame.from_dict(scores)
display(scores_df.head())

Unnamed: 0,1,10,25,50
0,-3.380392,-2.087647,-1.988549,-2.022314
1,-3.286275,-2.411765,-2.601725,-2.444353
2,-3.454902,-1.891176,-1.808863,-1.840431
3,-4.962745,-3.588039,-3.664392,-3.366784
4,-2.556863,-2.142549,-1.891843,-1.888


Now that we have multiple scores for various values of `n_estimators`, let's see how the number of estimators fare against each other by averaging the scores for each respective value of `n_estimators`. 

Here, `n_estimators` is the number of trees in the ensemble model, and we are interested in how this makes a difference to support our understanding of the benefits of bagging to create strong learners.

In [4]:
scores_df.mean()

1    -3.252587
10   -2.337840
25   -2.247047
50   -2.218703
dtype: float64

It is clear from the above, that as the number of estimators (i.e trees) increases, the ensemble model produces more robust predictions.

## Extending bagging with Random Forests

Random forests is very similar to bagging, with the addition of dropping a few features in the training data (i.e only using a subset of features instead of all of them, chose randomly) for each iteration along with resampling it. This adds another level of randmness to the generationg of trees, and further account for variance. 

### Pseudocode for Random Forests
The pseudocode below is **very** similar to the one above, except for the `resample` line.

In [None]:
"""
Create our strong learner by bagging weak learners. Note that the code below 
will not run, it is only an outline of the general implementation.
"""
# Assume `data` is defined
trees = []
number_of_trees = 100
for i in range(number_of_trees):
  subset_data = drop_random_features(resample(data))
  tree = DecisionTreeModel().fit(subset_data)
  trees.append(tree)

"""
Predict for target variable by running all weak learners on an observation
and averaging the result (or taking the mode if target variable is categorical).
"""
# Assume `x_test` is defined where x_test is the observation we will to predict for
results = []
for tree in trees:
  tree.predict(x_test)
pred = results.mean()

### Random Forest (Regressor) with SK Learn

In [36]:
scores = {1: [], 10: [], 25: [], 50: []}
for num_estimators in scores:
  # define the model
  model = RandomForestRegressor(n_estimators=num_estimators)
  # evaluate the model
  cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
  n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
  scores[num_estimators] = n_scores
# report performance
scores_df = pd.DataFrame.from_dict(scores)
display(scores_df.head())

Unnamed: 0,1,10,25,50
0,-3.611765,-2.184118,-1.848,-1.870196
1,-3.288235,-2.633725,-2.332784,-2.465294
2,-3.001961,-1.844706,-1.761569,-1.761176
3,-3.907843,-3.690588,-3.386431,-3.418706
4,-2.827451,-1.878235,-1.827216,-1.787961


Let's perform the same analysis for the number of estimator that we did above. 

In [37]:
scores_df.mean()

1    -3.196889
10   -2.314654
25   -2.205537
50   -2.178932
dtype: float64

The results are consistent with what we observed above. Increasing the number of estimators improves the performance of the ensemble model, supporting the benefit of using an ensemble model as opposed to a single learner. 