# Labs Data Science Workshop: Super Ensemble Regressor

### Workshop Outline
1. Introduction & Motivation
2. Regressor Model
3. Ensemble Model
4. Super Ensemble Model

## Step 1 - Introduction & Motivation
<br>

### Problem
We've trained several Regressor models, but none of them are achieving the results we're looking for.
We need a better model!
<br>

### Goal
Build a regression model that can be trained in under 10 seconds and achieve better than 120 MSE.
<br>

### Solution
What if we combine multiple models to make one giant model?
<br>

### Restrictions
Due to time constraints, we will not be doing any of the following in this workshop... but in general, you really should.
- Data Cleaning
- Scaling
- Imputing
- Encoding
- Data Engineering
- Hyperparameter Tuning
- Data Analysis

## Setup

Choosing a model: [Scikit Cheat Sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

In [33]:
import os
from time import perf_counter

import pandas
from sklearn.linear_model import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

### Load Data

In [34]:
df = pandas.read_csv(os.path.join("data", "dataset2.csv"))
df.head()

Unnamed: 0,A,B,C,D,E,Target
0,3.882026,3.882026,3.882026,3.882026,3.882026,2.559627
1,3.200079,3.200079,3.200079,3.200079,3.200079,5.924739
2,3.489369,3.489369,3.489369,3.489369,3.489369,-7.223148
3,4.120447,4.120447,4.120447,4.120447,4.120447,-1.95286
4,3.933779,3.933779,3.933779,3.933779,3.933779,-0.719467


In [35]:
df.shape

(5000, 6)

In [36]:
df.describe()

Unnamed: 0,A,B,C,D,E,Target
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.577287,0.570648,0.601463,0.609685,0.605202,-0.109464
std,1.519043,1.509097,1.511941,1.499876,1.491528,49.658474
min,-3.6352,-3.295561,-3.9224,-3.688365,-3.856375,-203.648769
25%,-0.524725,-0.526124,-0.488893,-0.468482,-0.482765,-26.337749
50%,0.305244,0.288422,0.326872,0.290739,0.333431,-1.507371
75%,1.550164,1.477511,1.543315,1.576715,1.529544,28.055682
max,4.379678,4.479084,4.379678,4.379678,4.379678,209.467517


In [37]:
df.corr()

Unnamed: 0,A,B,C,D,E,Target
A,1.0,0.650384,0.64142,0.644478,0.651553,-0.025961
B,0.650384,1.0,0.649802,0.640194,0.654123,-0.018336
C,0.64142,0.649802,1.0,0.643924,0.640466,0.566533
D,0.644478,0.640194,0.643924,1.0,0.644547,-0.018216
E,0.651553,0.654123,0.640466,0.644547,1.0,-0.032977
Target,-0.025961,-0.018336,0.566533,-0.018216,-0.032977,1.0


### Train/Test Split

In [38]:
target = df.columns[-1]
features = df.columns.drop(target)

X_train, X_test, y_train, y_test = train_test_split(
    df[features],
    df[target],
    random_state=42,
    test_size=0.2,
)

## Step 2 - Regressor Models, Review
<br>

### Problem
Review - Regressors vs Classifiers
<br>

### Solution
1. Regressors are for predicting continuous values.
    - Examples?
2. Classifiers are for predicting discrete values.
    - Examples?

### Base Model: LinearRegression

In [39]:
base_model = LinearRegression()

start = perf_counter()
base_model.fit(X_train, y_train)
stop = perf_counter()
duration = stop - start

print(f"Algorithm: {base_model}")
print(f"Train Time: {duration:.2f}s")
print(f"Test Score: {base_model.score(X_test, y_test):.2%}")
print(f"MSE: {mean_squared_error(y_test, base_model.predict(X_test))}")

Algorithm: LinearRegression()
Train Time: 0.00s
Test Score: 79.62%
MSE: 512.802013271985


### Other Linear Models

In [40]:
models = [
    BayesianRidge(),
    RidgeCV(),
    LassoCV(random_state=42),
    SGDRegressor(random_state=42),
    PassiveAggressiveRegressor(random_state=42),
    RANSACRegressor(random_state=42),
]
for model in models:
    start = perf_counter()
    model.fit(X_train, y_train)
    stop = perf_counter()
    duration = stop - start
    print(f"Algorithm: {model}")
    print(f"Train Time: {duration:.2f}s")
    print(f"Test Score: {model.score(X_test, y_test):.2%}")
    print(f"MSE: {mean_squared_error(y_test, model.predict(X_test))}\n")

Algorithm: BayesianRidge()
Train Time: 0.00s
Test Score: 79.62%
MSE: 512.799626406829

Algorithm: RidgeCV()
Train Time: 0.00s
Test Score: 79.62%
MSE: 512.7998128942497

Algorithm: LassoCV(random_state=42)
Train Time: 0.05s
Test Score: 79.62%
MSE: 512.7977588863672

Algorithm: SGDRegressor(random_state=42)
Train Time: 0.01s
Test Score: 79.62%
MSE: 512.7126778441786

Algorithm: PassiveAggressiveRegressor(random_state=42)
Train Time: 0.00s
Test Score: 72.54%
MSE: 690.863888355196

Algorithm: RANSACRegressor(random_state=42)
Train Time: 0.03s
Test Score: 77.32%
MSE: 570.7119748792113



### Are There Any Questions?
[Allow time for only a few questions]
<br>

### Check For Understanding Questions
1. True or False. It's a good idea to throw out techniques like: data cleaning, hyperparameter tuning and data engineering.
2. Of the linear models we tested, which one performed best?
3. What is the fundamental difference between a regressor and a classifier?
4. What is the goto validation test for regression models?
<br>

### Check For Understanding Answers
1. False. We only dropped these techniques for this workshop due to time constraints.
2. Four-way tie with MSE 512 & score 79%: BayesianRidge, RidgeCV, LassoCV, SGDRegressor
3. Regressors predict continuous values, classifiers predict discrete values.
4. Mean Squared Error.
<br>

### Next Steps
If you want to know more: [Suggest resources for further understanding]


## Step 3 - Ensemble Models
<br>

### Problem
Model not working?
<br>

### Solution
Try an ensemble model!

#### Stacked Linear Ensemble

In [41]:
from sklearn.ensemble import StackingRegressor

In [42]:
model = StackingRegressor(
    estimators=[
        ("LR", LinearRegression()),
        ("BRR", BayesianRidge()),
        ("RCV", RidgeCV()),
        ("LCV", LassoCV(random_state=42)),
        ("SGDR", SGDRegressor(random_state=42)),
        ("PAR", PassiveAggressiveRegressor(random_state=42)),
        ("RANSAC", RANSACRegressor(random_state=42)),
    ],
    final_estimator=BayesianRidge(),
)

start = perf_counter()
model.fit(X_train, y_train)
stop = perf_counter()
duration = stop - start

print(f"Workers: {', '.join(model.named_estimators)}")
print(f"Executive: {model.final_estimator}")
print(f"Train Time: {duration:.2f}s")
print(f"Test Score: {model.score(X_test, y_test):.2%}")
print(f"MSE: {mean_squared_error(y_test, model.predict(X_test))}")

Workers: LR, BRR, RCV, LCV, SGDR, PAR, RANSAC
Executive: BayesianRidge()
Train Time: 0.63s
Test Score: 79.62%
MSE: 512.7959961191419


#### Prebuilt Ensemble Models

In [43]:
from sklearn.ensemble import *

In [44]:
models = [
    ("GBR", GradientBoostingRegressor(random_state=42)),
    ("HGBR", HistGradientBoostingRegressor(random_state=42)),
    ("ET", ExtraTreesRegressor(random_state=42)),
    ("ABR", AdaBoostRegressor(random_state=42)),
    ("RFR", RandomForestRegressor(random_state=42)),
]
for name, model in models:
    start = perf_counter()
    model.fit(X_train, y_train)
    stop = perf_counter()
    duration = stop - start
    print(f"Algorithm: {name}")
    print(f"Train Time: {duration:.2f}s")
    print(f"Test Score: {model.score(X_test, y_test):.2%}")
    print(f"MSE: {mean_squared_error(y_test, model.predict(X_test))}")
    print()

Algorithm: GBR
Train Time: 0.45s
Test Score: 95.36%
MSE: 116.83276574521398

Algorithm: HGBR
Train Time: 0.40s
Test Score: 95.21%
MSE: 120.54505468217337

Algorithm: ET
Train Time: 0.57s
Test Score: 94.94%
MSE: 127.33486237863517

Algorithm: ABR
Train Time: 0.19s
Test Score: 94.93%
MSE: 127.50951001560519

Algorithm: RFR
Train Time: 1.05s
Test Score: 94.83%
MSE: 130.16930958453065



### Are There Any Questions?
[Allow time for only a few questions]
<br>

### Check For Understanding Questions
1. Why did the stack of linear models only achieve results similar to the best model in the stack?
2. [Question 2]
3. [Question 3]
<br>

### Check For Understanding Answers
1. Because they all work in a very similar way.
2. [Answer 2]
3. [Answer 3]
<br>

### Next Steps
If you want to know more: [Scikit Ensemble Models](https://scikit-learn.org/stable/modules/ensemble.html)


## Step 4 - Super Ensemble Models
<br>

### Problem
Model Still Not Working?
<br>

### Solution
Try A Super Ensemble Regressor... Mo Powah!

#### Super Ensemble

In [45]:
model = StackingRegressor(
    estimators=[
        ("LR", LinearRegression()),
        ("RCV", RidgeCV()),
        ("LCV", LassoCV(random_state=42)),
        ("SGDR", SGDRegressor(random_state=42)),
        ("ARDR", ARDRegression()),
        ("PAR", PassiveAggressiveRegressor(random_state=42)),
        ("TSR", TheilSenRegressor(random_state=42)),
        ("RANSAC", RANSACRegressor(random_state=42)),
        ("RFR", RandomForestRegressor(random_state=42)),
        ("HGBR", HistGradientBoostingRegressor(random_state=42)),
        ("ABR", AdaBoostRegressor(random_state=42)),
        ("ET", ExtraTreesRegressor(random_state=42)),
    ],
    final_estimator=StackingRegressor(
        estimators=[
            ("LR", LinearRegression()),
            ("RCV", RidgeCV()),
            ("LCV", LassoCV(random_state=42)),
            ("SGDR", SGDRegressor(random_state=42)),
            ("ARDR", ARDRegression()),
            ("PAR", PassiveAggressiveRegressor(random_state=42)),
            ("TSR", TheilSenRegressor(random_state=42)),
            ("RANSAC", RANSACRegressor(random_state=42)),
            ("RFR", RandomForestRegressor(random_state=42)),
            ("HGBR", HistGradientBoostingRegressor(random_state=42)),
            ("ABR", AdaBoostRegressor(random_state=42)),
            ("ET", ExtraTreesRegressor(random_state=42)),
        ],
        final_estimator=GradientBoostingRegressor(random_state=42),
    ),
)

start = perf_counter()
model.fit(X_train, y_train)
stop = perf_counter()
duration = stop - start

print(f"Workers: {', '.join(model.named_estimators)}")
print(f"Executive: {model.final_estimator}")
print(f"Train Time: {duration:.2f}s")
print(f"Test Score: {model.score(X_test, y_test):.2%}")
print(f"MSE: {mean_squared_error(y_test, model.predict(X_test))}")

Workers: LR, RCV, LCV, SGDR, ARDR, PAR, TSR, RANSAC, RFR, HGBR, ABR, ET
Executive: StackingRegressor(estimators=[('LR', LinearRegression()), ('RCV', RidgeCV()),
                              ('LCV', LassoCV(random_state=42)),
                              ('SGDR', SGDRegressor(random_state=42)),
                              ('ARDR', ARDRegression()),
                              ('PAR',
                               PassiveAggressiveRegressor(random_state=42)),
                              ('TSR', TheilSenRegressor(random_state=42)),
                              ('RANSAC', RANSACRegressor(random_state=42)),
                              ('RFR', RandomForestRegressor(random_state=42)),
                              ('HGBR',
                               HistGradientBoostingRegressor(random_state=42)),
                              ('ABR', AdaBoostRegressor(random_state=42)),
                              ('ET', ExtraTreesRegressor(random_state=42))],
                  final_estima

In [46]:
from sklearn.svm import SVR

In [47]:
model = StackingRegressor(
    estimators=[
        ("GBR", GradientBoostingRegressor(random_state=42)),
        ("BRR", BayesianRidge()),
        ("SVR", SVR()),
        ("RCV", RidgeCV()),
        ("LCV", LassoCV(random_state=42)),
    ],
    final_estimator=StackingRegressor(
        estimators=[
            ("GBR", GradientBoostingRegressor(random_state=42)),
            ("BRR", BayesianRidge()),
            ("SVR", SVR()),
            ("RCV", RidgeCV()),
            ("LCV", LassoCV(random_state=42)),
        ],
        final_estimator=GradientBoostingRegressor(),
    ),
)

start = perf_counter()
model.fit(X_train, y_train)
stop = perf_counter()
duration = stop - start

print(f"Workers: {', '.join(model.named_estimators)}")
print(f"Executive: {model.final_estimator}")
print(f"Train Time: {duration:.2f}s")
print(f"Test Score: {model.score(X_test, y_test):.2%}")
print(f"MSE: {mean_squared_error(y_test, model.predict(X_test))}")

Workers: GBR, BRR, SVR, RCV, LCV
Executive: StackingRegressor(estimators=[('GBR',
                               GradientBoostingRegressor(random_state=42)),
                              ('BRR', BayesianRidge()), ('SVR', SVR()),
                              ('RCV', RidgeCV()),
                              ('LCV', LassoCV(random_state=42))],
                  final_estimator=GradientBoostingRegressor())
Train Time: 12.37s
Test Score: 95.20%
MSE: 120.64349460650865


#### Ensemble - Double Stack

In [48]:
model = StackingRegressor(
    estimators=[
        ("GBR", GradientBoostingRegressor(random_state=42)),
        ("HGBR", HistGradientBoostingRegressor(random_state=42)),
        ("ET", ExtraTreesRegressor(random_state=42)),
    ],
    final_estimator=GradientBoostingRegressor(random_state=42),
)

start = perf_counter()
model.fit(X_train, y_train)
stop = perf_counter()
duration = stop - start

print(f"Workers: {', '.join(model.named_estimators)}")
print(f"Executive: {model.final_estimator}")
print(f"Train Time: {duration:.2f}s")
print(f"Test Score: {model.score(X_test, y_test):.2%}")
print(f"MSE: {mean_squared_error(y_test, model.predict(X_test))}")

Workers: GBR, HGBR, ET
Executive: GradientBoostingRegressor(random_state=42)
Train Time: 7.35s
Test Score: 95.54%
MSE: 112.25137112560077


### Are There Any Questions?
[Allow time for questions]
<br>

### Check For Understanding Questions
1. Why not just use all the models?
2. True or False. When using an ensemble, it's more effective to use similar models as it boosts the ensemble's performance.
3. Is an ensemble of ensembles of ensembles possible?
<br>

### Check For Understanding Answers
1. Using similar models typically doesn't make the model better, unless they are very dynamic like pre-made ensemble models.
2. False.
3. Yes.
<br>

### Next Steps
If you want to know more: [Scikit Documentation](https://scikit-learn.org)

As an exercise, try to work out which model or combination of models gives the best MSE for the training time it requires.
