# Exercise 10- Stacking

In this exercise you will implement an ensemble method by learning a stacked regressor.

- paul.kahlmeyer@uni-jena.de

### Submission
- Deadline of submission:
        21.06.23 23:59
- Submission on [moodle page](https://moodle.uni-jena.de/course/view.php?id=43681)


### Help
In case you cannot solve a task, you can use the saved values within the `help` directory:
- Load arrays with [Numpy](https://numpy.org/doc/stable/reference/generated/numpy.load.html)
```
np.load('help/array_name.npy')
```
- Load functions with [Dill](https://dill.readthedocs.io/en/latest/dill.html)
```
import dill
with open('help/some_func.pkl', 'rb') as f:
    func = dill.load(f)
```

to continue working on the other tasks.

# The Dataset

We will use a real world dataset used for predicting the [quality of red wine](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009).
Altough the quality is a discrete value between 0 and 10, we interpret it as a regression task. 

### Task 1

Load the dataset stored in `dataset.csv` and split it into `X` and `y`.

In [10]:
# TODO: load data
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import sklearn
from tqdm.notebook import tqdm

df = pd.read_csv("dataset.csv")
y = np.array(df['quality'])
X = np.array(df.iloc[:,:11])

df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


## $R^2$ Score

Sklearn uses the [$R^2$ score](https://en.wikipedia.org/wiki/Coefficient_of_determination) as a quality measure for regressors. Given true values $y$ and predicted values $\hat{y}$ the $R^2$ score is defined as 

\begin{align*}
R^2(y, \hat{y}) &= 1-\cfrac{\sum_{i=1}^m(y_i-\hat{y}_i)^2}{\sum_{i=1}^m(y_i - \bar{y})^2}\,,
\end{align*}
where $\bar{y}$ is the average of $y$.

This value is 1 if the predictions match exactly, 0 if we would simply always predict the average and negative if our predictions are worse than this simple baseline.\
In short we aim for a value $>0$ and close to $1$.

### Task 2

Implement the $R^2$ score.\
Then use scikit learns [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model to fit on the dataset and calculate the $R^2$ score.\
Compare your result to the `.score` method of the regressor.

In [11]:
def r2_score(X : np.ndarray, y : np.ndarray, y_hat : np.ndarray) -> float:
    '''
    Calculates coefficient of determination.
    
    @Params:
        X... features // TODO WHY????
        y... labels
        y_hat.. predictions
    
    @Returns:
        score in (-inf, 1)
    '''
    
    # TODO: implement
    return 1 - np.sum((y-y_hat)**2) / np.sum((y-np.mean(y))**2)


# TODO: calculate r2 score for linear regressor, compare with .score
lr = LinearRegression()
lr.fit(X, y)
y_hat = lr.predict(X)

score0 = r2_score(None, y, y_hat)
score1 = lr.score(X, y)

"R2 score", score0, ".score", score1

('R2 score', 0.36055170303868833, '.score', 0.36055170303868833)

# Stacking

The main idea in stacking is to 
1. learn several heterogenous base models on the original data
2. learn a meta model on the predictions of the base models

<div>
<img src="images/stacking.png" width="600"/>
</div>
The hope is that the meta model can learn to combine the strengths of the base models (e.g. if model 1 fails, model 3 is strong).
Note that in contrast to bagging and boosting the base models must not be of the same method (e.g. decision trees).

## Base Models

First lets select a set of base models. We can now choose from the wide pool of regression methods.

Here we want to use the following models:
- [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- Polynomial Regression of degree 2 (use a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) of [Polynomial Features](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) followed by Linear Regression)
- [KNN Regression](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
- [Decision Tree Regression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)


### Task 3
Create a list of base models and evaluate them using crossvalidation (avg. of 10 folds).

In [12]:
# TODO: create base models

degree = 2
alpha = 1e-3

from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

from sklearn.model_selection import cross_val_score

linear_regression = LinearRegression()
knn_regression = KNeighborsRegressor()
dtree_regression = DecisionTreeRegressor()
polynomial_regression = make_pipeline(PolynomialFeatures(degree), LinearRegression())

regressors = [linear_regression, polynomial_regression, knn_regression, dtree_regression]
scores = []

# TODO: estimate avg. crossvalidation score for each base model
for r in regressors:
    r.fit(X, y)
    scores.append(np.mean(cross_val_score(r, X, y, cv=10)))

scores

[0.23554709694307557,
 0.19022921981676175,
 -0.07634595535227609,
 -0.48844774679177255]

## Meta Model

The meta model uses the predictions of the base models to predict $y$. One can thus view the base models as a feature map for the meta model.

In order to train the meta model, **we need the predictions of the base models on unseen data** since this is the scenario we would face at inference time. A simple method is to use **out-of-fold predictions** during training:

1. separate the data into k folds.
2. hold out one of the folds and train the base models on the other folds.
3. predict the held out fold using the base models.
4. repeat the above two steps k times to obtain out-of-fold predictions for all k folds.
5. feed all the out-of-fold prediction as features (training data) to the meta model.


### Task 4

Implement the out-of-fold method below.\
Calculate the $R^2$-Score on the out-of-fold predictions for each of the base models.

In [13]:
from sklearn.model_selection import KFold

def oof_prediction(model : sklearn.base.BaseEstimator, X : np.ndarray, y : np.ndarray, k = 5, permutate : bool = False) -> np.ndarray:
    '''
    Calculates out-of-fold predictions.
    
    @Params:
        model... class with a .fit and .predict method
        X... samples
        y... labels
        
    @Returns:
        predictions
    '''
    # TODO: implement
    _length = X.shape[0]
    _block_size = np.floor(X.shape[0] / k)

    kf = KFold(k, shuffle=permutate)

    predictions = []

    for i, (train_index, test_index) in enumerate(kf.split(X)):
        _test_X = X[test_index]
        _test_y = y[test_index]

        _train_X = X[train_index]
        _train_y = y[train_index]

        model.fit(_train_X, _train_y)
        _y_hat = model.predict(_test_X)

        # predictions.append({
        #     "predictions": _y_hat,
        #     "r2_score": r2_score(None, _test_y, _y_hat)
        # })

        predictions.append(_y_hat)

    return np.concatenate(predictions)

k = 10
# TODO: calculate r2 score for oof predictions for each base model
oof_scores = []
for r in regressors:
    # for reproducibility
    predictions = oof_prediction(r, X, y, k, permutate=False)
    oof_scores.append(r2_score(X, y, predictions))

print(regressors)
oof_scores

[LinearRegression(), Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('linearregression', LinearRegression())]), KNeighborsRegressor(), DecisionTreeRegressor()]


[0.332343863940908,
 0.2807938638434405,
 0.023571220255133585,
 -0.3510335317224569]

Now lets put everything together.

### Task 5

Implement the following `Stacking` class. Keep in mind the following things:
- the meta model is trained on out-of-fold predictions of the base models
- the base models are trained on the given dataset
- when predicting, we just use the predictions of the base models (no out-of-fold) as input for the meta model

Use your class to learn a stacked regressor with **linear regression as meta model** and the base models from Task 3. Evaluate it using crossvalidation (avg. of 10 folds) and compare the score to those of the base models (Task 3).

In [14]:
def smallestDivisor(number: int):
    divisor = 5
    while number % divisor != 0:
        divisor+=1
    return divisor

smallestDivisor(1439) # because it is f****** prime.......

1439

In [15]:
class StackedRegressor(sklearn.base.BaseEstimator):
    
    def __init__(self, base_models : list, meta_model : sklearn.base.BaseEstimator):
        self.base_models = base_models
        self.meta_model = meta_model
        
    def fit(self, X : np.ndarray, y : np.ndarray):
        '''
        Learns base and meta models.
        
        @Params:
            X... features
            y... labels
        '''
        # TODO: implement
        y_hats = []
        # print(X.shape[0])
        # k = smallestDivisor(X.shape[0])

        print("Fit base models")
        print(f"k={k}")
        for m in tqdm(self.base_models):
            # make oof predictions
            y_hats.append(oof_prediction(m, X, y, k, permutate=False))

            # fit base models on the whole dataset
            m.fit(X, y)

        self.meta_model.fit(np.stack(y_hats, axis=1), y)
        pass
    
    def predict(self, X : np.ndarray) -> np.ndarray:
        '''
        Given features, predicts labels.
        
        @Params:
            X... features
            
        @Returns:
            labels as array
        '''
        # TODO: implement
        # feature map with base models
        y_hats = []
        for m in tqdm(self.base_models):
            y_hats.append(m.predict(X))

        # predict with metamodel
        return self.meta_model.predict(np.stack(y_hats, axis=1))

    
    def score(self, X, y):
        '''
        R2-Score, needed for crossvalidation.
        
        @Params:
            X... features
            y... labels
            
        @Returns:
            Accuracy when predicting for X.
        '''
        # TODO: implement
        return r2_score(X, y, self.predict(X))

# TODO: fit stacked model
sr = StackedRegressor(regressors, LinearRegression())
sr.fit(X, y)

# TODO: evaluate with crossvalidation, compare to base models
np.mean(cross_val_score(sr, X, y, cv=10))

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Fit base models
k=10


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

0.247949445130604

### Task 6

Use the [scikit-learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html) to learn a stacked regressor.\
Evaluate it using crossvalidation (avg. of 10 folds) and compare the score to the scores of task 5.

Note that minor differences can occur due to a more advanced oof-prediction used in sklearn.

In [81]:
# TODO: fit sklearn stacked model
from sklearn.ensemble import StackingRegressor
from itertools import chain, combinations

from sklearn.linear_model import QuantileRegressor
from sklearn.kernel_ridge import KernelRidge

def powerset(iterable):
    s = list(iterable)
    return list(chain.from_iterable(combinations(s, r) for r in range(len(s)+1)))

results = []

def stacking_regressor_helper(estimators):
    for set in tqdm(powerset(estimators)):
        if not set:
            continue

        set = [x for x in set]
        # print(type(set), set)

        print("Training:", [a[0] for a in set])

        reg = StackingRegressor(
            estimators=set,
            final_estimator=LinearRegression()
        )

        reg.fit(X, y)

        # TODO: evaluate with crossvalidation, compare to custom model
        # print(", ".join([a[0] for a in set]))
        results.append(
            ([a[0] for a in set], np.mean(cross_val_score(reg, X, y, cv=10)))
        )

estimators = [
    ('Linear Regression', linear_regression),
    ('Polynomial Regression', polynomial_regression),
    ('KNN Regression', knn_regression),
    ('Paul ist cool 😎', dtree_regression),
    ('Kernel Ridge', KernelRidge()),
    ('Quantile Regressor', QuantileRegressor())
]
stacking_regressor_helper(estimators)

  0%|          | 0/64 [00:00<?, ?it/s]

Training: ['Linear Regression']
Training: ['Polynomial Regression']
Training: ['KNN Regression']
Training: ['Paul ist cool 😎']
Training: ['Kernel Ridge']
Training: ['Quantile Regressor']




KeyboardInterrupt: 

### Task 7
Try at least two different combinations of regressors for base models and meta model and report the average crossvalidation score.
[Here](https://scikit-learn.org/stable/supervised_learning.html) you can find an overview page of sklearn estimators.

In [77]:
# TODO: try different combinations

# see task 6
pd.DataFrame(results)

Unnamed: 0,0,1
0,[Linear Regression],0.236917
1,[Polynomial Regression],0.168551
2,[KNN Regression],0.000667
3,[Paul ist cool 😎],0.026877
4,[Kernel Ridge],0.237912
5,"[Linear Regression, Polynomial Regression]",0.24276
6,"[Linear Regression, KNN Regression]",0.237251
7,"[Linear Regression, Paul ist cool 😎]",0.238822
8,"[Linear Regression, Kernel Ridge]",0.236933
9,"[Polynomial Regression, KNN Regression]",0.178466
