## Generals

### Framework for each dataset

1. Plot response, check for multicollinearity and non-linear relationships -> do we need to log y or x?
2. Set a Loss Functon and argue choice (RMSE, MAE, ...)
3. Run different models (LM, NaiveBayer, Trees,...) and optime parameters with 5-fold CV
4. Choose each model with "optimal" parameters after CV and save Loss

### Final question

1. Compare Results of the algorithms for different data set
2. Which one performed best in which setting and why?
3. How sensitive are models to changes in parameters?

## Inputs Tobias Kr

1. Create list of models and list of parameters which we can loop through all datasets. I suggest *[Model (parameters)]*:
    - LM (robust I/O, Normalise I/O) */Robust regression models: https://scikit-learn.org/stable/auto_examples/linear_model/plot_robust_fit.html)
    - Ridge ($\lambda$)
    - Lasso ($\lambda$)
    - Bayesian Ridge
    - Decision Tree Regression (so many parameters o.o)
    - Random Forest Regression (also so many o.o)
    - k-nn Regressor (Scaling of Variables, Distance, k)
    - Auto-ML ? 🎉
    
*List of all scikit algorithms here: https://scikit-learn.org/stable/supervised_learning.html*

---

## Show Case Classes

---

- **DataSetting:**
    - Is the environment for each (or maybe all) dataset/s. It allows us to simply run all fits and regressions with one command, to collect all outputs and find best models.
    - Inputs:
        - y, x, models, loss_function
    - Methods:
        - evaluate_all() ... runs all models with all parameter settings
        - collect_losses() ... collects all losses from all models and parameter settings
- **Regressor:**
    - This class stores a model and various combinations of parameters as dictionary. Calling the method fit_all() will fit the model with all parameter combinations and save the losses.
    - inputs:
        - name, model, parameters, (eventually loss_function)
        
### Simple example

***Attention**: If you change the source file, you must restart the jupyter kernel :) *

In [1]:
# import numpy
import numpy as np

# load classes
from DataSetting import *
from Regressor import *
from HelpFunctions import *

# choose two models
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# get a dataset
from sklearn.datasets import load_boston
dat = load_boston()
x = dat['data']
y = dat['target']

# set parameters for models as dict
parameters_randomforest = [{'max_depth':3, 'min_samples_leaf':1},
                           {'max_depth':6, 'bootstrap':False}]

parameters_decisiontree = [{'max_depth':1, 'max_leaf_nodes':2},
                           {'max_depth':3, 'max_leaf_nodes':2},
                           {'max_depth':5, 'max_leaf_nodes':5},
                           {'max_depth':7, 'max_leaf_nodes':5}]

# create Regressors (a Regressor is one algorithm and a list of parameters)
models = [Regressor(name="RandomForest",
                    model=RandomForestRegressor,
                    parameters=parameters_randomforest),
          Regressor(name="DecisionTree",
                    model=DecisionTreeRegressor,
                    parameters=parameters_decisiontree)]

# create a DataSetting
ds = DataSetting(y=y, x=x, models=models, loss_function=rmse)

# fit all models
ds.evaluate_all()

# show results
ds.collect_losses()

RandomForest done in 0.046 seconds.
DecisionTree done in 0.005 seconds.


RandomForest 	 2.65 	 {'max_depth': 3, 'min_samples_leaf': 1}
RandomForest 	 1.818 	 {'max_depth': 6, 'bootstrap': False}
DecisionTree 	 4.389 	 {'max_depth': 1, 'max_leaf_nodes': 2}
DecisionTree 	 4.389 	 {'max_depth': 3, 'max_leaf_nodes': 2}
DecisionTree 	 3.338 	 {'max_depth': 5, 'max_leaf_nodes': 5}
DecisionTree 	 3.338 	 {'max_depth': 7, 'max_leaf_nodes': 5}




---

## Full code

see FullCode.py

---

## To Dos:

- Introduce (5-fold) Cross Validation: create a matrix with (5) columns storing the IDs used in each CV. 
- "collect_losses()" should return a vector with all values
- Introduce function like "collect_losses()" which only takes the model with the best parameters.
- Make research on more parameters and add them
- Add more ToDos 🎉