## 3. Models and evaluation

For both `m1` and `m2`, the same types of models are fit on each feature set. A simple linear regression will function as a point of reference. Furthermore, experiments with polynomial regression, tree regression and singular value regression are performed. Gridsearch is applied to tune hyperparameters. In case training turns out to be too computationally expensive, DASK-ML is implemented.
Each model is assesed using 5-fold cross validation. Measures of least squared error, sum squared error and mean squared error are produced, as well as accuracy and AUROC.

Lasso model in sklearn: https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_model_selection.html
Mogelijkheid om code in sklearn pipelines te zetten

### Setup

In [2]:
import os
from datetime import datetime

import pandas as pd
import numpy as np

from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

In [9]:
def frmt(scores):
    r2 = np.abs(scores['test_r2'].mean())
    mse = np.abs(scores['test_neg_mean_squared_error'].mean())
    rmse = np.abs(scores['test_neg_root_mean_squared_error'].mean())
    mar = np.abs(scores['test_neg_mean_absolute_error'].mean())

    return {'r2':r2, 'mse':mse, 'rmse':rmse, 'mar':mar}

In [10]:
def log(model, fts, estimator, results):
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    
    line = f"{model},{timestamp},{fts},{estimator},{results['r2']},{results['mar']},{results['mse']},{results['rmse']}\n"

    with open('log.csv', 'a') as log:
        log.write(line)

In [11]:
y1 = pd.read_csv(f'featsets/y1.csv', header=None)
y2 = pd.read_csv(f'featsets/y2.csv', header=None)

https://towardsdatascience.com/mse-and-bias-variance-decomposition-77449dd2ff55
https://machinelearningmastery.com/regression-metrics-for-machine-learning/
https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

### 3.1 Linear regression 

In [None]:
measures = ['r2', 'neg_mean_squared_error', 'neg_root_mean_squared_error', 'neg_mean_absolute_error']

#### m1

In [47]:
fts = '00_first_try'
X = pd.read_csv(f'featsets/X1/{fts}.csv', delimiter=',', header=None)

model = LinearRegression()
scores = cross_validate(model, X, y1, cv=5, scoring=measures)

log('m1', fts, 'linear_regression', frmt(scores))

#### m2

In [76]:
fts = '00_first_try'
X = pd.read_csv(f'featsets/X2/{fts}.csv', delimiter=',', header=None)

model = LinearRegression()
scores = cross_validate(model, X, y2, cv=5, scoring=measures)

log('m2', fts, 'linear_regression', frmt(scores))

### 3.2 Polynomial regression

> Use dask https://ml.dask.org/preprocessing.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures

#### m1

In [None]:
# fts = '00_first_try'
# X = pd.read_csv(f'featsets/X1/{fts}.csv', delimiter=',', header=None)

# model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression(fit_intercept = False))
# scores = cross_validate(model, X, y1, cv=5, scoring=measures)

# log('m1', fts, 'polinomial_regression', frmt(scores))

#### m2

### 3.3 Ridge
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge

In [12]:
from sklearn.linear_model import Ridge

measures = ['r2', 'neg_mean_squared_error', 'neg_root_mean_squared_error', 'neg_mean_absolute_error']

#### m1

In [13]:
fts = '00_first_try'
X = pd.read_csv(f'featsets/X1/{fts}.csv', delimiter=',', header=None)

model = Ridge(alpha=1.0)
scores = cross_validate(model, X, y1, cv=5, scoring=measures)

log('m1', fts, 'ridge_regression', frmt(scores))

#### m2

### 3.4 Lasso

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html#sklearn.linear_model.LassoLarsCV

In [15]:
from sklearn.linear_model import Lasso
#  Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.939e+12, tolerance: 4.214e+08
# https://github.com/dask/dask-ml/issues/101 

#### m1

In [16]:
fts = '00_first_try'
X = pd.read_csv(f'featsets/X1/{fts}.csv', delimiter=',', header=None)

model = Lasso(alpha=1.0)
scores = cross_validate(model, X, y1, cv=5, scoring=measures)

log('m1', fts, 'lasso', frmt(scores))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


### 3.5 Tree regression

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
https://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeRegressor.html#sklearn.tree.ExtraTreeRegressor
https://george-jen.gitbook.io/data-science-and-apache-spark/decision-tree-regression

In [17]:
from sklearn.tree import DecisionTreeRegressor

#### m1

In [None]:
fts = '00_first_try'
X = pd.read_csv(f'featsets/X1/{fts}.csv', delimiter=',', header=None)


In [18]:
model = DecisionTreeRegressor(random_state=0)
scores = cross_validate(model, X, y1, cv=5, scoring=measures)

log('m1', fts, 'tree_regression', frmt(scores))

#### m2