# Train, Validate $\rightarrow$ Train, Test

### Focus: Linear and Ridge Regression

## Introduction
When constructing a model, data availability may become an issue. 
In order to avoid overfitting, it is necessary to withhold some portion of the data as a test set. 
However, overfitting *on the test set* may also occur without a secondary validation step. 
As such, `scikit` contains a number of methods for cross-validation of data.

## References
1. [Scikit documentation - LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
2. [Scikit documentation - Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)


## Setting up the model

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.dummy import DummyRegressor
from collections import OrderedDict

# load dataset 
raw = load_diabetes()
X = raw.data# slice off only the first feature (.data is multi-dimensional)
y = raw.target # the target data is a single label, so it can all be kept

# test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

# we'll use the linear regression & ridge regression to predict diabetese
basline_reg_model = DummyRegressor()
linear_reg_model = LinearRegression()
ridge_reg_model = Ridge()


## Cross-validation
Though a manual CV workflow was described in [the cross-validation lab](./CrossValidation.ipynb), the automated `cross_val_score()` will work well enough for this example.

In [2]:
print('Features: ', raw.feature_names)

Features:  ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


In [3]:
# automated CV step

baseline_scores = cross_val_score(basline_reg_model, X_train, y_train, cv=5)
linear_scores = cross_val_score(linear_reg_model, X_train, y_train, cv=5)
ridge_score = cross_val_score(ridge_reg_model, X_train, y_train, cv=5)
print("Baseline score :",baseline_scores) 
print("Linear score :",linear_scores) 
print("Ridge score  :", ridge_score)

Baseline score : [-1.80462803e-03 -3.36823685e-06 -1.40266582e-02 -8.86415630e-07
 -7.02248801e-03]
Linear score : [0.44978621 0.59947543 0.49575483 0.57183858 0.43842298]
Ridge score  : [0.40684985 0.44450481 0.38745555 0.41612711 0.35038474]


Note that we are performing cross validation with the training set. These cross-validation values represent how well (with 1 being a perfect score) the model performed against a small, as-yet-untrained portion of the data for the classification task. In the regression setting **R^2 (coefficient of determination)** is used as the default regression scoring function. The best possible R^2 score is 1.0 and it can be negative (because the model can be arbitrarily worse). See [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score). 

## Training the new model

Since the CV values are relatively high, we can create a model using all the data in the training set and test against the testing set:

In [4]:
# fit new model
basline_reg_model.fit(X_train, y_train)
linear_reg_model.fit(X_train, y_train)
ridge_reg_model.fit(X_train, y_train)

# model.predict() returns predicted values
y_pred_baseline = basline_reg_model.predict(X_test)
y_pred_linear = linear_reg_model.predict(X_test)
y_pred_ridge = ridge_reg_model.predict(X_test)

# print(classification_report(y_test, y_pred))

### Model metric

In [5]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
from pandas import DataFrame
import numpy as np

scores = [
    [mean_squared_error(y_test, y_pred_baseline),
     mean_absolute_error(y_test, y_pred_baseline),
     mean_absolute_percentage_error(y_test, y_pred_baseline),
     r2_score(y_test, y_pred_baseline)
    ],    
    [mean_squared_error(y_test, y_pred_linear),
     mean_absolute_error(y_test, y_pred_linear),
     mean_absolute_percentage_error(y_test, y_pred_linear),
     r2_score(y_test, y_pred_linear)
    ],    
    [mean_squared_error(y_test, y_pred_ridge),
     mean_absolute_error(y_test, y_pred_ridge),
     mean_absolute_percentage_error(y_test, y_pred_ridge),
     r2_score(y_test, y_pred_ridge)
    ],
    
]

scores_matrix = DataFrame(scores, 
                          columns=['MSE', 'MAE', 'MAPE', 'R2'],
                         index=['Baseline', 'Linear Regression', 'Ridge Regression'])

scores_matrix = np.around(scores_matrix, 2)
print(scores_matrix)

                       MSE    MAE  MAPE    R2
Baseline           5420.61  64.27  0.67 -0.02
Linear Regression  3137.69  45.20  0.44  0.41
Ridge Regression   3280.30  48.55  0.49  0.38


**Comment: Linear Regression is outperforming Ridge and baseline in terms of all the metrices. For MSE, MAE, and MAPE, the lower, the better. And for R^2, the higher, the better.**