# Cross Validation: Mini Tutorial with Boston Housing Data

This tutorial shows how to use cross validation with `quantcore.glm` using the [sklearn boston housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html).

*Note:* if you wish to explore this dataset further, there are a handful of resources online. For example, [this blog](https://medium.com/@amitg0161/sklearn-linear-regression-tutorial-with-boston-house-dataset-cde74afd460a). 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from quantcore.glm import GeneralizedLinearRegressorCV

## Load the data

In [2]:
boston = datasets.load_boston()
df_bos = pd.DataFrame(boston.data, columns = boston.feature_names)
df_bos['PRICE'] = boston.target
df_bos = df_bos[df_bos['PRICE'] <= 40] # remove outliers
df_bos.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## Fit model

We fit our `GeneralizedLinearRegressorCV` model using typical regularized least squares (Normal family). As the name implies, the best model is selected by cross-validation.

Some important parameters:

- **alphas**: For each model, `alpha` is the constant multiplying penalty termdetermines regularization strength. For `GeneralizedLinearRegressorCV()`, `alphas` is list of alphas for which to compute the models. If `None`, (preferred) the alphas are set automatically. The best value is chosen and stored as `self.alpha_`
- **l1_ratio**: For each model, the `l1_ratio` is the elastic net mixing parameter (`0 <= l1_ratio <= 1`). For `l1_ratio = 0`, the penalty is an L2 penalty. ``For l1_ratio = 1``, it is an L1 penalty.  For ``0 < l1_ratio < 1``, the penalty is a combination of L1 and L2. For `GeneralizedLinearRegressorCV()`, if you pass ``l1_ratio`` as an array, the `fit` method will choose the best value of `l1_ratio` and store it as `self.l1_ratio_`

In [3]:
X = df_bos[["CRIM", "ZN", "CHAS", "NOX", "RM", "AGE", "TAX", "B", "LSTAT"]]
y = df_bos["PRICE"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=5)

glmcv = GeneralizedLinearRegressorCV(
    family='normal',
    alphas=None,
    l1_ratio=[0, 0.2, 0.4, 0.6, 0.8, 1.0],
    fit_intercept=True,
    max_iter=150
)
glmcv.fit(X_train, y_train)
print(f"Chosen alpha:    {glmcv.alpha_}")
print(f"Chosen l1 ratio: {glmcv.l1_ratio_}")

Chosen alpha:    0.021157592673796407
Chosen l1 ratio: 0.4


## Test

In [4]:
print(f"Train RMSE: {mean_squared_error(glmcv.predict(X_train), y_train, squared=False)}")
print(f"Test  RMSE: {mean_squared_error(glmcv.predict(X_test), y_test, squared=False)}")

Train RMSE: 3.8043984268543487
Test  RMSE: 2.8062480574512425
