<h1><center>STAT/CSE 416 Lab 2: Ridge and LASSO Regression</center></h1>
<center><b>Section:</b> AA/AB</center>
<center><b>Instructor:</b> Datong</center>
<center><b>TA:</b> Long Nguyen</center>
<center><b>Date:</b> October 12, 2023</center>

*Adapted from slides and notebooks from STAT/CSE 416 Spring 2021 TAs.*

## 1. Introduction

This lab will cover:
- Regularization by the $\ell_2$-penalty.
- Regularization by the $\ell_1$-penalty.
- Cross-validation to select hyperparameters.

## 2. Concepts

### 2.1. Why regularization?

**Regularization** generally refers to adjustment to the learning objective or optimization algorithm that will "smoothen" the resulting predictor $\hat{f}$ as an effort to **decrease variance** and **prevent overfitting**. (See [here](https://en.wikipedia.org/wiki/Overfitting) for examples.) 

A common framework for regularization involves decreasing the magnitude of the coefficients in the coefficient vector $w$. We use some measure $R(w)$ of the magnitude of the coefficients and add it to the training objective.
$$
\min_{w} \text{RSS}(w) + \lambda R(w).
$$
Note that these terms will often be in competition with one another; do we care more about fitting the data (keeping $\text{RSS}(w)$ small) or having small coefficients (keeping $R(w)$ small)? $\lambda$ controls this trade off.

**Exercise 2.1:**
- Does increasing $\lambda$ increase or decrease the values of the $w$ parameters?
- Please explain why.
- What would we expect to happen if we set $\lambda < 0$?

### 2.2. Ridge vs Lasso

Ridge regression uses the $\ell_2$ penalty, that is
$$
R(w) = ||w||_2^2 = \sum_{j=1}^d |w_j|^2 = |w_1|^2 + ... + |w_d|^2.
$$
For those familiar with vector calculus or linear algebra, this is just the Euclidean length of $w$.
LASSO regression uses the $\ell_1$ penalty, that is
$$
R(w) = ||w||_1 = \sum_{j=1}^d |w_j| = |w_1| + ... + |w_d|.
$$

## 3. Preprocessing

The basic preprocessing steps will be viewing and standardizing the data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Read in weather data and view.
weather = pd.read_csv("weather.csv")
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   STA     20000 non-null  int64  
 1   YR      20000 non-null  int64  
 2   MO      20000 non-null  int64  
 3   DA      20000 non-null  int64  
 4   MAX     20000 non-null  float64
 5   MIN     20000 non-null  float64
dtypes: float64(2), int64(4)
memory usage: 937.6 KB


In [3]:
train_data, test_data = train_test_split(weather, random_state=42)

input_cols = ['STA', 'YR', 'MO', 'DA', 'MIN']

# TODO:
train_X = train_data[input_cols]
train_y = train_data["MAX"]

test_X = test_data[input_cols]
test_y = test_data["MAX"]

In [13]:
train_X.head()

Unnamed: 0,STA,YR,MO,DA,MIN
5514,50403,44,1,23,72.0
1266,16405,42,7,25,41.0
5864,10002,44,9,5,74.0
15865,62701,43,4,18,70.0
12892,31701,43,5,19,66.0


Check the shape of each set to make sure they make sense!

In [6]:
print("train input data shape:", train_X.shape)
print("train target data shape:", train_y.shape)
print()
print("test input data shape:", test_X.shape)
print("test target data shape:", test_y.shape)

train input data shape: (15000, 5)
train target data shape: (15000,)

test input data shape: (5000, 5)
test target data shape: (5000,)


**Exercise 3.1:** Normalize training and test set input data (`X`) using statistics generated from the training set. To do this, use the [Standard Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from sklearn.

In [8]:
scaler = StandardScaler().fit(train_X, train_y)

# TODO:
train_X_norm = None
test_X_norm = None

## 4. Ridge Regression

We create a [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) linear model with a regularization coefficent of `1.0`. 

Note: This coefficent is referred to as "lambda ($\lambda$)" in course material and "alpha $\alpha$" in the `sklearn` docs. They are the same thing!

In [9]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

**Exercise 4.1:** Train the model using the training data and output the training error.

In [10]:
# TODO: 
ridge_model = None

**Exercise 4.2:** Define a function `rmse(mode, X, y)` that calculates the RMSE error for a given mode, input, and target data.

In [11]:
# TODO:
def rmse(model, X, y):
    pass

**Exercise 4.3:** Perform 5-fold cross validation with your Ridge model. Output the array of errors (length 5) as well as the mean error. You should use [Cross Validation Score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_scor) from `sklearn` to do this.

In [None]:
# TODO: 
ridge_CV_scores = cross_val_score(None, None, None, cv=5, scoring=rmse)
ridge_CV_scores

Perform 5-fold cross validation on Ridge models with a range of `alpha` values. For each `alpha`, print the `alpha` value and the corresponding mean CV score.

In [None]:
for reg_coef in [0.1, 1, 10, 100, 1000, 10000]:
    ridge_model = Ridge(alpha=reg_coef)
    ridge_CV_scores = cross_val_score(ridge_model, train_X_norm, train_y, cv=5, scoring=rmse)
    print(reg_coef, ridge_CV_scores.mean(), sep='\t')

Take a look at how the weights of Ridge models change as you change the regularization coefficient!

In [None]:
for reg_coef in [1, 100, 10e4, 10e7, 10e12]:
    ridge_model = Ridge(alpha=reg_coef)
    ridge_model.fit(train_X_norm, train_y)
    print(f"Lambda: \t {reg_coef}")
    print(f"Intercept: \t {ridge_model.intercept_}")
    print(f"Coefficients: \t {ridge_model.coef_}")
    print("------------------")

# 5. Regularization with LASSO

Create a [LASSO](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) linear model with a regularization coefficent of `1.0`.

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso_model = Lasso().fit(train_X_norm, train_y)
rmse(lasso_model, train_X_norm, train_y)

Perform 5-fold cross validation with your LASSO model. Output the array of errors (length 5) as well as the mean error.

In [None]:
lasso_CV_scores = cross_val_score(Lasso(), train_X_norm, train_y, cv=5, scoring=rmse)

Perform 5-fold cross validation with your Ridge model. Output the array of errors (length 5) as well as the mean error. You should use [Cross Validation Score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_scor) from `sklearn` to do this.

In [None]:
for reg_coef in [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100, 1000]:
    lasso_model = Lasso(alpha=reg_coef)
    lasso_CV_scores = cross_val_score(lasso_model, train_X_norm, train_y, cv=5, scoring=rmse)
    print(reg_coef, lasso_CV_scores.mean(), sep='\t')

Take a look at how the weights of LASSO models change as you change the regularization coefficient! For reasons we might not delve into in this course, LASSO tends to find **sparse** solutions, i.e. solutions that have many zeros in the coefficient vector.

In [None]:
for reg_coef in [0.001, 0.1, 1, 10, 100]:
    lasso_model = Lasso(alpha=reg_coef)
    lasso_model.fit(train_X_norm, train_y)
    print(lasso_model.intercept_, lasso_model.coef_)

**Exercise 5.1:** Using the regularization coefficient that leads to the best validation error, compute test scores for a Ridge and LASSO model.

In [None]:
# TODO:
print("Ridge", rmse(Ridge(alpha=None).fit(train_X_norm, train_y), test_X_norm, test_y))
print("LASSO", rmse(Lasso(alpha=None).fit(train_X_norm, train_y), test_X_norm, test_y))