### 3.4 Shrinkage Methods
In this chapter we'll focus on shrinking our regression estimates. The advantage to shrinkage methods is often lower variance than model selection methods, which will yield better prediction error.

#### 3.4.1 Ridge Regression
We'll begin with the ridge regression, also commonly known as L2 regularization. Ridge regression shrinks the coefficients by imposing a penalty on their size. Similar to OLS, the ridge coefficients minimize a residual sum of squares, this time penalized by the shrinkage penalty.
$$
RSS(\lambda) = (y-X\beta)^T (y- X\beta) + \lambda \beta^T \beta
$$
Differentiating this and setting it equal to zero yields.
$$
\hat{\beta}^{ridge} = (X^T X + \lambda I)^{-1} X^T y
$$
Let us also introduce the singular value decomposition (SVD) of the centered input matrix X. This will be very useful in understanding ridge regression. The SVD has the form:
$$
X = UDV^{T}
$$

In [3]:
# Import libraries.
import pandas as pd
import numpy as np
import scipy.stats
import math
import seaborn as sns

# Read data.
data = pd.read_table('prostate.txt')
data.drop('Unnamed: 0', axis=1, inplace=True)
data.head(3)

Unnamed: 0,lcavol,lweight,age,lbph,svi,lcp,gleason,pgg45,lpsa,train
0,-0.579818,2.769459,50,-1.386294,0,-1.386294,6,0,-0.430783,T
1,-0.994252,3.319626,58,-1.386294,0,-1.386294,6,0,-0.162519,T
2,-0.510826,2.691243,74,-1.386294,0,-1.386294,7,20,-0.162519,T


In [4]:
# Grab train / test mask and target.
mask = data.pop('train')
y_ = data.pop('lpsa')


# Normalize predictors with zscores.
data = data.apply(scipy.stats.zscore)


# Select training data.
y_train = y_[mask == 'T']
X_train = data[mask == 'T']


# Insert intercept column.
X_train.insert(0, 'Intercept', 1)

Before training our model, we need to discuss identification of the possible penalty hyperparameter. The hyperparameter $\lambda$ is constructed by considering the effective degrees of freedom for the ridge regression. These EDFs are given by the monotonically decreasing function $df\left(\lambda\right)$.
$$
df\left(\lambda\right) = \sum_{j=1}^{p} \frac{d_j^2}{d_j^2 + \lambda}
$$
Where $d_j$ are the entries of the diagonal matrix $D$ from the SVD of $X$. Note that the bounds of lambda are $\lambda \in \left[0, \infty \right)$, while the monotonically decreasing EDFs are bounded by $df\left(\lambda \right) \in \left[0, p \right]$ where $p$ is the number of predictors in our dataset.