# 3 Ridge Regression

## VIF

Assume we are doing multivariate linear regression $y = X\beta +\epsilon$ where $X\in\mathbb R^{n\times (k+1)}$.

<br>

We have assumed that the data $X$ is of full column rank. However, when $X$ is not full rank, or has small singular values, it implies underlying relation between features. This is called the multicolinearity. We should detect this to avoid mistakes.


To test multicolinearity, we can extract feature $j\ (1\leqslant j\leqslant k)$, and fit a linear mdoel to column $x_{*j}$ with the rest of data $[x_{*0},x_{*1},\dotsc,x_{*(j-1)},x_{*(j+1)},\dotsc,x_{*n}]$. If the model has a high $R^2$, then we know that feature $x_{*j}$  might have colinearity with other features.

Explicitly, we can define VIF (variance inflation factor) to be
$${\rm VIF}_j = \frac{1}{1 - R_j^2} = \frac{1}{1 - \frac{{\rm SSR}_j}{{\rm SST}_j}}=\frac{{\rm SST}_j}{{\rm SSE}_j}$$
where $R_j^2$ is the $R^2$ of the regression on $x_{*j}$ with the remaining features.

<br>

When there is no multicolinearity, then $R_j = 0$ and ${\rm VIF}_j = 1$. When there is prominent multicolinearity, then $R_j^2\rightarrow 1$ and ${\rm VIF}_j\rightarrow +\infty$.



### Conditional Variance

**Theorem** Denote $X_j\in\mathbb R^n$ to be the $(j+1)$-th column of $X$ (do not forget that the first column is full of ones)
$${\rm VIF}_j = \frac{{\rm Var}X_j}{{\rm Var}(X_j|X_0,X_1,\dotsc,X_{j-1},X_{j+1},\dotsc,X_k)}$$

**Proof** On the one hand, ${\rm Var}  X_j = \frac{1}{n-1}\sum_{i=1}^n (x_{ij} - \bar x_{*j})^2=\frac{1}{n-1}{\rm SST}_j$. On the other,
$$  {{\rm Var}(X_j|X_1,\dotsc,X_{j-1},X_{j+1},\dotsc,X_k)}=
  {{\rm Var}(X_j - \mathbb E(X_j|\dotsc))}.$$
Recall that 

### Inverse Entry

**Theorem** Denote $X_j\in\mathbb R^n$ to be the $(j+1)$-th column of $X$ (do not forget that the first column is full of ones), then ${\rm VIF}_j$ is the $(j+1,j+1)$ entry of the matrix $\sigma^2(X^TX)^{-1}$.


**Proof** Let $q= [0,\dotsc,1,\dotsc,0]^T\in\mathbb R^{k+1}$, a vector with the $(j+1)$-th entry set to one. We can then extract the $(j+1)$-th column $X$ by $Xq$. To regress $Xq$ with the rest of the features, it is 
$$\min_{\beta}\left\{\Vert Xq - X\beta\Vert:\quad q^T\beta = 0\right\}.$$

If denote $Q\in\mathbb R^{(k+1)\times k}$ to be the orthogonal complement of $q$, and write $\beta = Qu$, then the solution is given by
$$\beta = Qu= Q(Q^TX^TXQ)^{-1}Q^TX^TXq.$$
Also, denote $e = [1,\dotsc,1]^T\in\mathbb R^{k+1}$ to be a vector full of ones. Then, $\frac 1n e^TXq$ is the average of $Xq$. Hence,
$${\rm VIF}_j =\frac{{\rm SST}_j}{{\rm SSE}_j} = \frac{\Vert Xq - \frac 1n ee^TXq\Vert^2}{\Vert Xq - X\beta\Vert^2}


## Ridge Regression

Ridge regression with regularization factor $\lambda\ (\lambda \geqslant 0)$ estimates $\hat\beta$ by
$$\hat\beta_\lambda = (X^TX+\lambda I)^{-1}X^Ty.$$

### Regularization

**Thoerem** The ridge regression estaimtor $\hat\beta = (X^TX+\lambda I)^{-1}X^Ty$ is the minimizer of the following:

$$\hat\beta_\lambda = \inf_{\beta}\left\{\Vert y - X\beta\Vert^2 +\lambda \Vert \beta\Vert^2\right\}.$$

**Proof**
$$\begin{aligned}
\Vert y - X\beta\Vert^2 +\lambda \Vert \beta\Vert^2
=\beta^T(X^TX+\lambda I)\beta - 2y^TX\beta + y^Ty.
\end{aligned}$$

This is a quadratic form with minimum reached at 
$$-\frac{b}{2a} = -\frac 12\left(X^TX+\lambda I\right)^{-1}\left(2X^Ty\right) = \hat\beta_\lambda $$

### Singular Value Shrinkage

Apply singular value decomposition on $X$, say $X = UDV^T$ where $U,V\in\mathbb R^{n\times (k+1)}$ are orthogonal and $D\in\mathbb R^{(k+1)\times (k+1)}$ is diagonal and positive.

Observe that $(VD^2V^T+\lambda I)V(D^2+\lambda I)^{-1}V^T = 0$, we can see that 

$$\hat\beta = (VD^2V^T+\lambda I)^{-1}VDU^Ty = V(D^2+\lambda I)^{-1}D U^Ty$$



### Statistics

Here lists some of the statistics properties of the ridge regression estimator $\hat\beta_\lambda$:
$$\left\{\begin{aligned}& \Vert \mathbb E \hat\beta_\lambda - \beta\Vert^2 = \lambda^2\beta^T(X^TX+\lambda I)^{-2}\beta
=\beta^T(\frac 1\lambda X^TX+\lambda I)^{-2}\beta
\\ & {\rm Cov}(\hat\beta_\lambda) =\sigma^2 (X^TX+\lambda I)^{-1}X^TX(X^TX+\lambda I)^{-1}
\preceq \sigma^2(X^TX+\lambda I)^{-1}
\end{aligned}\right.$$

**Proof** Note that 
$$(X^TX+\lambda I)^{-1}(X^TX+\lambda I) = I\quad \Rightarrow\quad (X^TX+\lambda I)^{-1}X^TX =I -\lambda (X^TX+\lambda I)^{-1}.$$

The bias $ \Vert \mathbb E\hat\beta_\lambda - \beta\Vert^2$ is given by
$$\begin{aligned}\left\Vert \mathbb E((X^TX+\lambda I)^{-1}X^Ty - \beta) \right\Vert^2 & = 
\left\Vert (X^TX+\lambda I)^{-1}X^TX\beta - \beta\right\Vert^2
\\ &= \left\Vert\left( (X^TX+\lambda I)^{-1}X^TX  - I\right)\beta\right\Vert^2
\\ &= \left\Vert -\lambda (X^TX+\lambda I)^{-1}\beta \right \Vert^2
\\ &= \lambda^2\beta^T(X^TX+\lambda I)^{-2}\beta
\end{aligned}
$$

The derivation of the covariance is rather trivial.


### Mean Squared Error

The mean squared error for $\hat\beta_\lambda$ to estimate $\beta$ is given by
$$\begin{aligned}{\rm MSE}(\hat\beta_\lambda) &=\Vert \mathbb E(\hat \beta_\lambda) - \beta\Vert^2 + \mathbb E(\Vert \hat\beta_\lambda - \mathbb E\hat\beta_\lambda \Vert^2)
\\ &=\Vert \mathbb E(\hat \beta_\lambda) - \beta\Vert^2 + \mathbb E{\rm tr}\left[\left(\hat\beta_\lambda - \mathbb E\hat\beta_\lambda\right) \left(\hat\beta_\lambda - \mathbb E\hat\beta_\lambda\right) ^T\right]
\\ &= \lambda^2\beta^T (X^TX+\lambda I)^{-2}\beta + {\rm tr}\left[\sigma^2 (X^TX+\lambda I)^{-1}X^TX(X^TX+\lambda I)^{-1}\right]
\\ &= \lambda^2\beta^T (X^TX+\lambda I)^{-2}\beta +\sigma^2 {\rm tr}\left[X^TX(X^TX+\lambda I)^{-2}\right]
\end{aligned}$$



#### Orthogonal Case

We first study a special case, when $X^TX = I$. In this case, 
$${\rm MSE} = \frac{\lambda^2\beta^T\beta}{(1+\lambda)^2} + \frac{\sigma^2(k+1)}{(1+\lambda)^2}
\quad{\rm and}\quad \frac{\partial MSE}{\partial \lambda}=\frac{2\beta^T\beta \lambda}{(1+\lambda)^3}-\frac{2\sigma^2(k+1)}{(1+\lambda)^3}.$$

Therefore, $\lambda_*= \dfrac{\sigma^2(k+1)}{\beta^T\beta}$ minimizes the MSE (and is better than the ordinary least squares). However, in general it seems that it is hard to obtain a closed form solution to the best $\lambda_*$.

### Leave-One-Out-Cross-Validation (LOOCV)

Leave-one-out-cross-validation (LOOCV) can help select the best hyperparameter $\lambda$ in ridge regression. It has the following process:

For each piece of datum $x_i\ (i=1,\dotsc,n)$ from $X\in\mathbb R^{n\times (k+1)}$, we can remove it from $X$ to obtain $X_{(i)}\in\mathbb R^{(n-1)\times (k+1)}$. We can fit a ridge regression model $\hat\beta_{\lambda, (i)}$ on $X_{(i)}$ and validate it on $x_i$: ${\rm CV}_i = \Vert y_i - x_i^T\hat\beta_{\lambda,(i)}\Vert^2$. When taking $i = 1,\dotsc,n$ to obtain losses ${\rm CV}_1,\dotsc,{\rm CV}_n$, we can compute the average loss ${\rm CV} = \frac 1n \sum_{i=1}^n {\rm CV}_i$. If denote $\hat\beta_\lambda = (X^TX+\lambda I)^{-1}X^Ty$ to be the ridge regression fitted on all data, then we can show that, the average of the loss has the following equation:
$${\rm CV} = \frac 1n \sum_{i=1}^n {\rm CV}_i \equiv \frac 1n \sum_{i=1}^n \left(\frac{y_i -\hat y_i}{1 - h_{ii}} \right)^2$$
where $\hat y = X\hat\beta_\lambda$ is the prediction with the full model and $h_{ii}$ is the $(i,i)$-th entry of $X(X^TX+\lambda I)^{-1}X^T$.

We select $\lambda$ with small loss ${\rm CV}$.

**Proof** Note that $X_{(i)}^TX_{(i)} = X^TX - x_ix_i^T$, we have by Sherman-Morrison-Windbury formula that
$$\begin{aligned}(X_{(i)}^TX_{(i)} + \lambda I)^{-1}  & = (X^TX +\lambda I - x_ix_i^T)^{-1} = 
(X^TX+\lambda I)^{-1} +\frac{(X^TX+\lambda I)^{-1}x_ix_i^T(X^TX+\lambda I)^{-1}}{1-x_i^T(X^TX+\lambda I)^{-1}x_i}
\\ &= (X^TX+\lambda I)^{-1} +\frac{(X^TX+\lambda I)^{-1}x_ix_i^T(X^TX+\lambda I)^{-1}}{1-h_{ii}}.\end{aligned}
$$

Thus we can show that each $\hat\beta_{\lambda,(i)}$ is given by 
$$\begin{aligned}\hat\beta_{\lambda,(i)} & = (X_{(i)}^TX_{(i)} + \lambda I)^{-1}X_{(i)}^Ty_{(i)}
= \left[(X^TX+\lambda I)^{-1} +\frac{(X^TX+\lambda I)^{-1}x_ix_i^T(X^TX+\lambda I)^{-1}}{1-h_{ii}}\right](X^Ty - x_iy_i)
\\ &= \hat\beta_{\lambda}-(X^TX+\lambda I)^{-1}x_iy_i + \frac{(X^TX+\lambda I)^{-1}x_i\left(x_i^T\hat\beta_\lambda - h_{ii}y_i\right)}{1 - h_{ii}}
\\ &= \hat\beta_{\lambda}+ \frac{(X^TX+\lambda I)^{-1}x_i\left(\hat y_i-y_i\right)}{1 - h_{ii}}.
\end{aligned}
$$

Therefore,
$$\begin{aligned}{\rm CV} &= \frac 1n \sum_{i=1}^n \Vert y_i - x_i^T\hat\beta_{\lambda,(i)}\Vert^2=\frac 1n\sum_{i=1}^n \left\Vert y_i - x_i^T\left(\hat\beta_\lambda + \frac{(X^TX+\lambda I)^{-1}x_i(\hat y_i - y_i)}{1 - h_{ii}}\right)\right\Vert^2
\\ &= \frac 1n\sum_{i=1}^n \left\Vert y_i - \hat y_i -\frac{h_{ii}(\hat y_i - y_i)}{1 - h_{ii}}\right\Vert^2
= \frac 1n \sum_{i=1}^n \left(\frac{y_i - \hat y_i}{1 - h_{ii}}\right)^2\end{aligned}
$$

### K-Fold Cross Validation 

K-fold cross vaildation is a generalization of leave-one-out-cross-validation (LOOCV). It split the data matrix (in row) into $K$ parts. For each part $i$, we remove it from $X$ to obtain $X_{(i)}$ and fit a ridge regression $\hat\beta_{\lambda,(i)}$ and validate it on part $i$. Compute the average of validation loss for $i = 1,\dotsc,K$. 

It is clear that, when $K = n$, then K-fold is exactly the LOOCV.