# Regularization Methods

Ridge and lasso regression are 2 forms of regularization regression. These methods seek to alleviate the consequences of multicollinearity.

## Ridge

Ridge regression (RR) is motivated by a constrained minimization problem, which can be formulated as 

$$\min \sum_{i=1}^n (y_i - X_i^T \beta)^2$$

$$\hbox{s.t.} \quad \sum_{j=1}^p \beta_j^2 \leq t \quad \hbox{for} \quad t \geq 0 $$

The feasible set for this problem is constrained to be 

$$s(t) = \{ \underset{\sim}{\beta} \in \mathcal{R}^p : ||\underset{\sim}{\beta}||_2^2 \leq t \}$$

where $\underset{\sim}{\beta}$ does not include the intercept $\beta_0$.

<u>Notes:</u>

The RR estimators are not equivalent under a rescaling of all the $X_j^T$ because of the L2 penalty. This difficulty is circumvented by centering the predictors. In this chapter, design matrix $X$ will be the centered matrix. In addition, we exclude intercept $\beta_0$. The use of an L2 penalty in least squares problem is sometimes referred to as Tikhonov regularization. Using a lagrange multiplier, the above constrained minimization problem is equivalent to 

$$\min \sum_{i=1}^n (y_i - X_i^T \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2 \quad \hbox{for} \quad \lambda \geq 0$$

There is a one-to-one correspondence between $t$ in the first minimization problem and $\lambda$ in this new formulation. A constrained optimization problem is said to be a convex optimization, if both the objective function and the constraints are convex functions. Our constrained optimization problem is a convex minimization problem in $\beta$. 

### Analytical minimization
The PRSS (penalized residual sum of squares) for RR is 

$$\begin{align*}
PRSS(\beta, \lambda) &= \sum_{i=1}^n (y_i - X_i^T \beta)^2 _ x \sum_{i=1}^p \beta_j^2 \\
&= (y - X \beta)^T (y - X \beta) + \lambda \beta^T \beta
\end{align*}$$


$$\begin{align*}
\frac{\partial PRSS}{\partial \beta} &= 2 (X^T X) \beta - 2 X^T y + 2 \lambda \beta = 0\\
&\Longleftrightarrow 2(X^T X) \beta + 2 \lambda \beta = 2 X^T y\\
&\Longleftrightarrow (X^T X + \lambda I) \beta = X^T y\\
\end{align*}$$

Ridge estimator is $\hat{\beta}_R = (X^T X + \lambda I)^{-1} X^T y$.

Since we are adding a positive constant to the diagonal of $X^TX$, we are producing an invertible matrix $X^T X + \lambda I$, even if $X^T X$ is singular. Historically, this particular aspect of RR was the main motivation behind the adoption of this particular extension of OLS theory. In addition, this also shows that $\hat{\beta}_R$ is still a linear function of the observed values, $y$. $\hat{\beta}_R$ is related to the classical OLS estimator, $\hat{\beta}_{R} = [I + \lambda (X^T X)^{-1}]^{-1} \hat{\beta}_{OLS}$ assuming that $X^T X$ is non-singular.

<u>Proof:</u>


$$\begin{align*}
\hat{\beta}_R &= [I + \lambda (X^T X)^{-1}]^{-1} (X^T X)^{-1} X^T y\\
&\hbox{\textcolor{green}{using the fact that $(AB)^{-1} = B^{-1}A^{-1}$}}\\
&=(X^T X ( I + \lambda (X^T X)^{-1})^{-1}) X^T y\\
&=(X^T X + \lambda I)^{-1} X^T y\\
\end{align*}$$

This shows that the ridge estimator is simply a downweighted version of the OLS estimator.

### Bias and variance

Ridge estimation produces a biased estimator of the true parameter $\beta$. We know that $E[Y|X] = X\beta$.

$$\begin{align*}
E [ \hat{\beta}_R | X] &= [X^T X + \lambda I]^{-1} X^T X \beta\\
&= (X^T X + \lambda I)^{-1} (X^T X + \lambda I - \lambda I) \beta \\
&= (I - \lambda (X^T X + \lambda I)^{-1}) \beta\\
&= \beta - \lambda ( X^T X + \lambda I)^{-1} \beta\\
\end{align*}$$

The bias of $\hat{\beta}_R$ is proportional to $\lambda$. The larger $\lambda$ is, the alrger the bias of $\hat{\beta}_R$. Even though the vector of ridge estimators incur a greater bias, it posesses a smaller variance than the vector of OLS estimators. One may compare these 2 quantities by taking the trace of the variance matrices of the 2 methods. The solution $\hat{\beta}_R$ is indexed by $\lambda$, i.e. for each $\lambda$, we have a solution.

$\lambda$ is the shrinkage parameter

* $\lambda$ controls the size of the coefficients

* $\lambda$ controls the amount of the regularization

* as $\lambda$ goes to 0, we obtain LS solution

* as $\lambda$ goes to $\infty$, $\hat{\beta}_R \xrightarrow{} 0$ (intercept only model)

### Data augmentation solution

The L2-PRSS is

$$\begin{align*}
PRSS(\beta, \lambda) &= \sum_{i=1}^n (y_i - X_i^T \beta)^2 + \lambda \sum_{i=1}^p \beta_j^2\\
&= \sum_{i=1}^n (y_i - X_i^T \beta)^2 + \sum_{j=1}^p (0 - \sqrt{\lambda} \beta_j)^2\\
\end{align*}$$

***

So, the L2 criterion can be recast as another least-squares (LS) for another data set.

$$\underset{\sim}{X_\lambda} = \begin{pmatrix}
x\\
\sqrt{\lambda} I_p\\
\end{pmatrix} \qquad \underset{\sim}{y_\lambda} = \begin{pmatrix}
y\\
0\\
\end{pmatrix}$$

i.e.

$$\underset{\sim}{\lambda_\lambda} = \begin{pmatrix}
x_{11} & x_{12} & \dots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \dots & x_{np} \\
\sqrt{\lambda} & 0 & \dots & 0 \\
0 & \sqrt{\lambda} & \vdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & \dots & \dots & 0 \\
\end{pmatrix}$$

$$\underset{\sim}{y_\lambda} = \begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_n\\
0\\
\vdots\\
0\\
\end{pmatrix}$$

So, the LS solution for the augmented data set is 

$$(X_\lambda^T X_\lambda)^{-1} X_\lambda^{-1} y_\lambda = \begin{bmatrix}
\begin{pmatrix} X^T & \sqrt{\lambda} I_p \\ \end{pmatrix} & \begin{pmatrix} x \\ \sqrt{\lambda}I_p \end{pmatrix}\\
\end{bmatrix}^{-1} \begin{pmatrix}
X^T & \sqrt{\lambda} I_p\\
\end{pmatrix} \begin{bmatrix}
y \\
0 \\
\end{bmatrix} = (X^T X + \lambda I_p)^{-1} X^T y = \hat{\beta}_R$$

## Lasso

Tibshirani introduced the lasso (least absolute shrinkage and selection operator) in JRSS, 1996. The lasso, by contrast to RR, tries to produce a sparse solution in the sense that several of the slope parameters will be set to 0. With Lasso, only a subset of the variables are included in the finished model.

### Constrained optimization

Lasso is formulated wrt the centered matrix $X$. The L1 penalty is solely applied to the slope coefficients. And thus, the intercept $\beta_0$ is excluded from the penalty term. Lasso can be expressed as a constrained minimization problem

$$\min \sum_{i=1}^n (y_i - X_i^T \beta)^2$$

$$\hbox{s.t.} \quad \sum_{j=1}^p |\beta_j| \leq t \quad \hbox{for} \quad t \geq 0$$

Using the lagrange multiplier, the lagrangian is 

$$\sum_{i=1}^n (y_i - X_i^T \beta)^2 + \lambda \sum_{j=1}^p |\beta_j|$$

where $\lambda \geq 0$ and there is a one-to-one correspondence between $t$ and $\lambda$. Unlike RR, lasso does not admit a closed form solution. The L1 penalty makes the solution non-linear in the $y$s. Again, we have tuning parameter $\lambda$ that controls the amount of regularization. Have a "path" of solution for each $\lambda$. If $\lambda = 0$, no shrinkage and hence obtain the LS solution.Often, we believe that many of the $\beta_j$s should be 0. So, we seek a set of sparse solutions - large enough $\lambda$ will set some coefficients exactly to 0. So, lasso will perform model solection for us. Original implementation involves quadratic programming techniques from convex optimization. <u>Efron et al (Annals of Sentistics, 2006)</u> proposed LARS (least angle regression), which computes the lasso path efficiently. Interesting modification is called forward stagewise. In many cases, it is the same as Lasso. Lasso is often used in high-dimensional problems. How can we efficiently compute the lasso solution? The lasso objective $(y - X\beta)^T (y - X \beta) + \lambda ||\beta||$ is <u>not</u> differentiable everywhere on $\mathcal{R}$. Many strategies exist for minimizing the Lasso objective function:

* LARS

* Coordinate descent

### Coordinate descent optimizationS

Minimize a function $f : \mathcal{R}^n \xrightarrow{} \mathcal{R}$ with $f(x) = g(x) + \sum_i h_i (X_i)$ (with $g$ convex, differentiable and each $h_i$ convex strategy)

**Strategy**: minimize each coordinate separately while cycling through the coordinates. Start with initial guess $\underset{\sim}{X}$ and repeat for $k=1,2,...$

$$\underset{\sim}{X}^{(0)} = (X_1^{(0)}, X_2^{(0)}, X_3^{(0)}, ..., X_n^{(0)})$$

$$\begin{align*}
X_1^{(k+1)} &= arg\min_{X_1} f(X_1, X_2^{(k)}, X_3^{(k)}, ..., X_n^{(k)}) \\
X_2^{(k+1)} &= arg\min_{X_2} f(X_1^{(k+1)}, X_2, X_3^{(k)}, ..., X_n^{(k)}) \\
X_3^{(k+1)} &= arg\min_{X_3} f(X_1^{(k+1)}, X_2^{(k+1)}, X_3, ..., X_n^{(k)}) \\
\vdots &\\
X_n^{(k+1)} &= arg\min_{X_n} f(X_1^{(k+1)}, X_2^{(k+1)}, X_3^{(k+1)}, ..., X_n^{(k+1)}) \\
\end{align*}$$

<u>Note:</u> After we solve for $X_i(k)$, we use its new value from then on

Neglected technique in the past that gained popularity recently. **Does this procedure always converge to an extreme point of the objective function in general? No!**

Does coordinate descent work for lasso? Yes! We exploit the fact that the non-differentiable part of the objective is separable. 

Let $\underset{\sim}{X} = (x_1, x_2, ..., x_p)^T$

**Theorem (See Tseng, 2001)**: 

Suppose $f(\underset{\sim}{X}) = f_o(\underset{\sim}{X}) + \sum_{i=1}^p f_i(X_i) \qquad (f \in \mathcal{R}^p)$

i. $f_o : \mathcal{R}^p \xrightarrow{} \mathcal{R}$ is convex and continuously differentiable

ii. $f_i : \mathcal{R} \xrightarrow{} \mathcal{R}$ is convex $(i=1, ..., p)$

iii. The set $X^{(0)} = \{ \underset{\sim}{X} \in \mathcal{R}^p : f(\underset{\sim}{X}) \leq f(X^{(0)} \}$ is compact

iv. $f$ is continuous on $X^{(0)}$

Then every limit point of the sequence ($X^{(k)}$) $k\geq 1$ generated by cyclic coordinate descent converge to a global minimum of $f$. In other words, work of <u>Tseng (2001)</u> proves that for such $f$, any limit point of $X^{(k)}, k=1,2,...$ is a minimizer of $f \quad$ ($X^\star$ minimizer). $X^{(k)}$ has subsequences converging to $X^{(\star)}$ (Bolzano-Weierstrass). $f(X^{(k)})$ converges to $f^\star = f(X^{(\star)})$

<u>Remark:</u>

Order of cycle through coordinates is arbitrary. Can use any permutation of $\{1, 2, ...,n \}$. Can everywhere replace individual coordinates with block of coordinates. <u>Friedman et al (2007)</u> suggested to use the coordinate descent algorithm to solve the lasso problem (better algorithm than LARS). Digression

<u>Digression: Subdifferential calculus</u>

Suppose $f$ is convex and differentiable. Then $f(y) \geq f(x) + \nabla f(x)^T (y - \mu)$

Definition 1: Let $f \mathcal{R}^n \xrightarrow{} \mathcal{R}$ be a continuous function (not necessarily differentiable), $g \in \mathcal{R}^n$ is a subgradient of $f$ at $X \in \mathcal{R}^n \quad \hbox{iff} \quad f(y) - f(x) \geq g^T (y - x) \forall y \in \mathcal{R}^n$. The set of all subgradients at $X$ is $\partial f(X) = \{ g \in \mathcal{R}^n | g \hbox{ is a subgradient of f at X} \}$ is called the subdifferential at $f$ at $X$. The reason why subgradients are so useful in convex optimization is tha tthey always exist for convex functions. 

<u>Proposition 1</u>: Let $f: \mathcal{R}^n \xrightarrow{} \mathcal{R}$ be a convex continuous function, then $\partial f(x) \neq 0$ for all $X \in \mathcal{R}^n$.

<u>Proposition 2</u>: Let $f: \mathcal{R}^n \xrightarrow{} \mathcal{R}$ be a continuous function, then $\partial f(x)$ is a closed and convex set.


In cases where $f$ is differentiable, the relationship between subgradients and the gradient is given by the following proposition:

<u>Proposition 3</u>: Let $f: \mathcal{R}^n \xrightarrow{} \mathcal{R}$ be a convex continuous function differentiable at $X \in \mathcal{R}^n$, then $$\partial f(x) = \{ \nabla f(x) \}$$

<u>Proposition 4</u>: Let $f: \mathcal{R}^n \xrightarrow{} \mathcal{R}$ be a continuous function and assume that for some $X\in \mathcal{R}^n$, $\nabla f(x) \{ g \}$. Then $f$ is differentiable at $X$ and $\nabla f(x) = g$.

**Example 1:** Let $f: \mathcal{R} \xrightarrow{} \mathcal{R}$ with $f(x) = |x|$ $x \in \mathcal{R}$. The only point at which $f$ is not differentiable is $x=0$. At this point, subderivatives are characterized by the inequality $f(x) * f(0) = gx$  ( i.e. $|x| \geq gx$ ). This is simplified to $g \in [-1, 1]$. For $x \neq 0$, the subderivatives coincide with the derivative. In summary, 

$$\partial f(x) = \begin{Bmatrix}
1 \quad \hbox{if} \quad x > 0 \\
[-1,1] \quad \hbox{if} \quad x = 0 \\
-1 \quad \hbox{if} \quad x < 0 \\
\end{Bmatrix}$$

In the same way, stationary points characterize optimal points for convex functions, we have the following proposition.

<u>Proposition 5</u>: Let $f$ be a convex continuous function, $X^\star \in \mathcal{R}^n$ is a global minimizer of $f \quad \hbox{iff} \quad 0 \in \partial f(x)$

### Solution of Lasso using coordinate descent algorithm

Want to minimize 

$$f(\beta) = \frac{1}{2} \sum_{i=1}^n (y_i - \sum_{j=1}^p X_{ij} \beta_j )^2 + \lambda \sum_{j=1}^p |\beta_j|$$

First assume that we have 1 predictor ($p=1$). So, $f(\beta) = \frac{1}{2} \sum_{i=1}^n (y_i - \beta X_i)^2 + \lambda |\beta|$. For $\beta > 0$, the subgradient of $f$ at $\beta$ is $\sum (-X_i) (y_i - \beta X_i) + \lambda$ if $\beta > 0$. Equivalently, $\sum (- X_i y_i + \beta \sum X_i^2) + \lambda$ if $\beta > 0$.

Use a similar argument for $\beta < 0$. So, the subgradients of $f$ at $\beta$

$$\begin{cases}
-\sum X_i y_i + \beta + \lambda \quad \hbox{if} \quad \beta > 0\\
-\sum X_i y_i + \beta - \lambda \quad \hbox{if} \quad \beta < 0\\
-\sum X_i y_i - \lambda, -\sum X_i y_i + \lambda \quad \hbox{if} \quad \beta = 0\\
\end{cases}$$

Next if $\beta > 0$, setting the subgradient of $f$ at $\beta$ to 0 yields

$$-\sum X_i y_i + \beta + \lambda = 0 \Longleftrightarrow \beta = \sum X_i y_i - \lambda $$

For $\beta < 0$

$$-\sum X_i y_i + \beta - \lambda = 0 \Longleftrightarrow \beta = \sum X_i y_i + \lambda $$

If $\beta = 0$

$$- \lambda < \sum X_i y_i < \lambda $$

Hence, Lasso solution for $p=1$ is 

$$\hat{\beta}_L = \begin{Bmatrix}
\hat{\beta}_{OLS} - \lambda \quad \hbox{if} \quad \hat{\beta}_{OLS} > \lambda\\
\hat{\beta}_{OLS} + \lambda \quad \hbox{if} \quad \hat{\beta}_{OLS} < -\lambda\\
0 \quad \hbox{if} \quad \hat{\beta} \in [-\lambda, \lambda]\\
\end{Bmatrix}$$

then $\hat{\beta}_L = S(\hat{\beta}_{OLS}, \lambda)$. In the general case, we can rewrite $f(\beta)$ as

$$f(\beta) = \frac{1}{2} \sum_{i=1}^n (y_i - \sum_{k \neq j} X_{ik} \tilde{\beta}_k - X_{ij} \beta_j)^2 + \lambda \sim_{k\neq j} |\hat{\beta}_k| + \lambda |\beta_j | + ADDON$$

where $ADDON = \mu \sum_{k \neq j} \hat{\beta}_k^2 + \mu \beta_j^2$ is just added on if wanted. Let $r_i = y_i - \sum_{k \neq j} x_{ik} \hat{\beta}_k$ be the partial residual. The subgradients of the full objective at $\beta_j$ are 

$$\begin{Bmatrix}
\sum - X_{ij} r_i + \beta_j + \lambda \quad \hbox{if} \quad \beta_j > 0\\
\sum - X_{ij} r_i + \beta_j - \lambda \quad \hbox{if} \quad \beta_j < 0\\
\sum - \sum X_{ij} r_i - \lambda_1 - \sum X_{ij} r_j + \lambda \quad \hbox{if} \quad \beta_j = 0\\
\end{Bmatrix}$$

as $\sum X_{ij}^2 = 1$. Set subgradients to 0, we get $\star$ if $\beta_j < 0$, $\sum X_{ij} r_i + \beta_j - \lambda = 0 \Longleftrightarrow \beta_j = \sum X_{ij} r_i + \lambda < 0 \Rightarrow \sum X_{ij} r_i < - \lambda$.

If $\beta_j > 0, \quad \beta_j = \sum X_{ij} r_i - \lambda > 0 \Rightarrow \sum X_{ij} r_i > \lambda$.

If $\beta = 0, \quad -\lambda < \sum X_{ij} r_i < \lambda $

So, $$\hat{\beta}_j^L = \begin{Bmatrix}
\sum X_{ij} r_i + \lambda \quad \hbox{if} \quad \sum X_{ij} r_i < - \lambda \\
0 \quad \hbox{if} \quad \sum X_{ij} r_i \in [-\lambda, \lambda]\\
\sum X_{ij} r_i - \lambda \quad \hbox{if} \quad \sum X_{ij} r_i > \lambda \\
\end{Bmatrix}$$

### Summary

Even though $X^T X$ may not be of full rank, both RR and lasso admit solutions. OLS has a problem when $p >> n$. Regularization tends to reduce prediction error. Assume that $p >> n$. RR produces coefficient values for each of the $p$-variables. But, because of the L1 penalty, Lasso will set many of the variables exactly equal to 0. This means that Lasso produces sparse solutions. So, Lasso takes care of model selection. <u>Zou and Hastie (2005)</u> propose the Elastic Net, which is a convex combination of L1 and L2 penalties.
