## Linear Regression

We continue our discussion of linear regression in chapter 7.

### Review of last chapter

In practice, to get the estimated parameters $\hat{\beta}$, one solves the normal equation directly instead of taking the inverse of $(XX^T)$. Now, we say that the matrix is irregular if the inverse does not exist, and likewise we say that the matrix is regular if the inverse exists. Often, $XX^T$ is nearly irregular when working with real data. Possible reasons are:
+ number of features are larger than number of samples (p > N)
+ collinearity in the observations

In particular, OLS has zero bias: as $N \rightarrow \infty$, we are going to recover the true parameter values. *But* OLS suffers from high variance.

### Bias-variance decomposition of mean squared error

Let's assume that the parameter is a scalar; the mean squared error of the estimate *w.r.t.* the true parameter is:

$$ \begin{split}
\mathbb{E}[(\hat{\theta} - \theta)^2] & = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2] \\
   & = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2] + \mathbb{E}[(\mathbb{E}[\hat{\theta}] - \theta)^2] + 2 \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta)]\\
   & = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2] + (\mathbb{E}[\hat{\theta}] - \theta)^2
\end{split}$$

The first term is the **variance** (i.e. the mean squared deviation of the parameter estimate from the expected parameter estimate); the second term is the squared **bias** (i.e. how much the expected estimate deviates from the true parameter value).

<img src="img/empirical-mse.png" alt="Drawing" style="width: 800px;"/>

## Regularized Linear Regression

General idea of regularization: reduce variance by limiting the flexibility of the model + choose a sweet spot by cross-validation. Three concrete methods will be discussed below.

### PCR (Principal Component Regression)

* project data to r-dim subspace (r < p)
* then do OLS (ordinary least squares) in that subspace

Here we keep only the first few eigenvalues (out of total p) of $XX^T$, as small eigenvalues give unstable results (high uncertainty in the corresponding direction of the eigenvector).

### Ridge Regression (1970s)

$$\min_\beta || Y - \beta^T X||^2_2 + \lambda ||\beta||^2_2 $$

The first term is the OLS, second term bias solution towards modest slopes. $\lambda$ is the regularization strength. This can be solved in closed form by taking the partial derivative w.r.t. $\beta$:

$$ \frac{\partial}{\partial \beta} \left( YY^T - 2 Y X^T \beta + \beta^T XX^T \beta + \lambda \beta^T \beta \right) \\
= -2 YX^T + 2 \beta^T XX^T + 2 \lambda \beta^T$$

Setting this expression to zero, we get

$$ \beta^T (XX^T + \lambda I) = Y X^T $$

or an explicit expression for the parameter estimate

$$ \hat{\beta} = (XX^T + \lambda I)^{-1} X Y^T $$

Note the additional ridge regularization term (the identity matrix multiplied by a constant). While PCR completely ignores small eigenvalues, ridge stabilizes the inverse operation by increasing all eigenvalues of $XX^T$ by a constant value. In other words, ridge penalty artificially inflates the covariance matrix $XX^T$ in all dimensions, i.e. it spreads out the data in all dimensions. As one increases the regularization term, eventually all coefficients would become zero, but it can't make the problem sparser by selectively making certain coefficients to be close to zero (useful for feature selection).

### LASSO regression (~1996)

Our aim is to be able to make the problem sparser (i.e. setting some coefficients to zero). We can introduce

$$ \min_\beta || Y - \beta^T X||^2_2 + \lambda || \beta||_0$$

where the L0-norm counts the number of nonzero components in the vector. For instance, when $p = 2$, we have to compute three solutions:
1. $\beta_1 = 0, \beta_2 \neq 0$
2. $\beta_1 \neq 0, \beta_2 = 0$
3. $\beta_1 \neq 0, \beta_2 \neq 0$

Due to the combinatorial explosion of number of possible solutions to try, we need something more tractable than the L0-norm. Hence we introduce the $L_p$-norm (the $p$ here is not related to the dimension of the features, so we use $l$ instead in the formula below to avoid confusion):

<img src="img/equicountour.png" alt="Drawing" style="width: 400px;"/>

The equicountour lines (the surfaces where the values of the loss are equal) for intermediate values of $l$ can be interpolated from the ones shown above. When $l \geq 1$, however, the regularizers are convex. We therefore use the L1-norm for LASSO regression:

$$ \min_\beta \mathrm{SSQ}(\beta | X, Y) = \min_\beta (Y - \beta^T X) (Y - \beta^T X)^T + \lambda || \beta||_1 $$

We can also write the regularization term as follows

$$ || \beta||_1 = a(\beta)^T \beta$$

where $a(\beta) \in \{-1, +1\}^p$ by the definition of the L1-norm. Here, it is clear that $||\beta||_1$ is just a plane in each of the quadrants. Therefore, the SSQ is still a parabola because $ YY^T + \beta^T XX^T \beta - 2 \beta^T XY^T$ is a parabola and $a(\beta)^T \beta$ are different planes in each orthant (=generalization of quadrant to higher dimensions). In other words, the objective is just a parabola in each orthant. Moreover, the parabolae match up at the orthant boundaries since the function $||\beta||_1$ is continous across orthants. Thanks to convexity, there is a single minimum.

<img src="img/lasso-energy.png" alt="Drawing" style="width: 800px;"/>

Let's now compare the solutions of LASSO and ridge by using the equivalent constraint formulations to see why we get sparse solutions in the case of LASSO but not ridge.

<img src="img/lasso-vs-ridge.png" alt="Drawing" style="width: 800px;"/>

Moreover, the space of sparse solutions becomes larger as $\kappa$ is reduced (or equivalently if $\lambda$ is increased):

<img src="img/space-sparse.png" alt="Drawing" style="width: 800px;"/>

Now, as an example of linear regression, consider computed tomography:
* measurements $y$ are the detected absorption with shape (1 x n), where n is the number of rays
* unknowns $\hat{\beta}$ are the intensities of pixels with shape (p x 1), where p is the number of pixels

<img src="img/tomography.png" alt="Drawing" style="width: 800px;"/>