# COMP 562 – Lecture 4

$$
\renewcommand{\xx}{\mathbf{x}}
\renewcommand{\yy}{\mathbf{y}}
\renewcommand{\zz}{\mathbf{z}}
\renewcommand{\loglik}{\log\mathcal{L}}
\renewcommand{\likelihood}{\mathcal{L}}
\renewcommand{\Data}{\textrm{Data}}
\renewcommand{\given}{|}
\renewcommand{\MLE}{\textrm{MLE}}
\renewcommand{\tth}{\textrm{th}}
\renewcommand{\Gaussian}[2]{\mathcal{N}\left(#1,#2\right)}
\renewcommand{\norm}[1]{\left\lVert#1\right\rVert}
\renewcommand{\ones}{\mathbf{1}}
$$

# Linear Regression -- Matrix Form

A general multiple-regression model can be written as

$$
y \given \xx = \beta_0 + \sum_j x_j \beta_j + \epsilon, \ \ \ \ \epsilon \sim \Gaussian{}{0, \sigma^2}
$$
Which is equivalent to

$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots +  \beta_p x_{ip} + \epsilon_i \ \ \ \ for \ i=1,\ldots,N
$$

In matrix form, we can rewrite this model as 

$$
\left[\begin{array}{c} { y_1 \\ y_2 \\ \vdots \\ y_N}\end{array}\right]_{\ N \times 1} = \left[\begin{array}{ccccc} 1  & x_{11} & x_{12} & \ldots & x_{1p} \\ 1  & x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots  & \vdots & \vdots & \ldots & \vdots \\ 1  & x_{N1} & x_{N2} & \ldots & x_{Np} \end{array}\right]_{\ N \times p+1} \left[\begin{array}{c} \ { \beta_0 \\ \beta_1 \\ \vdots \\ \beta_P}\end{array}\right]_{\ p+1 \times 1} \ + \left[\begin{array}{c} \ { \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N}\end{array}\right]_{\ N \times 1}
$$

This can be rewritten more simply as:
$$
\yy = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon}
$$

# Linear Regression -- Closed Form Solution for $\beta$

Remember that maximizing log-likelihood is equivalent to minimizing **RSS** or **MSE**

$$
RSS = \sum_{i=1}^N \left(y_i-(\beta_0 + \sum_j x_{i,j} \beta_j)\right)^2 = \sum_{i=1}^N \left(e_i\right)^2 = \mathbf{e_i}^{T} \mathbf{e_i} = \left[\begin{array}{cccc} e_1 & e_2 & \dots & e_N \end{array}\right]_{1 \times N} \ \left[\begin{array}{c} { e_1 \\ e_2 \\ \vdots \\ e_N}\end{array}\right]_{\ N \times 1} 
$$

$$
\begin{aligned}
RSS &= (\yy - \mathbf{X} \mathbf{\beta})^{T} (\yy- \mathbf{X} \mathbf{\beta}) \\ &= (\yy^{T}  - \mathbf{\beta}^{T} \mathbf{X}^{T}) (\yy - \mathbf{X} \mathbf{\beta})  \\ &= \yy^{T} \yy  -  \beta^{T} \mathbf{X}^{T} \yy - \yy^{T} \mathbf{X} \mathbf{\beta} + \beta^{T} \mathbf{X}^{T} \mathbf{X} \beta \\ &= \yy \yy^{T} -  2 \beta^{T} \mathbf{X}^{T} \yy + \beta^{T} \mathbf{X}^{T} \mathbf{X} \beta 
\end{aligned}
$$

Where this development uses the fact that the transpose of a scalar is the scalar i.e. $\beta^{T} \mathbf{X}^{T} \yy = \yy^{T} \mathbf{X} \mathbf{\beta}$



To find the $\mathbf{\beta}$ that minimizes $RSS$, we solve the following equation:

$$
\nabla_\beta \ RSS = -2 \mathbf{X}^{T} \yy + 2 \mathbf{X}^{T} \mathbf{X} \beta = 0
$$

The corresponding solution to this linear system of equations is called the **ordinary least squares** or **OLS** solution

$$
\hat{\beta} = \beta^{\textrm{OLS}} = \beta^{\textrm{MLE}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\yy
$$

# Linear Regression -- Closed Form Solution for $\sigma^2$

Recall log-likelihoood function

$$
\loglik(\beta_0,\beta|\yy,\xx) = \sum_{i=1}^N \left[ -\frac{1}{2}\log 2\pi\sigma^2 -\frac{1}{2\sigma^2}\left(y_i-(\beta_0 + \sum_j x_{i,j} \beta_j)\right)^2\right]
$$

Which can be written in matrix form

$$
\loglik(\beta\given\yy,\xx) = -\frac{N}{2}\log 2\pi\sigma^2 - \frac{1}{2\sigma^2}(\yy -  \mathbf{X} \beta)^T(\yy -  \mathbf{X} \beta) 
$$

Taking derivative and equating it to zero yields

$$
(\sigma^{2})^\MLE = \frac{1}{N} (\yy -  \mathbf{X} \beta^\MLE)^T(\yy -  \mathbf{X} \beta^\MLE)  = \frac{1}{N} \sum_{i=1}^N \left(y_i - (\beta_0^\MLE + \sum_j x_{i,j} \beta_j^\MLE)\right)^2
$$

**<font color='red'> Please verify $(\sigma^{2})^\MLE$ at home </font>**

* **Overfitting**: Model every minor variation in the input using highly flexible complex models 
    * High variance and low bias

* **Underfitting**: Simple model that is unable to capture the true relationships in given data
    * Low variance and high bias

* **Model Selection**: Picking the right model from a variety of models of different complexity

**<font color='red'> Q: Which model overfits/underfits the data? </font>**

# Bias-Variance Tradeoff

The **mean quared errors** or **MSE** may be decomposed into **bias** and **variance** components:

$$
\underbrace{\mathop{\mathbb{E}}(y - \hat{y})^2}_{\mathbf{MSE}} = \underbrace{(\mathop{\mathbb{E}}(\hat{y}) - y)^2}_{\mathbf{Bias^2}} + \underbrace{\mathop{\mathbb{E}} \left[ (\hat{y} - \mathop{\mathbb{E}}(\hat{y}))^2 \right]}_{\mathbf{Variance}} + \underbrace{\sigma^2_{e}}_{\mathbf{Irreducible \ Error}}
$$

<img src="./Images/biasvariance.png" width="800" align="center"/>

# Ill-Posed Problems

**<font color='red'> Q: What happens if you are solving a linear system $Ax = y$ and there are more unknowns than equations? </font>**

In our setting -- $N$ samples, $P$ features --  linear regresion is ill-posed if $P>N$

Another example of ill-posed linear regression problem arises when we have two copies of the same predictors 

This is a problem even if $P<N$

# Ridge Regression

Adding that penalty to linear regression log-likelihood yields **ridge regresion**

$$
\loglik(\beta_0,\beta|\yy,\xx) = \sum_{i=1}^N \left[ -\frac{1}{2}\log 2\pi\sigma^2 -\frac{1}{2\sigma^2}\left(y_i-(\beta_0 + \sum_j x_{i,j} \beta_j)\right)^2\right] - \underbrace{\frac{\lambda}{2} \sum_{j} \beta_{j}^2}_{\textrm{ridge penalty}}
$$

All those sums can get cumbersome, so we will use norms
1. $\ell_2$ norm $\norm{\xx} = \sqrt{\sum_{i} x_i^2}$
2. $\ell_1$ norm $\norm{\xx}_1 = \sum_{i} \left|x_i\right|$

$$
\loglik(\beta\given\yy,\xx) =  -\frac{1}{2\sigma^2}\norm{\yy - \mathbf{X} \mathbf{\beta}}^2  \underbrace{-\frac{\lambda}{2}\norm{\beta}^2}_{\textrm{ridge penalty}}+ \textrm{const.}
$$


# Ridge Regression -- Computing Gradients

$$
\loglik(\beta\given\yy,\xx) =  -\frac{1}{2\sigma^2}\norm{\yy - \mathbf{X} \mathbf{\beta}}^2  \underbrace{-\frac{\lambda}{2}\norm{\beta}^2}_{\textrm{ridge penalty}}+ \textrm{const.}
$$

Computing the gradient and setting it to zero

$$
\nabla_\beta \loglik(\beta\given\yy,\xx) = \frac{1}{\sigma^2}X^T(\yy - \mathbf{X} \mathbf{\beta}) - \lambda\beta = 0
$$

yields

$$
\beta^{\MLE} = (\mathbf{X}^T\mathbf{X} + \lambda\sigma^2 \mathbf{I}_N)^{-1}(\mathbf{X}^T\yy)
$$

Where $\mathbf{I}_N$ is the identity matrix of size $N$

Contrast this to closed form solution of linear regression

$$
\beta^{\textrm{MLE}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\yy
$$

# Ridge Regression -- Computing Gradients

The bias/intercept coefficient $\beta_0$ is typically not regularized in a linear regression

A regularized $\beta_0$ (shrinked) may us from prevent finding the correct relationship

$$
\nabla \loglik(\beta_0,\beta,\sigma^2\given\xx,\yy) = \left[\begin{array}{c} 
\sum_{i=1}^N -\frac{1}{\sigma^2}\left(y_i-(\beta_0 + \sum_j x_{i,j} \beta_j)\right)(-1) \\
\sum_{i=1}^N -\frac{1}{\sigma^2}\left(y_i-(\beta_1 + \sum_j x_{i,j} \beta_j)\right)(-x_{i,1}) - \lambda \beta_1 \\
\vdots\\
\sum_{i=1}^N -\frac{1}{\sigma^2}\left(y_i-(\beta_0 + \sum_j x_{i,j} \beta_j)\right)(-x_{i,p}) - \lambda \beta_p
\end{array}
\right]
$$

Note that  $\beta_0$ is **not** regularized


Remember our closed form solution for ridge regression

$$
\beta^{\MLE} = (\mathbf{X}^T\mathbf{X} + \lambda\sigma^2 \mathbf{I}_N)^{-1}(\mathbf{X}^T\yy)
$$

Updating our closed form solution without regularizing $\beta_0$ will yeild

$$
\beta^{\MLE} = \left(\mathbf{X}^T\mathbf{X} + \lambda\sigma^2 \left[\begin{array}{ccccc} 0  & 0 & 0 & \ldots & 0 \\ 0  & 1 & 0 & \ldots & 0 \\ \vdots  & \vdots & \vdots & \ldots & \vdots \\ 0  & 0 & 0 & \ldots & 1 \end{array}\right]_{\ N \times N}\right)^{-1}(\mathbf{X}^T\yy)
$$