# Q2 {-}
## Linear regression: bias-variance tradeoff, CV, and variable selection {-}

consider a dataset with n data points $(x^i , y^i ), x^i ∈ \mathbb{R}^n$ , following from the following linear model:

$$
y^i = \beta^{*^T} x^i + \epsilon^i, \ \ \ \ \ \ \text{i = 1, ...,m,}
$$

where $\epsilon^i$ are i.i.d. Gaussian noise with zero mean and variance $σ^2$ , and $β^∗$ is the true parameter.
Consider the ridge regression as follows:

$$
\hat{\beta}(\lambda) = \text{argmin}_{\beta}{\frac{1}{m} \sum_{i=1}^{m}(y^i - \beta^T x^i)^2 + \lambda ||\beta||_2^2}
$$

where $\lambda \leq 0$ is the regularized parameter.

#### (a) Find the closed form solution for $\hat{\beta}(\lambda)$ and its distribution conditioning on ${x^i}$ (i.e., treat them as fixed). {-}

##### **Answer** {-}
Rewriting the equation into matrix form produces:

$$
|| y-X \beta||_2^2 + \lambda || \beta ||_2^2\ \ \ \ (1)
$$

Taking the derivative of (1) with respect $\beta$ and setting equal to 0 yields

$$
= (2y-2X\beta)X + (2\lambda\beta) = 0
\\
= (2X^Ty - 2X^TX\beta) + (2\lambda\beta) = 0
\\
= (X^Ty - X^TX\beta) + (\lambda\beta) = 0\ \ \ \ (2)
$$

Solving (2) for $\beta$ yields

$$
X^Ty = X^TX\beta + \lambda\beta
\\
X^Ty = (X^TX + \lambda I)\beta
$$

$$
\boxed{(X^TX + \lambda I)^{-1}X^Ty = \beta}
$$


#### (b) Calculate the bias $E[x^T β̂(λ)] − x^T β^∗$ as a function of λ and some fixed test point x.

##### **Answer**

Since y in the answer above is 

$$
y = (X\beta + \epsilon)
$$

We can rewrite it as 

$$
\beta = (X^TX + \lambda I)^{-1} X^T(X\beta + \epsilon)
$$

Expanding produces:
$$
(X^TX + \lambda I)^{-1} X^TX\beta + (X^TX + \lambda I)^{-1} X^T \epsilon
$$

Since we desire $E[\beta]$ the equation above becomes

$$
(X^TX + \lambda I)^{-1} X^TX\beta + (X^TX + \lambda I)^{-1} X^T E[\epsilon]
$$

Since $\epsilon$ is i.i.d with mean 0 and variance $\sigma^2$, then $E[\epsilon] = 0$ which reduces the above to

$$
(X^TX + \lambda I)^{-1} X^TX\beta
$$

Therefore
$$
E[x^T\beta(\lambda)] - x^T\beta^* 
$$

$$
\boxed{(X^TX + \lambda I)^{-1} X^TX\beta - x^T\beta^*}
$$



#### (c) Calculate the variance term $E[(x^T \hat{\beta}(\lambda) - E[x^T \hat{\beta}(\lambda)])^2]$ as a function of λ {-}

Noting from above that

$$
\hat{\beta}(\lambda) = (X^TX + \lambda I)^{-1} X^TX\beta
$$

since var is under the convention of $[C^TC]$ we have

$$
(X^TX + \lambda I)^{-1} X^TX\beta [(X^TX + \lambda I)^{-1} X^TX\beta]^T \\ 
= (X^TX + \lambda I)^{-1} X^TX [\beta^T\beta](X^TX + \lambda I)^{-1} X^TX\beta
$$

Keeping note that $[\beta^T\beta] = \sigma^2(X^TX)^{-1}$ the above equation reduces down to

$$
\boxed{\sigma^2 (X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1}}
$$

#### Use the results from parts (b) and (c) and the bias-variance decomposition to analyze the impact of λ in the mean squared error. Specifically, which term dominates when λ is small, and large, respectively? {-}

Since $MSE = bias^2 + var$, it is easy to see from the above two equations for bias and variance, when $\lambda$ is large, the variance term dominates. conversely when $\lambda$ is small, the bias term dominates

#### Now suppose we have m = 100 samples. Write a pseudo-code to explain how to use cross validation to find the optimal λ. {-}

* Initialize range of hyperparameter $\lambda$
* Divide data of size m by number of folds k
* for each fold DO:
    * fold is holdout set
    * For each hyperparameter $\lambda$ DO:
        * Train model on k-1 folds
        * Eval model on holdout fold
        * If best evaluation, keep params
    * end 
* end

#### Explain if we would like to perform variable selection, how should we change the regularization term in Equation (1) to achieve this goal. {-}

If we want to achieve a model which in essence performs variable selection, we wish to regularize the objective function such that it drives coefficient sparsity. Therefore, we can change ridge regression to LASSO regression to do this by optimizing the absolute value of the regularization term (L1 regularization)

$$
|| y-X \beta||_2^2 + \lambda || \beta ||_1
$$