# Coordinate Descent and Soft-Thresholding for Lasso Regression

## Introduction  

This notebook explores these topics on Lasso Regression
- Solve lasso regression with orthogonal design matrix with soft-thresholding
- Solve multivariate lasso regression using coordinate descent when the design matrix is properly scaled
- Re-scale the multivariate lasso regression solution back onto the orignal scale of the data  

This notebook requires some prior knowlege about regression analysis and lasso regression

## 1. Soft-thresholding a lasso problem

In a classical regression problem, the assumption placed on data is:  
$$y = X\beta + \epsilon$$  
Where $y$ is a vector of size $(n, 1)$, $X$ a matrix of size $(n, p)$, $\beta$ a vector of size $(p, 1)$, and $\epsilon$ a vector of size $(n, 1)$. Also, $\epsilon \sim N(0, \sigma^2)$  
Define the loss function of lasso regression:  

$$\frac{1}{2n}|| y - X\beta ||^2 = \frac{1}{2n}|| y - X\hat{\beta}^{OLS} + X\hat{\beta}^{OLS} - X\beta||^2 + \lambda|| \beta ||^1$$  
  
$$\hspace{3.2cm}=\frac{1}{2n}|| y - X\hat{\beta}^{OLS}||^2 + || X\hat{\beta}^{OLS} - X\beta||^2+ \lambda|| \beta ||^1$$  
Where the cross product term:
$$2(y - X\hat{\beta}^{OLS})^T( X\hat{\beta}^{OLS} - X\beta)=2r^{T}(X\hat{\beta}^{OLS} - X\beta)=0$$  


because the second term is on the column space of $X$, which is orthogonal to $r$, the residual of OLS.

Now since $||y - X\hat{\beta}^{OLS}||^2$ is not a function of $\beta$, the loss function is minimized by minimizing $||X\hat{\beta}^{OLS} - X\beta||^2+ \lambda||\beta||^1$

We have:  
$$\hat{\beta}^{lasso}=\frac{1}{2n}||X\hat{\beta}^{OLS} - X\beta||^2+ \lambda||\beta||^1$$  

$$\hspace{5.8cm}=argmin\hspace{0.1cm}\frac{1}{2n}(X\hat{\beta}^{OLS} - X\beta)^{T}X^{T}X(X\hat{\beta}^{OLS} - X\beta) + \lambda||\beta||^1$$  

$$\hspace{5cm}Assume\hspace{0.3cm}we\hspace{0.3cm}have\hspace{0.3cm}X^{T}X=nI$$  

$$\hspace{5.8cm}=argmin\hspace{0.1cm}\frac{1}{2}(\hat{\beta}^{OLS} - \beta)^{T}(\hat{\beta}^{OLS} - \beta) + \lambda||\beta||^1$$  
$$\hspace{5.8cm}=argmin\hspace{0.1cm}\frac{1}{2}\Sigma_{j=1}^{p}(\beta_{j}^{OLS} - \beta_{j})^2+ \lambda||\beta_{j}||^1$$  

**This further implies we can shove the lasso estimateors individually from the OLS estimator.**

$$For\hspace{0.1cm}each\hspace{0.1cm}\beta_{j}:$$
$$\beta_{j}=argmin\hspace{0.1cm}\frac{1}{2}(\beta - \hat{\beta_{jOLS}})^{2}+\lambda|\beta|$$  
$$Take\hspace{0.1cm}derivative\hspace{0.1cm}yields:$$
$$\hat{\beta_{j}^{lasso}}=(sign\hspace{0.1cm}\hat{\beta_{jOLS}})(|\hat{\beta_{jOLS}|-\lambda})+$$
