# How Ridge Shrinks Coefficient in Gradient Descent
> X and $\beta$ are the vector representaions of features and coefficient

- $X^TX$ is a $p\times p$ matrix. It is invertible if the features are not perfectly colinear.
- $(X^TX)^{-1}$ helps to undo the effect of colinearity between the features. Normalizes the contribution of each features to the final solution.
- $X^Ty$ represents the correlation between the features and target variable $y$
- putting them together:
  $$ \beta = (X^TX)^{-1}(X^Ty) $$
- gives the coefficients of linear regression


# Ridge Regression:
$$ Error =  ||y-X\beta||^2 + \lambda ||\beta||^2$$
$$ SSR = ||y-X\beta||^2$$
$$ Penalty = \lambda ||\beta||^2 $$
- gradient of the SSR:
$$ J(\beta) = -2X^T(y-X\beta)$$
- gradient of Penalty
$$ J(\beta) = 2\lambda \beta $$

- Combining them we get:
$$ J(\beta) = -2X^T(y-X\beta) + 2\lambda \beta $$
The gradient descent aims to minimmize this cost function. This happens as
$$\nabla J(\beta) = -2X^T(y-X\beta) + 2\lambda \beta $$

The value of beta is updated until convergance as:
$$ \beta \leftarrow \beta - \alpha \nabla J(\beta) $$
$$ \beta \leftarrow \beta - \alpha [-2X^T(y-X\beta) + 2\lambda \beta] $$
$$\beta \leftarrow \beta + 2\alpha X^T(y - X\beta) - 2\alpha\lambda\beta$$
$$\beta \leftarrow \beta (1 - 2\alpha\lambda) + 2\alpha X^T(y - X\beta)$$

Here, the term $2\alpha X^T(y - X\beta)$ reduces the sum of squared residuals by least square method and the term $\beta(1-2\alpha\lambda)$ reduces \beta by some extra penalty that we tune with Ridge regression

- Here, we can see that for each update, $\beta$ is shrinked by $2\alpha\lambda$

# Lasso Regression
- Unlike ridge regression which uses quadratic penalty, the L1 (lasso regression) adds linear penalty term.
- The linear penalty has more aggressive shrinkage effect on small coefficient.
- #### Sparsity:
    - Lasoo regression can drive some coefficient to exactly zero.
    - For small coefficients, the penalty is strong enough to make them zero
    - For larger coefficient, the penalty shrinks them but does not necessarily make them zero

- The cost function for lasso regression:
$$ J(\beta) = ||y-X\beta||^2 + \lambda||\beta||$$
    - where, $||\beta||$ is the sum of absolute value of coefficients.

- For gradient calculation:
$$ \nabla J(\beta) = −2X^T(y−X\beta)+\lambda(\beta)$$
- Update rule, update until convergence as:
$$ \beta \leftarrow \beta - \alpha \nabla J(\beta) $$
$$ \beta \leftarrow \beta - \alpha[−2X^T(y−X\beta)+\lambda(\beta)]$$
$$ \beta \leftarrow \beta + 2\alpha X^T(y−X\beta)- \alpha\lambda(\beta)$$
    - here, the term $2\alpha X^T(y−X\beta)$ aims to optimize the residuals
    - the term $\alpha\lambda(\beta)$ aims to shrink the coefficients based on their absolute value