# LASSO REGRESSION

### Lasso Regression Loss Function
The Lasso (Least Absolute Shrinkage and Selection Operator) regression loss function is given by:

$
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{j=1}^{n} |\theta_j|
$

where:
- $m$ is the number of training examples.
- $h_{\theta}(x) = \theta^T x$ is the hypothesis or predicted value.
- $y^{(i)}$ is the actual target value.
- $\lambda$ is the regularization parameter.
- $\theta_j$ represents the model parameters.

### Gradient of the Loss Function
To perform gradient descent, we need to compute the subgradient of the loss function with respect to $\theta$ because the $L1$ term is not differentiable at $\theta_j = 0$. The subgradient for the absolute value function $ |\theta_j| $ is:

$
\frac{\partial J(\theta)}{\partial \theta_j} =
\begin{cases}
\frac{1}{m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \lambda & \text{if } \theta_j > 0 \\
\frac{1}{m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} - \lambda & \text{if } \theta_j < 0 \\
\frac{1}{m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} & \text{if } \theta_j = 0
\end{cases}
$

This can be expressed in vectorized form (ignoring the piecewise nature for simplicity) as:

$
\nabla_{\theta} J(\theta) = \frac{1}{m} X^T (X\theta - y) + \lambda \cdot \text{sign}(\theta)
$

Where $ \text{sign}(\theta) $ is the sign function, which returns $-1$, $0$, or $1$ depending on the value of $\theta_j$.

### Gradient Descent Update Rule
The update rule for $\theta$ using gradient descent is:

$
\theta := \theta - \alpha \nabla_{\theta} J(\theta)
$

Substituting the subgradient, we get:

$
\theta := \theta - \alpha \left( \frac{1}{m} X^T (X\theta - y) + \lambda \cdot \text{sign}(\theta) \right)
$

Where:
- $\alpha$ is the learning rate.
- $X$ is the matrix of input features.

This indicates that the update for $\theta$ in Lasso regression includes a shrinkage term that penalizes the absolute value of the coefficients. Unlike Ridge regression, where the penalty is quadratic (squared values of $\theta_j$), the Lasso penalty is linear, leading to sparsity in the model (some coefficients can become exactly zero). This makes Lasso particularly useful for feature selection in high-dimensional datasets.