# **Lasso Regression**

# Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that performs both regularization and variable selection to enhance prediction accuracy and interpretability.

### 1. Mathematical Objective
Lasso minimizes the sum of squared residuals plus a penalty proportional to the sum of the absolute values of the coefficients:

$$ \min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\} $$

Where:
- $\lambda \ge 0$ is the regularization parameter (often called `alpha` in libraries like scikit-learn).
- $\lambda \sum_{j=1}^{p} |\beta_j|$ is the **L1 penalty**.

### 2. Key Characteristics
*   **Feature Selection:** Unlike Ridge (L2), Lasso can shrink coefficients to exactly zero, effectively performing automatic feature selection.
*   **Sparsity:** It produces "sparse" models, which are easier to interpret in high-dimensional datasets.
*   **Scaling Requirement:** Since the penalty is based on the magnitude of coefficients, features must be standardized (mean=0, variance=1) before fitting.

### 3. Bias-Variance Trade-off
*   **Increasing $\lambda$:** Increases bias but decreases variance (prevents overfitting).
*   **Decreasing $\lambda$:** Decreases bias but increases variance (approaches OLS).


As you keep incresing the value of lambda in lasso regression, the model will start removing the features from the model. And it will explain the features which are most important and will only use those features, but the issue with this is it will lead to underfitting, and vice-versa.

There are 3 main things to know about alpha ($\lambda$) in Lasso:

1. Alpha = Penalty Strength: It controls how strictly you want to punish complex models.
2. Alpha = 0 (No Penalty): This is just normal Linear Regression. You use all features, even useless ones.
3. Alpha > 0 (Feature Killer): As you increase alpha, Lasso starts forcing the coefficients of useless features to become exactly ZERO. This effectively deletes those features from your model.

- Small Alpha (e.g., 0.01): Keeps most features, slight regularization.
- Large Alpha (e.g., 10): Keeps only the most important features, kills the rest (High Bias, Low Variance).

### Comparison: Ridge Regression vs. Lasso Regression

| Feature | Ridge Regression ($L_2$ Regularization) | Lasso Regression ($L_1$ Regularization) |
| :--- | :--- | :--- |
| **Penalty Term** | Adds squared magnitude of coefficients: $\lambda \sum_{j=1}^p \beta_j^2$ | Adds absolute magnitude of coefficients: $\lambda \sum_{j=1}^p |\beta_j|$ |
| **Coefficient Shrinkage** | Shrinks coefficients asymptotically toward zero, but never exactly zero. | Can shrink coefficients exactly to zero. |
| **Feature Selection** | Does **not** perform feature selection; retains all variables. | Performs **automatic feature selection** by nullifying irrelevant features. |
| **Solution Type** | Analytical solution exists (Closed-form). | No closed-form solution (requires numerical optimization like Coordinate Descent). |
| **Computational Complexity** | Generally faster to compute. | Slightly more computationally intensive due to non-differentiable absolute value. |
| **Best Used When...** | Most features are useful and have small/medium effects. | Only a few features are significant (sparse models). |
| **Multicollinearity** | Handles multicollinearity by distributing the penalty among correlated variables. | Tends to pick one variable from a group of correlated variables and ignores the rest. |

#### Geometric Interpretation
*   **Ridge ($L_2$):** The constraint region is a **circle/hypersphere**. The elliptical contours of the least squares error function usually hit the circle at a point where no coordinate is zero.
*   **Lasso ($L_1$):** The constraint region is a **diamond/hyper-octahedron**. The error contours often hit the "corners" of the diamond on an axis, resulting in coefficients becoming exactly zero.
