# LASSO Regression (L1 Regularization)

## 1. Definition
**LASSO** = **L**east **A**bsolute **S**hrinkage and **S**election **O**perator.

It is a regression analysis method that performs both **variable selection** and **regularization** in order to enhance the prediction accuracy and interpretability of the statistical model it produces.

---

## 2. The Cost Function (Math)

Lasso modifies the standard Mean Squared Error (MSE) by adding an **L1 Penalty** term.

$$J(\beta) = \underbrace{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}_{\text{MSE}} + \underbrace{\lambda \sum_{j=1}^{k} |\beta_j|}_{\text{L1 Penalty}}$$

* **$\lambda$ (Lambda):** The tuning parameter.
* **$|\beta_j|$:** The absolute value of the coefficients.
* **Note:** We do **not** regularize the intercept ($\beta_0$).

---

## 3. Why Coefficients Hit Zero (Geometric Intuition)



* **Ridge (L2):** The constraint region is a **Circle** ($\beta^2$). The error contours usually hit the circle *not* on the axis, shrinking coefficients close to 0 but never exactly 0.
* **Lasso (L1):** The constraint region is a **Diamond** ($|\beta|$). The corners of the diamond lie on the axes. The error contours are statistically likely to hit these "sharp corners" first.
    * **Result:** When the solution hits a corner, one of the coefficients is exactly 0.



---

## 4. Optimization (How it solves)

Unlike Linear or Ridge Regression, Lasso **does not** have a closed-form solution (you can't just use matrix algebra to solve it in one step) because the absolute value function $|\beta|$ is not differentiable at zero.

**Algorithms used:**
1.  **Coordinate Descent:** (Most common) Optimizes one coefficient at a time while holding others fixed.
2.  **LARS (Least Angle Regression):** An efficient algorithm for high-dimensional data.

---

## 5. Lasso vs. Ridge (Comparison)

| Topic                 | **Lasso (L1)**                                    | **Ridge (L2)**                                 |
| --------------------- | ------------------------------------------------- | ---------------------------------------------- |
| **Penalty type**      | Uses **absolute value** of coefficients           | Uses **square** of coefficients                |
| **Feature selection** |  **Yes** – makes some coefficients exactly **0**  |  **No** – makes coefficients **small**, not 0 |
| **Multicollinearity** | Keeps **one** feature and removes others          | Keeps **all** features but shrinks them        |
| **Solution method**   | Solved using **iterative algorithms**             | Has a **direct (closed-form)** solution        |
| **Best use case**     | When **few features are important**               | When **all features matter a little**          |


---

## 6. Critical Requirement: Feature Scaling

**You must standardize your data** before using Lasso.
* **Why?** Lasso penalizes the *magnitude* of coefficients.
* If "Salary" is in thousands (100,000) and "Age" is in double digits (40), Lasso will unfairly punish "Salary" just because the number is bigger, not because it's less important.
* **Fix:** Use `StandardScaler`.

---

## 7. The Weakness & The Fix (Elastic Net)

**The Problem:**
If two features are highly correlated ( *Height* and *Leg Length*), Lasso effectively flips a coin and keeps one while dropping the other to zero. This can be unstable.

**The Solution: Elastic Net**
Combines L1 (Lasso) and L2 (Ridge).
$$Cost = MSE + \lambda_1 \sum |\beta| + \lambda_2 \sum \beta^2$$
* It groups correlated features together (Ridge effect) and selects them or drops them as a group (Lasso effect).

---

# Applications of L1 Regularization (Lasso)

## 1. Lasso Regression (Linear Models)
This is the most common application of L1 regularization.

* **Goal:** To prevent overfitting in linear models while simultaneously performing **Feature Selection**.
* **Mechanism:** It adds the absolute value of coefficients ($|\beta|$) to the cost function.
* **Result:** It forces the coefficients of weak or irrelevant features to become **exactly zero**, effectively removing them from the equation.
* **Best For:** Creating **Sparse Models** (models where most weights are zero) which are easier to interpret and faster to run.

---

## 2. Logistic Regression with L1
L1 regularization is not limited to regression; it is widely used in classification tasks.

* **Goal:** To identify which features most strongly discriminate between classes (e.g., "Spam" vs. "Not Spam").
* **How it works:** The solver (like `liblinear` in Scikit-Learn) minimizes the Log Loss + L1 Penalty.
* **Use Case:** High-dimensional classification, such as:
    * **Text Classification:** Determining which specific words (features) predict the sentiment of a review.
    * **Genomics:** Identifying which specific genes (out of thousands) predict a disease.

---

## 3. Neural Networks 
In Deep Learning, L1 regularization is applied to the weight matrices of the layers.

* **Goal:** Reduce model complexity and memory footprint.
* **Input Layer Pruning:** Applying L1 to the **first hidden layer** is a powerful trick. It forces the network to ignore irrelevant input features entirely, acting as an automatic feature selector for raw data.
* **Network Sparsity:**
    * **L2 (Weight Decay):** Makes weights small (diffuse).
    * **L1 Regularization:** Makes weights zero (sparse). This can allow for "Network Pruning" (compressing the model size by removing zero-weight connections).

---

## 4. Summary: When to Apply L1?

| Domain | Application | Benefit |
| :--- | :--- | :--- |
| **Finance** | Predicting credit risk | Removes irrelevant financial indicators to explain *why* a loan was rejected. |
| **Bioinformatics** | DNA Analysis | Selects the 5 genes responsible for a trait out of 20,000 candidates. |
| **Computer Vision** | Compressed Sensing | Reconstructs images with fewer pixels by assuming the image signal is sparse. |
| **NLP** | Sentiment Analysis | Isolates key keywords (e.g., "Terrible", "Amazing") and ignores filler words (e.g., "the", "is"). |



In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np


data = fetch_california_housing()
X = data.data
y = data.target


X_train, X_test, Y_train, Y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Lasso model
model = Lasso(alpha=0.1)
model.fit(X_train, Y_train)

y_pred = model.predict(X_test)

# Evaluation
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(Y_test, y_pred))
print("R2 Score:", r2_score(Y_test, y_pred))


Coefficients: [ 3.92693362e-01  1.50810624e-02 -0.00000000e+00  0.00000000e+00
  1.64168387e-05 -3.14918929e-03 -1.14291203e-01 -9.93076483e-02]
Intercept: -7.698845419807455
MSE: 0.6135115198058131
R2 Score: 0.5318167610318159


In [3]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

data = load_diabetes()
X = data.data
y = data.target

X_train, X_test, Y_train, Y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = Lasso(alpha=0.1)
model.fit(X_train, Y_train)

y_pred = model.predict(X_test)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(Y_test, y_pred))
print("R2 Score:", r2_score(Y_test, y_pred))


Coefficients: [   0.         -152.66477923  552.69777529  303.36515791  -81.36500664
   -0.         -229.25577639    0.          447.91952518   29.64261704]
Intercept: 151.57485282893947
MSE: 2798.193485169719
R2 Score: 0.4718547867276227


In [4]:
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Synthetic regression dataset
X, y = make_regression(
    n_samples=500,
    n_features=10,
    noise=10,
    random_state=42
)

X_train, X_test, Y_train, Y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = Lasso(alpha=0.1)
model.fit(X_train, Y_train)

y_pred = model.predict(X_test)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(Y_test, y_pred))
print("R2 Score:", r2_score(Y_test, y_pred))


Coefficients: [41.82204111 64.12395241 18.73201774 46.72243239 23.9565675  16.39228576
 82.69535426 64.17889918  7.4871428  28.65860581]
Intercept: -0.2927999172299187
MSE: 97.1743302138308
R2 Score: 0.9950667951885857
