# Lasso Regression (L1 Regularization)


## Summary

* **Lasso Regression**, also known as **L1 Regularization**, is used for **feature selection**.
* It modifies the cost function by adding a penalty term: **Lambda ($\lambda$)** multiplied by the **magnitude of the slope**.
* A key property of Lasso Regression is that it can reduce the coefficients of less important features to exactly **zero**.
* This zeroing out of coefficients effectively removes the corresponding feature from the model, making it ideal for datasets with many features.

## Exam Notes

### Feature Selection vs. Overfitting

**Question:** Why do we use Lasso Regression compared to Ridge Regression?

**Answer:**  
While Ridge Regression is primarily used to reduce overfitting, **Lasso Regression** is specifically used for **feature selection**.  
It helps identify and retain only the most important features in a model.

### Relationship Between Lambda and Slope

**Question:** What happens to the slope when Lambda increases in Lasso Regression?

**Answer:**  
As **Lambda ($\lambda$)** increases, the **slope ($\theta$)** decreases.  
Unlike Ridge Regression, in Lasso Regression the slope can eventually become **zero**, indicating that the corresponding feature has been removed from the model.

## Lasso Regression Details

**Lasso Regression** (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that includes a regularization term.  
It is also referred to as **L1 Regularization**.

### The Cost Function

The cost function for Lasso Regression is the standard Mean Squared Error (MSE) plus a penalty term based on the **absolute value** of the slopes.

$$
J(\theta) =
\frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
+
\lambda \sum_{i=1}^{n} |\ Slope|
$$

* **First Term:** Standard Mean Squared Error (MSE) from linear regression.
* **Second Term:** Regularization penalty using the **absolute value** of coefficients (L1 norm), not the square as in Ridge Regression.

### Feature Selection Mechanism

The primary use case for Lasso Regression is **feature selection**.

* **Unimportant Features:**  
  Features with weak correlation to the output tend to have their coefficients shrunk to **zero**.

* **Important Features:**  
  Features strongly correlated with the output retain **non-zero** coefficients.

### Example Scenario

Consider a model with four features: $x_1, x_2, x_3, x_4$.

$$
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4
$$

Suppose the initial coefficients are:

* $\theta_1 = 0.65$
* $\theta_2 = 0.72$
* $\theta_3 = 0.034$
* $\theta_4 = 0.12$

Here, $x_4$ has a very small coefficient, indicating weak correlation with the output.  
When Lasso Regression is applied, this coefficient is penalized and reduced to **zero**.

The equation becomes:

$$
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + 0 \times x_4
$$

The term $0 \times x_4$ disappears, effectively removing feature $x_4$ from the model.

### Relationship Between Lambda ($\lambda$) and Slope ($\theta$)

Understanding the relationship between **Lambda** and the **Slope** explains how feature selection occurs.

1. **$\lambda = 0$**  
   * No penalty is applied  
   * Model behaves like standard Linear Regression

2. **$\lambda$ Increases**  
   * Penalty grows stronger  
   * Coefficients begin to shrink

3. **Coefficient Becomes Zero**  
   * At sufficiently large $\lambda$, some coefficients become **exactly zero**
   * Corresponding features are removed from the model

This behavior makes **Lasso Regression** an effective automatic feature selection technique, especially useful for datasets with a large number of input features.
