#### Why Does Lasso Regression Create Sparsity?

Lasso Regression minimizes the following objective function:
$$
Loss = Σ (y - ŷ)² + λ Σ |wᵢ|
$$
The key difference from Ridge Regression is the L1 penalty term:
$$
λ Σ |wᵢ|
$$
This absolute value term is the reason Lasso produces sparse models.

---

#### 1) Role of the L1 Penalty

The L1 penalty adds a cost proportional to the absolute value of the coefficients.

Unlike L2 (w²), the absolute value function:

- Is not differentiable at zero
- Has a sharp "V" shape
- Applies constant pressure toward zero

This makes small coefficients more likely to become exactly zero.

---

#### 2) Gradient Behavior Near Zero

For Ridge (L2):
$$
d/dw (w²) = 2w
$$
When w is small, the gradient is also small.  
So coefficients shrink smoothly but rarely become exactly zero.

For Lasso (L1):

d/dw (|w|) =
+1 , if w > 0  
-1  , if w < 0  

Near zero, the gradient does not become small — it remains constant.  
This strong push drives coefficients directly to zero.

---

#### 3) Geometric Interpretation (Most Important Intuition)

Lasso can also be seen as:

Minimize RSS subject to ||w||₁ ≤ t

The L1 constraint region forms a diamond shape.

When RSS contours (ellipses) expand outward, they are more likely to touch the diamond at its sharp corners.

Those corners lie on the coordinate axes.

When the solution touches a corner:
- One or more coefficients become exactly zero.

This is sparsity.

---

#### 4) Why Ridge Does Not Create Sparsity

Ridge uses an L2 constraint:
$$
||w||₂ ≤ t
$$
The constraint region is circular (smooth boundary).

Since there are no sharp corners:
- The solution rarely lies exactly on an axis.
- Coefficients shrink but almost never become exactly zero.

---

#### 5) Optimization Insight

Because |w| is not differentiable at zero, the optimization uses subgradients.

At zero, the subgradient contains a range of values.

This allows zero to satisfy optimality conditions easily.

As a result, many coefficients are set exactly to zero.

---

#### 6) Final Conclusion

Lasso creates sparsity because:

- The L1 penalty has sharp corners.
- It applies constant shrinkage pressure.
- The optimization process prefers axis-aligned solutions.

Therefore, Lasso performs automatic feature selection by forcing some coefficients to exactly zero.

#### Simple Linear Regression (1 Feature)

We assume a linear relationship between X and Y:

$$
Y = mX + b
$$

Where:

- $m$ = slope  
- $b$ = intercept  

---

#### Intercept Formula

The intercept can be written as:

$$
b = \bar{y} - m\bar{x}
$$

Where:

$$
\bar{y} = \text{mean}(y)
$$

$$
\bar{x} = \text{mean}(x)
$$

---

#### Slope Formula (Closed-Form Solution)

$$
m = \frac{\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})}
         {\sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

The numerator represents the covariance between $X$ and $Y$.

The denominator represents the variance of $X$.

So we can also write:

$$
m = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}
$$

---

#### Final Regression Equation

Once $m$ and $b$ are computed:

$$
\hat{Y} = mX + b
$$

____________

#### Ridge Regression (Closed-Form Slope in 1D)

In simple linear regression, the slope is:

$$
m = 
\frac{\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})}
     {\sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

---

#### Ridge Regression Modification

When we add L2 regularization (Ridge), the denominator changes:

$$
m =
\frac{\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})}
     {\sum_{i=1}^{n} (x_i - \bar{x})^2 + \lambda}
$$

---

#### What Changed?

The only difference is:

$$
\sum (x_i - \bar{x})^2
\quad \longrightarrow \quad
\sum (x_i - \bar{x})^2 + \lambda
$$

This additional $\lambda$ term:

- Increases the denominator
- Shrinks the slope value
- Prevents very large coefficients
- Reduces variance

---

#### Key Insight

As $\lambda$ increases:

$$
m \rightarrow 0
$$

But it never becomes exactly zero (unless $\lambda \to \infty$).

This is why Ridge shrinks coefficients but does not create sparsity.

#### Lasso Regression in 1D (Setup)

We start from the simple linear model:

$$
\hat{y} = mx + b
$$

The intercept can be written as:

$$
b = \bar{y} - m\bar{x}
$$

---

#### Lasso Loss Function

The Lasso objective function is:

$$
L = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda |m|
$$

Substituting $\hat{y}_i = mx_i + b$:

$$
L = \sum_{i=1}^{n} (y_i - (mx_i + b))^2 + \lambda |m|
$$

---

#### Substitute Intercept Expression

Using:

$$
b = \bar{y} - m\bar{x}
$$

We substitute into the loss:

$$
L =
\sum_{i=1}^{n}
\left(
y_i - m x_i - \bar{y} + m \bar{x}
\right)^2
+ \lambda |m|
$$

---

#### Goal

We now want to find:

$$
m = ?
$$

That minimizes:

$$
L = \sum_{i=1}^{n}
(y_i - m x_i - \bar{y} + m \bar{x})^2
+ \lambda |m|
$$

This is where L1 regularization makes the solution different from Ridge.

Because of the absolute value term:

$$
|m|
$$

The function is not differentiable at $m = 0$.

This leads to the soft-thresholding solution and sparsity.

_________

______________

We start from:

$$
L =
\sum_{i=1}^{n}
(y_i - m x_i - \bar{y} + m \bar{x})^2
+ \lambda |m|
$$

---

#### Case 1: Assume $m > 0$

If $m > 0$, then:

$$
|m| = m
$$

So the loss becomes:

$$
L =
\sum_{i=1}^{n}
(y_i - m x_i - \bar{y} + m \bar{x})^2
+ \lambda m
$$

---

#### Simplify the Expression

Notice:

$$
y_i - \bar{y} - m(x_i - \bar{x})
$$

So loss becomes:

$$
L =
\sum_{i=1}^{n}
\left(
(y_i - \bar{y}) - m(x_i - \bar{x})
\right)^2
+ \lambda m
$$

---

#### Differentiate w.r.t. $m$

Take derivative:

$$
\frac{dL}{dm}
=
-2 \sum_{i=1}^{n}
(x_i - \bar{x})
\left[
(y_i - \bar{y}) - m(x_i - \bar{x})
\right]
+ \lambda
$$

---

#### Set Derivative = 0

$$
-2 \sum (x_i - \bar{x})(y_i - \bar{y})
+ 2m \sum (x_i - \bar{x})^2
+ \lambda = 0
$$

Rearranging:

$$
2m \sum (x_i - \bar{x})^2
=
2 \sum (x_i - \bar{x})(y_i - \bar{y})
- \lambda
$$

---

#### Solve for $m$

$$
m =
\frac{
\sum (x_i - \bar{x})(y_i - \bar{y})
- \frac{\lambda}{2}
}{
\sum (x_i - \bar{x})^2
}
$$

---

#### Interpretation (Important)

Compare with OLS slope:

$$
m_{OLS} =
\frac{
\sum (x_i - \bar{x})(y_i - \bar{y})
}{
\sum (x_i - \bar{x})^2
}
$$

Lasso subtracts:

$$
\frac{\lambda}{2}
$$

from the numerator.

This is the shrinkage effect.

________

__________

##### if We Use 2λ|m| Before Differentiation?

The original Lasso loss function is written as:

$$
L = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda |m|
$$

However, before differentiating, we rewrite it as:

$$
L = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + 2\lambda |m|
$$

---

##### Why Add the Factor 2?

When differentiating the squared error term, we get:

$$
\frac{d}{dm}(z^2) = 2z \frac{dz}{dm}
$$

So the derivative of the RSS part naturally produces a factor of 2.

If we keep the original loss:

$$
L = RSS + \lambda |m|
$$

Then after differentiation we get:

$$
2mS_{xx} - 2S_{xy} + \lambda = 0
$$

We would then have to divide everything by 2, which introduces a messy $\lambda/2$ term.

---

##### Cleaner Form Using 2λ|m|

If we instead define:

$$
L = RSS + 2\lambda |m|
$$

Then after differentiation (for $m > 0$):

$$
-2S_{xy} + 2mS_{xx} + 2\lambda = 0
$$

Dividing by 2 gives:

$$
mS_{xx} = S_{xy} - \lambda
$$

So:

$$
m = \frac{S_{xy} - \lambda}{S_{xx}}
$$

This form is cleaner and avoids extra fractions.

---

##### Important Insight

Multiplying the penalty term by 2 does NOT change the model conceptually.

It simply rescales $\lambda$.

Since $\lambda$ is a hyperparameter, scaling it does not affect the theory — it only changes its numerical value.

This is a common mathematical trick used to simplify differentiation.

##### 1)  Lasso (For $m > 0$)

For the case when $m > 0$, the solution becomes:

$$
m =
\frac{
\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x}) - \lambda
}{
\sum_{i=1}^{n} (x_i - \bar{x})^2
}
$$

##### 2)  Lasso (For $m = 0$ Case Boundary Condition)

At the boundary case, when regularization dominates and the coefficient becomes zero:

$$
m =
\frac{
\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})
}{
\sum_{i=1}^{n} (x_i - \bar{x})^2
}
$$

This is simply the Ordinary Least Squares (OLS) solution.

If the shrinkage term $\lambda$ is large enough such that:

$$
\left| \sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x}) \right| \le \lambda
$$

then the optimal solution becomes:

$$
m = 0
$$

##### 3) Lasso (For $m < 0$)

For the case when $m < 0$, the solution becomes:

$$
m =
\frac{
\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x}) + \lambda
}{
\sum_{i=1}^{n} (x_i - \bar{x})^2
}
$$

## so why sparsity ??

##### Why Does Lasso Create Sparsity? (Intuitive Explanation)

Let’s understand this in simple words.

We derived that for the case when $m > 0$:

$$
m = \frac{S_{xy} - \lambda}{S_{xx}}
$$

where:

- $S_{xy}$ depends only on the data
- $\lambda$ is the regularization strength
- $S_{xx} > 0$

---

##### What Happens When We Increase λ?

- $S_{xy}$ is fixed.
- As $\lambda$ increases, the numerator decreases.
- So the value of $m$ keeps decreasing.

When:

$$
\lambda = S_{xy}
$$

then:

$$
m = 0
$$

So the slope becomes exactly zero.

---

##### What Happens If We Increase λ Even More?

Mathematically, the formula suggests $m$ would become negative.

But here is the important point:

That formula was derived assuming $m > 0$.

If $m$ tries to become negative, the assumption breaks.

So we must switch to the $m < 0$ case.

When we check the optimization conditions carefully, we find:

Once $m$ reaches zero, it stays at zero.

It does NOT cross to the negative side.

---

##### Why Does It Stay at Zero?

Because at $m = 0$, the loss function satisfies the optimality condition.

In simple words:

- The regularization penalty is strong enough.
- The model decides that having this feature is not worth it.
- So it removes the feature completely.

The coefficient becomes exactly zero and stays there.

---

##### Intuition in Simple Words

Lasso says:

"If this feature is not strong enough to overcome the penalty, remove it."

As λ increases:

- Small coefficients shrink.
- When they become small enough, they collapse to zero.
- After reaching zero, they do not flip sign.
- They remain zero.

This is called **sparsity**.

---


Lasso creates sparsity because:

- The L1 penalty subtracts a fixed amount (λ) from the slope.
- When the slope becomes zero, it gets "stuck" there.
- Increasing λ further does not revive it.
- The feature is effectively removed from the model.

That is why Lasso performs automatic feature selection.

#### Value of $m$ in Ridge Regression

For Ridge Regression, the loss function is:

$$
L =
\sum_{i=1}^{n} (y_i - \hat{y}_i)^2
+ \lambda m^2
$$

After differentiating with respect to $m$ and setting the derivative equal to zero, the slope becomes:

$$
m =
\frac{
\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})
}{
\sum_{i=1}^{n} (x_i - \bar{x})^2 + \lambda
}
$$