

📘 **Ridge Regression - Notes**


🔹 **What is Ridge Regression?**

Ridge Regression is a **regularized version of Linear Regression** that adds a penalty term to reduce model complexity and prevent **overfitting**.

> 🚨 It helps when we have **multicollinearity** (features are highly correlated) or when the number of features is **close to or greater than** the number of observations.



🔹 **Why do we need it?**

In ordinary linear regression:

$$
y = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_n x_n
$$

We minimize the **Mean Squared Error (MSE)**:

$$
\mathcal{L}(w) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - w^T x_i)^2
$$
Of course, ATUL! Let’s break it down carefully.

In your formula:

$$
\mathcal{L}(w) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - w^T x_i)^2
$$

Here’s what each part means:

* $y_i$: actual target value for the $i$-th data point.
* $\hat{y}_i$: predicted value for the $i$-th data point.
* $x_i$: the feature vector (a column vector) for the $i$-th data point.
* $w$: the weight vector (also a column vector) that our model is trying to learn.

Now, **$w^T x_i$** means the transpose of $w$ multiplied by $x_i$:

* $w^T$: transpose of the weight vector $w$.
* When you do $w^T x_i$, you get a scalar (single number), which becomes the prediction $\hat{y}_i$.

### Why transpose?

If:

* $w \in \mathbb{R}^d$ (a column vector with $d$ elements).
* $x_i \in \mathbb{R}^d$ (also a column vector with $d$ elements).

Then:

* $w^T \in \mathbb{R}^{1 \times d}$ is a row vector.
* $w^T x_i$ is then a dot product, resulting in a scalar.

So, **$w^T x_i$** is the predicted value for the $i$-th observation in a linear regression model.



But if:

* There are **too many features**
* Or features are **highly correlated**

→ The model becomes **unstable**, and coefficients may become large.



🔹 **Ridge Regression Loss Function**

Ridge adds a **penalty** (regularization term):

$$
\mathcal{L}_{\text{ridge}}(w) = \frac{1}{n} \sum_{i=1}^{n} (y_i - w^T x_i)^2 + \lambda \sum_{j=1}^{p} w_j^2
$$

Where:

* $\lambda$ = Regularization parameter (controls penalty strength)
* $\sum w_j^2$ = L2 norm (squared magnitude of weights)
* Note: Bias term $w_0$ is often not penalized.



🔹 **Key Intuition**

* Ridge **shrinks the weights**, but **doesn’t make them exactly zero**.
* Controls overfitting by adding **bias** to reduce **variance**.



🔹 **Derivative of Ridge Loss Function** (Univariate Case for Simplicity)

Let’s take a simple case:

$$
y_i = w x_i
$$

Loss function:

$$
\mathcal{L}(w) = \frac{1}{n} \sum_{i=1}^{n} (y_i - w x_i)^2 + \lambda w^2
$$

Now derive the gradient w\.r.t. $w$:

$$
\frac{d}{dw} \left[ \frac{1}{n} \sum (y_i - w x_i)^2 + \lambda w^2 \right]
= \frac{1}{n} \sum -2 x_i (y_i - w x_i) + 2\lambda w
$$

$$
= -\frac{2}{n} \sum x_i (y_i - w x_i) + 2\lambda w
$$

Set gradient to zero for optimal $w$:

$$
-\frac{2}{n} \sum x_i (y_i - w x_i) + 2\lambda w = 0
$$

This leads to the regularized normal equation in general form.



🔹 **Closed-Form Solution (Matrix Form)**

For multivariate case with:

* $X$ = feature matrix (n x p)
* $y$ = target vector
* $w$ = weight vector

Then:

$$
w = (X^T X + \lambda I)^{-1} X^T y
$$

Where:

* $I$ is the identity matrix
* $\lambda$ prevents the matrix $X^T X$ from being singular



🔹 **Effect of Lambda (λ):**

* $\lambda = 0$ → same as Linear Regression
* $\lambda \to \infty$ → coefficients shrink toward zero
* Choosing $\lambda$ is critical → use **Cross-Validation**



🔹 **Pros and Cons**

✅ Pros:

* Handles multicollinearity well
* Improves model generalization
* Prevents overfitting

❌ Cons:

* Doesn't perform feature selection (unlike Lasso)
* All coefficients are reduced but **none become zero**


🔹 **Ridge vs. Linear vs. Lasso**

| Property            | Linear Regression | Ridge Regression | Lasso Regression |
| ------------------- | ----------------- | ---------------- | ---------------- |
| Regularization      | ❌ No              | ✅ L2             | ✅ L1             |
| Coefficients Shrink | ❌ No              | ✅ Yes            | ✅ Some = 0       |
| Feature Selection   | ❌ No              | ❌ No             | ✅ Yes            |



📌 Summary:

> Ridge Regression = **Linear Regression + L2 penalty**


<font color = "Red">**Continue With Lecture 55 From !00 days of Machine Learning (** Lasso and Elastic Left**)



ss