#### Vector Form of Loss (Ridge Regression)

#### Scalar Form (Sum of Squared Errors)

$$
L = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

#### Prediction in Vector Form

$$
\hat{y} = Xw
$$

####  Loss in Vector Form

$$
L = (Xw - y)^T (Xw - y)
$$

####  L2 Regularization Term

$$
\lambda \|w\|^2
$$

Since,

$$
\|w\|^2 = w^T w
$$

####  Final Ridge Regression Loss

$$
L = (Xw - y)^T (Xw - y) + \lambda w^T w
$$

##### if x1, x2 ,x3 x4 .....xn are input columns

$$
X =
\begin{bmatrix}
x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \dots & x_n^{(1)} \\
x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \dots & x_n^{(2)} \\
x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \dots & x_n^{(3)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_1^{(m)} & x_2^{(m)} & x_3^{(m)} & \dots & x_n^{(m)}
\end{bmatrix}
$$

#### Gradient Descent Update Rule

$$
w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial L}{\partial w}
$$

Now we compute the gradient \( \frac{\partial L}{\partial w} \) for Ridge Regression.

---

#### Ridge Loss Function (Vector Form)

$$
L = (Xw - y)^T (Xw - y) + \lambda w^T w
$$

---

#### Gradient of the Loss Function

$$
\frac{\partial L}{\partial w}
=
\frac{\partial}{\partial w}
\left[
(Xw - y)^T (Xw - y)
+
\lambda w^T w
\right]
$$

---

#### Gradient of First Term

$$
\frac{\partial}{\partial w}
(Xw - y)^T (Xw - y)
=
2 X^T (Xw - y)
$$

---

#### Gradient of Regularization Term

$$
\frac{\partial}{\partial w}
\lambda w^T w
=
2 \lambda w
$$

---

#### Final Gradient

$$
\frac{\partial L}{\partial w}
=
2 X^T (Xw - y)
+
2 \lambda w
$$

---

#### Final Weight Update Rule for Ridge Regression

$$
w_{\text{new}}
=
w_{\text{old}}
-
\eta
\left(
2 X^T (Xw - y)
+
2 \lambda w
\right)
$$

# OR........

#### Ridge Loss Function (with 1/2 scaling)

$$
L = \frac{1}{2}(Xw - y)^T (Xw - y) + \frac{1}{2}\lambda w^T w
$$

---

#### Step 1: Expand the Quadratic Term

First rewrite:

$$
(Xw - y)^T = w^T X^T - y^T
$$

So,

$$
L =
\frac{1}{2}
(w^T X^T - y^T)(Xw - y)
+
\frac{1}{2}\lambda w^T w
$$

---

#### Step 2: Multiply the Terms

$$
L =
\frac{1}{2}
\left[
w^T X^T X w
-
w^T X^T y
-
y^T X w
+
y^T y
\right]
+
\frac{1}{2}\lambda w^T w
$$

---

#### Step 3: Take Derivative w.r.t. w

Now differentiate term-by-term.

1) $$ \frac{\partial}{\partial w} (w^T X^T X w) = 2 X^T X w $$

2) $$ \frac{\partial}{\partial w} (w^T X^T y) = X^T y $$

3) $$ \frac{\partial}{\partial w} (y^T X w) = X^T y $$

4) $$ \frac{\partial}{\partial w} (y^T y) = 0 $$

5) $$ \frac{\partial}{\partial w} \left( \frac{1}{2}\lambda w^T w \right) = \lambda w $$

---

#### Step 4: Combine Everything

Because of the 1/2 in front, the 2 cancels.

Final gradient:

$$
\frac{dL}{dw}
=
X^T X w
-
X^T y
+
\lambda w
$$

---

#### Final Compact Form

$$
\frac{dL}{dw}
=
X^T (Xw - y)
+
\lambda w
$$

$$
w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial L}{\partial w}
$$

## let's code this in python 

In [17]:
# Using SGDRegressor (L2 = Ridge)

In [18]:
from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score
import numpy as np

X, y = load_diabetes(return_X_y=True)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=43
)

from sklearn.linear_model import SGDRegressor

reg = SGDRegressor(
    penalty='l2',
    max_iter=500,
    eta0=0.1,
    learning_rate='constant',
    alpha=0.001
)

reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("R2 score", r2_score(y_test, y_pred))
print(reg.coef_)
print(reg.intercept_)

R2 score 0.4226329657491993
[ -21.78308475 -118.38584487  388.45896744  193.4126863    -4.87454601
  -68.27275091 -175.55323483  141.57747142  354.10451479  100.16440325]
[177.85718022]


In [19]:
# Using Ridge (Closed-form / Solver-based)

In [21]:
from sklearn.linear_model import Ridge

reg = Ridge(
    alpha=0.001,
    max_iter=500,
    solver='sparse_cg'
)

reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("R2 score", r2_score(y_test, y_pred))
print(reg.coef_)
print(reg.intercept_)

R2 score 0.5421306472529078
[ -60.55051435 -225.36087698  529.87642791  259.28548334 -738.97717811
  407.97223878  105.56321825  214.51629966  795.66969963   35.44396968]
152.0759468102823


| Situation            | Best Solver    |
| -------------------- | -------------- |
| Small dataset        | `svd`          |
| Medium dense dataset | `cholesky`     |
| Large sparse dataset | `sparse_cg`    |
| Huge dataset         | `sag` / `saga` |


| Solver      | Type                    | How It Works                            | Best For                             | Speed     | Stability         | Supports Sparse | Notes                              |
| ----------- | ----------------------- | --------------------------------------- | ------------------------------------ | --------- | ----------------- | --------------- | ---------------------------------- |
| `auto`      | Auto-select             | Chooses best solver automatically       | General use                          | Depends   | Depends           | Yes             | Default option                     |
| `svd`       | Direct                  | Singular Value Decomposition            | Small datasets, ill-conditioned data | Slow      | ⭐⭐⭐⭐⭐ (Very High) | No              | Most numerically stable            |
| `cholesky`  | Direct                  | Cholesky decomposition of (X^TX)        | Medium dense datasets                | Fast      | ⭐⭐⭐               | No              | Efficient but less stable than SVD |
| `lsqr`      | Iterative               | Least Squares QR-based iterative method | Large datasets                       | Fast      | ⭐⭐⭐⭐              | Yes             | Memory efficient                   |
| `sparse_cg` | Iterative               | Conjugate Gradient method               | Large sparse data                    | Fast      | ⭐⭐⭐⭐              | Yes             | No explicit matrix inverse         |
| `sag`       | Iterative (GD-based)    | Stochastic Average Gradient             | Large datasets                       | Very Fast | ⭐⭐⭐               | Yes             | Requires feature scaling           |
| `saga`      | Iterative (Improved GD) | Variant of SAG                          | Very large datasets                  | Very Fast | ⭐⭐⭐               | Yes             | Supports L1 & ElasticNet           |


## building own class

In [22]:
class MyRidgeGD:
    
    def __init__(self, epochs=500, learning_rate=0.01, alpha=0.001):
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.alpha = alpha
        self.coef_ = None
        self.intercept_ = None
        
    def fit(self, X_train, y_train):
        
        m, n = X_train.shape
        
        # Add bias column
        X_train = np.insert(X_train, 0, 1, axis=1)
        
        # Initialize theta
        theta = np.zeros(n + 1)
        
        for _ in range(self.epochs):
            
            # Gradient core
            gradient = X_train.T @ (X_train @ theta - y_train)
            
            # Regularization (DO NOT regularize bias)
            reg = self.alpha * theta
            reg[0] = 0
            
            gradient += reg
            
            # Average over samples
            gradient = gradient / m
            
            # Update
            theta = theta - self.learning_rate * gradient
        
        self.intercept_ = theta[0]
        self.coef_ = theta[1:]
    
    def predict(self, X_test):
        return X_test @ self.coef_ + self.intercept_

In [27]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [29]:
model = MyRidgeGD(epochs=1000, learning_rate=0.1, alpha=0.001)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

from sklearn.metrics import r2_score
print("R2:", r2_score(y_test, y_pred))

print(reg.coef_)
print(reg.intercept_)

R2: 0.5410641628046949
[ 1.19597937  0.17851711  5.34158267  3.20284639  1.86339584  1.62023936
 -3.57840332  3.95766448  5.00484777  2.96974345]
140.01445559898588
