<a href="https://colab.research.google.com/github/Redcoder815/Deep_Learning_Python/blob/main/L2Regularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Linear Model (Forward Pass): The forward function calculates the predicted output y_pred using the standard linear equation:

y_pred = X @ W + b

Where:

X is your input feature matrix.
W (weights) are the coefficients that determine the influence of each feature.
b (bias) is the intercept term.
2. Mean Squared Error (MSE) Loss: The mse_loss function measures how well your model's predictions y_pred match the actual true values y. It calculates the average of the squared differences between them:

MSE = (1/N) * sum((y_pred - y_true)^2)

This is a common metric to quantify the error, and the goal of training is to minimize this error.

3. L2 Regularization Loss: The l2_loss function adds a penalty to the total loss based on the magnitude of the weights W. This helps prevent overfitting by discouraging the model from assigning very large values to the weights.

L2_Loss = 0.5 * lam * sum(W^2)

Where lam (lambda) is the regularization strength. A larger lam means a stronger penalty on the weights.

4. Total Loss: The overall loss that the model tries to minimize is the sum of the MSE loss and the L2 regularization loss:

Total_Loss = MSE_Loss + L2_Loss

5. Gradient Calculation (compute_grads): This is the core of the optimization process. It calculates how much each parameter (W and b) needs to change to reduce the Total_Loss.

Gradient of Loss wrt Predictions (dL_dy): First, it calculates how the MSE loss changes with respect to the predictions:

dL_dy = (2/N) * (y_pred - y)

Gradient of Loss wrt Weights (dW): Using the chain rule, it calculates the gradient for the weights. This includes the gradient from the MSE term and the gradient from the L2 regularization term:

dW = X.T @ dL_dy + lam * W

The lam * W term comes directly from the derivative of 0.5 * lam * sum(W^2) with respect to W.

Gradient of Loss wrt Bias (db): Similarly, the gradient for the bias is calculated:

db = sum(dL_dy)

6. Gradient Descent Update: In each epoch (iteration), the W and b parameters are updated by moving a small step (lr, the learning rate) in the opposite direction of their respective gradients:

W = W - lr * dW b = b - lr * db

This iterative process gradually moves the W and b values towards the ones that minimize the Total_Loss function.

dL_dy (Gradient of Loss with respect to Predictions y_pred)
This term represents how sensitive the Mean Squared Error (MSE) loss is to changes in your model's predictions (y_pred).

The MSE loss function is defined as: L_MSE = (1/N) * Σ(y_pred - y_true)²

To find dL_dy (which is ∂L_MSE / ∂y_pred), we differentiate L_MSE with respect to y_pred: dL_dy = ∂/∂y_pred [ (1/N) * Σ(y_pred - y_true)² ] dL_dy = (1/N) * Σ[ 2 * (y_pred - y_true) * ∂/∂y_pred (y_pred - y_true) ] dL_dy = (1/N) * Σ[ 2 * (y_pred - y_true) * 1 ] dL_dy = (2/N) * (y_pred - y_true)

This dL_dy is then a vector where each element tells you how much the loss changes for a unit change in the corresponding prediction.

2. dW (Gradient of Loss with respect to Weights W)
The dW term represents how sensitive the total loss (MSE loss + L2 regularization loss) is to changes in your model's weights W.

The total loss is: L_total = L_MSE + L_L2 So, dW = ∂L_total / ∂W = ∂L_MSE / ∂W + ∂L_L2 / ∂W

Let's break it down:

a) Contribution from MSE Loss (∂L_MSE / ∂W) We use the chain rule, knowing that y_pred = X @ W + b. ∂L_MSE / ∂W = (∂L_MSE / ∂y_pred) @ (∂y_pred / ∂W)

We already found ∂L_MSE / ∂y_pred = dL_dy.
For ∂y_pred / ∂W: y_pred = X @ W + b Differentiating this with respect to W (a matrix derivative), we get X.T.
More formally, if y_pred_i = Σ_j X_ij W_j + b, then ∂y_pred_i / ∂W_k = X_ik. So, ∂y_pred / ∂W is X.T when dL_dy is a column vector.
Therefore, ∂L_MSE / ∂W = X.T @ dL_dy

b) Contribution from L2 Regularization Loss (∂L_L2 / ∂W)

The L2 regularization loss is defined as: L_L2 = 0.5 * lam * Σ(W²) = 0.5 * lam * WᵀW (for vector W)

To find ∂L_L2 / ∂W, we differentiate L_L2 with respect to W: ∂L_L2 / ∂W = ∂/∂W [ 0.5 * lam * Σ(W²) ] ∂L_L2 / ∂W = 0.5 * lam * Σ[ 2 * W * ∂W/∂W ] ∂L_L2 / ∂W = lam * W

c) Combining them for dW dW = X.T @ dL_dy + lam * W

This is exactly what you see in the compute_grads function.

3. db (Gradient of Loss with respect to Bias b)
The db term represents how sensitive the total loss is to changes in your model's bias b.

Since the L2 regularization is only applied to weights W and not bias b, the db gradient only comes from the MSE loss.

db = ∂L_total / ∂b = ∂L_MSE / ∂b

Again, we use the chain rule: ∂L_MSE / ∂b = (∂L_MSE / ∂y_pred) @ (∂y_pred / ∂b)

We know ∂L_MSE / ∂y_pred = dL_dy.
For ∂y_pred / ∂b: y_pred = X @ W + b Differentiating y_pred with respect to b, we get a vector of ones (if b is a scalar that's broadcast across all y_pred). Essentially, a change in b affects each y_pred_i equally by 1.
Therefore, ∂L_MSE / ∂b = Σ(dL_dy) (summing up the individual gradients dL_dy for each sample because the scalar b affects all samples equally).

This is why in the compute_grads function, db = np.sum(dL_dy, axis=0).

These gradients are then used in the gradient descent update step to adjust W and b in the direction that minimizes the total loss.



In [1]:
import numpy as np

np.random.seed(0)
X = np.random.randn(100, 3)
true_W = np.array([[2.0], [-3.0], [1.0]])
y = X @ true_W + 0.5 * np.random.randn(100, 1)

W = np.random.randn(3, 1)
b = np.zeros((1,))

def forward(X, W, b):
    return X @ W + b

def mse_loss(y_pred, y_true):
    return np.mean((y_pred - y_true)**2)

def l2_loss(W, lam):
    return 0.5 * lam * np.sum(W * W)

def compute_grads(X, y, y_pred, W, lam):
    N = X.shape[0]

    # gradient of MSE wrt predictions
    dL_dy = (2/N) * (y_pred - y)

    # gradients wrt parameters
    dW = X.T @ dL_dy + lam * W   # L2 gradient added here
    db = np.sum(dL_dy, axis=0)

    return dW, db

lr = 0.05
lam = 0.1   # L2 strength

for epoch in range(200):
    y_pred = forward(X, W, b)

    loss = mse_loss(y_pred, y) + l2_loss(W, lam)

    dW, db = compute_grads(X, y, y_pred, W, lam)

    # gradient descent update
    W -= lr * dW
    b -= lr * db

    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} | Loss = {loss:.4f}")

Epoch   0 | Loss = 11.8049
Epoch  20 | Loss = 0.9744
Epoch  40 | Loss = 0.8899
Epoch  60 | Loss = 0.8890
Epoch  80 | Loss = 0.8890
Epoch 100 | Loss = 0.8890
Epoch 120 | Loss = 0.8890
Epoch 140 | Loss = 0.8890
Epoch 160 | Loss = 0.8890
Epoch 180 | Loss = 0.8890
