# Intuition on Why L1 and L2 regularization can reduce overfitting?
- extreme coefficients/weight are unlikely to yield good generalization.
- Introducing a penalty to the sum of the weights means that the model has to “distribute” its weights optimally, so naturally most of this “resource” will go to the simple features that explain most of the variance, with complex features getting small or zero weights.
- L1 useful for variable and feature selection as beta/weight got to zero
- L2 much less likely to get zero coeffcients

Normalization
- We did not normalize our variable for the synthetic data set because we only had one variable. Normally, we have to convert all X variables to standard scores so they are all in the same range and zero centered. If the variables are all in different ranges, regularization will squash some coefficients more than the others because all regularization does is constrain coefficients.

# Mean Squared Error (MSE) Loss with L1 and L2 Regularization

The MSE loss with L1 and L2 regularization is given by:

$$
L = \frac{1}{2m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right)^2 + \lambda_1 \sum_{j} \left| w_j \right| + \frac{\lambda_2}{2} \sum_{j} w_j^2
$$

Here:
- $m$ is the number of data points.
- $y_i$ is the true label, and $\hat{y}_i$ is the prediction.
- $\lambda_1$ and $\lambda_2$ are the L1 and L2 regularization coefficients, respectively.

# Gradient of Loss with Respect to Weights and Bias

To compute the gradients step by step:

## 1. Compute the gradient of the MSE loss without regularization

Let $y_i = w^T x_i + b$ be the predicted value. The gradient of the loss with respect to $w$ and $b$ is:

$$
\frac{\partial L_{\text{MSE}}}{\partial w} = -\frac{1}{m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right) x_i
$$

$$
\frac{\partial L_{\text{MSE}}}{\partial b} = -\frac{1}{m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right)
$$

## 2. Add the L1 regularization gradient

The gradient of the L1 regularization term with respect to $w$ is:

$$
\frac{\partial \lambda_1 \sum_{j} \left| w_j \right|}{\partial w_j} = \lambda_1 \cdot \text{sign}(w_j)
$$

## 3. Add the L2 regularization gradient

The gradient of the L2 regularization term with respect to $w$ is:

$$
\frac{\partial \frac{\lambda_2}{2} \sum_{j} w_j^2}{\partial w_j} = \lambda_2 w_j
$$

## 4. Combine the gradients

The total gradient with respect to $w$ is:

$$
\frac{\partial L}{\partial w} = -\frac{1}{m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right) x_i + \lambda_1 \cdot \text{sign}(w) + \lambda_2 w
$$

The total gradient with respect to $b$ is:

$$
\frac{\partial L}{\partial b} = -\frac{1}{m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right)
$$


In [1]:
import numpy as np


class LinearModel: 

    def __init__(self, input_dim, output_dim):
        # n, p: (row of data, feature#)
        # project matrix input_dim -> output_dim (p, 1)
        # we initialize weight by sampling from normal distribution since
        # Max Likelihood + normal distribution = linear regression: 
        self.weight = np.random.randn(input_dim, output_dim) * np.sqrt(2. / input_dim)
        self.bias = np.zeros(output_dim)

    def __call__(self, x):
        self.x = x
        return x @ self.weight + self.bias

    def backward(self, loss_grad):
        # 2/n * (y - y_pred) * X = c * X
        # 2/n * (y - y_pred) * 1 = c * 1
        self.weight_grad = self.x.T @ loss_grad # (100, 1) @ (100, 1) = self.x @ loss_grad
        self.bias_grad = loss_grad.sum(axis=0) # axis=1 right and axis=0 down

        if self.l1_lambda is not None:
            self.weight_grad += np.sign(self.weight)
    
        if self.l2_lambda is not None:
            self.weight_grad += self.l2_lambda * self.weight

    def step(self, lr):
        # key part
        self.weight = self.weight - lr * self.weight_grad
        self.bias = self.bias - lr * self.bias_grad
    

class Loss: 

    def __init__(self, l1_lambda, l2_lambda, model):
        self.l1_lambda = l1_lambda
        self.l2_lambda = l2_lambda
        self.model = model

    def __call__(self, y_true, y_pred):
        self.y_true, self.y_pred = y_true, y_pred
        mse = np.mean((y_true - y_pred)**2) # mse
        l1 = self.l1_lambda * np.sum( np.abs(self.model.weight) )
        l2 = self.l2_lambda * np.sum( np.sum(self.model.weight**2) )
        return mse + l1 + l2

    def get_loss_grad(self):
        # key part
        """1/n * sum(   (y-pred)**2    ) where y_pred = Wx + b
        dl/dw = dl/dy_pred * dy_pred/dw = 2/n * (y - y_pred) * X = c * X
        dl/db = dl/dy_pred * dy_pred/db = 2/n * (y - y_pred) * 1 = c * 1
        here we just return 2/n * (y - y_pred), part and interaction with X
        part will be inside the LinearModel.backward()
        """
        n = self.y_true.shape[0]
        # remove n here see what will happen. answer: gradient exploding since too large
        # this term is for averaging the gradient for data in batch
        # otherwise it will overshoot the local minimal point. if n removed, you need to
        # increasing learning rate on the other side.
        return 2 / n * (self.y_pred - self.y_true) # c = - 2/n * (self.y_true - self.y_pred)

    
def one_epoch_train(i, X, y_true, model, loss):
    y_pred = model(X)
    loss_val = loss(y_true, y_pred)
    loss_grad = loss.get_loss_grad()
    print(f"loss: {loss_val}")
    if i % 10 == 0:
        plt.plot(X, y_pred)

    model.backward(loss_grad) # update total gradient
    model.step(0.02) # update parameters


def train(X, y_true, model, loss, n_epoch):
    for i in range(n_epoch):
        one_epoch_train(i, X, y_true, model, loss)