##  Mean Squared Error (MSE) Loss

**Definition:**  
The Mean Squared Error (MSE) measures the average squared difference between predictions and true values.  
It is mainly used in **regression problems**, where the goal is to predict continuous values.

---

**Formula (per batch of size \(n\)):**

$$
L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2
$$

- \( \hat{y}_i \): model prediction  
- \( y_i \): true target  
- \( n \): number of samples  

Sometimes a factor is included to simplify gradients:

$$
\frac{1}{2}, \quad \frac{1}{2n}
$$


---

**Gradient with respect to predictions (\(\hat{y}_i\)):**

$$
\frac{\partial L}{\partial \hat{y}_i} = \frac{2}{n} (\hat{y}_i - y_i)
$$

- If the \( \tfrac{1}{2n} \) version is used, the gradient becomes:  

$$
\frac{\partial L}{\partial \hat{y}_i} = \frac{1}{n} (\hat{y}_i - y_i)
$$

---

**Key Idea:**  
- MSE penalizes large errors more strongly because of the squared term.  
- Taking the mean ensures the loss is independent of batch size.  
- The gradient points in the direction of reducing the gap between predictions and targets.


In [1]:
import torch
def mse_loss(y_hat, y, reduction= 'mean'):
    diff = y_hat - y
    loss = diff**2 

    if reduction == 'mean':
        return loss.mean()
    elif reduction == 'sum':
        return loss.sum()
    elif reduction == 'None':
        return loss
    else:
        raise ValueError('reduction must be "mean" | "sum" | "none"')

In [2]:
torch.manual_seed(0)
y = torch.randn(4, 3)
y_hat = torch.randn(4, 3, requires_grad=True)

print("Our mean loss:", mse_loss(y_hat, y, reduction="mean").item())

import torch.nn as nn
loss =nn.MSELoss()
criterion = loss(y_hat, y)
print(f'Torch Mean loss: {criterion}')


Our mean loss: 2.255875825881958
Torch Mean loss: 2.255875825881958


# Gradient

In [3]:
def mse_grad(y_hat, y, reduction='mean'):

    diff = y_hat - y
    if reduction == 'mean':
        return (2 * diff)/y_hat.numel()
    elif reduction == 'sum':
        return 2 * diff
    elif reduction == 'None':
        return 2 * diff
    else:
        raise ValueError("reduction must be 'mean' | 'sum' | 'none'")

In [5]:
torch.manual_seed(0)
y = torch.randn(4, 3)
y_hat = torch.randn(4, 3, requires_grad=True)

# our loss
L = mse_loss(y_hat, y, reduction="mean")
print("Our loss:", L.item())

# torch loss
import torch.nn as nn
torch_loss = nn.MSELoss(reduction="mean")(y_hat, y)
print("Torch loss:", torch_loss.item())

# gradients: autograd
y_hat.grad = None
L.backward()
grad_autograd = y_hat.grad.detach().clone()

# gradients: manual
grad_manual = mse_grad(y_hat.detach(), y, reduction="mean")

print("Autograd grad (first row):", grad_autograd[0])
print("Manual grad (first row): ", grad_manual[0])
print("Max |diff|:", (grad_autograd - grad_manual).abs().max().item())


Our loss: 2.255875825881958
Torch loss: 2.255875825881958
Autograd grad (first row): tensor([-0.3996,  0.2323,  0.1846])
Manual grad (first row):  tensor([-0.3996,  0.2323,  0.1846])
Max |diff|: 2.9802322387695312e-08
