# Linear Layer Backward Pass (Gradients)

The linear (fully connected) layer computes:

$$
y = xW^T + b
$$

- $x \in \mathbb{R}^{N \times d_{in}}$ (input batch)  
- $W \in \mathbb{R}^{d_{out} \times d_{in}}$ (weights)  
- $b \in \mathbb{R}^{d_{out}}$ (bias)  
- $y \in \mathbb{R}^{N \times d_{out}}$ (output)  

---

### Backward Derivatives:

1. Gradient w.r.t input:
$$
dX = dY \cdot W
$$

2. Gradient w.r.t weights:
$$
dW = dY^T \cdot x
$$

3. Gradient w.r.t bias:
$$
db = \sum_{i=1}^N dY_i
$$

This ensures gradients flow backward through the network.


In [1]:
import numpy as np

# Dummy data
N, d_in, d_out = 4, 3, 2
x = np.random.randn(N, d_in)
W = np.random.randn(d_out, d_in)
b = np.random.randn(d_out)

# Forward pass
y = x @ W.T + b

# Assume upstream gradient (from next layer / loss)
dY = np.random.randn(N, d_out)

# Backward pass
dX = dY @ W
dW = dY.T @ x 
db = dY.sum(axis=0) 

print("dX shape:", dX.shape)
print("dW shape:", dW.shape)
print("db shape:", db.shape)


dX shape: (4, 3)
dW shape: (2, 3)
db shape: (2,)


# Verifying with PyTorch Autograd
We can confirm our manual gradients using PyTorch’s `autograd`.


In [2]:
import torch

# Dummy data
x = torch.randn(N, d_in, requires_grad=True)
W = torch.randn(d_out, d_in, requires_grad=True)
b = torch.randn(d_out, requires_grad=True)

# Forward
y = x @ W.T + b

# Dummy loss (sum of outputs)
loss = y.sum()

# Backward
loss.backward()

print("PyTorch dX:", x.grad.shape)
print("PyTorch dW:", W.grad.shape)
print("PyTorch db:", b.grad.shape)


PyTorch dX: torch.Size([4, 3])
PyTorch dW: torch.Size([2, 3])
PyTorch db: torch.Size([2])


# Linear Layer + MSE Loss: Forward & Backward

We use a linear (fully connected) layer:
$$
y = x W^{\top} + b
$$
- $x \in \mathbb{R}^{N \times d_{in}},\;\; W \in \mathbb{R}^{d_{out} \times d_{in}},\;\; b \in \mathbb{R}^{d_{out}},\;\; y \in \mathbb{R}^{N \times d_{out}}$.

We choose **mean squared error (MSE)** against targets $t \in \mathbb{R}^{N \times d_{out}}$:
$$
L = \frac{1}{2N}\sum_{i=1}^{N}\lVert y_i - t_i \rVert_2^2
$$

---

## Backward (using chain rule)

First, the loss gradient w.r.t. layer output:
$$
\frac{\partial L}{\partial y} = dY = \frac{1}{N}(y - t)
$$

Then the linear layer gradients:
$$
\begin{aligned}
dX &= dY \cdot W \\
dW &= dY^{\top} \cdot x \\
db &= \sum_{i=1}^{N} dY_{i}
\end{aligned}
$$

- $dX=\frac{\partial L}{\partial x}$ is needed to pass gradients to earlier layers.  
- $dW=\frac{\partial L}{\partial W}$ and $db=\frac{\partial L}{\partial b}$ are used to update parameters.


In [3]:
import numpy as np

# Reproducibility
rng = np.random.default_rng(0)

#dummy data
N, d_in, d_out = 5, 4, 3
x = rng.normal(size=(N, d_in))
t = rng.normal(size=(N, d_out))

# Parameters
W = rng.normal(size=(d_out, d_in))
b = rng.normal(size=(d_out,))

# Forward 
y = x @ W.T + b 

# MSE loss
residual = y - t
L = 0.5 / N * np.sum(residual**2)

# Backward
dY = (1.0 / N) * residual
dX = dY @ W
dW = dY.T @ x
db = dY.sum(axis=0)

print(f"Loss: {L:.6f}")
print("dX:", dX.shape, " dW:", dW.shape, " db:", db.shape)


Loss: 7.859075
dX: (5, 4)  dW: (3, 4)  db: (3,)


In [4]:
import torch

# Convert NumPy -> Torch with the same values
xt = torch.tensor(x, dtype=torch.float32, requires_grad=True)
Wt = torch.tensor(W, dtype=torch.float32, requires_grad=True)
bt = torch.tensor(b, dtype=torch.float32, requires_grad=True)
tt = torch.tensor(t, dtype=torch.float32)

# Forward
yt = xt @ Wt.T + bt
Lt = 0.5 / N * torch.sum((yt - tt)**2)

# Backward
Lt.backward()

# Compare (allow small numerical diffs)
np_allclose = lambda a,b: np.allclose(a, b, atol=1e-6, rtol=1e-5)

ok_W = np_allclose(Wt.grad.detach().numpy(), dW)
ok_b = np_allclose(bt.grad.detach().numpy(), db)
ok_X = np_allclose(xt.grad.detach().numpy(), dX)

print(f"Loss torch: {Lt.item():.6f}")
print("Match dW:", ok_W, "| Match db:", ok_b, "| Match dX:", ok_X)


Loss torch: 7.859075
Match dW: True | Match db: True | Match dX: True
