# Linear (Fully Connected) Layer — Forward Pass

**Goal:** Implement the forward pass of a linear (fully connected) layer.  
This layer applies an **affine** transformation to inputs.

---

## Definitions & Shapes

- Batch input **X** shape: `(N, d_in)`
- Weights **W** shape: `(d_in, d_out)`
- Bias **b** shape: `(d_out,)`
- Output **Y** shape: `(N, d_out)`

---

## Formulas

**Single example** (vector x):

$$
y = W^{\top} x + b
$$

**Batch form** (matrix X):

$$
Y = XW + \mathbf{1}\, b^{\top} \;\equiv\; Y = XW + b
$$

Here, $ \mathbf{1} $ is an all-ones column; in code the bias adds by **broadcasting**.

**PyTorch convention** (`nn.Linear(in_features, out_features)`) stores $ \tilde W $ as `(d_out, d_in)` and computes:

$$
Y = X\, \tilde W^{\top} + b
$$

---

## Intuition

Each output dimension \( j \) is a weighted sum of inputs plus a bias:

$$
Y_{:,j} = X \, W_{:,j} + b_j
$$

---

## Complexity

- Time: $ \mathcal{O}(N \cdot d_{\text{in}} \cdot d_{\text{out}}) $ 
- Params: $ d_{\text{in}} \cdot d_{\text{out}} + d_{\text{out}} $

---

## Shape Checklist

- `X: (N, d_in)`
- `W: (d_in, d_out)` → use `X @ W`
- `b: (d_out,)` → broadcasts over batch
- `Y: (N, d_out)`


In [1]:
#imports 

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import numpy as np
import matplotlib.pyplot as plt

import math

In [9]:
def linear_forward(X, weight, bias=None):
    """
    X:       (N, d_in)
    weight:  (d_out, d_in)    # same as nn.Linear
    bias:    (d_out,) or None
    returns: (N, d_out)
    """
    Y = X @ weight.t()
    if bias is not None:
        Y = Y + bias
    return Y


In [10]:
torch.manual_seed(0)
X = torch.randn(5, 3)
W = torch.randn(4, 3)
b = torch.randn(4)

# Gradients flow
X = torch.randn(5, 3, requires_grad=True)
W = torch.randn(4, 3, requires_grad=True)
b = torch.randn(4,     requires_grad=True)
loss = linear_forward(X, W, b).pow(2).mean()
loss.backward()
print(X.grad.shape, W.grad.shape, b.grad.shape)
 

torch.Size([5, 3]) torch.Size([4, 3]) torch.Size([4])


In [4]:
class MyLinear(nn.Module):
    """
    A manual Linear layer:
      X: (N, d_in)
      weight: (d_out, d_in)
      bias: (d_out,)
      forward: Y = X @ weight.T + bias
    """
    def __init__(self, in_features, out_features, bias=True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features

        # Parameters
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        if bias:
            self.bias = nn.Parameter(torch.empty(out_features))
        else:
            self.register_parameter("bias", None)

        self.reset_parameters()

    def reset_parameters(self):
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
            nn.init.uniform_(self.bias, -bound, bound)

    def forward(self, X):
        Y = X.matmul(self.weight.T)
        if self.bias is not None:
            Y = Y + self.bias
        return Y


In [5]:

# random batch
torch.manual_seed(0)
X = torch.randn(4, 3)

# reference layer
ref = nn.Linear(3, 2, bias=True)

# our layer with same params
mine = MyLinear(3, 2, bias=True)
with torch.no_grad():
    mine.weight.copy_(ref.weight) 
    mine.bias.copy_(ref.bias)

# outputs should match
Y_ref  = ref(X)
Y_mine = mine(X)
print(torch.allclose(Y_ref, Y_mine, atol=1e-6)) 


True


In [6]:
target = torch.randn(4, 2)
opt = torch.optim.SGD(mine.parameters(), lr=0.1)
loss_fn = nn.MSELoss()

for _ in range(50):
    opt.zero_grad()
    loss = loss_fn(mine(X), target)
    loss.backward()
    opt.step()

print("final loss:", loss.item())

final loss: 0.17782405018806458


In [7]:
class MLP(nn.Module):
    def __init__(self, d_in, hidden, d_out):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, hidden),
            nn.ReLU(),
            nn.Linear(hidden, d_out),
        )
    def forward(self, x):
        return self.net(x)


In [8]:
model = MLP(d_in=3, hidden=8, d_out=2)
out = model(torch.randn(5, 3))
out

tensor([[-0.2854,  0.2482],
        [-0.3773,  0.0953],
        [-0.3697, -0.1341],
        [-0.4097,  0.1283],
        [-0.4590,  0.0427]], grad_fn=<AddmmBackward0>)