## Batch Normalization

**Goal:** Normalize activations within a mini-batch to stabilize and speed up training.

### Training Mode

For a batch of inputs $x = (x_1, \dots, x_m)$:

1. **Batch mean**  
   $$ \mu_B = \frac{1}{m} \sum_{i=1}^m x_i $$

2. **Batch variance**  
   $$ \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 $$

3. **Normalize**  
   $$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$

4. **Scale and shift (learned parameters)**  
   $$ y_i = \gamma \hat{x}_i + \beta $$

- $\gamma$: scaling factor (learned)  
- $\beta$: shift factor (learned)  
- $\epsilon$: small constant for numerical stability  

### Inference Mode
- Use running averages of $\mu$ and $\sigma^2$ instead of batch statistics:
  $$ y_i = \gamma \cdot \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$

### Benefits
- Reduces *internal covariate shift*  
- Enables higher learning rates  
- Adds regularization effect  
- Improves convergence speed


## Batch Normalization Forward (NumPy Implementation)

We implement BatchNorm manually for both **training** and **inference**.

- During **training**, batch statistics ($\mu_B, \sigma_B^2$) are used.  
- During **inference**, running averages of mean/variance are used instead.  


In [1]:
import numpy as np

class BatchNorm:
    def __init__(self, dim, momentum=0.9, eps=1e-5):
        self.gamma = np.ones((1, dim))   # scale
        self.beta = np.zeros((1, dim))   # shift
        self.momentum = momentum
        self.eps = eps

        # running stats (for inference)
        self.running_mean = np.zeros((1, dim))
        self.running_var = np.zeros((1, dim))

    def forward(self, x, training=True):
        if training:
            batch_mean = np.mean(x, axis=0, keepdims=True)
            batch_var = np.var(x, axis=0, keepdims=True)

            # normalize
            x_hat = (x - batch_mean) / np.sqrt(batch_var + self.eps)
            out = self.gamma * x_hat + self.beta

            # update running stats
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * batch_mean
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * batch_var

            return out
        else:
            # inference: use running averages
            x_hat = (x - self.running_mean) / np.sqrt(self.running_var + self.eps)
            out = self.gamma * x_hat + self.beta
            return out


# Example usage
np.random.seed(0)
x = np.random.randn(4, 3)  # batch of 4, 3 features
bn = BatchNorm(dim=3)

print("Training forward:")
print(bn.forward(x, training=True))

print("\nInference forward:")
print(bn.forward(x, training=False))


Training forward:
[[ 0.59662445 -0.2123117   0.67702389]
 [ 1.2697554   1.67649844 -1.39017913]
 [-0.55240527 -0.92221048 -0.46643536]
 [-1.31397458 -0.54197626  1.17959061]]

Inference forward:
[[ 7.27530483  1.39868512  3.15781239]
 [ 9.40374094  7.37118196 -3.37892902]
 [ 3.64207961 -0.84604458 -0.45794154]
 [ 1.23400192  0.35627195  4.74698802]]


## Batch Normalization Forward (PyTorch)

PyTorch provides `nn.BatchNorm1d`, `nn.BatchNorm2d`, and `nn.BatchNorm3d` for 
different input shapes.

- `BatchNorm1d`: for fully-connected layers (2D input: `[batch, features]`)
- `BatchNorm2d`: for CNNs with images (4D input: `[batch, channels, H, W]`)

The layer automatically:
1. Computes batch mean and variance during training.  
2. Uses running averages during inference.  
3. Maintains learnable parameters $\gamma$ and $\beta$.

Formula:
$$ y_i = \gamma \cdot \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$


In [2]:
import torch
import torch.nn as nn

# Example: BatchNorm for fully connected layer
bn = nn.BatchNorm1d(num_features=3)

# Fake input: batch of 4 samples, 3 features each
x = torch.randn(4, 3)

# Training mode
bn.train()
y_train = bn(x)
print("Training forward:\n", y_train)

# Inference mode
bn.eval()
y_test = bn(x)
print("\nInference forward:\n", y_test)


Training forward:
 tensor([[-9.7447e-01,  2.0465e-05,  1.9892e-01],
        [-8.9459e-01, -1.4601e+00, -1.2333e+00],
        [ 1.4361e+00,  1.3634e+00, -4.5864e-01],
        [ 4.3300e-01,  9.6719e-02,  1.4930e+00]],
       grad_fn=<NativeBatchNormBackward0>)

Inference forward:
 tensor([[ 0.3533, -0.2773,  0.3221],
        [ 0.3810, -1.9022, -0.4627],
        [ 1.1895,  1.2399, -0.0382],
        [ 0.8415, -0.1697,  1.0313]], grad_fn=<NativeBatchNormBackward0>)


## BatchNorm in a CNN (Conv2d → BN → ReLU)

**Why BN here?**  
In CNNs, BatchNorm is applied **per channel** after a convolution and before the nonlinearity:

> `Conv2d → BatchNorm2d → ReLU → (optional MaxPool / Dropout)`

For a 4D input $x \in \mathbb{R}^{N \times C \times H \times W}$, `BatchNorm2d(C)` normalizes each channel using the batch’s spatial activations:

- **Training stats** (per channel):  
  $$\mu_c = \frac{1}{N\!HW}\sum_{n,h,w} x_{n,c,h,w}, \quad
    \sigma_c^2 = \frac{1}{N\!HW}\sum_{n,h,w} (x_{n,c,h,w}-\mu_c)^2$$
- **Normalize, then scale/shift**:  
  $$\hat{x}_{n,c,h,w}=\frac{x_{n,c,h,w}-\mu_c}{\sqrt{\sigma_c^2+\epsilon}}, \quad
    y_{n,c,h,w}=\gamma_c \hat{x}_{n,c,h,w}+\beta_c$$

During **inference**, running means/vars accumulated during training are used.


In [3]:
# PyTorch CNN with BatchNorm2d — full working demo
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define a tiny CNN
class SmallBNConvNet(nn.Module):
    def __init__(self, in_channels=3, num_classes=10):
        super().__init__()
        self.block1 = nn.Sequential(
            nn.Conv2d(in_channels, 16, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(16),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
        )
        self.block2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
        )
        self.head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32 * 8 * 8, num_classes)
        )

    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        logits = self.head(x)
        return logits

# Create dummy image batch
torch.manual_seed(0)
N, C, H, W = 8, 3, 32, 32
X = torch.randn(N, C, H, W)
y = torch.randint(0, 10, (N,)) # dummy labels for 10 classes

# Instantiate model, loss, optimizer 
model = SmallBNConvNet(in_channels=C, num_classes=10)
criterion = nn.CrossEntropyLoss()
opt = torch.optim.SGD(model.parameters(), lr=0.1)

print(model)

# TRAINING mode
model.train()
logits_train = model(X)
loss = criterion(logits_train, y)
opt.zero_grad()
loss.backward()
opt.step()

print(f"\n[TRAIN] logits shape: {logits_train.shape}, loss: {loss.item():.4f}")

# EVAL/INFERENCE mode
model.eval()
with torch.no_grad():
    logits_eval = model(X)  # same X just to show it runs; normally we'd use new data
print(f"[EVAL ] logits shape: {logits_eval.shape}")

# check a BN layer's running stats
bn1 = model.block1[1]
print(f"\nBN1 running_mean shape: {bn1.running_mean.shape}")
print(f"BN1 running_var   shape: {bn1.running_var.shape}")


SmallBNConvNet(
  (block1): Sequential(
    (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (block2): Sequential(
    (0): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=2048, out_features=10, bias=True)
  )
)

[TRAIN] logits shape: torch.Size([8, 10]), loss: 2.6602
[EVAL ] logits shape: torch.Size([8, 10])

BN1 running_mean shape: torch.Size([16])
BN1 running_var   shape: torch.Size([16])
