This notebook compares the performance of a neural network that uses Batch Normalization to one that does not.

Batch Normalization is a technique used in deep learning to stabilize and speed up the training process of neural networks by normalizing the inputs of each layer. It was introduced by Sergey Ioffe and Christian Szegedy in 2015 and has since become a standard practice in training deep neural networks.


1. **Normalization**:
   - Batch normalization normalizes the output of a previous activation layer by adjusting and scaling the activations.
   - It subtracts the batch mean and divides by the batch standard deviation for each feature, ensuring that the output has a mean of zero and a variance of one.

2. **Learnable Parameters**:
   - After normalization, batch normalization introduces two learnable parameters: 
     - **Gamma (γ)**: A scaling factor.
     - **Beta (β)**: A shifting factor.
   - These parameters allow the network to undo the normalization if it is optimal for the learning task, effectively giving the model the flexibility to adjust the scale and mean of the normalized output.

3. **Where It's Applied**:
   - Batch normalization is typically applied after the activation function of a layer, though it can also be applied before the activation.

4. **Effect on Training**:
   - **Stabilizes Learning**: By normalizing the inputs to each layer, batch normalization helps in stabilizing the learning process, which can allow for higher learning rates.
   - **Reduces Internal Covariate Shift**: It mitigates the problem where the distribution of each layer's inputs changes during training, known as internal covariate shift. This stabilization helps the model converge faster.
   - **Regularization Effect**: Batch normalization has a slight regularization effect, as the noise introduced by estimating statistics (mean and variance) from mini-batches acts as a form of regularization, reducing the need for dropout in some cases.

5. **Training and Inference Mode**:
   - During training, batch normalization uses the statistics (mean and variance) of the current batch.
   - During inference, it uses the running average of the batch statistics collected during training to maintain consistency.

6. **Mathematical Formulation**:
   Given an input \( x \) to a layer, the batch normalization process involves:
   1. **Compute the mean and variance**:
      \[
      \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
      \]
   2. **Normalize the input**:
      \[
      \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
      \]
      where \( \epsilon \) is a small constant added for numerical stability.
   3. **Scale and shift**:
      \[
      y_i = \gamma \hat{x}_i + \beta
      \]
      where \( \gamma \) and \( \beta \) are learnable parameters.

### Benefits of Batch Normalization:
- **Faster Training**: By allowing higher learning rates, batch normalization can significantly speed up the training process.
- **Reduced Sensitivity to Initialization**: Models become less sensitive to the choice of initial weights, making it easier to train deep networks.
- **Improved Generalization**: The regularization effect can lead to better generalization on unseen data.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# Transformations for the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Loading MNIST dataset
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Define a simple neural network
class Net(nn.Module):
    def __init__(self, use_batch_norm=False):
        super(Net, self).__init__()
        self.use_batch_norm = use_batch_norm

        self.fc1 = nn.Linear(28*28, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)

        # Batch Normalization layers
        if use_batch_norm:
            self.bn1 = nn.BatchNorm1d(512)
            self.bn2 = nn.BatchNorm1d(256)

    def forward(self, x):
        x = x.view(-1, 28*28)
        x = torch.relu(self.fc1(x))

        # Apply batch normalization if specified
        if self.use_batch_norm:
            x = self.bn1(x)

        x = torch.relu(self.fc2(x))

        # Apply batch normalization if specified
        if self.use_batch_norm:
            x = self.bn2(x)

        x = self.fc3(x)
        return x

# Function to train the model
def train_model(model, criterion, optimizer, epochs=3):
    for epoch in range(epochs):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data

            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 100 == 99:
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 100))
                running_loss = 0.0

# Training without batch normalization
model_no_bn = Net(use_batch_norm=False)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_no_bn.parameters(), lr=0.01, momentum=0.9)

print("Training without Batch Normalization:")
train_model(model_no_bn, criterion, optimizer)

# Training with batch normalization
model_with_bn = Net(use_batch_norm=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_with_bn.parameters(), lr=0.01, momentum=0.9)

print("\nTraining with Batch Normalization:")
train_model(model_with_bn, criterion, optimizer)

Training without Batch Normalization:
[1,   100] loss: 1.166
[1,   200] loss: 0.446
[1,   300] loss: 0.361
[1,   400] loss: 0.323
[1,   500] loss: 0.288
[1,   600] loss: 0.267
[1,   700] loss: 0.231
[1,   800] loss: 0.246
[1,   900] loss: 0.210
[2,   100] loss: 0.200
[2,   200] loss: 0.174
[2,   300] loss: 0.171
[2,   400] loss: 0.157
[2,   500] loss: 0.155
[2,   600] loss: 0.145
[2,   700] loss: 0.135
[2,   800] loss: 0.146
[2,   900] loss: 0.145
[3,   100] loss: 0.132
[3,   200] loss: 0.127
[3,   300] loss: 0.109
[3,   400] loss: 0.117
[3,   500] loss: 0.110
[3,   600] loss: 0.106
[3,   700] loss: 0.099
[3,   800] loss: 0.094
[3,   900] loss: 0.095

Training with Batch Normalization:
[1,   100] loss: 0.464
[1,   200] loss: 0.253
[1,   300] loss: 0.219
[1,   400] loss: 0.172
[1,   500] loss: 0.183
[1,   600] loss: 0.152
[1,   700] loss: 0.161
[1,   800] loss: 0.149
[1,   900] loss: 0.143
[2,   100] loss: 0.118
[2,   200] loss: 0.106
[2,   300] loss: 0.098
[2,   400] loss: 0.102
[2,   

In [None]:
# Transformations for the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Loading MNIST test dataset
testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False)

# Function to calculate accuracy
def get_accuracy(model, dataloader):
    correct = 0
    total = 0
    with torch.no_grad():
        for data in dataloader:
            inputs, labels = data
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total

# Print accuracy for both models
print("\nAccuracy without Batch Normalization: {:.2%}".format(get_accuracy(model_no_bn, testloader)))
print("Accuracy with Batch Normalization: {:.2%}".format(get_accuracy(model_with_bn, testloader)))



Accuracy without Batch Normalization: 96.77%
Accuracy with Batch Normalization: 97.33%


Slightly faster convergence with better performance