# Neural Network Training Techniques

## Batch Normalization Study

This notebook demonstrates the impact of batch normalization on neural network training by comparing two architectures:
1. A baseline feed-forward network
2. The same architecture enhanced with batch normalization layers

Both networks are trained on the FashionMNIST dataset to empirically compare convergence speed and final performance.

## Batch Normalization
In this study we construct a feed forward neural network with batch normalization during training. Training deep neural networks can be challenging due to the change in the distribution of inputs to layers deep in the network as a result of the updates of the weights in the previous layers. This causes the learning to chase a "moving target", which slows down the learning process. Batch normalization is a technique that aims to address this problem by normalizing layer inputs. This stabilizes the learning process and can greatly decrease training time. If you are interested you can read the paper introducing batch normalization [here](https://arxiv.org/abs/1502.03167).

We will be working with the [FashionMNIST](https://pytorch.org/vision/stable/generated/torchvision.datasets.FashionMNIST.html) dataset by Zalando, which consists of $28\times 28$ black and white images and has 10 classes just like the [MNIST](https://pytorch.org/vision/main/generated/torchvision.datasets.MNIST.html) dataset. But instead of numbers the classes are various items of clothing such as shoes, t-shirts, dresses, etc.

### Baseline Network
We construct a feed forward neural network with 3 hidden linear layers with a ReLU after each of the first 2 layers. The first layer has a hidden size of 64 and the second a hidden size of 32. This is a multi-class classification problem so we use cross entropy loss (provided by [nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)). We use stochastic gradient descent (provided by [torch.optim.SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html)) with a learning rate of 0.001 as the optimizer. The network is trained on the training data for 5 epochs and accuracy on the **test** set is reported after each epoch, for this refer to [PyTorch Training Loop](https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#the-training-loop) and [Per-Epoch Activity](https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#per-epoch-activity).

**Note**: The data comes in the format of $28 \times 28$ tensors, so we flatten it before training.

In [6]:
import torch
import torchvision
import torch.nn as nn
from torchvision import transforms
import matplotlib.pyplot as plt
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets
from tqdm import tqdm

In [7]:
# load train and test set
fashion_trainset = torchvision.datasets.FashionMNIST('data/', train=True, download=True, transform=transforms.ToTensor())
fashion_testset = torchvision.datasets.FashionMNIST('data/', train=False, download=True, transform=transforms.ToTensor())

In [8]:
# get train and test loader
fashion_train_loader = torch.utils.data.DataLoader(dataset=fashion_trainset, batch_size=64, shuffle=True)
fashion_test_loader = torch.utils.data.DataLoader(dataset=fashion_testset, batch_size=64, shuffle=False)

### Network with Batch Normalization
We construct another network with the same parameters as the baseline but this time include a batch normalization layer (refer to [nn.BatchNorm1d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html)) after the hidden layers before the activation function. Train the network and save the test accuracies.

### Performance Comparison
We plot the accuracies of the two networks and analyze the results.