## Why Batch Normalization
- Faster convergence speed
- Smoothen loss function -> more stability during training
- Reduce covariate shift (dataset shift, distribution shift during training)
- Allow for higher learning rates

![image.png](attachment:image.png)

## How to do Batch Normalization
Example for Batchnorm1D:
![image-2.png](attachment:image-2.png)

## Procedure of Batch Normalization
![image-3.png](attachment:image-3.png)

Steps:
1. Calculate the mean and variance of the batch of activations:

    $\text{Mean: } \mu = \frac{1}{n} \sum_{i=1}^{n} x_i$
    
    $\text{Variance: } \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2$

2. Standardize the orginal input to make the batch of activations has mean 0 and variance 1:

    $\text{Normalize: } h_{i(norm)} = \frac{h_i - \mu}{\sigma + \epsilon}$

3. Rescaling and offsetting to bring back the representation power of the activation

    $h_i = \gamma h_{i(norm)} + \beta$

4. Update the inference mean via moving average

Note: BatchNorm layer is often after the activation layer


In [None]:
# Hypeparameters

LR_BASE = 0.01 #lr baseline
LR_BN = 0.01 #lr bn network

num_iterations = 10000 #50000
valid_steps = 50 # training iterations before validation

verbose = True

In [None]:

# Libs
import numpy as np
import torch
from torch import nn
import torchvision
import torchvision.datasets as datasets
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# Seeds
torch.manual_seed(0)
np.random.seed(0)