## Batch Normalization

In [2]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms

#### Image Normalization
<font size=2>

In order to avoid of gradient divergency (e.g. from Sigmoid) when datas' output are too close to the realm of plateau, which is shown below, normalization should be applied.
    
<div>
<img src = "SigmoidPlateau.png" style = "zoom: 60%" />
</div>
    
The formular for normalization is following, which is aimed at transferring data into a united distribution of $N(0,1)$:
    
$$ \tilde{d} = \frac{d - \mu}{\sigma} $$
    
where $d$ is original data, $\tilde{d}$ is normalized data, $\mu$ is statistical mean of original data, $\sigma$ is statistical standard variance of original data.
    
Attention: $\mu$ and $\sigma$ are from specific calculation from original data.

In [9]:
# image dataset with 2 images, 3 channels, shape of 2x2 pixels
img_data = torch.rand(2,3,2,2)
print(out)
# the 'mean' and 'std' are mean and standard variance separately of 3 channels
# which are statistics from image dataset
# in order to normalize input data into a distribution of N(0,1)
normalizer = transforms.Normalize(mean=[0.45,0.52,0.48],std=[0.23,0.236,0.24])
out = normalizer(img_data)
print(out)

tensor([[[[ 0.6010, -0.0446],
          [ 1.5668,  0.4178]],

         [[ 0.8321,  0.2740],
          [-1.9274,  1.1882]],

         [[-1.9613,  2.0621],
          [ 0.5819,  0.1401]]],


        [[[ 1.8352,  0.1644],
          [-1.5332, -0.4747]],

         [[ 1.9882, -1.1932],
          [ 1.7495, -0.4430]],

         [[ 1.6050,  1.6737],
          [ 0.3339, -0.8236]]]])
tensor([[[[ 0.8679,  1.5408],
          [ 0.8657, -0.6081]],

         [[-0.2586, -0.6329],
          [ 1.2518, -0.3239]],

         [[ 0.0646,  1.5763],
          [ 1.4138,  1.9536]]],


        [[[-1.8601, -1.4089],
          [ 0.7828,  0.1823]],

         [[ 1.3136,  1.1332],
          [ 1.8170, -1.8920]],

         [[-0.7052, -1.6217],
          [ 1.0318,  1.0576]]]])


#### Batch Normalization
<font size=2>
    
In one batch deploy normalization along with the dimension of channels. For example: a batch of 5 images with 16 channels, 28x28 pixels: **[5,16,28,28]**, or flattened version: **[5,16,784]**. The normalizatio is along with the dimension with size of **16**(i.e. channel dimension), so there will be 16 $\mu$ and 16 $\sigma$ for each channel.
    
The following image illustrates vividly how to normalize datasets:
    
<div>
<img src = "BatchNorm.png" style = "zoom:50%" />
</div>
    
The $\mu$ and $\sigma$ are calculated from original data for each channel. Lets say $\mu_{1}$ and $\sigma_{1}$ are mean and standard variance for **channel 1**, which are calculated from output $z^{1}$. And $\tilde{z}^{1}$ is normalized $z^{1}$ by:
    
$$ \tilde{z}^{1} = \frac{z^{1} - \mu_{1}}{\sigma_{1}} $$
    
When we still want the **scale** and **shift** data into a new distribution $N(\beta,\gamma)$, where $\beta$ and $\gamma$ are parameters in the network, that is to say that they are **learned**:
    
$$ \hat{z}^{1} = \gamma \odot \tilde{z}^{1} + \beta $$
    
And there is the formular pipeline for batch normalization:
    
<div>
<img src = "BatchNormPipeline.png" style = "zoom:50%" />
</div>

In [19]:
# a batch of 100 images, 16 channels, 28x28 pixels but flattened
data_1d = torch.rand(100,16,784)
# data_2d = torch.rand(100,16,28,28)  # 2d version
# input of nn.BatchNorm1d() is the channel size of input data
layer_1d = nn.BatchNorm1d(16)
# layer_2d = nn.BatchNorm2d(16)  # 2d version
out = layer_1d(data_1d)
# running_mean & running_var:
# the dynamicly updated mean 'mu' and std 'sigma' of this batch
print('layer.running_mean: {} with shape of: {}'.format(layer_1d.running_mean, layer_1d.running_mean.shape))
print('layer.running_var: {} with shape of: {}'.format(layer_1d.running_var, layer_1d.running_var.shape))
# weight & bias
# the learned mean 'beta' and learned std 'gamma', which need gradients to update
print('layer.weight: {} with shape of: {}'.format(layer_1d.weight, layer_1d.weight.shape))
print('layer.bias: {} with shape of: {}'.format(layer_1d.bias, layer_1d.bias.shape))
print('----------------------------------------------------------------------------------------------')
# all parameters of current layer
print(vars(layer_1d))

layer.running_mean: tensor([0.0498, 0.0500, 0.0498, 0.0500, 0.0501, 0.0498, 0.0500, 0.0501, 0.0499,
        0.0500, 0.0501, 0.0499, 0.0500, 0.0500, 0.0499, 0.0500]) with shape of: torch.Size([16])
layer.running_var: tensor([0.9083, 0.9084, 0.9083, 0.9084, 0.9084, 0.9083, 0.9083, 0.9084, 0.9083,
        0.9083, 0.9083, 0.9084, 0.9084, 0.9083, 0.9083, 0.9084]) with shape of: torch.Size([16])
layer.weight: Parameter containing:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       requires_grad=True) with shape of: torch.Size([16])
layer.bias: Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       requires_grad=True) with shape of: torch.Size([16])
----------------------------------------------------------------------------------------------
{'training': True, '_parameters': OrderedDict([('weight', Parameter containing:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       requires_grad=True