### Batch Normalization

Just like the normalization for input data to improve model performance, now we perform BN during NN model traning to solve internal covariance shift and make convergence faster, can also be used as regularization method.

The main idea of BatchNorm is this: for the current minibatch while training, in each hidden layer, we normalize the activations so that its distribution is Standard Normal (zero mean and one standard deviation). 

Then, we apply a linear transform to it with learned parameters so that the network could learn what kind of distribution is the best for the layer’s activations.

**Full Connection Layer：**  
$$
\boldsymbol{x} = \boldsymbol{W\boldsymbol{u} + \boldsymbol{b}} \\
 output =\phi(\boldsymbol{x})
 $$   


**BN：**
$$ 
output=\phi(\text{BN}(\boldsymbol{x}))$$


$$
\boldsymbol{y}^{(i)} = \text{BN}(\boldsymbol{x}^{(i)})
$$


$$
\boldsymbol{\mu}_\mathcal{B} \leftarrow \frac{1}{m}\sum_{i = 1}^{m} \boldsymbol{x}^{(i)},
$$ 
$$
\boldsymbol{\sigma}_\mathcal{B}^2 \leftarrow \frac{1}{m} \sum_{i=1}^{m}(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B})^2,
$$


$$
\hat{\boldsymbol{x}}^{(i)} \leftarrow \frac{\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B}}{\sqrt{\boldsymbol{\sigma}_\mathcal{B}^2 + \epsilon}},
$$

Here, ϵ>0 is to ensure the denominator is greater than 0.


$$
{\boldsymbol{y}}^{(i)} \leftarrow \boldsymbol{\gamma} \odot
\hat{\boldsymbol{x}}^{(i)} + \boldsymbol{\beta}.
$$

Then the result input x is squashed through a linear function with learnable parameters: the scale param(gamma) γ and shift param(beta) β

If gamma = sqrt(var(x)) and beta = mean(x), the original activation is restored. 



### For Conv Layer
After conv layer, before activation layer

If the output has multiple channels, we need to perform separately, each channel has different params 

Conv_out_put.Shape: **channel_num * conv_output_size(height*width) * batch_size**

Calculation: for each channel, BN on batch_size**conv_output_size(height*width)


### For Prediction
- Use EMA to estimate mean and std

In [2]:
import time
import torch
from torch import nn, optim
import torch.nn.functional as F
import torchvision
import sys
sys.path.append("/home/kesci/input/") 
import d2lzh1981 as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # see if training mode or not
    if not is_training:
        # if in prediction, use Moving Average to calculate mean & std
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2: # dim=2, mean the FC layer
            # calculate mean & std on dim = 0
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else: # dim=4, Conv layer,
            # calculate the mean & std on channel dim(axis=1), has the same dim as channel_num
            # keep the shape of X for broadcast operations
            mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
            var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
        # in traninf mode, use the current mean & std to normalization
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # update moving average mean & std
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean # momentum is pre-defined param
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    
    Y = gamma * X_hat + beta  # scale and shift
    
    return Y, moving_mean, moving_var

In [3]:
class BatchNorm(nn.Module):
    
    def __init__(self, num_features, num_dims):
        super(BatchNorm, self).__init__()
        if num_dims == 2:
            shape = (1, num_features) # FC layer output neurons
        else:
            shape = (1, num_features, 1, 1)  # channel num
        # initialize scale param to one, shift param to zero
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # initialize param for prediction to zero(no grad needed)
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.zeros(shape)

    def forward(self, X):
        # if X is not on the memory, copy moving_mean&moving_var to device
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # save updated moving_mean&moving_var
        # self.traning default is true, if .eval() set to false
        Y, self.moving_mean, self.moving_var = batch_norm(self.training, 
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
            
        return Y

### Implemented on LeNet

In [13]:
net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            BatchNorm(6, num_dims=4), # num_dims=4 after conv
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            BatchNorm(16, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            BatchNorm(120, num_dims=2), # after FC
            nn.Sigmoid(),
            nn.Linear(120, 84),
            BatchNorm(84, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )
print(net)

Sequential(
  (0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (1): BatchNorm()
  (2): Sigmoid()
  (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (4): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (5): BatchNorm()
  (6): Sigmoid()
  (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (8): FlattenLayer()
  (9): Linear(in_features=256, out_features=120, bias=True)
  (10): BatchNorm()
  (11): Sigmoid()
  (12): Linear(in_features=120, out_features=84, bias=True)
  (13): BatchNorm()
  (14): Sigmoid()
  (15): Linear(in_features=84, out_features=10, bias=True)
)


In [14]:
batch_size = 128  
## if on cpu, the batchsize should be lower
# batch_size = 16

def load_data_fashion_mnist(batch_size, resize=None, root='/home/kesci/input/FashionMNIST2065'):
    """Download the fashion mnist dataset and then load into memory."""
    trans = []
    if resize:
        trans.append(torchvision.transforms.Resize(size=resize))
    trans.append(torchvision.transforms.ToTensor())
    
    transform = torchvision.transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
    mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)

    train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
    test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)

    return train_iter, test_iter
train_iter, test_iter = load_data_fashion_mnist(batch_size)

In [12]:
lr, num_epochs = 0.001, 50
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

training on  cuda
epoch 1, loss 0.0311, train acc 0.990, test acc 0.890, time 7.5 sec
epoch 2, loss 0.0315, train acc 0.990, test acc 0.894, time 7.9 sec
epoch 3, loss 0.0263, train acc 0.993, test acc 0.888, time 7.8 sec
epoch 4, loss 0.0274, train acc 0.992, test acc 0.892, time 7.9 sec
epoch 5, loss 0.0293, train acc 0.991, test acc 0.891, time 7.7 sec
epoch 6, loss 0.0258, train acc 0.992, test acc 0.891, time 7.9 sec
epoch 7, loss 0.0257, train acc 0.992, test acc 0.890, time 7.6 sec
epoch 8, loss 0.0246, train acc 0.992, test acc 0.891, time 7.5 sec
epoch 9, loss 0.0264, train acc 0.992, test acc 0.888, time 7.9 sec
epoch 10, loss 0.0248, train acc 0.992, test acc 0.888, time 8.0 sec
epoch 11, loss 0.0224, train acc 0.993, test acc 0.890, time 7.5 sec
epoch 12, loss 0.0237, train acc 0.993, test acc 0.889, time 8.1 sec
epoch 13, loss 0.0206, train acc 0.994, test acc 0.889, time 7.6 sec
epoch 14, loss 0.0230, train acc 0.993, test acc 0.891, time 7.8 sec
epoch 15, loss 0.0202, tr

### Built-in Batch Norm Function in PyTorch

In [15]:
net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            nn.BatchNorm2d(6), # batchNorm2d means after conv2d
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            nn.BatchNorm2d(16),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            nn.BatchNorm1d(120), # batchNorm1d means after fc
            nn.Sigmoid(),
            nn.Linear(120, 84),
            nn.BatchNorm1d(84),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

training on  cuda
epoch 1, loss 1.0832, train acc 0.790, test acc 0.801, time 7.7 sec
epoch 2, loss 0.4375, train acc 0.866, test acc 0.849, time 7.5 sec
epoch 3, loss 0.3486, train acc 0.881, test acc 0.825, time 7.1 sec
epoch 4, loss 0.3180, train acc 0.891, test acc 0.866, time 7.2 sec
epoch 5, loss 0.2981, train acc 0.895, test acc 0.874, time 7.1 sec
epoch 6, loss 0.2840, train acc 0.899, test acc 0.852, time 7.1 sec
epoch 7, loss 0.2741, train acc 0.903, test acc 0.862, time 7.0 sec
epoch 8, loss 0.2648, train acc 0.906, test acc 0.834, time 7.1 sec
epoch 9, loss 0.2560, train acc 0.908, test acc 0.861, time 7.6 sec
epoch 10, loss 0.2480, train acc 0.911, test acc 0.789, time 6.9 sec
epoch 11, loss 0.2421, train acc 0.913, test acc 0.885, time 6.8 sec
epoch 12, loss 0.2345, train acc 0.915, test acc 0.857, time 6.8 sec
epoch 13, loss 0.2301, train acc 0.917, test acc 0.837, time 6.9 sec
epoch 14, loss 0.2237, train acc 0.920, test acc 0.823, time 6.9 sec
epoch 15, loss 0.2184, tr

## ResNet
In CNN, when neural network reaches certain depth, more layers cannot improve performance, instead only make model worse.

### Residual Block
Left：f(x)=x                                                  
Right：f(x)-x=0 

![Image Name](https://cdn.kesci.com/upload/image/q5l8lhnot4.png?imageView2/0/w/600/h/600)

Using Residual Block can make input across layers and move forward faster.

In [16]:
class Residual(nn.Module):  
    # define output_channel, if use extra 1x1 conv layer to change stride for channels or conv layers。
    def __init__(self, in_channels, out_channels, use_1x1conv=False, stride=1):
        super(Residual, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, stride=stride)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X) # reshape X to the same size of Y
        return F.relu(Y + X) 

In [17]:
blk = Residual(3, 3)
X = torch.rand((4, 3, 6, 6))
blk(X).shape 
# torch.Size([4, 3, 6, 6])

torch.Size([4, 3, 6, 6])

In [18]:
blk = Residual(3, 6, use_1x1conv=True, stride=2)
blk(X).shape 
# torch.Size([4, 6, 3, 3])

torch.Size([4, 6, 3, 3])

### ResNet Model
Conv (64, 7x7, 3)  
Batch Norm 
ReLU
MaxPooling (3x3, 2)  

Residual Blockx4 (use residual block of stride 2 to reduce height and width between blocks)
Global Average Pooling
Full Connection

In [19]:
net = nn.Sequential(
        nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
        nn.BatchNorm2d(64), 
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [20]:
def resnet_block(in_channels, out_channels, num_residuals, first_block=False):
    if first_block:
        assert in_channels == out_channels # the first block should have same input_channel/output_channel num
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(in_channels, out_channels, use_1x1conv=True, stride=2))
        else:
            blk.append(Residual(out_channels, out_channels))
    return nn.Sequential(*blk)

net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))

In [21]:
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d output: (Batch, 512, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(512, 10))) 

In [22]:
# example of an image's feature map changethrough resnet model
X = torch.rand((1, 1, 224, 224))
for name, layer in net.named_children():
    X = layer(X)
    print(name, ' output shape:\t', X.shape)

0  output shape:	 torch.Size([1, 64, 112, 112])
1  output shape:	 torch.Size([1, 64, 112, 112])
2  output shape:	 torch.Size([1, 64, 112, 112])
3  output shape:	 torch.Size([1, 64, 56, 56])
resnet_block1  output shape:	 torch.Size([1, 64, 56, 56])
resnet_block2  output shape:	 torch.Size([1, 128, 28, 28])
resnet_block3  output shape:	 torch.Size([1, 256, 14, 14])
resnet_block4  output shape:	 torch.Size([1, 512, 7, 7])
global_avg_pool  output shape:	 torch.Size([1, 512, 1, 1])
fc  output shape:	 torch.Size([1, 10])


In [23]:
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

training on  cuda
epoch 1, loss 0.4184, train acc 0.847, test acc 0.839, time 152.5 sec
epoch 2, loss 0.2985, train acc 0.890, test acc 0.886, time 151.6 sec
epoch 3, loss 0.2628, train acc 0.903, test acc 0.887, time 151.6 sec
epoch 4, loss 0.2348, train acc 0.912, test acc 0.898, time 151.5 sec
epoch 5, loss 0.2130, train acc 0.922, test acc 0.903, time 152.6 sec


## DenseNet

![Image Name](https://cdn.kesci.com/upload/image/q5l8mi78yz.png?imageView2/0/w/600/h/600)

#### Main Blocks：  
Dense block：define the concatenation of input and output 
Transition layer：control channel_num not too large 

In [24]:
def conv_block(in_channels, out_channels): # for simpler use in below densenet model
    blk = nn.Sequential(nn.BatchNorm2d(in_channels), 
                        nn.ReLU(),
                        nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
    return blk

class DenseBlock(nn.Module):
    def __init__(self, num_convs, in_channels, out_channels): # num_convs is the num of above conv_block
        super(DenseBlock, self).__init__()
        net = []
        for i in range(num_convs):
            in_c = in_channels + i * out_channels # concatenation
            net.append(conv_block(in_c, out_channels))
        self.net = nn.ModuleList(net)
        self.out_channels = in_channels + num_convs * out_channels # calculate out_channel num

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            X = torch.cat((X, Y), dim=1)  # concat input and output on the channel dim
        return X

In [25]:
blk = DenseBlock(2, 3, 10)
X = torch.rand(4, 3, 8, 8)
Y = blk(X)
Y.shape 
# torch.Size([4, 23, 8, 8])

torch.Size([4, 23, 8, 8])

### Transition Block

$1\times1$ Conv layer：to reduce channel num
stride: 2 AvgPool：halve the height & width

In [27]:
def transition_block(in_channels, out_channels):
    blk = nn.Sequential(
            nn.BatchNorm2d(in_channels), 
            nn.ReLU(),
            nn.Conv2d(in_channels, out_channels, kernel_size=1),
            nn.AvgPool2d(kernel_size=2, stride=2))
    return blk

blk = transition_block(23, 10)
blk(Y).shape 
# torch.Size([4, 10, 4, 4])

torch.Size([4, 10, 4, 4])

### DenseNet Model

In [28]:
net = nn.Sequential(
        nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
        nn.BatchNorm2d(64), 
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [29]:
num_channels, growth_rate = 64, 32  # num_channels is current channel number
num_convs_in_dense_blocks = [4, 4, 4, 4]

for i, num_convs in enumerate(num_convs_in_dense_blocks):
    DB = DenseBlock(num_convs, num_channels, growth_rate)
    net.add_module("DenseBlosk_%d" % i, DB)
    # last dense block ouput_channel num
    num_channels = DB.out_channels
    # add transition block with half of the channels between dense block
    if i != len(num_convs_in_dense_blocks) - 1:
        net.add_module("transition_block_%d" % i, transition_block(num_channels, num_channels // 2))
        num_channels = num_channels // 2

In [30]:
net.add_module("BN", nn.BatchNorm2d(num_channels))
net.add_module("relu", nn.ReLU())
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d output: (Batch, num_channels, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(num_channels, 10))) 

X = torch.rand((1, 1, 96, 96))
for name, layer in net.named_children():
    X = layer(X)
    print(name, ' output shape:\t', X.shape)

0  output shape:	 torch.Size([1, 64, 48, 48])
1  output shape:	 torch.Size([1, 64, 48, 48])
2  output shape:	 torch.Size([1, 64, 48, 48])
3  output shape:	 torch.Size([1, 64, 24, 24])
DenseBlosk_0  output shape:	 torch.Size([1, 192, 24, 24])
transition_block_0  output shape:	 torch.Size([1, 96, 12, 12])
DenseBlosk_1  output shape:	 torch.Size([1, 224, 12, 12])
transition_block_1  output shape:	 torch.Size([1, 112, 6, 6])
DenseBlosk_2  output shape:	 torch.Size([1, 240, 6, 6])
transition_block_2  output shape:	 torch.Size([1, 120, 3, 3])
DenseBlosk_3  output shape:	 torch.Size([1, 248, 3, 3])
BN  output shape:	 torch.Size([1, 248, 3, 3])
relu  output shape:	 torch.Size([1, 248, 3, 3])
global_avg_pool  output shape:	 torch.Size([1, 248, 1, 1])
fc  output shape:	 torch.Size([1, 10])


In [32]:
batch_size = 256

train_iter, test_iter =load_data_fashion_mnist(batch_size, resize=96)
lr, num_epochs = 0.001, 15
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

training on  cuda
epoch 1, loss 0.1900, train acc 0.931, test acc 0.911, time 67.1 sec
epoch 2, loss 0.1721, train acc 0.938, test acc 0.912, time 67.4 sec
epoch 3, loss 0.1597, train acc 0.941, test acc 0.887, time 67.6 sec
epoch 4, loss 0.1496, train acc 0.945, test acc 0.919, time 67.7 sec
epoch 5, loss 0.1379, train acc 0.950, test acc 0.925, time 67.7 sec
epoch 6, loss 0.1287, train acc 0.953, test acc 0.919, time 67.7 sec
epoch 7, loss 0.1197, train acc 0.955, test acc 0.925, time 67.7 sec
epoch 8, loss 0.1111, train acc 0.959, test acc 0.910, time 67.8 sec
epoch 9, loss 0.1022, train acc 0.962, test acc 0.932, time 67.7 sec
epoch 10, loss 0.0939, train acc 0.965, test acc 0.912, time 67.7 sec
epoch 11, loss 0.0864, train acc 0.968, test acc 0.932, time 67.7 sec
epoch 12, loss 0.0755, train acc 0.972, test acc 0.925, time 67.6 sec
epoch 13, loss 0.0728, train acc 0.974, test acc 0.928, time 67.7 sec
epoch 14, loss 0.0627, train acc 0.977, test acc 0.916, time 67.6 sec
epoch 15, l