# Lab 3 Part 2: Batch Normalization 

José María Martínez Marín 100443343

Azamat Ziiadinov 100460540

------------------------------------------------------
*Deep Learning. Master in Big Data Analytics*

*Pablo M. Olmos pamartin@ing.uc3m.es*

------------------------------------------------------

Batch normalization was introduced in Sergey Ioffe's and Christian Szegedy's 2015 paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf). The idea is that, instead of just normalizing the inputs to the network, we normalize the inputs to _layers within_ the network. 

> It's called **batch** normalization because during training, we normalize each layer's inputs by using the mean and variance of the values in the current *batch*.

We will first analyze the effect of Batch Normalization (BN) in a simple NN with dense layers. Then you will be able to incorportate BN into the CNN that you designed in the first part of Lab 3. 

Note: a big part of the following material is a personal wrap-up of [Facebook's Deep Learning Course in Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188). So all credit goes for them!!

## Batch Normalization in PyTorch<a id="implementation_1"></a>

This section of the notebook shows you one way to add batch normalization to a neural network built in PyTorch. 

The following cells import the packages we need in the notebook and load the MNIST dataset to use in our experiments.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  #To get figures with high quality!

import numpy as np
import torch
from torch import nn
from torch import optim
import matplotlib.pyplot as plt
import time

In [None]:
from google.colab import drive
drive.mount('/content/drive')

from torchvision import datasets, transforms

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

sst_home='drive/MyDrive/'
#modify this path 
path=sst_home+'MNIST_data/MNIST_data'

trainset = datasets.MNIST(path, download=False, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

testset = datasets.MNIST(path, download=False, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=True)

Mounted at /content/drive


### Neural network classes

The following class, `MLP`, allows us to create identical neural networks **with and without batch normalization** to compare. We are defining a simple NN with **two dense layers** for classification; this design choice was made to support the discussion related to batch normalization and not to get the best classification accuracy.

Two importants points about BN:

- We use PyTorch's [BatchNorm1d](https://pytorch.org/docs/stable/nn.html#batchnorm1d). This is the function you use to operate on linear layer outputs; you'll use [BatchNorm2d](https://pytorch.org/docs/stable/nn.html#batchnorm2d) for 2D outputs like filtered images from convolutional layers. 
- We add the batch normalization layer **before** calling the activation function.


In [None]:
class MLP(nn.Module):
    def __init__(self,dimx,hidden1,hidden2,nlabels,use_batch_norm): #Nlabels will be 10 in our case
        
        super().__init__()
        
        # Keep track of whether or not this network uses batch normalization.
        self.use_batch_norm = use_batch_norm
        
        self.output1 = nn.Linear(dimx,hidden1)
        
        self.output2 = nn.Linear(hidden1,hidden2)        
        
        self.output3 = nn.Linear(hidden2,nlabels)
    
        self.relu = nn.ReLU()
        
        self.logsoftmax = nn.LogSoftmax(dim=1)
        
        if self.use_batch_norm:

            self.batch_norm1 = nn.BatchNorm1d(hidden1)
            
            self.batch_norm2 = nn.BatchNorm1d(hidden2)
            
        
    def forward(self, x):
        # Pass the input tensor through each of our operations
        x = self.output1(x)
        if self.use_batch_norm:
            x = self.batch_norm1(x)
        x = self.relu(x)
        x = self.output2(x)
        if self.use_batch_norm:
            x = self.batch_norm2(x)        
        x = self.relu(x)
        x = self.output3(x)
        x = self.logsoftmax(x) 
        return x

> **Exercise:** 
> 
> - Create a validation set with the 20% of training set
> - Extend the class above to incorporate a training method where both training and validation losses are computed, and a method to evaluate the classification performance on a given set

**Note:** As we do with Dropout, for BN we have to call the methods `self.eval()` and `self.train()` in both validation and training. Setting a model to evaluation mode is important for models with batch normalization layers!

>* Training mode means that the batch normalization layers will use **batch** statistics to calculate the batch norm. 
* Evaluation mode, on the other hand, uses the estimated **population** mean and variance from the entire training set, which should give us increased performance on this test data!  

In [None]:


import copy

numval = int(len(trainset)*0.8)

validloader = copy.deepcopy(trainloader)  # Creates a copy of the object 
full_trainloader = copy.deepcopy(trainloader)


#We take the first  images for training
trainloader.dataset.data = trainloader.dataset.data[:numval,:,:]
trainloader.dataset.targets = trainloader.dataset.targets[:numval]

#And the rest for validation
validloader.dataset.data = validloader.dataset.data[numval:,:,:]
validloader.dataset.targets = validloader.dataset.targets[numval:]


In [None]:
class MLP_extended(MLP):
    
    def __init__(self,dimx,hidden1,hidden2,nlabels,use_batch_norm,epochs=100,lr=0.001):   
        super().__init__(dimx,hidden1,hidden2,nlabels,use_batch_norm)  #To initialize MLP!      
        self.lr = lr #Learning Rate      
        self.optim = optim.Adam(self.parameters(), self.lr)      
        self.epochs = epochs       
        self.criterion = nn.NLLLoss()    
        self.loss_during_training = []
        # A list to store the loss evolution along validation    
        self.valid_loss_during_training = [] 

    def valid_loss(self, validloader): 
        if validloader is not None:  
            with torch.no_grad():
                self.eval()
                running_loss = 0.    
                for (images, labels) in validloader:
                    out = self.forward(images.view(images.shape[0], -1))
                    loss = self.criterion(out, labels)
                    running_loss += loss.item()    
                self.valid_loss_during_training.append(running_loss/len(validloader))    
        else:  
            raise ValueError('validloader must contain data.')        

    # set model back to train mode
        self.train() 
                  
    def trainloop(self,trainloader,validloader = None):     
        # Adam Loop     
        for e in range(int(self.epochs)):      
            running_loss = 0.

            for images, labels in trainloader:             
                self.optim.zero_grad()  #TO RESET GRADIENTS!
                out = self.forward(images.view(images.shape[0], -1))             
                loss = self.criterion(out, labels)
                running_loss += loss.item()
                loss.backward()
                self.optim.step()       

            self.loss_during_training.append(running_loss/len(trainloader))
            if validloader is not None:
                self.valid_loss(validloader)

            if(e % 1 == 0): # Every 1 epochs              
                print("Training loss after %d iterations: %f" 
                      %(e,self.loss_during_training[-1]))              
                print("Validation loss after %d epochs: %f" 
                      %(e,self.valid_loss_during_training[-1]))           

    def accuracy(self, loader):     
        accuracy = 0
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
          for (images, labels) in loader:
            logprobs = self.forward(images.view(images.shape[0], -1)) # We use a log-softmax, to get log-probabilities
            top_p, top_class = logprobs.topk(1, dim=1)
            equals = (top_class == labels.view(images.shape[0], 1))
            accuracy += torch.mean(equals.type(torch.FloatTensor))     
        print("Accuracy %f" %(accuracy/len(loader)))
        return accuracy/len(loader)  
        

### Create two different models for testing

* `net_batchnorm` uses batch normalization applied to the output of its hidden layers
* `net_no_norm` does not use batch normalization

Besides the normalization layers, everthing about these models is the same.

> **Exercise:** Train both models and compare the evolution of the train/validation loss in both cases

In [None]:


net_batchnorm = MLP_extended(dimx=784, hidden1=128,hidden2=64, nlabels=10,use_batch_norm=True,epochs=10,lr=1e-3)

net_batchnorm.trainloop(trainloader, validloader)

print("Train Accuracy %f" %(net_batchnorm.accuracy(trainloader)))
print("Valid Accuracy %f" %(net_batchnorm.accuracy(validloader)))
print("Test Accuracy %f" %(net_batchnorm.accuracy(testloader)))

Training loss after 0 iterations: 0.290957
Validation loss after 0 epochs: 0.113285
Training loss after 1 iterations: 0.104279
Validation loss after 1 epochs: 0.101893
Training loss after 2 iterations: 0.073004
Validation loss after 2 epochs: 0.089217
Training loss after 3 iterations: 0.055506
Validation loss after 3 epochs: 0.085939
Training loss after 4 iterations: 0.046058
Validation loss after 4 epochs: 0.082477
Training loss after 5 iterations: 0.036773
Validation loss after 5 epochs: 0.077693
Training loss after 6 iterations: 0.031783
Validation loss after 6 epochs: 0.075806
Training loss after 7 iterations: 0.029539
Validation loss after 7 epochs: 0.084594
Training loss after 8 iterations: 0.026864
Validation loss after 8 epochs: 0.080450
Training loss after 9 iterations: 0.023114
Validation loss after 9 epochs: 0.073443
Accuracy 0.995750
Train Accuracy 0.995750
Accuracy 0.975399
Valid Accuracy 0.975399
Accuracy 0.976214
Test Accuracy 0.976214


In [None]:


net_no_norm = MLP_extended(dimx=784, hidden1=128,hidden2=64, nlabels=10, use_batch_norm = False ,epochs=10,lr=1e-3)

net_no_norm.trainloop(trainloader, validloader)

print("Train Accuracy %f" %(net_no_norm.accuracy(trainloader)))
print("Valid Accuracy %f" %(net_no_norm.accuracy(validloader)))
print("Test Accuracy %f" %(net_no_norm.accuracy(testloader)))

Training loss after 0 iterations: 0.463631
Validation loss after 0 epochs: 0.306246
Training loss after 1 iterations: 0.235319
Validation loss after 1 epochs: 0.235725
Training loss after 2 iterations: 0.173280
Validation loss after 2 epochs: 0.203273
Training loss after 3 iterations: 0.139725
Validation loss after 3 epochs: 0.177059
Training loss after 4 iterations: 0.116717
Validation loss after 4 epochs: 0.139388
Training loss after 5 iterations: 0.098232
Validation loss after 5 epochs: 0.156430
Training loss after 6 iterations: 0.086279
Validation loss after 6 epochs: 0.133406
Training loss after 7 iterations: 0.076281
Validation loss after 7 epochs: 0.127400
Training loss after 8 iterations: 0.071329
Validation loss after 8 epochs: 0.138808
Training loss after 9 iterations: 0.061316
Validation loss after 9 epochs: 0.149094
Accuracy 0.977578
Train Accuracy 0.977578
Accuracy 0.957083
Valid Accuracy 0.957083
Accuracy 0.961286
Test Accuracy 0.961286


As expected, it can be seen that in the case with Batch Normalization, the accuracy is higher in both train, validation and test sets. Moreover, the loss functions are way lower than in the case without BN.

---
### Considerations for other network types

This notebook demonstrates batch normalization in a standard neural network with fully connected layers. You can also use batch normalization in other types of networks, but there are some special considerations.

#### ConvNets

Convolution layers consist of multiple feature maps. (Remember, the depth of a convolutional layer refers to its number of feature maps.) And the weights for each feature map are shared across all the inputs that feed into the layer. Because of these differences, batch normalizing convolutional layers requires batch/population mean and variance per feature map rather than per node in the layer.

> To apply batch normalization on the outputs of convolutional layers, we use [BatchNorm2d](https://pytorch.org/docs/stable/nn.html#batchnorm2d). To use it, we simply state the **number of input feature maps**. I.e. `nn.BatchNorm2d(num_features=nmaps)`


#### RNNs

Batch normalization can work with recurrent neural networks, too, as shown in the 2016 paper [Recurrent Batch Normalization](https://arxiv.org/abs/1603.09025). It's a bit more work to implement, but basically involves calculating the means and variances per time step instead of per layer. You can find an example where someone implemented recurrent batch normalization in PyTorch, in [this GitHub repo](https://github.com/jihunchoi/recurrent-batch-normalization-pytorch).

> **Exercise:** Using CIFAR10 database, incorporate BN to your solution of Lab 3 (Part I). Compare the results with and without BN!!

In [None]:

import torch
from torchvision import datasets, transforms, utils

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset2 = datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)

trainloader2 = torch.utils.data.DataLoader(trainset2, batch_size=64,
                                          shuffle=True, num_workers=2)

testset2 = datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)

testloader2 = torch.utils.data.DataLoader(testset2, batch_size=64,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')



Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


HBox(children=(FloatProgress(value=0.0, max=170498071.0), HTML(value='')))


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


In [None]:


import copy

numval2 = int(len(trainset2)*0.8)

validloader2 = copy.deepcopy(trainloader2)  # Creates a copy of the object 
full_trainloader2 = copy.deepcopy(trainloader2)


#We take the first  images for training
trainloader2.dataset.data = trainloader2.dataset.data[:numval2,:,:]
trainloader2.dataset.targets = trainloader2.dataset.targets[:numval2]

#And the rest for validation
validloader2.dataset.data = validloader2.dataset.data[numval2:,:,:]
validloader2.dataset.targets = validloader2.dataset.targets[numval2:]


In [None]:


class Lenet5(nn.Module):
    def __init__(self,dimx,use_batch_norm,nlabels, pr=0.2): #Nlabels will be 10 in our case
        super().__init__()
        # Keep track of whether or not this network uses batch normalization.
        self.use_batch_norm = use_batch_norm
        # convolutional layer (sees 28x28x1 image tensor)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=6, 
                               kernel_size=5, stride=1, padding=0)     
        # convolutional layer (sees 12x12x16 tensor)
        self.conv2 = nn.Conv2d(6, 16, 5, padding=0)      
        # Max pool layer
        self.pool = nn.MaxPool2d(2, 2)
        # Linear layers
        self.linear1 = nn.Linear(in_features = 400, out_features = 120) #      
        self.linear2 = nn.Linear(in_features = 120, out_features = 84) #       
        self.linear3 = nn.Linear(in_features = 84, out_features = 10) #
        self.relu = nn.ReLU()      
        self.logsoftmax = nn.LogSoftmax(dim=1)     
        # adding the dropout component:
        self.dropout = nn.Dropout(pr)

        if self.use_batch_norm:
            self.batch_norm1 = nn.BatchNorm2d(num_features=6)
            self.batch_norm2 = nn.BatchNorm2d(num_features=16)  
            self.batch_norm3 = nn.BatchNorm1d(120) # first linear
            self.batch_norm4 = nn.BatchNorm1d(84) # second linear         

        # Spatial dimension of the Tensor at the output of the 2nd CNN
        self.final_dim = int(((dimx-4)/2-4)/2)
        
    def forward(self, x):
        # Pass the input tensor through the CNN operations
        x = self.conv1(x)
        if self.use_batch_norm:
            x = self.batch_norm1(x) 
        x = self.relu(x) 
        x = self.pool(x)
        x = self.conv2(x)
        if self.use_batch_norm:
            x = self.batch_norm2(x)
        x = self.relu(x)
        x = self.pool(x)
        # Flatten the tensor into a vector of appropiate dimension using self.final_dim
       # x = x.view(-1, 16*self.final_dim*self.final_dim)
        x = x.view(x.size(0), -1)
        # Pass the tensor through the Dense Layers
        x = self.linear1(x) #
        if self.use_batch_norm:   # batch normalization #3
          x = self.batch_norm3(x)
        x = self.relu(x)
        x = self.dropout(x) 
        x = self.linear2(x) #    
        if self.use_batch_norm:   # batch normalization #4
          x = self.batch_norm4(x)  
        x = self.relu(x)
        x = self.dropout(x) 
        x = self.linear3(x) #
        x = self.logsoftmax(x) 
        return x


In [None]:
class Lenet5_extended(Lenet5):    
    def __init__(self,dimx,nlabels,use_batch_norm,epochs=100,lr=0.001): 
        super().__init__(dimx,nlabels,use_batch_norm, pr = 0.3)  #To initialize Lenet5!  
        self.lr = lr #Learning Rate
        self.optim = optim.Adam(self.parameters(), self.lr) 
        self.epochs = epochs   
        self.criterion = nn.NLLLoss()
       # A list to store the loss evolution along training and validation
        self.loss_during_training = [] 
        self.valid_loss_during_training = []


    def valid_loss(self, validloader): 
        if validloader is not None:  
            with torch.no_grad():
                self.eval()
                running_loss = 0.    
                for (images, labels) in validloader:
                   # out = self.forward(images.view(images.shape[0], -1))
                    out = self.forward(images)
                    loss = self.criterion(out, labels)
                    running_loss += loss.item()    
                self.valid_loss_during_training.append(running_loss/len(validloader))    
        else:  
            raise ValueError('validloader must contain data.')    

    # set model back to train mode
        self.train() 

    def trainloop(self,trainloader, validloader = None):       
        # Adam Loop       
        for e in range(int(self.epochs)):
            running_loss = 0.
            for (images, labels) in trainloader:             

                self.optim.zero_grad()  #TO RESET GRADIENTS!
                # out = self.forward(images.view(images.shape[0], -1))
                out = self.forward(images)
                loss = self.criterion(out, labels)
                running_loss += loss.item()
                loss.backward()
                self.optim.step()        
            self.loss_during_training.append(running_loss/len(trainloader))
            if validloader is not None:
                self.valid_loss(validloader) 

            if(e % 1 == 0): # Every 1 epochs              
                print("Training loss after %d iterations: %f" 
                      %(e,self.loss_during_training[-1]))              
                print("Validation loss after %d epochs: %f" 
                      %(e,self.valid_loss_during_training[-1]))   

    def accuracy(self, test_data):
          accuracy = 0
        # Turn off gradients for validation, saves memory and computations
          with torch.no_grad():
            for (images, labels) in test_data:
                #logprobs = self.forward(images.view(images.shape[0], -1)) # We use a log-softmax, to get log-probabilities
                logprobs = self.forward(images)
                top_p, top_class = logprobs.topk(1, dim=1)
                equals = (top_class == labels.view(images.shape[0], 1))
                accuracy += torch.mean(equals.type(torch.FloatTensor))
                
          print("Accuracy %f" %(accuracy/len(test_data)))
          return accuracy/len(test_data)

In [None]:
my_CNN_Lenet = Lenet5_extended(dimx = 32, nlabels = 10, use_batch_norm = True, epochs = 20, lr = 1e-3)
my_CNN_Lenet.trainloop(trainloader2, validloader2)

print("Train data:")
my_CNN_Lenet.accuracy(full_trainloader2)
print("Test data:")
my_CNN_Lenet.accuracy(testloader2)  #

Training loss after 0 iterations: 1.631382
Validation loss after 0 epochs: 1.413077
Training loss after 1 iterations: 1.359642
Validation loss after 1 epochs: 1.219037
Training loss after 2 iterations: 1.251185
Validation loss after 2 epochs: 1.147089
Training loss after 3 iterations: 1.186236
Validation loss after 3 epochs: 1.104796
Training loss after 4 iterations: 1.136592
Validation loss after 4 epochs: 1.061871
Training loss after 5 iterations: 1.097977
Validation loss after 5 epochs: 1.029301
Training loss after 6 iterations: 1.071666
Validation loss after 6 epochs: 1.021443
Training loss after 7 iterations: 1.039811
Validation loss after 7 epochs: 1.022729
Training loss after 8 iterations: 1.022765
Validation loss after 8 epochs: 1.060643
Training loss after 9 iterations: 1.004440
Validation loss after 9 epochs: 0.999715
Training loss after 10 iterations: 0.982071
Validation loss after 10 epochs: 0.990055
Training loss after 11 iterations: 0.967940
Validation loss after 11 epoch

tensor(0.6112)

In [None]:
my_CNN_Lenet = Lenet5_extended(dimx = 32, nlabels = 10, use_batch_norm = False, epochs = 20, lr = 1e-3)
my_CNN_Lenet.trainloop(trainloader2, validloader2)

print("Train data:")
my_CNN_Lenet.accuracy(full_trainloader2)
print("Test data:")
my_CNN_Lenet.accuracy(testloader2)  #

Training loss after 0 iterations: 1.604086
Validation loss after 0 epochs: 1.316669
Training loss after 1 iterations: 1.328411
Validation loss after 1 epochs: 1.287307
Training loss after 2 iterations: 1.221542
Validation loss after 2 epochs: 1.125470
Training loss after 3 iterations: 1.161390
Validation loss after 3 epochs: 1.063552
Training loss after 4 iterations: 1.105915
Validation loss after 4 epochs: 1.019644
Training loss after 5 iterations: 1.072462
Validation loss after 5 epochs: 1.007428
Training loss after 6 iterations: 1.036811
Validation loss after 6 epochs: 1.017430
Training loss after 7 iterations: 1.007060
Validation loss after 7 epochs: 1.005187
Training loss after 8 iterations: 0.989100
Validation loss after 8 epochs: 1.011226
Training loss after 9 iterations: 0.967621
Validation loss after 9 epochs: 0.964540
Training loss after 10 iterations: 0.951535
Validation loss after 10 epochs: 1.016596
Training loss after 11 iterations: 0.928738
Validation loss after 11 epoch

tensor(0.6215)

From the analysis, it is seen that the accuracies in the train and test data with and without Batch Normalization is are similar, nevertheless, the model without BN is slightly more accurate. This issue happens sometimes, it is not quite strange, and can be caused by several facts: the objective function is not correctly smoothed, or maybe the intermmediate layers don't work as expected. 