# HPML LAB 3
### Sayan Banerjee
### sb7594

## Problem 1 - *Batch Normalization, Dropout, MNIST* 

Batch normalization and Dropout are used as effective regularization techniques. However, it is not clear
which one should be preferred and whether their benefits add up when used in conjunction. In this problem,
we will compare batch normalization, dropout, and their conjunction using MNIST and LeNet-5 (see e.g.,
http://yann.lecun.com/exdb/lenet/). LeNet-5 is one of the earliest convolutional neural networks developed
for image classification and its implementation in all major frameworks is available.

1. Explain the terms co-adaptation and internal covariance-shift. Use examples if needed. You may need
to refer to two papers mentioned below to answer this question. 

Answer:

### Co-adaptation
In a neural network sometimes hidden units(neurons) might become highly correlated. This is called co-adaptation. This happens when a unit changes in a particular way to fix mistakes of other neurons. This leads to over-fitting as it performs well on train data but does not geralise well in test data. The way to fix this problem is called dropout.<br>
As stated in the paper " Dropout: A Simple Way to Prevent Neural Networks from Overfitting.":<br>
<br>
"In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing.Therefore, units may change in a way that they fix up the mistakes of the other units.This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must perform well in a wide variety of different contexts provided by the other hidden units. To observe this effect directly, we look at the first level features learned by neural networks trained on visual tasks with and without dropout."

### Internal Covariate Shift

In the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Internal covariate shift is defined as:<br>

"We define Internal Covariate Shift as the change in the
distribution of network activations due to the change in
network parameters during training"


This as the paper mentions can be fixed with batch nonrmalization.<br>
As stated by:<br>
"Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs"

2. Batch normalization is traditionally used in hidden layers, for the input layer standard normalization
is used. In standard normalization, the mean and standard deviation are calculated using the entire
training dataset whereas in batch normalization these statistics are calculated for each mini-batch.
Train LeNet-5 with standard normalization of input and batch normalization for hidden layers. What
are the learned batch norm parameters for each layer?

Answer:
Below.
The learned parameters for batch norm are the alpha and beta values , that is alpha for scale and beta for shift. This is to normalize the values.

In [2]:
from torch.nn import Module
from torch import nn
import torch.nn.functional as F
import torch



In [4]:
import numpy as np
import torch
from torchvision.datasets import mnist
from torch.nn import CrossEntropyLoss
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
from torchvision import datasets, transforms

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'


In [18]:
def get_accuracy(model, data_loader, device):
    correct_pred = 0 
    n = 0
    
    with torch.no_grad():
        model.eval()
        for X, y_true in data_loader:

            X = X.to(device)
            y_true = y_true.to(device)

            _, y_prob = model(X)
            _, predicted_labels = torch.max(y_prob, 1)

            n += y_true.size(0)
            correct_pred += (predicted_labels == y_true).sum()

    return correct_pred.float() / n

In [12]:
train_dataset = mnist.MNIST(root='./train', train=True, transform=transforms.Compose(
    [
        transforms.Resize((32, 32)),
        transforms.ToTensor(),
        transforms.Normalize(mean = (0.1307,), std = (0.3081,))
    ]),download=True)
test_dataset = mnist.MNIST(root='./test', train=False, transform=transforms.Compose(
    [
        transforms.Resize((32, 32)),
        transforms.ToTensor(),
        transforms.Normalize(mean = (0.1307,), std = (0.3081,))
    ]),download=True)
train_loader = DataLoader(train_dataset, batch_size=256)
test_loader = DataLoader(test_dataset, batch_size=256)


# transforms = transforms.Compose([transforms.Resize((32, 32)),
#                                  transforms.ToTensor()])


# train_dataset = mnist.MNIST(root='./train', train=True, transform=transforms,download=True)
# test_dataset = mnist.MNIST(root='./test', train=False, transform=transforms,download=True)
# train_loader = DataLoader(train_dataset, batch_size=256,shuffle=True)
# test_loader = DataLoader(test_dataset, batch_size=256,shuffle=False)

In [13]:
class LeNetNormInput(nn.Module):

    def __init__(self, n_classes):
        super(LeNetNormInput, self).__init__()
        
        self.feature_extractor = nn.Sequential(     
        nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
        nn.BatchNorm2d(6),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2),
        nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
        nn.BatchNorm2d(16),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2))
        

        self.classifier = nn.Sequential(
        nn.Linear(400, 120),
        nn.BatchNorm1d(120),
        nn.ReLU(),
        nn.Linear(120, 84),
        nn.BatchNorm1d(84),
        nn.ReLU(),
        nn.Linear(84, n_classes)
        )


    def forward(self, x):
        x = self.feature_extractor(x)
        x = x.reshape(x.size(0), -1)
        logits = self.classifier(x)
        probs = F.softmax(logits, dim=1)
        return logits, probs

In [14]:
model = LeNetNormInput(10).to(DEVICE)
optimizer = SGD(model.parameters(), lr=1e-1)
criterion = CrossEntropyLoss()
all_epoch = 10

train_loss_arr=[]
test_loss_arr=[]
for current_epoch in range(all_epoch):
    model.train()
    loss_acc=0
    for idx, (x_train, y_train) in enumerate(train_loader):
        optimizer.zero_grad()
        x_train=x_train.to(DEVICE)
        y_train = y_train.to(DEVICE)
        
        #forward
        y_hat_pred ,_= model(x_train)
        loss = criterion(y_hat_pred, y_train)
        loss_acc += loss.item() * x_train.size(0)

              
        #backward
        loss.backward()
        optimizer.step()
        epoch_loss_train = loss_acc / len(train_loader.dataset)
        train_loss_arr.append(epoch_loss_train)
    
    
    
    
    all_correct_num = 0
    all_sample_num = 0
    
    with torch.no_grad():
        model.eval()
        loss_acc=0
        
        for idx, (x_test, y_test) in enumerate(test_loader):
            x_test=x_test.to(DEVICE)
            y_test=y_test.to(DEVICE)
            y_hat_pred ,_= model(x_test)
            loss = criterion(y_hat_pred, y_test) 
            loss_acc += loss.item() * x_train.size(0)

        epoch_loss_test = loss_acc / len(test_loader.dataset)    
        test_loss_arr.append(epoch_loss_test)

        train_acc = get_accuracy(model, train_loader, device=DEVICE)
        valid_acc = get_accuracy(model, test_loader, device=DEVICE)
                
        print(f'Epoch: {current_epoch}\t'
            f'Train loss: {epoch_loss_train:.4f}\t'
            f'Valid loss: {epoch_loss_test:.4f}\t'
            f'Train accuracy: {100 * train_acc:.2f}\t'
            f'Valid accuracy: {100 * valid_acc:.2f}')


Epoch: 0	Train loss: 0.2195	Valid loss: 0.0275	Train accuracy: 98.12	Valid accuracy: 98.08
Epoch: 1	Train loss: 0.0576	Valid loss: 0.0183	Train accuracy: 98.87	Valid accuracy: 98.55
Epoch: 2	Train loss: 0.0389	Valid loss: 0.0150	Train accuracy: 99.15	Valid accuracy: 98.81
Epoch: 3	Train loss: 0.0291	Valid loss: 0.0131	Train accuracy: 99.39	Valid accuracy: 98.95
Epoch: 4	Train loss: 0.0224	Valid loss: 0.0121	Train accuracy: 99.52	Valid accuracy: 98.99
Epoch: 5	Train loss: 0.0174	Valid loss: 0.0116	Train accuracy: 99.58	Valid accuracy: 99.03
Epoch: 6	Train loss: 0.0137	Valid loss: 0.0118	Train accuracy: 99.61	Valid accuracy: 99.02
Epoch: 7	Train loss: 0.0109	Valid loss: 0.0118	Train accuracy: 99.69	Valid accuracy: 99.02
Epoch: 8	Train loss: 0.0087	Valid loss: 0.0130	Train accuracy: 99.65	Valid accuracy: 98.96
Epoch: 9	Train loss: 0.0071	Valid loss: 0.0123	Train accuracy: 99.74	Valid accuracy: 99.01


In [None]:
#learned batch norm params example
list(list(model.classifier.children())[1].parameters())

3. Next instead of standard normalization use batch normalization for the input layer also and train the
network. Plot the distribution of learned batch norm parameters for each layer (including input) using
violin plots. Compare the train/test accuracy and loss for the two cases? Did batch normalization for
the input layer improve performance?

Ans:<br>
As seen below in the output the performance is about the same albeit negligibly better(which is in the realm of error of running model performance)
    

In [20]:
#prepare data.

transforms = transforms.Compose([transforms.Resize((32, 32)),
                                 transforms.ToTensor()])


train_dataset = mnist.MNIST(root='./train', train=True, transform=transforms,download=True)
test_dataset = mnist.MNIST(root='./test', train=False, transform=transforms,download=True)
train_loader = DataLoader(train_dataset, batch_size=256,shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=256,shuffle=False)

In [14]:
class LeNetNormAll(nn.Module):

    def __init__(self, n_classes):
        super(LeNetNormAll, self).__init__()
        
        self.feature_extractor = nn.Sequential(
        nn.BatchNorm2d(1),            
        nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
        nn.BatchNorm2d(6),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2),
        nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
        nn.BatchNorm2d(16),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2))
        

        self.classifier = nn.Sequential(
        nn.Linear(400, 120),
        nn.BatchNorm1d(120),
        nn.ReLU(),
        nn.Linear(120, 84),
        nn.BatchNorm1d(84),
        nn.ReLU(),
        nn.Linear(84, n_classes)
        )


    def forward(self, x):
        x = self.feature_extractor(x)
        x = x.reshape(x.size(0), -1)
        logits = self.classifier(x)
        probs = F.softmax(logits, dim=1)
        return logits, probs

In [15]:

model = LeNetNormAll(10).to(DEVICE)
optimizer = SGD(model.parameters(), lr=1e-1)
criterion = CrossEntropyLoss()
all_epoch = 10

train_loss_arr=[]
test_loss_arr=[]
for current_epoch in range(all_epoch):
    model.train()
    loss_acc=0
    for idx, (x_train, y_train) in enumerate(train_loader):
        optimizer.zero_grad()
        x_train=x_train.to(DEVICE)
        y_train = y_train.to(DEVICE)
        
        #forward
        y_hat_pred ,_= model(x_train)
        loss = criterion(y_hat_pred, y_train)
        loss_acc += loss.item() * x_train.size(0)

              
        #backward
        loss.backward()
        optimizer.step()
        epoch_loss_train = loss_acc / len(train_loader.dataset)
        train_loss_arr.append(epoch_loss_train)
    
    
    
    
    all_correct_num = 0
    all_sample_num = 0
    
    with torch.no_grad():
        model.eval()
        loss_acc=0
        
        for idx, (x_test, y_test) in enumerate(test_loader):
            x_test=x_test.to(DEVICE)
            y_test=y_test.to(DEVICE)
            y_hat_pred ,_= model(x_test)
            loss = criterion(y_hat_pred, y_test) 
            loss_acc += loss.item() * x_train.size(0)

        epoch_loss_test = loss_acc / len(test_loader.dataset)    
        test_loss_arr.append(epoch_loss_test)

        train_acc = get_accuracy(model, train_loader, device=DEVICE)
        valid_acc = get_accuracy(model, test_loader, device=DEVICE)
                
        print(f'Epoch: {current_epoch}\t'
            f'Train loss: {epoch_loss_train:.4f}\t'
            f'Valid loss: {epoch_loss_test:.4f}\t'
            f'Train accuracy: {100 * train_acc:.2f}\t'
            f'Valid accuracy: {100 * valid_acc:.2f}')


    

Epoch: 0	Train loss: 0.2244	Valid loss: 0.0291	Train accuracy: 98.25	Valid accuracy: 98.10
Epoch: 1	Train loss: 0.0567	Valid loss: 0.0185	Train accuracy: 98.96	Valid accuracy: 98.55
Epoch: 2	Train loss: 0.0396	Valid loss: 0.0199	Train accuracy: 98.77	Valid accuracy: 98.36
Epoch: 3	Train loss: 0.0306	Valid loss: 0.0322	Train accuracy: 97.87	Valid accuracy: 97.44
Epoch: 4	Train loss: 0.0237	Valid loss: 0.0143	Train accuracy: 99.21	Valid accuracy: 98.75
Epoch: 5	Train loss: 0.0195	Valid loss: 0.0359	Train accuracy: 97.78	Valid accuracy: 96.93
Epoch: 6	Train loss: 0.0172	Valid loss: 0.0178	Train accuracy: 99.22	Valid accuracy: 98.55
Epoch: 7	Train loss: 0.0148	Valid loss: 0.0123	Train accuracy: 99.67	Valid accuracy: 98.97
Epoch: 8	Train loss: 0.0122	Valid loss: 0.0115	Train accuracy: 99.84	Valid accuracy: 99.07
Epoch: 9	Train loss: 0.0100	Valid loss: 0.0102	Train accuracy: 99.91	Valid accuracy: 99.21


In [None]:
plt.violinplot(list(list(model.classifier.children())[1].parameters())[0].detach().cpu().numpy())

In [None]:
plt.violinplot(list(list(model.classifier.children())[1].parameters())[1].detach().cpu().numpy())

4. Train the network without batch normalization but this time use dropout. For hidden layers use a
dropout probability of 0.5 and for input, layer take it to be 0.2 Compare test accuracy using dropout
to test accuracy obtained using batch normalization in parts 2 and 3.

ans:<br>
As seen below in the output the performance is a little bit worse than the above 2 models but again pretty negligible.(about a percent difference)

In [8]:
class LeNetDropOutNoBatch(nn.Module):

    def __init__(self, n_classes):
        super(LeNetDropOutNoBatch, self).__init__()
        
        self.feature_extractor = nn.Sequential(            
        nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
        nn.Dropout(p=0.2),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2),
        nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
        nn.Dropout(p=0.5),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2))
        

        self.classifier = nn.Sequential(
        nn.Linear(400, 120),
        nn.Dropout(p=0.5),
        nn.ReLU(),
        nn.Linear(120, 84),
        nn.Dropout(p=0.5),
        nn.ReLU(),
        nn.Linear(84, n_classes)
        )


    def forward(self, x):
        x = self.feature_extractor(x)
        x = x.reshape(x.size(0), -1)
        logits = self.classifier(x)
        probs = F.softmax(logits, dim=1)
        return logits, probs

In [9]:
model = LeNetDropOutNoBatch(10).to(DEVICE)
optimizer = SGD(model.parameters(), lr=1e-1)
criterion = CrossEntropyLoss()
all_epoch = 10

train_loss_arr=[]
test_loss_arr=[]
for current_epoch in range(all_epoch):
    model.train()
    loss_acc=0
    for idx, (x_train, y_train) in enumerate(train_loader):
        optimizer.zero_grad()
        x_train=x_train.to(DEVICE)
        y_train = y_train.to(DEVICE)
        
        #forward
        y_hat_pred ,_= model(x_train)
        loss = criterion(y_hat_pred, y_train)
        loss_acc += loss.item() * x_train.size(0)

              
        #backward
        loss.backward()
        optimizer.step()
        epoch_loss_train = loss_acc / len(train_loader.dataset)
        train_loss_arr.append(epoch_loss_train)
    
    
    
    
    all_correct_num = 0
    all_sample_num = 0
    
    with torch.no_grad():
        model.eval()
        loss_acc=0
        
        for idx, (x_test, y_test) in enumerate(test_loader):
            x_test=x_test.to(DEVICE)
            y_test=y_test.to(DEVICE)
            y_hat_pred ,_= model(x_test)
            loss = criterion(y_hat_pred, y_test) 
            loss_acc += loss.item() * x_train.size(0)

        epoch_loss_test = loss_acc / len(test_loader.dataset)    
        test_loss_arr.append(epoch_loss_test)

        train_acc = get_accuracy(model, train_loader, device=DEVICE)
        valid_acc = get_accuracy(model, test_loader, device=DEVICE)
                
        print(f'Epoch: {current_epoch}\t'
            f'Train loss: {epoch_loss_train:.4f}\t'
            f'Valid loss: {epoch_loss_test:.4f}\t'
            f'Train accuracy: {100 * train_acc:.2f}\t'
            f'Valid accuracy: {100 * valid_acc:.2f}')


    

Epoch: 0	Train loss: 1.2538	Valid loss: 0.1318	Train accuracy: 93.32	Valid accuracy: 93.67
Epoch: 1	Train loss: 0.3491	Valid loss: 0.0792	Train accuracy: 95.96	Valid accuracy: 96.10
Epoch: 2	Train loss: 0.2492	Valid loss: 0.0944	Train accuracy: 95.23	Valid accuracy: 95.27
Epoch: 3	Train loss: 0.2069	Valid loss: 0.0491	Train accuracy: 97.21	Valid accuracy: 97.31
Epoch: 4	Train loss: 0.1827	Valid loss: 0.0422	Train accuracy: 97.64	Valid accuracy: 97.76
Epoch: 5	Train loss: 0.1643	Valid loss: 0.0398	Train accuracy: 97.82	Valid accuracy: 97.76
Epoch: 6	Train loss: 0.1515	Valid loss: 0.0364	Train accuracy: 98.05	Valid accuracy: 97.90
Epoch: 7	Train loss: 0.1384	Valid loss: 0.0377	Train accuracy: 98.11	Valid accuracy: 98.11
Epoch: 8	Train loss: 0.1323	Valid loss: 0.0323	Train accuracy: 98.14	Valid accuracy: 98.13
Epoch: 9	Train loss: 0.1278	Valid loss: 0.0268	Train accuracy: 98.41	Valid accuracy: 98.35


5.Now train the network using both batch normalization and dropout. How does the performance (test
accuracy) of the network compare with the cases with dropout alone and with batch normalization
alone?

ans:<br>
As seen below in the output the performance is a little bit worse than just using batch norm but similar to dropout.

In [19]:
class LeNetDropOutBatch(nn.Module):

    def __init__(self, n_classes):
        super(LeNetDropOutBatch, self).__init__()
        
        self.feature_extractor = nn.Sequential(            
        nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
        nn.BatchNorm2d(6),
        nn.Dropout(p=0.2),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2),
        nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
        nn.BatchNorm2d(16),
        nn.Dropout(p=0.5),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size = 2, stride = 2))
        

        self.classifier = nn.Sequential(
        nn.Linear(400, 120),
        nn.BatchNorm1d(120),
        nn.Dropout(p=0.5),
        nn.ReLU(),
        nn.Linear(120, 84),
        nn.BatchNorm1d(84),
        nn.Dropout(p=0.5),
        nn.ReLU(),
        nn.Linear(84, n_classes)
        )


    def forward(self, x):
        x = self.feature_extractor(x)
        x = x.reshape(x.size(0), -1)
        logits = self.classifier(x)
        probs = F.softmax(logits, dim=1)
        return logits, probs

In [20]:
model = LeNetDropOutBatch(10).to(DEVICE)
optimizer = SGD(model.parameters(), lr=1e-1)
criterion = CrossEntropyLoss()
all_epoch = 10

train_loss_arr=[]
test_loss_arr=[]
for current_epoch in range(all_epoch):
    model.train()
    loss_acc=0
    for idx, (x_train, y_train) in enumerate(train_loader):
        optimizer.zero_grad()
        x_train=x_train.to(DEVICE)
        y_train = y_train.to(DEVICE)
        
        #forward
        y_hat_pred ,_= model(x_train)
        loss = criterion(y_hat_pred, y_train)
        loss_acc += loss.item() * x_train.size(0)

              
        #backward
        loss.backward()
        optimizer.step()
        epoch_loss_train = loss_acc / len(train_loader.dataset)
        train_loss_arr.append(epoch_loss_train)
    
    
    
    
    all_correct_num = 0
    all_sample_num = 0
    
    with torch.no_grad():
        model.eval()
        loss_acc=0
        
        for idx, (x_test, y_test) in enumerate(test_loader):
            x_test=x_test.to(DEVICE)
            y_test=y_test.to(DEVICE)
            y_hat_pred ,_= model(x_test)
            loss = criterion(y_hat_pred, y_test) 
            loss_acc += loss.item() * x_train.size(0)

        epoch_loss_test = loss_acc / len(test_loader.dataset)    
        test_loss_arr.append(epoch_loss_test)

        train_acc = get_accuracy(model, train_loader, device=DEVICE)
        valid_acc = get_accuracy(model, test_loader, device=DEVICE)
                
        print(f'Epoch: {current_epoch}\t'
            f'Train loss: {epoch_loss_train:.4f}\t'
            f'Valid loss: {epoch_loss_test:.4f}\t'
            f'Train accuracy: {100 * train_acc:.2f}\t'
            f'Valid accuracy: {100 * valid_acc:.2f}')


    

Epoch: 0	Train loss: 0.6396	Valid loss: 0.1017	Train accuracy: 94.70	Valid accuracy: 95.56
Epoch: 1	Train loss: 0.2359	Valid loss: 0.0660	Train accuracy: 95.98	Valid accuracy: 96.48
Epoch: 2	Train loss: 0.1752	Valid loss: 0.0568	Train accuracy: 96.20	Valid accuracy: 96.47
Epoch: 3	Train loss: 0.1501	Valid loss: 0.0452	Train accuracy: 96.94	Valid accuracy: 97.22
Epoch: 4	Train loss: 0.1327	Valid loss: 0.0381	Train accuracy: 97.37	Valid accuracy: 97.64
Epoch: 5	Train loss: 0.1197	Valid loss: 0.0359	Train accuracy: 97.57	Valid accuracy: 97.79
Epoch: 6	Train loss: 0.1092	Valid loss: 0.0297	Train accuracy: 98.01	Valid accuracy: 98.08
Epoch: 7	Train loss: 0.1047	Valid loss: 0.0319	Train accuracy: 97.83	Valid accuracy: 98.01
Epoch: 8	Train loss: 0.0959	Valid loss: 0.0255	Train accuracy: 98.29	Valid accuracy: 98.46
Epoch: 9	Train loss: 0.0945	Valid loss: 0.0295	Train accuracy: 97.92	Valid accuracy: 98.02


## Problem 3

1. Calculate the number of parameters in Alexnet. You will have to show calculations for each layer and
then sum it to obtain the total number of parameters in Alexnet. When calculating you will need to
account for all the filters (size, strides, padding) at each layer. Look at Sec. 3.5 and Figure 2 in Alexnet
paper (see reference). Points will only be given when explicit calculations are shown for each layer.

Ans:
O = Size (width) of output image.<br>
I = Size (width) of input image.<br>
K = Size (width) of kernels used in the Conv Layer.<br>
N = Number of kernels.<br>
S = Stride of the convolution operation.<br>
P = Padding.<br>

For conv layer output,<br>
Output = $\frac{I-  K + 2P}{S} + 1$ <br>
for Maxpool,<br>
Output = (I-Pool size)/S +1<br>


For conv layer and fully connected layer,
W_c = Number of weights of the Conv Layer.<br>
B_c = Number of biases of the Conv Layer.<br>
P_c = Number of parameters of the Conv Layer.<br>
K = Size (width) of kernels used in the Conv Layer.<br>
N = Number of kernels.<br>
C = Number of channels of the input image.<br>

\begin{align*}W_c &= K^2 \times C \times N \\B_c &= N \\P_c &= W_c + B_c\end{align*}









Input = (227 -  11 + 2(0))/4 + 1 = 55 <br>
conv1: (11)^2(3)(96)+96 = 34944 parameters <br>
Output = (55 - 3)/2+1=27 <br>
pool1: 0 parameters; size: 27 x 27 x 96 <br>
Output = (27 -  5 + 2(2))/1 + 1 = 27 <br>
conv2: (5)^2(96)(256)+256=614656 parameters <br>
Output = (27 - 3)/2+1=13<br>
pool2: 0 parameters; size: 13 x 13 x 256 <br>
Output = (13 -  3 + 2(1))/1 + 1 = 13 <br>
conv3: (3)^2(256)(384)+384=885120 parameters <br>
Output = (13 -  3 + 2(1))/1 + 1 = 13<br>
conv4: 3)^2(384)(384)+384=1327488 parameters<br>
Output = (13 -  3 + 2(1))/1 + 1 = 13 <br>
conv5: 3)^2(384)(256)+256=884992 parameters<br>
Output = (13 - 3)/2+1=6 <br>
pool5: 0 parameters; size: 6 x 6 x 256 <br>
FC: (1)^2(9216)(4096)+4096=37752832 parameters<br>
FC: (1)^2(4096)(4096)+4096=16781312 parameters<br>
FC: (1)^2(4096)(1000)+1000=4097000 parameters<br>
If we sum all the above parameters , we get the total,
Total: 62,378,344 parameters


2. VGG (Simonyan et al.) has an extremely homogeneous architecture that only performs 3x3 convolutions
with stride 1 and pad 1 and 2x2 max pooling with stride 2 (and no padding) from the beginning to
the end. However VGGNet is very expensive to evaluate and uses a lot more memory and parameters.
Refer to VGG19 architecture on page 3 in Table 1 of the paper by Simonyan et al. You need to complete
Table 1 below for calculating activation units and parameters at each layer in VGG19 (without counting
biases). Its been partially filled for you. 

Ans:

INPUT: [224x224x3]                memory:  224*224*3=150K                    parameter(compute): 0<br>
CONV3-64: [224x224x64]          memory:  224*224*64=3.2M                    parameter(compute): (3*3*3)*64 = 1,728<br>
CONV3-64: [224x224x64]          memory:  224*224*64=3.2M                    parameter(compute): (3*3*64)*64 = 36,864<br>
POOL2: [112x112x64]          memory:  112*112*64=800K                        parameter(compute): 0<br>
CONV3-128: [112x112x128        ]  memory:  112*112*128=1.6M                    parameter(compute): (3*3*64)*128 = 73,728<br>
CONV3-128: [112x112x128        ]  memory:  112*112*128=1.6M                    parameter(compute): (3*3*128)*128 = 147,456<br>
POOL2: [56x56x128]          memory:  56*56*128=400K                            parameter(compute): 0<br>
CONV3-256: [56x56x256]          memory:  56*56*256=800K                         parameter(compute): (3*3*128)*256 = 294,912<br>
CONV3-256: [56x56x256]          memory:  56*56*256=800K                         parameter(compute): (3*3*256)*256 = 589,824<br>
CONV3-256: [56x56x256]          memory:  56*56*256=800K                        parameter(compute): (3*3*256)*256 = 589,824<br>
POOL2: [28x28x256]          memory:  28*28*256=200K                            parameter(compute): 0<br>
CONV3-512: [28x28x512]          memory:  28*28*512=400K                        parameter(compute): (3*3*256)*512 = 1,179,648<br>
CONV3-512: [28x28x512]          memory:  28*28*512=400K                        parameter(compute): (3*3*512)*512 = 2,359,296<br>
CONV3-512: [28x28x512]          memory:  28*28*512=400K                       parameter(compute): (3*3*512)*512 = 2,359,296<br>
POOL2: [14x14x512]          memory:  14*14*512=100K                             parameter(compute): 0<br>
CONV3-512: [14x14x512]          memory:  14*14*512=100K                     parameter(compute): (3*3*512)*512 = 2,359,296<br>
CONV3-512: [14x14x512]          memory:  14*14*512=100K                         parameter(compute): (3*3*512)*512 = 2,359,296<br>
CONV3-512: [14x14x512]          memory:  14*14*512=100K                         parameter(compute): (3*3*512)*512 = 2,359,296<br>
POOL2: [7x7x512]            memory:  7*7*512=25K                                  parameter(compute): 0<br>
FC: [1x1x4096]          memory:  4096                                          parameter(compute): 7*7*512*4096 = 102,760,448<br>
FC: [1x1x4096]          memory:  4096                                       parameter(compute): 4096*4096 = 16,777,216<br>
FC: [1x1x1000]          memory:  1000                                          parameter(compute): 4096*1000 = 4,096,000<br>

total          memory: 24M (approx)                                                     params: 138M parameters<br>


3. VGG architectures have smaller filters but deeper networks compared to Alexnet (3x3 compared to
11x11 or 5x5). Show that a stack of N convolution layers each of filter size F × F has the same
receptive field as one convolution layer with filter of size (NF − N + 1) × (NF − N + 1). Use this to
calculate the receptive field of 3 filters of size 5x5.

If we consider a stack of N layers of convolution of filter size F*F:<br>
Shape of kernel(receptive field)= S(shape)-F+1,<br>
For N layers, this becomes = S-N*(F+1)<br>
S-NF-N<br>
<br>
For one layer with filter (NF − N + 1)<br>
receptive field = S-(NF − N + 1)+1 = S-NF-N<br>
<br>
which is the same.<br>




4. The original Googlenet paper (Szegedy et al.) proposes two architectures for Inception module, shown
in Figure 2 on page 5 of the paper, referred to as naive and dimensionality reduction respectively.<br><br>
(a) What is the general idea behind designing an inception module (parallel convolutional filters of
different sizes with a pooling followed by concatenation) in a convolutional neural network ? (2)<br>
(b) Assuming the input to inception module (referred to as ”previous layer” in Figure 2 of the paper) has size 32x32x256, calculate the output size after filter concatenation for the naive and
dimensionality reduction inception architectures with number of filters given in Figure 1. (3)<br>
(c) Next calculate the total number of convolutional operations for each of the two inception architecture again assuming the input to the module has dimensions 32x32x256 and number of filters
given in Figure 1. (3)<br>
(d) Based on the calculations in part (c) explain the problem with naive architecture and how dimensionality reduction architecture helps (Hint: compare computational complexity). How much is the
computational saving ? (2+2)<br>

Ans 4a.<br>
In the inception module instead of using sparse matrices in the kernels for more computations , we convert them to a more dense format . This can effectively take advantage of more efficient in-built matrix-multiplication routines. The idea is to make computations more efficient by increasing the effective network size.



Ans 4b:<br>
Output size after filter concatenation,<br>
For the Naive: 32* 32 *(128+192+96+256) =688128<br>
For Dimensinality Reduction inception architecture: 32 * 32 * (128+192+96+64) =491520
So we can see that for the dimentionality reduction arch. the output size is smaller.


Ans 4c:<br>
Total no of Convolution ops:<br>
<br>
For the Naive implementation: <br>
Conv1 = 32 * 32 * 1 * 256 * 128 = 33554432 <br>
Conv3 = 32 * 32 * 9 * 256 * 128 = 301989888 <br>
Conv5 = 32 * 32 * 25 * 256 * 128 = 838860800 <br>
Total = 1174405120 <br>
For Dimensinality Reduction inception architecture: <br>
Conv1 = (32 * 32 * 256 * 128) + (32 * 32 * 256 * 128) + (32 * 32 * 256 * 64) = 92274688 <br>
Conv3 = 32 * 32 * 9 * 128 * 192 = 226492416 <br>
Conv5 = 32 * 32 * 25 * 32 * 96 = 78643200 <br>
Total = 397410304 <br>

Ans 4d:<br>
We can easily see that the total for the naive implementation vs the total for the dimensionality reduction arch is:<br>
1174405120 / 397410304= 2.95514511873