### AlexNet

![](alexnet.png)


#### Same Convolution

* A Same Convolution is a type of convolution where the output matrix is of the same dimension as the input matrix.
* For a nxn input matrix A and a fxf filter matrix F: the output of the convolution A*F is of dimension 
$$\left(\frac{n*2p-f}{s}\right)+1 \text{ x } \left(\frac{n*2p-f}{s}\right)+1$$
s = stride   
p = padding  
* For a same convolution:
    - s = 1,  
    - p = $\frac{f - 1}{2}$, and   
    - f is an odd number

### VGG-16

* Karen Simonyan and Andrew Zisserman (2014). Visual Geometry Group Lab of Oxford University

![](VGG.png)

* ~138 Million parameters
* 3x3 filters
* Stride = 1
* All convolutions are same convolutions
* Number of filters 64->128->256->512

## ResNet

* He,Zhang,Ren and Sun (2015) [Deep Residual Learning for Image Recognition. ](https://arxiv.org/abs/1512.03385)

* Why doesn't adding more layers improve Training and Test Error?
    - Learning F(x) = 0 because of weight decay, small random initialization, L2 regularization


![](MoreLayers.png)

### Residual Blocks

* Solution: Learn F(x) + x 

![](ResBlock.png)

<div style="font-size: 115%;">
$$a^{l+2} = ReLU((W^{l+2}\cdot{a^{l+1}}+b^{l+2}) + a^l)$$
</div>

* If the weights and bias = 0 (because of weight decay, small random initialization, L2 regularization) then

<div style="font-size: 115%;">
$$ a^l = ReLU(a^l) = a^l$$
</div>

* The Residual block learns the indentity function
* Called a "skip" or "short cut" connection

#### ResNet Architecture

![](ResNet.png)

### 1x1 Convolutions

* Dotted line in the ResNet architecture are where the number of channels increases
* 1x1 Convolution projects to a higher number of channels but doesn't change the input size

In [1]:
def num_params(n,m,l,k):
    '''Number of parameters conv2d_2
        n,m = shape of kernel
        l = number of inputs
        k = number of outputs
        num_param = (n*m*l+1)*k'''
    print(f'(Kernel=({n}x{m}) * num_in={l} + 1)) * num_out={k} = {(n*m*l+1)*k}')
    return 

In [2]:
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
tf.__version__

'2.3.1'

In [4]:
# create model
model = Sequential()
model.add(Conv2D(512, (3,3), padding='same', 
                 activation='relu', input_shape=(256, 256, 3)))
# summarize model
num_params(3,3,3,512)
model.summary()

(Kernel=(3x3) * num_in=3 + 1)) * num_out=512 = 14336
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 256, 256, 512)     14336     
Total params: 14,336
Trainable params: 14,336
Non-trainable params: 0
_________________________________________________________________


In [None]:
Conv2D(512, (3,3), padding='same', activation='relu', input_shape=(256, 256, 3))

#### 1x1, number of channels in equals number of channels out

* Size of feature map doesn't cange

In [29]:
model.add(Conv2D(512, (1,1), activation='relu'))
num_params(1,1,512,512)
model.summary()

(Kernel=(1x1) * num_in=512 + 1)) * num_out=512 = 262656
Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_12 (Conv2D)           (None, 256, 256, 512)     14336     
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 256, 256, 512)     262656    
Total params: 276,992
Trainable params: 276,992
Non-trainable params: 0
_________________________________________________________________


#### 1x1, number of channels decreases

In [30]:
model.add(Conv2D(64, (1,1), activation='relu'))
num_params(1,1,512,64)
model.summary()

(Kernel=(1x1) * num_in=512 + 1)) * num_out=64 = 32832
Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_12 (Conv2D)           (None, 256, 256, 512)     14336     
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 256, 256, 512)     262656    
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 256, 256, 64)      32832     
Total params: 309,824
Trainable params: 309,824
Non-trainable params: 0
_________________________________________________________________


#### 1x1, number of channels increases

In [31]:
model.add(Conv2D(512, (1,1), activation='relu'))
# summarize model
num_params(1,1,64,512)
model.summary()

(Kernel=(1x1) * num_in=64 + 1)) * num_out=512 = 33280
Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_12 (Conv2D)           (None, 256, 256, 512)     14336     
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 256, 256, 512)     262656    
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 256, 256, 64)      32832     
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 256, 256, 512)     33280     
Total params: 343,104
Trainable params: 343,104
Non-trainable params: 0
_________________________________________________________________


####  Use 1x1 convolution to reduce number of channels before applying larger convolution

input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth)

input (256 depth) -> 4x4 convolution (256 depth)

Bottom ~3.7 times slower

## ResNet Model

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchsummary import summary
torch.__version__

'1.5.0'

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [5]:
class Residual(nn.Module):
  
  def __init__(self,input_channels, num_channels, use_1x1conv=False, strides=1, **kwargs):
    super(Residual, self).__init__(**kwargs)
    self.conv1 = nn.Conv2d(input_channels, num_channels,kernel_size=3, padding=1, stride=strides)
    self.conv2 = nn.Conv2d(num_channels, num_channels, kernel_size=3, padding=1)
    if use_1x1conv:
      self.conv3 = nn.Conv2d(input_channels, num_channels, kernel_size=1, stride=strides)
    else:
      self.conv3 = None
    self.bn1 = nn.BatchNorm2d(num_channels)
    self.bn2 = nn.BatchNorm2d(num_channels)
    self.relu = nn.ReLU(inplace=True)
  
  def forward(self, X):
    Y = self.relu(self.bn1(self.conv1(X)))
    Y = self.bn2(self.conv2(Y))
    if self.conv3:
      X = self.conv3(X)
    Y += X
    Y = self.relu(Y)
    return Y

In [6]:
def resnet_block(input_channels, num_channels, num_residuals, first_block=False):
  blk = []
  for i in range(num_residuals):
    if i == 0 and not first_block:
      blk.append(Residual(input_channels, num_channels, use_1x1conv=True, strides=2))
    else:
      blk.append(Residual(num_channels, num_channels))
  return blk

In [7]:
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
                    nn.BatchNorm2d(64),
                    nn.ReLU(),
                    nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [8]:
b2=nn.Sequential(*resnet_block(64,64,2,first_block=True))
b3=nn.Sequential(*resnet_block(64,128,2))
b4=nn.Sequential(*resnet_block(128,256,2))
b5=nn.Sequential(*resnet_block(256,512,2))
net=nn.Sequential(b1,
                  b2,b3,b4,b5,
                  nn.AdaptiveMaxPool2d((1,1)),
                  nn.Flatten(),
                  nn.Linear(512, 10))
net.to(device)

In [9]:
summary(net,(1,28,28))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 64, 14, 14]           3,200
       BatchNorm2d-2           [-1, 64, 14, 14]             128
              ReLU-3           [-1, 64, 14, 14]               0
         MaxPool2d-4             [-1, 64, 7, 7]               0
            Conv2d-5             [-1, 64, 7, 7]          36,928
       BatchNorm2d-6             [-1, 64, 7, 7]             128
              ReLU-7             [-1, 64, 7, 7]               0
            Conv2d-8             [-1, 64, 7, 7]          36,928
       BatchNorm2d-9             [-1, 64, 7, 7]             128
             ReLU-10             [-1, 64, 7, 7]               0
         Residual-11             [-1, 64, 7, 7]               0
           Conv2d-12             [-1, 64, 7, 7]          36,928
      BatchNorm2d-13             [-1, 64, 7, 7]             128
             ReLU-14             [-1, 6

In [4]:
def init_weights(m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d:
        torch.nn.init.xavier_uniform_(m.weight)
        
def evaluate_accuracy(data_iter, net, device):
    """Evaluate accuracy of a model"""
    net.eval()  # Switch to evaluation mode for Dropout, BatchNorm etc layers.
    acc_sum, n = torch.tensor([0], dtype=torch.float32, device=device), 0
    for X, y in data_iter:
        # Copy the data to device.
        X, y = X.to(device), y.to(device)
        with torch.no_grad():
            y = y.long()
            acc_sum += torch.sum((torch.argmax(net(X), dim=1) == y))
            n += y.shape[0]
    return acc_sum.item()/n

import time
def train_resnet(net, train_iter, test_iter, num_epochs, batch_size, device, lr=None):
    print('training on', device)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(net.parameters(), lr=lr)
    for epoch in range(num_epochs):
        net.train() # Switch to training mode
        n, start = 0, time.time()
        train_l_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
        train_acc_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
        for X, y in train_iter:
            optimizer.zero_grad()
            X, y = X.to(device), y.to(device) 
            y_hat = net(X) # Forward
            loss = criterion(y_hat, y)
            loss.backward()
            optimizer.step()
            with torch.no_grad():
                y = y.long()
                train_l_sum += loss.float()
                train_acc_sum += (torch.sum((torch.argmax(y_hat, dim=1) == y))).float()
                n += y.shape[0]

        test_acc = evaluate_accuracy(test_iter, net, device) 
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'\
            % (epoch + 1, train_l_sum/n, train_acc_sum/n, test_acc, time.time() - start))


In [9]:
train_dataset = torchvision.datasets.FashionMNIST(
    root='.',
    train=True,
    transform=transforms.ToTensor(),
    download=True)
test_dataset = torchvision.datasets.FashionMNIST(
    root='.',
    train=False,
    transform=transforms.ToTensor(),
    download=True)

batch_size = 256
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

In [10]:
lr, num_epochs, batch_size = 0.05, 5, 256
net.apply(init_weights)
train_resnet(net, train_loader, test_loader, num_epochs, batch_size, device, lr)

training on cpu
epoch 1, loss 0.0081, train acc 0.632, test acc 0.440, time 674.2 sec
epoch 2, loss 0.0022, train acc 0.796, test acc 0.800, time 672.7 sec
epoch 3, loss 0.0017, train acc 0.834, test acc 0.834, time 674.7 sec
epoch 4, loss 0.0016, train acc 0.849, test acc 0.856, time 678.6 sec
epoch 5, loss 0.0014, train acc 0.863, test acc 0.859, time 675.6 sec


### Batch Normalization

* Ioffe and Szegedy (2015) [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)

* In each training iteration, BN normalizes the activations of each hidden layer node 
(on each layer where it is applied) by subtracting its mean and dividing by its standard deviation, estimating both based on the current minibatch.

* Batch Normalization transforms the activation at a given layer from $\mathbf{a}$ to

$$\mathrm{BN}(\mathbf{a}) = \mathbf{\gamma} \odot \frac{\mathbf{a} - \hat{\mathbf{\mu}}}{\hat\sigma} + \mathbf{\beta}$$

Where:  
$\hat{\mathbf{\mu}}$ is the estimate of the mean  
$\hat{\mathbf{\sigma}}$ is the estimate of the variance   
$\mathbf{\gamma}$ is coordinate-wise scaling coefficient   
$\mathbf{\beta}$ is an offset  
$\odot$ is elementwise multiplication

* For convolutional layers, batch normalization occurs after the convolution computation and before the application of the activation function.

### References

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola, DiveIntoDeepLearning

Andrew Ng, DeepLearning.AI

Jason Brownlee, A Gentle Introduction to 1×1 Convolutions to Manage Model Complexity