### What's wrong with Linear
* 4 hidden layers: \[784,256,256,256,10\]
    * 390K parameters
    * 1.6MB memory
    * 80386

## Convolution
* Receptive Field
* Weight sharing
    * e.g. LeNet-5
    * ~60K parameters
    * 6 Layers
* Convolution Operation


### Notation
* Input_channels: e.g. 3 for RGB
* Kernel_channels: number of kernels
* Kernel_size: size of kernel e.g. 3*3
* Stride: steps of kernel moving
* Padding: number of zeros adding around input

e.g.

* x: \[b,3,28,28\]
* one k: \[3,3,3\]
* multi-k: \[16,3,3,3\]
* bias: \[16\]
* out: \[b,16,28,28\]


In [1]:
import torch

In [2]:
layer = torch.nn.Conv2d(1,3,kernel_size=3,stride=1,padding=0)
x = torch.rand(1,1,28,28)
out = layer.forward(x)
out.shape

torch.Size([1, 3, 26, 26])

In [4]:
layer = torch.nn.Conv2d(1,3,kernel_size=3,stride=1,padding=1)
out = layer.forward(x)
out.shape

torch.Size([1, 3, 28, 28])

In [5]:
layer = torch.nn.Conv2d(1,3,kernel_size=3,stride=2,padding=1)
out = layer.forward(x)
out.shape

torch.Size([1, 3, 14, 14])

In [7]:
out = layer(x) # __call__ hooks
out.shape

torch.Size([1, 3, 14, 14])

In [10]:
layer.weight

Parameter containing:
tensor([[[[-0.0548,  0.3053, -0.0196],
          [ 0.0065,  0.1361, -0.1108],
          [-0.2680,  0.1364,  0.2705]]],


        [[[ 0.0743,  0.1881,  0.2415],
          [ 0.0644, -0.2529,  0.1107],
          [ 0.1143,  0.0075, -0.1950]]],


        [[[-0.0960, -0.2884, -0.1681],
          [ 0.2984,  0.1852,  0.1816],
          [ 0.2769,  0.2022, -0.0348]]]], requires_grad=True)

In [12]:
layer.weight.shape, layer.bias.shape

(torch.Size([3, 1, 3, 3]), torch.Size([3]))

In [22]:
# another low_level way
w = torch.rand(16,3,5,5)
b = torch.rand(16)
x = torch.randn(1,3,28,28)
out = torch.nn.functional.conv2d(x,w,b,stride=1,padding=1)
out.shape

torch.Size([1, 16, 26, 26])

## Pooling
* Downsample
* Upsample
* Max Pooling: max pool with 2*2 filters and stride 2 from rectified map
    * Average

In [23]:
x = out
layer = torch.nn.MaxPool2d(2,stride=2)
out.shape

torch.Size([1, 16, 26, 26])

In [24]:
# max pooling
out = layer(x)
out.shape

torch.Size([1, 16, 13, 13])

In [25]:
# average pooling
out = torch.nn.functional.avg_pool2d(x,2,stride=2)
out.shape

torch.Size([1, 16, 13, 13])

In [26]:
# upsample - interpolate
out = torch.nn.functional.interpolate(x,scale_factor=2,mode='nearest')
out.shape

torch.Size([1, 16, 52, 52])

In [27]:
out = torch.nn.functional.interpolate(x,scale_factor=3,mode='nearest')
out.shape

torch.Size([1, 16, 78, 78])

## Batch_Norm
> avoid gradient dispersion or explosion e.g. sigmoid

* feature scaling

* Batch \[N,C,H*W\] norm e.g.\[6,3,784\]
    * take means from every **C_i** of **C**
    > C_0 -> mean_0, C_1 -> mean_1, C_2 -> mean_2
    * mini-batch mean -> mini-batch variance -> normalize -> scale and shift 
* Layer norm
    * take means from every **N_i** of **N**
    > N_0 -> mean_0, ... , N_5 -> mean_5
* Instance norm
    * take means from every instance

### Advantages
* Converge faster
* Better performance
* Robust
    * stable
    * larger learning rate

### Unit
* conv2d + (batch_norm + pool + ReLU)

In [30]:
# BatchNorm1d
x = torch.rand(100,16,784)
layer = torch.nn.BatchNorm1d(16) # number of channals
out = layer(x)
layer.running_mean, layer.running_var, layer.weight, layer.bias

(tensor([0.0498, 0.0499, 0.0498, 0.0501, 0.0500, 0.0500, 0.0500, 0.0499, 0.0500,
         0.0500, 0.0499, 0.0499, 0.0501, 0.0502, 0.0500, 0.0501]),
 tensor([0.9083, 0.9084, 0.9083, 0.9084, 0.9084, 0.9083, 0.9083, 0.9083, 0.9083,
         0.9083, 0.9083, 0.9083, 0.9083, 0.9083, 0.9084, 0.9083]),
 Parameter containing:
 tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        requires_grad=True))

In [33]:
# BatchNorm2d
x = torch.randn(4,16,7,7)
layer = torch.nn.BatchNorm2d(16)
out = layer(x)
vars(layer)

{'training': True,
 '_parameters': OrderedDict([('weight', Parameter containing:
               tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
                      requires_grad=True)),
              ('bias',
               Parameter containing:
               tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                      requires_grad=True))]),
 '_buffers': OrderedDict([('running_mean',
               tensor([-0.0016,  0.0036,  0.0071,  0.0058,  0.0185, -0.0164,  0.0034, -0.0005,
                       -0.0035, -0.0070, -0.0060,  0.0083,  0.0095, -0.0077,  0.0038, -0.0055])),
              ('running_var',
               tensor([0.9921, 0.9902, 0.9935, 0.9979, 1.0072, 0.9997, 0.9847, 0.9753, 1.0047,
                       0.9902, 0.9924, 0.9950, 1.0086, 0.9834, 0.9894, 1.0026])),
              ('num_batches_tracked', tensor(1))]),
 '_non_persistent_buffers_set': set(),
 '_backward_hooks': OrderedDict(),
 '_is_full_backward_hook

In [37]:
layer.eval()
vars(layer)

{'training': False,
 '_parameters': OrderedDict([('weight',
               Parameter containing:
               tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
                      requires_grad=True)),
              ('bias',
               Parameter containing:
               tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                      requires_grad=True))]),
 '_buffers': OrderedDict([('running_mean',
               tensor([-0.0016,  0.0036,  0.0071,  0.0058,  0.0185, -0.0164,  0.0034, -0.0005,
                       -0.0035, -0.0070, -0.0060,  0.0083,  0.0095, -0.0077,  0.0038, -0.0055])),
              ('running_var',
               tensor([0.9921, 0.9902, 0.9935, 0.9979, 1.0072, 0.9997, 0.9847, 0.9753, 1.0047,
                       0.9902, 0.9924, 0.9950, 1.0086, 0.9834, 0.9894, 1.0026])),
              ('num_batches_tracked', tensor(1))]),
 '_non_persistent_buffers_set': set(),
 '_backward_hooks': OrderedDict(),
 '_is_fu