In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

## Define the network

In [2]:
# lets code a simple feed forward neural network

class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        
        # 1 input image channel, 6 output channels, 5x5 square convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        # 6 input channels, 16 output channels, 5x5 square convolution kernel
        self.conv2 = nn.Conv2d(6, 16, 5)
        
        # we also need some linear transformations; these are initialised by their in_ and out_features
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 16 output ch from second layer times the kernel size 
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)  
        # I don't know, where 120, 84 and 10 come from; but they stem from the picture in the tutorial
        
    def forward(self, x):
        # Max pooling over a (2, 2) window
        # => from a 2x2 window take the max value. relu = Rectified Linear unit; relu(x) = max(0,x)
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(x.size(0), -1)  # this function replaced the num_flat_features from the tutorial
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()
print(net)

Net(
  (conv1): Conv2d (1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d (6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120)
  (fc2): Linear(in_features=120, out_features=84)
  (fc3): Linear(in_features=84, out_features=10)
)


since we created a ```forward``` function, a ```backward``` function will be created automagically. The backward function can then be used for 
``` autograd ```. 

learnable parameters of a model: ```net.parameters()```

In [3]:
params = list(net.parameters())
print(params[0])
# NOTE: I don't know exactly what these weights correspond to in the actual network above. 
# In the sense of: how does the network look like, where do these weights actually sit.

Parameter containing:
(0 ,0 ,.,.) = 
 -0.0422  0.1326 -0.1497  0.1332 -0.1435
 -0.0981  0.1486 -0.0478  0.1822 -0.1259
 -0.0827 -0.0105 -0.1260 -0.0991 -0.0467
  0.1747  0.1547 -0.0756  0.1781  0.0076
  0.1063 -0.1987 -0.0100  0.0198 -0.0666

(1 ,0 ,.,.) = 
 -0.0246 -0.1099  0.1335 -0.1048 -0.0662
  0.1759 -0.1175 -0.0679 -0.0134  0.0980
 -0.0322  0.1630  0.1893  0.1355 -0.0755
  0.0227  0.1194  0.1236  0.1811  0.1407
 -0.1081  0.1927  0.1280  0.0220 -0.0808

(2 ,0 ,.,.) = 
 -0.0387  0.0119  0.1786 -0.0696 -0.0710
 -0.0696 -0.1785  0.1896 -0.0441  0.0579
 -0.1013  0.1831 -0.1188  0.0047  0.1406
  0.0967 -0.1223  0.1321 -0.0568 -0.1077
  0.1127 -0.1153  0.1979 -0.1258 -0.0429

(3 ,0 ,.,.) = 
 -0.0280  0.1213  0.1458 -0.0799  0.1522
  0.1565 -0.1184 -0.0131 -0.1579  0.1548
  0.0861 -0.1382  0.1314 -0.1928 -0.0272
  0.1141  0.1184 -0.0622  0.0567 -0.1393
 -0.1454  0.1954 -0.1382  0.0198 -0.1250

(4 ,0 ,.,.) = 
  0.0300 -0.1286 -0.1265  0.0209  0.1932
  0.0406  0.0997  0.0256 -0.1047 -0.15

The input to the ```forward``` method is an ```autograd.Variable```. The produced output is of the same type. 
The CNN expects an input of $32 \times 32$ picture sizes. how does that work? let's walk through that:

First, let us just assume for the moment, that we have a $32\times32\times1$ sized picture (we only have one colour channel, hence BW picture). Our first layer ```net.conv1``` has a $5\times5$ convolutional layer, with kernel size (KS) of $5$ and $6$ outputs $\Leftarrow 6$ kernels.

The default values in ```torch.nn.Conv2d``` for ```stride``` and ```padding``` are $1$ and $0$ respectively. Since we didn't set any padding ourselves, we'll lose some information of the image. If the function
```python
out = lambda wid,KS,pad,st: ((wid-KS+2*pad)/st+1)
```
with wid=width, KS=kernel size, pad=padding, st=stride, returns an integer, we don't lose information. In our case we have ```out(32, 5, 0, 1)=13.5```.
To ensure, that we don't lose information, we can set $\mathrm{pad} = (\mathrm{KS}-1)/2$, if we additionally make sure that $\mathrm{st} = 1$.

The loss of information is $\mathrm{KS} -1$, so after ```net.conv1``` we end up with $28\times28\times6$, because we have 6 outputs. This is fed into the 2d pooling layer of $2\times2$, so we get $14\times14\times6$, because channels aren't pooled.
Same procedure again with layer ```net.conv2``` $\Rightarrow\ 10\times10\times16$, and after pooling $5\times5\times16$.

This output tensor is then streched into a 1d object, such that ```net.fc1``` is able to use it. Hence the $16\cdot 5 \cdot 5$ input size.

## actual input, now!

Our input is going to be a ```autograd.Variable```, such that all the wonderful automagical stuff happens in the background. :-)
```Variable``` takes a $4$-dimensional tensor: $\mathrm{nSamples} \times \mathrm{nChannels}\times \mathrm{height}\times \mathrm{width}$.

```torch.nn``` only supports mini-batches, no single samples. So if we had only a single sample, we'd have to use ```inp.unsqueeze(0)``` to add a fake bath dimension.

In [12]:
inp = Variable(torch.randn(1, 1, 32, 32))  
# what is a channel? think of an RGB picture - it has 3 channels, each for a certain colour. each channel has info
# about the whole picture
out = net(inp)
print(out)

Variable containing:
-0.0533  0.0633  0.0528 -0.0811 -0.0318 -0.0764  0.0818 -0.1272 -0.1108 -0.0568
[torch.FloatTensor of size 1x10]



Now we want to _zero_ the gradient buffers of all parameters and backpropagate using random gradients:

In [5]:
net.zero_grad()
out.backward(torch.randn(1, 10))

## going further, loss functions, yeeehaw!

Simple loss function: ```nn.MSELOSS```, which computes the Mean-Squared Error between input and the target.

In [13]:
output = net(inp)
target = Variable(torch.arange(1, 11))  # a dummy target, for example
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

Variable containing:
 39.0502
[torch.FloatTensor of size 1]



## and now for something completely backprop

to use backpropagation we just have to use the ```.backward()``` method for the loss, which is automagically there. We need to clear existing gradients, otherwise the new gradients will be accumulated to the existing gradients.

We'll inspect the biases of ```net.conv1``` before and after the backprop.

In [14]:
# clear gradients:
net.zero_grad()

print(':: conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print(':: conv1.bias.grad after backward')
print(net.conv1.bias.grad)

:: conv1.bias.grad before backward
Variable containing:
 0
 0
 0
 0
 0
 0
[torch.FloatTensor of size 6]

:: conv1.bias.grad after backward
Variable containing:
-0.0060
 0.0450
-0.0926
 0.0781
 0.1819
-0.0768
[torch.FloatTensor of size 6]



now, the only thing that's left is...
## weight update

which is done the easiest by using the stochastic gradient descent (SGD). The procedure is simple:
```python
weight = weight - learning_rate * gradient
```

SIDE NOTE: I'm more or less copy-pasting text & code from the tutorial here, and I think in the above they just meant to write _Gradient Descent_, because there is no stochasticity in the above formula. 

In [15]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)


To utilize more update rules, without the need to code them ourselves, we can use the ```torch.optim``` package. It is used as follows:

In [16]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(inp)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

## test section
below is just some testing of the above stuff, without any real reason. 

In [18]:
# test function to get an idea, how long it takes for a network to do a full data propagation with weight adaption
# still to add: weight adaption with learning rate, loss function, ..
def test(net):
    input = Variable(torch.randn(1,1,32,32))
    out = net(input)
    net.zero_grad()
    out.backward(torch.randn(1,10))
%timeit test(net)

def test2(net):
    optimizer.zero_grad()   # zero the gradient buffers
    output = net(inp)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step() 
%timeit test2(net)

955 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
896 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



 0.0544  0.0016  0.1104  0.0930  0.1158 -0.0569  0.0511  0.1915  0.1072  0.0430
[torch.FloatTensor of size 1x10]

