## Neural Networks in PyTorch

In [24]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Neural networks in PyTorch are constructed using the `torch.nn` package. Neural networks are defined as a subclass of `nn.Module` which contains a `forward` method which completely describes how an input tensor is transformed by the neural network to produce an `output`. PyTorch then uses `autograd` to differentiate the models.

### PyTorch Training Procedure:

    1. Define a neural network using `torch.nn.Module` with _learnable parameters_.
    2. Choose and define a **loss function**.

    For input, target in dataset:

        - Run `forward(input)`.
    
        - Compute the loss using the input's label.
    
        3. Propagate gradients using `.backward()`.
    
        4. Update the weights of a network using a _gradient-based optimiser_.

### 1. Define a neural network

We use the following architecture:

- Conv 5x5: 1 input channel -> 6 output channels
- Conv 5x5: 6 input channels -> 16 output channels
- FC: 120 hidden units
- FC: 84 hidden units
- FC: 10 **output** units

In [2]:
class CNN(nn.Module):
    
    def __init__(self):
        super().__init__()
        # LAYERS ==========================>
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        
    def forward(self, x):
        # conv1 -> relu -> max pool (2,2)
        x = F.max_pool2d(F.relu(self.conv1(x)), (2,2))
        # conv2 -> relu -> max pool (2,2) | note size is square so we only specify a single number for max_pool
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        # unroll the intermediate output
        x = x.view(-1, self.num_flat_features(x))
        # fc1 -> relu
        x = F.relu(self.fc1(x))
        # fc2 -> relu
        x = F.relu(self.fc2(x))
        # fc3 -> output
        return self.fc3(x)
    
    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

In [20]:
# we can print a summary of the network
net = CNN()
print(net)

CNN(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


In [8]:
# the learnable parameters of the neural network can be accessed using `net.parameters`
params = list(net.parameters())
print(len(params)) # number of parameter groups
print(params[0].size()) # dimensionality of conv1's weights

10
torch.Size([6, 1, 5, 5])


#### Testing the Forward Propogation

In [21]:
# try out forward with a random tensor
in_ = torch.randn(1,1,32,32) # note dimensionality: (N, C, H, W)
out = net(in_)
print(out)

tensor([[ 0.0109, -0.0894,  0.0501,  0.0767,  0.1023,  0.0595, -0.0435,  0.0773,
         -0.1163, -0.0197]], grad_fn=<AddmmBackward>)


#### Testing Backward Propogation

>Remember that PyTorch gradients are accumulated vector-Jacobian products. At the start of each epoch we need to zero out the accumulated gradients - it is good to get into this habbit.

In [12]:
net.zero_grad()
out.backward(torch.randn(1,10), retain_graph=True) # note that we have to supply a vector to evaluate gradients at

## 2. Choose a Loss Function

PyTorch contains many different kinds of [loss functions](https://pytorch.org/docs/stable/nn.html#id51). In general **loss functions** are functions which takes an $(input, target)$ pair and computes a _scalar value_ indicating how close the forward pass of the network is at approximating the target for the given input.

A simple example of a loss function is the **Mean Square Error**:

In [15]:
# choose a random target
target = torch.randn(10)
print(target.size())
print(out.size())
# make the target the same shape as the output
target.view(1,-1)
print(target.size())

torch.Size([10])
torch.Size([1, 10])
torch.Size([10])


In [22]:
# define the MSE loss function
mse = nn.MSELoss()

loss = mse(out, target)
print(loss)

tensor(0.9470, grad_fn=<MseLossBackward>)


Now `loss` is the **final output** of the computation graph `in_ -> CNN -> out -> MSELoss, target -> loss`. Calling `.backward()` on `loss` will compute gradients along the edges going backward from the loss to the input. The whole graph will be differentiated w.r.t to `loss` and all tensors with `requires_grad=True` will have their gradients accumulated in `.grad`.

To illustrate the graph, let's follow a few of these backward steps:

In [17]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # Relu

<MseLossBackward object at 0x7fa5430cf9e8>
<AddmmBackward object at 0x7fa5430cfb38>
<AccumulateGrad object at 0x7fa5430cf9e8>


## 3. Backpropagation

Now we will use `loss.backward()` to backpropagate the error using gradients. 

In [23]:
# we need to clear existing gradients which we accumulated
net.zero_grad()

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
None
conv1.bias.grad after backward
tensor([-0.0221,  0.0188, -0.0090, -0.0295, -0.0042,  0.0166])


## 4. Optimising Weights

We will use the simplest gradient-based optimiser used in modern machine learning, namely **stochastic gradient descent**.

In [25]:
optimiser = optim.SGD(net.parameters(), lr=0.01)

net.zero_grad()

in_ = torch.randn(1,1,32,32)
target = torch.randn(10).view(1,-1)

# inside a training loop
out = net(in_)
loss = mse(out, target)
loss.backward()
optimiser.step()