Matías Alloatti - 2020
This notebook was done compiling tutorials from:
1. https://pytorch.org/tutorials/
2. https://deeplizard.com/

# Neural Networks

### Neural networks can be constructed using the `torch.nn` package.

Now that you had a glimpse of `autograd`, the library `nn` depends on `autograd` to define models and differentiate them. An `nn.Module` contains layers, and a method `forward(input)` that returns the `output`.

A typical training procedure for a neural network is as follows:
   - Define the neural network that has some learnable parameters (or weights)
   - Iterate over a dataset of inputs
   - Process input through the network
   - Compute the loss (how far is the output from being correct)
   - Propagate gradients back into the network’s parameters
   - Update the weights of the network, typically using a simple update rule: `weight = weight - learning_rate * gradient`

In [2]:
# Libraries:
import torch
import torch.nn as nn
import torch.nn.functional as F

You just have to define the `forward` function; The `backward` function (where gradients are computed) is automatically defined for you using `autograd`. You can use any of the Tensor operations in the `forward` function.

In [4]:
# The nn.Module class is used to create a NN, it's the base class for all neural network modules.
# Your models should also subclass this class.
class Net(nn.Module):
# The __init__() method initializes an instance and is used to create the needed layers
    def __init__(self):
        super(Net, self).__init__()
        # nn.Conv2d() applies a 2d convolution 
        # 2d convolution layer with 1 input image channel (color), 6 output channels (filters), and 3x3 convolution kernels:
        self.conv1 = nn.Conv2d(1, 6, 3)        # defaults: stride=1, padding=0
        # 2d convolution layer with 6 input channels, 16 output channels (filters), and 3x3 convolution kernels:
        self.conv2 = nn.Conv2d(6, 16, 3) 
        # nn.Liear() applies a linear transformation: y = Wx + b takes n_features, out_features, and bias=True
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)          # 
        self.fc3 = nn.Linear(84, 10)           # 

# The forward() method creates the structure of the network and through its layers transforms x:
    def forward(self, x):
        # F.max_pool2d() applies maxpooling with a definite kernel_size, stride, and padding (defaults: stride=None, padding=0)
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2)) # max pooling over a (2, 2) window 
        x = F.max_pool2d(F.relu(self.conv2(x)), 2) # you can specify just a single number if the size is square
        x = x.view(-1, self.num_flat_features(x))  # reshaping of x (flattening)
        x = F.relu(self.fc1(x))                    # applies a nonlinearity (relu) to linear transform (fc1)
        x = F.relu(self.fc2(x))                    # non linear activation (relu) after linear transform (fc2)
        x = self.fc3(x)                            # last linear transformation (fc3)
        return x

# auxiliary method to flatten features:
    def num_flat_features(self, x):
        size = x.size()[1:]                        # takes all dimensions except the batch dimension
        num_features = 1                           # num_features will capture the total number of features
        for s in size:                             # multiplies all dimensions in size to get total features
            num_features *= s
        return num_features

net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


In [17]:
# The learnable parameters of a model are returned by net.parameters():
params = list(net.parameters())
print('len() of parameter list:', len(params))
print("conv1's weights size:", params[0].size())  # conv1's weights

len() of parameter list: 10
conv1's weights size: torch.Size([6, 1, 3, 3])


In [18]:
# Let’s try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. 
# To use this net on the MNIST dataset, please resize the images from the dataset to 32x32.
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[ 0.0747, -0.0454,  0.1177, -0.0867,  0.1047, -0.0502, -0.0487,  0.0826,
         -0.0449,  0.0971]], grad_fn=<AddmmBackward>)


In [19]:
net.zero_grad()
out.backward(torch.randn(1, 10))

---
- Note: `torch.nn` only supports mini-batches. The entire `torch.nn` package only supports inputs that are a mini-batch of samples, and not a single sample. 

For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width. If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension.

---

Before proceeding further, let’s recap all the classes you’ve seen so far:

- `torch.Tensor` - A multi-dimensional array with support for autograd operations like `backward()`. Also holds the gradient w.r.t. the tensor.
- `nn.Module` - Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc.
- `nn.Parameter` - A kind of `Tensor`, that is automatically registered as a parameter when assigned as an attribute to a `Module`.
- `autograd.Function` - Implements forward and backward definitions of an autograd operation. Every `Tensor` operation creates at least a single `Function` node that connects to functions that created a `Tensor` and encodes its history.

At this point, we covered:
- Defining a neural network
- Processing inputs and calling backward

Now, a break reviewing CNNs concepts.

## CNNs: terminology, Output Size Formula and more
### Terminology:
CNNs (Convolutional Neural Networks) have convolutional layers. They're really useful for detecting patterns in images.
- Convolutional layer: applies a convolution operation with certain ammount of filters.
- Convolutional filter: a filter is a small matrix initialized with random numbers that convolves (slides) through the input data and performs a dot.product with it.
- Convolutional kernel: another name for filters, they have a certain size
- Output channel `nn.Conv2d(out_channels)`: out_channels are the result of applying each filter/kernel.
- Feature maps: another way to refer to the output channels. This is due to the fact that the pattern detection that emerges as the weights are updated represent features like edges and other more sophisticated patterns.
- The filters are the Weight Tensors of the layer and they are used to convolve the Input Tensor and the result is the output channel.

### Algorithm:
    Color channels are passed in.
    Convolutions are performed using the weight tensor (filters).
    Feature maps are produced and passed forward.
    
Conceptually, we can think of the Weight Tensors as being distinct. However, what we really have in code is a single Weight Tensor that has an `out_channels` (filters) dimension. We can see this by checking the shape of the weight tensor: `self.conv1.weight.shape` This tensor’s shape is given by: [number of filters, number of input channels, filter height, filter width]

### Output Size Formula:
Let's have a look at the formula for computing the output size of the tensor after performing convolutional and pooling operations.
###### CNN Output Size Formula (Square)
With an $n×n$ input, a $f×f$ filter, a padding of $p$ and a stride of $s$. The output size $O$ is given by this formula: $$O=\frac{n−f+2p}{s}+1$$

This value will be the height and width of the output. However, if the input or the filter isn't a square, this formula needs to be applied twice, once for the width and once for the height.
###### CNN Output Size Formula (Non-Square)
With an $n_{h}×n_{w}$ input, a $f_{h}×f_{w}$ filter, a padding of $p$ and a stride of $s$.

The height of the output size $O_{h}$ is given by this formula: $$O_{h}=\frac{n_{h}−f_{h}+2p}{s}+1$$

The width of the output size O_{w} is given by this formula: $$O_{w}=\frac{n_{w}−f_{w}+2p}{s}+1$$

In [6]:
# Other example of CNN (input tensor size: [1, 1, 28, 28] --> [batch size, color channels, height, width]):
class Net2(nn.Module):
    def __init__(self):
        super().__init__()
        # 2d convolutions:
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)  # bigger kernels than Net (former) CNN
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5) # bigger kernels and less out_channels than Net
        # linear transformations:
        self.fc1 = nn.Linear(in_features=12*4*4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)
        
    def forward(self, t):
        t = t                                        # (1) input layer (identity function)
        t = self.conv1(t)                            # (2) hidden conv layer
        t = F.relu(t)                                # relu activation
        t = F.max_pool2d(t, kernel_size=2, stride=2) # 2d max pooling
        t = self.conv2(t)                            # (3) hidden conv layer
        t = F.relu(t)                                # relu activation
        t = F.max_pool2d(t, kernel_size=2, stride=2) # 2d max pooling  
        t = t.reshape(-1, 12 * 4 * 4)                # (4) hidden linear layer. This line flattens the tensor resulting in Size=([1,192])
        t = self.fc1(t)                              # linear transformation
        t = F.relu(t)                                # relu activation
        t = self.fc2(t)                              # (5) hidden linear layer
        t = F.relu(t)                                # relu activation
        t = self.out(t)                              # (6) output layer
        #t = F.softmax(t, dim=1)
        
        return t

net2 = Net2()
print(net2)

Net2(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=192, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=60, bias=True)
  (out): Linear(in_features=60, out_features=10, bias=True)
)


### ReLU activation function:
The call to the `F.relu()` function removes any negative values and replaces them with zeros. We can verify this by checking the `Tensor.min()` of the tensor before and after the call.

The `relu()` function can be expressed mathematically as: 
$$f(x)=\begin{cases} 0 & if & x<0 \\ x & if & x≥0\end{cases}$$

### The max pooling operation
The pooling operation reduces the shape of our tensor further by extracting the maximum value from each 2x2 location within our tensor. To apply a 2D max pooling over an input signal composed of several input planes we use: `torch.nn.MaxPool2d()`. It takes: `kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False` as posible inputs. 

The parameters `kernel_size, stride, padding, dilation` can either be:
- an `int` (same value for height and width)
- a `tuple` of two ints (1st int for height, 2nd for width)

If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points. Dilation controls the spacing between the kernel points. This [link](https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md) has a nice visualization of what dilation, padding and stride do.

Now, we continue with pytorch tutorial.

## Loss Function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target. There are several different [loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions) under the nn package . A simple loss is: `nn.MSELoss()` which computes the mean-squared error between the input and the target.

In [20]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.9672, grad_fn=<MseLossBackward>)


Now, if you follow loss in the backward direction, using its `.grad_fn` attribute, you will see a graph of computations that looks like this:

    input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
          -> view -> linear -> relu -> linear -> relu -> linear
          -> MSELoss
          -> loss

So, when we call `loss.backward()`, the whole graph is differentiated wrt the loss, and all Tensors in the graph that has `requires_grad=True` will have their `.grad` Tensor accumulated with the gradient.

In [24]:
# Following a few steps backward for ilustration:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward object at 0x7fc80a81fdd0>
<AddmmBackward object at 0x7fc80a817550>
<AccumulateGrad object at 0x7fc80a817910>


## Backprop

To backpropagate the error all we have to do is to `loss.backward()`. You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.

Now we shall call `loss.backward()`, and have a look at conv1’s bias gradients before and after the backward.

In [25]:
net.zero_grad()                            # zeroes the gradient buffers of all parameters
print('conv1.bias.grad before backward:')   
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward:')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([-0.0057,  0.0105,  0.0310, -0.0020,  0.0147, -0.0188])


#### Read Later:

The neural network package contains various modules and loss functions that form the building blocks of deep neural networks. A full list with documentation is [here](https://pytorch.org/docs/stable/nn.html).

## Update the weights
The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

    weight = weight - learning_rate * gradient
    
We can implement this using simple Python code:

In [33]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small package: `torch.optim` that implements all these methods. Using it is very simple:

In [34]:
import torch.optim as optim                      # import package

optimizer = optim.SGD(net.parameters(), lr=0.01) # create optimizer

# in your training loop:
optimizer.zero_grad()                            # zero the gradient buffers
# Gradient buffers had to be manually set to zero using optimizer.zero_grad(). 
# This is because gradients are accumulated as explained in the Backprop section.
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()                                 # Does the update, Easy ;) 