Neural Networks In Pytorch
===============

Neural networks can be constructed using the ``torch.nn`` package.

An ``nn.Module`` contains layers, and a method ``forward(input)`` that takes the input tensor and output's the result of the neural network(or any other module).

Pytorch provides a very powerful ``autograd`` package which is essential for calculating the derivates of the model.

We can divide the typical procedure for training a neural net into 5 steps:

1. Define the neural network that has some learnable parameters (or
  weights)
3. Process input through the network aka _______
4. Compute the loss (how far is the output from being correct) eg. _______
5. Move gradients back into the network’s parameters Using _______
6. Update the weights of the network, typically using a simple update rule:
  ``weight = weight - learning_rate * gradient``

## First things first, time for some imports!

In [3]:
import torch
from torch import nn, optim

# 1. Define the network

There are 2 ways to define a simple neural network:
1. Using built-in ``nn.Sequential``
2. Creating our own custom class inheriting from ``nn.Module``

In [4]:
num_inputs, num_hidden, num_outputs = 1000, 100, 10

sequential_model = nn.Sequential(
    nn.Linear(num_inputs, num_hidden),
    nn.ReLU(),
    nn.Linear(num_hidden, num_outputs),
)

In [5]:
class TwoLayerNet(nn.Module):
    def __init__(self, num_inputs, num_hidden, num_outputs):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(num_inputs, num_hidden)
        self.linear2 = torch.nn.Linear(num_hidden, num_outputs)

    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

custom_model = TwoLayerNet(num_inputs, num_hidden, num_outputs)

You just have to define the ``forward`` function, and the ``backward``
function (where gradients are computed) is automatically defined for you
using ``autograd`` (If you wanted to define custom backward function just overload the backword function in ``nn.Module``).
You can use any of the Tensor operations in the ``forward`` function.

## Let's print the networks and see what do we get...

In [6]:
print(sequential_model)
print(custom_model)

Sequential(
  (0): Linear(in_features=1000, out_features=100, bias=True)
  (1): ReLU()
  (2): Linear(in_features=100, out_features=10, bias=True)
)
TwoLayerNet(
  (linear1): Linear(in_features=1000, out_features=100, bias=True)
  (linear2): Linear(in_features=100, out_features=10, bias=True)
)


## Wait What!!!! Are they different??

## Let's check the number of trainable parameters

In [7]:
print("Sequential Model: {}".format(sum(p.numel() for p in sequential_model.parameters() if p.requires_grad)))
print("Custom Model: {}".format(sum(p.numel() for p in custom_model.parameters() if p.requires_grad)))

Sequential Model: 101110
Custom Model: 101110


# 2. Forward Pass

In [8]:
input = torch.randn(1, 1000)
out = custom_model(input)
print(out)

tensor([[ 0.1746,  0.1649,  0.5953, -0.2792, -0.1438, -0.1933,  0.0381,  0.2224,
          0.1697,  0.1465]], grad_fn=<AddmmBackward>)


``torch.nn`` only supports mini-batches.

Before proceeding further, let's recap all the classes you’ve seen so far.

**Recap:**
  -  ``torch.Tensor`` - A *multi-dimensional array* with support for autograd
     operations like ``backward()``. Also *holds the gradient* w.r.t. the
     tensor.
  -  ``nn.Module`` - Neural network module. *Convenient way of
     encapsulating parameters*, with helpers for moving them to GPU,
     exporting, loading, etc.
  -  ``nn.Parameter`` - A kind of Tensor, that is *automatically
     registered as a parameter when assigned as an attribute to a*
     ``Module``.
  -  ``autograd.Function`` - Implements *forward and backward definitions
     of an autograd operation*. Every ``Tensor`` operation creates at
     least a single ``Function`` node that connects to functions that
     created a ``Tensor`` and *encodes its history*.

# 3. Compute the Loss

A loss function takes the (output, target) pair of inputs, and computes a
value that estimates how far away the output is from the target.

There are several different
[loss functions](https://pytorch.org/docs/nn.html#loss-functions>) under the
nn package .
A simple loss is: ``nn.MSELoss`` which computes the mean-squared error
between the input and the target.

For example:

In [9]:
target = torch.randn(1, 10)  # a dummy target, for example
criterion = nn.MSELoss()

loss = criterion(out, target)
print(loss)

tensor(0.6360, grad_fn=<MseLossBackward>)


# 4. Backprop

To backpropagate the error all we have to do is to ``loss.backward()``.
You need to clear the existing gradients though, else gradients will be
accumulated to existing gradients.

In [10]:
custom_model.zero_grad()     # zeroes the gradient buffers of all parameters

print('linear1.bias.grad before backward')
print(custom_model.linear1.bias.grad)

loss.backward()

print('linear1.bias.grad after backward')
print(custom_model.linear1.bias.grad)

linear1.bias.grad before backward
None
linear1.bias.grad after backward
tensor([ 0.0000e+00,  8.8082e-02,  0.0000e+00,  4.9246e-02,  0.0000e+00,
         0.0000e+00,  7.5798e-03,  0.0000e+00,  0.0000e+00,  0.0000e+00,
        -5.1210e-03, -6.6060e-03, -3.7342e-02, -6.1204e-03,  7.1041e-02,
         0.0000e+00,  0.0000e+00,  2.9379e-02,  4.1755e-02, -2.5439e-02,
         0.0000e+00,  1.8737e-02,  9.4239e-03,  1.3639e-02, -3.6942e-02,
         0.0000e+00, -1.0299e-03,  0.0000e+00,  0.0000e+00,  2.6386e-02,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  7.8013e-03, -1.3217e-02,
        -1.7844e-02,  0.0000e+00,  0.0000e+00,  3.8579e-02,  8.7822e-03,
        -2.1021e-05,  2.9333e-02,  0.0000e+00,  1.7800e-02, -5.9021e-03,
         0.0000e+00,  7.0990e-03,  0.0000e+00,  1.6522e-02,  5.4553e-02,
        -1.2442e-03,  0.0000e+00, -1.6706e-02,  0.0000e+00,  1.7266e-03,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  1.1967e-02,
        -1.2956e-02, -7.4938e-03,  5.4445e-02,  0.00

loss function are very important

# 5. Gradient Descent

The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):

     ``weight = weight - learning_rate * gradient``

We can implement this using simple python code:
``
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)
``

However, as you use neural networks, you want to use various different
update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.
To enable this, They built a small package: ``torch.optim`` that
implements all these methods. Using it is very simple:

In [11]:
# create your optimizer
optimizer = optim.SGD(custom_model.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = custom_model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

Why do we have to zero the gradient buffers manually?