In [1]:
%matplotlib inline


Neural Networks
===============

Neural networks can be constructed using the ``torch.nn`` package.

Now that you had a glimpse of ``autograd``, ``nn`` depends on
``autograd`` to define models and differentiate them.
An ``nn.Module`` contains layers, and a method ``forward(input)``\ that
returns the ``output``.

For example, look at perceptron:


It is a simple feed-forward network. It takes the input, feeds it
through several layers one after the other, and then finally gives the
output.

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or
  weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
  ``weight = weight - learning_rate * gradient``

Define the network, Single Layer Perceptron
------------------



In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Perceptron(nn.Module):
    def __init__(self, input_size, num_class):
        super(Perceptron, self).__init__()
        self.fc = nn.Linear(input_size, num_class)
        self.sigmoid = torch.nn.Sigmoid() 
        # use sigmoid function as the activation function
    def forward(self, x):
        output = self.fc(x)
        output = self.sigmoid(output) 
        return output

In [3]:
??nn.Linear

Note that class ```nn.Linear(Module)``` applies a linear transformation to the incoming data: :math:$$y = xA^T + b$$


You just have to define the ``forward`` function, and the ``backward``
function (where gradients are computed) is automatically defined for you
using ``autograd``.
You can use any of the Tensor operations in the ``forward`` function.

The learnable parameters of a model are returned by ``net.parameters()``



In [4]:
params = list(Perceptron(input_size=48, num_class=10).parameters())
print(len(params))
print(params[0].size())  
print(params[1].size())  

2
torch.Size([10, 48])
torch.Size([10])


*Let*'s try a random 48 input.



In [5]:
input = torch.randn(48)
print(input.size())
model = Perceptron(input_size=48, num_class=10)
out = model(input)
print(out.size())

torch.Size([48])
torch.Size([10])


Zero the gradient buffers of all parameters and backprops with random
gradients:



In [6]:
model.zero_grad()

<div class="alert alert-info"><h4>Note</h4><p>``torch.nn`` only supports mini-batches. The entire ``torch.nn``
    package only supports inputs that are a mini-batch of samples, and not
    a single sample.

    For example, ``nn.Conv2d`` will take in a 4D Tensor of
    ``nSamples x nChannels x Height x Width``.

    If you have a single sample, just use ``input.unsqueeze(0)`` to add
    a fake batch dimension.</p></div>

Before proceeding further, let's recap all the classes you’ve seen so far.

**Recap:**
  -  ``torch.Tensor`` - A *multi-dimensional array* with support for autograd
     operations like ``backward()``. Also *holds the gradient* w.r.t. the
     tensor.
  -  ``nn.Module`` - Neural network module. *Convenient way of
     encapsulating parameters*, with helpers for moving them to GPU,
     exporting, loading, etc.
  -  ``nn.Parameter`` - A kind of Tensor, that is *automatically
     registered as a parameter when assigned as an attribute to a*
     ``Module``.
  -  ``autograd.Function`` - Implements *forward and backward definitions
     of an autograd operation*. Every ``Tensor`` operation creates at
     least a single ``Function`` node that connects to functions that
     created a ``Tensor`` and *encodes its history*.

**At this point, we covered:**
  -  Defining a neural network
  -  Processing inputs and calling backward

**Still Left:**
  -  Computing the loss
  -  Updating the weights of the network

Loss Function
-------------
A loss function takes the (output, target) pair of inputs, and computes a
value that estimates how far away the output is from the target.

There are several different
`loss functions <https://pytorch.org/docs/nn.html#loss-functions>`_ under the
nn package .
A simple loss is: ``nn.MSELoss`` which computes the mean-squared error
between the input and the target.

For example:



In [7]:
output = model(input)
target = torch.randn(10)  # a dummy target, for example
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(1.0273, grad_fn=<MseLossBackward0>)


Now, if you follow ``loss`` in the backward direction, using its
``.grad_fn`` attribute, you will see a graph of computations that looks
like this:

So, when we call ``loss.backward()``, the whole graph is differentiated
w.r.t. the loss, and all Tensors in the graph that has ``requires_grad=True``
will have their ``.grad`` Tensor accumulated with the gradient.

For illustration, let us follow a few steps backward:



In [8]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward0 object at 0x7f7ad7727c90>
<SigmoidBackward0 object at 0x7f7ad77271d0>
<AddBackward0 object at 0x7f7ad7727c10>


Backprop
--------
To backpropagate the error all we have to do is to ``loss.backward()``.
You need to clear the existing gradients though, else gradients will be
accumulated to existing gradients.


Now we shall call ``loss.backward()``, and have a look at conv1's bias
gradients before and after the backward.



In [9]:
model.zero_grad()     # zeroes the gradient buffers of all parameters

print('fc.bias.grad before backward')
print(model.fc.bias.grad)

loss.backward()

print('fc.bias.grad after backward')
print(model.fc.bias.grad)

fc.bias.grad before backward
None
fc.bias.grad after backward
tensor([-0.0067, -0.0137,  0.0287,  0.0041, -0.1154, -0.0769,  0.0350,  0.0452,
        -0.0260, -0.0090])


Now, we have seen how to use loss functions.

**Read Later:**

  The neural network package contains various modules and loss functions
  that form the building blocks of deep neural networks. A full list with
  documentation is `here <https://pytorch.org/docs/nn>`_.

**The only thing left to learn is:**

  - Updating the weights of the network

Update the weights
------------------
The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):

     ``weight = weight - learning_rate * gradient``

We can implement this using simple Python code:

.. code:: python

    learning_rate = 0.01
    for f in net.parameters():
        f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different
update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.
To enable this, we built a small package: ``torch.optim`` that
implements all these methods. Using it is very simple:



In [10]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

.. Note::

      Observe how gradient buffers had to be manually set to zero using
      ``optimizer.zero_grad()``. This is because gradients are accumulated
      as explained in the `Backprop`_ section.



## For Multilayer Perceptron
For this example, I only put one hidden layer but you can add as many hidden layers as you want. When you have more than two hidden layers, the model is also called the deep/multilayer feedforward model or multilayer perceptron model(MLP).


After the hidden layer, I use ReLU as activation before the information is sent to the output layer. This is to introduce non-linearity to the linear output from the hidden layer as mentioned earlier. What ReLU does here is that if the function is applied to a set of numerical values, any negative value will be converted to 0 otherwise the values stay the same. For example, if the input set is [-1,0,4,-5,6] then the function will return [0,0,4,0,6].

As an output activation function, I used Sigmoid. This is because the example I want to show you later is a binary classification task, meaning we have binary categories to predict from. Sigmoid is the good function to use because it calculates the probability(ranging between 0 and 1) of the target output being label 1. As said in the previous section, the choice of the activation function depends on your task. Now, let’s see a binary classifier example using this model.



In [11]:
class Feedforward(torch.nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Feedforward, self).__init__()
        self.input_size = input_size
        self.hidden_size  = hidden_size
        self.fc1 = torch.nn.Linear(self.input_size, self.hidden_size)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(self.hidden_size, 1)
        self.sigmoid = torch.nn.Sigmoid()
    def forward(self, x):
        hidden = self.fc1(x)
        relu = self.relu(hidden)
        output = self.fc2(relu)
        output = self.sigmoid(output)
        return output

In [12]:
# CREATE RANDOM DATA POINTS
import numpy as np
from sklearn.datasets import make_blobs
def blob_label(y, label, loc): # assign labels
    target = np.copy(y)
    for l in loc:
        target[y == l] = label
    return target
x_train, y_train = make_blobs(n_samples=40, n_features=2, cluster_std=1.5, shuffle=True)
x_train = torch.FloatTensor(x_train)
y_train = torch.FloatTensor(blob_label(y_train, 0, [0]))
y_train = torch.FloatTensor(blob_label(y_train, 1, [1,2,3]))
x_test, y_test = make_blobs(n_samples=10, n_features=2, cluster_std=1.5, shuffle=True)
x_test = torch.FloatTensor(x_test)
y_test = torch.FloatTensor(blob_label(y_test, 0, [0]))
y_test = torch.FloatTensor(blob_label(y_test, 1, [1,2,3]))

In [13]:
model = Feedforward(2, 10)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)

In [14]:
model.eval()
y_pred = model(x_test)
before_train = criterion(y_pred.squeeze(), y_test)
print('Test loss before training' , before_train.item())

Test loss before training 0.715816855430603


In [15]:
model.train()
epoch = 50
loss_hist = []
for epoch in range(epoch):
    optimizer.zero_grad()
    # Forward pass
    y_pred = model(x_train)
    # Compute Loss
    loss = criterion(y_pred.squeeze(), y_train)
   
    print('Epoch {}: train loss: {}'.format(epoch, loss.item()))
    # Backward pass
    loss.backward()
    optimizer.step()

Epoch 0: train loss: 0.4507593512535095
Epoch 1: train loss: 0.4389355778694153
Epoch 2: train loss: 0.42835408449172974
Epoch 3: train loss: 0.41887903213500977
Epoch 4: train loss: 0.4103502333164215
Epoch 5: train loss: 0.40255647897720337
Epoch 6: train loss: 0.3953876495361328
Epoch 7: train loss: 0.388751357793808
Epoch 8: train loss: 0.3825705349445343
Epoch 9: train loss: 0.3767983317375183
Epoch 10: train loss: 0.3714091181755066
Epoch 11: train loss: 0.36629945039749146
Epoch 12: train loss: 0.3614397346973419
Epoch 13: train loss: 0.3567997217178345
Epoch 14: train loss: 0.3523494601249695
Epoch 15: train loss: 0.34807300567626953
Epoch 16: train loss: 0.3439566493034363
Epoch 17: train loss: 0.33998075127601624
Epoch 18: train loss: 0.3361317217350006
Epoch 19: train loss: 0.3323979079723358
Epoch 20: train loss: 0.3287695050239563
Epoch 21: train loss: 0.32523784041404724
Epoch 22: train loss: 0.32179561257362366
Epoch 23: train loss: 0.3184363842010498
Epoch 24: train los

Let’s start training. First I switch the module mode to ```.train()``` so that new weights can be learned after every ```epoch. optimizer.zero_grad()``` sets the gradients to zero before we start backpropagation. This is a necessary step as PyTorch accumulates the gradients from the backward passes from the previous epochs.
After the forward pass and the loss computation, we perform backward pass by calling ```loss.backward()```, which computes the gradients. Then ```optimizer.step()``` updates the weights accordingly.


### Evaluation
Okay, the training is now done. Let’s see how the test loss changed after the training. Again, we switch the module mode back to the evaluation mode and check the test loss as the example below.



In [16]:
model.eval()
y_pred = model(x_test)
after_train = criterion(y_pred.squeeze(), y_test) 
print('Test loss after Training' , after_train.item())

Test loss after Training 0.7768269777297974


In order to improve the model, you can try out different parameter values for your hyperparameters(ie. hidden dimension size, epoch size, learning rates). You can also try changing the structure of your model (ie. adding more hidden layers) to see if your mode improves. There is a number of different hyperparameter and model selection techniques popularly used but this is the general idea behind it. In the end, you can select the hyperparameters and the model structure that gives you the best performance.