In [None]:
from __future__ import print_function
%matplotlib inline

Attribution: 

   * Most material is adapted from [Justin Johnson's PyTorch Examples](https://github.com/jcjohnson/pytorch-examples)

# Feedforward Neural Networks

## Warm-up: numpy

We are going to first build a simple neural network in numpy, and then we will re-build it in PyTorch observing the convenience working under that framework.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [None]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.dot(w1)
  h_relu = np.maximum(h, 0)
  y_pred = h_relu.dot(w2)
  
  # Compute and print loss
  loss = np.square(y_pred - y).sum()
  print(t, loss)
  
  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.T.dot(grad_y_pred)
  grad_h_relu = grad_y_pred.dot(w2.T)
  grad_h = grad_h_relu.copy()
  grad_h[h < 0] = 0
  grad_w1 = x.T.dot(grad_h)
 
  # Update weights
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

There are a few issues with the numpy implementation of our network.

First of all, numpy cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of [50x or greater](https://github.com/jcjohnson/cnn-benchmarks), so pure numpy is not really enough for large-scale learning.

Moreover, we had to compute and implement the gradients by hand. This is time-consuming and error-prone. For example, if we wanted to make a change to the loss function, or use a different kind of nonlinearity at the hiddens, we need to update and test both our forward and backward pass.

Let's see how painful that can be.

### Exercises

1. This network uses a squared error function. Change this to use an exponential cost of the form:

   $ \sum_n \exp \left( \sum_j \left| y_j^{\text{pred}} - y_j \right| \right)$

   where $n$ indexes examples and $j$ indexes output dimensions.

2. Now change your hidden layer such that it uses sigmoid units, $a(x_i) = 1/ \left( 1 + \exp(-x_i) \right)$ instead of RELU units.

## PyTorch: Tensors

PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on a GPU, you simply need to cast it to a new datatype.

We will now re-write our example above to use PyTorch Tensors. Let's first pretend we don't know anything about Variables and automatic differentiation.

In [None]:
import torch

dtype = torch.Tensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss
  loss = (y_pred - y).pow(2).sum()
  print(t, loss)

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

So what we've gained here is the ability to run our computations on either CPU or GPU, but we haven't gained any flexibility or time savings from the perspective of implementing our neural net.

In the above examples, we had to manually implement both the forward and backward passes of our neural network. In fact, this really isn't a big deal for the small two-layer network, but can quickly get hairy for large complex networks.

By using use PyTorch Variables instead of Tensors to implement our two-layer network, we no longer need to manually implement the backward pass through the network.


### Exercises

1. Re-write the above Tensor-based example to use Variables and autograd:

  - You will need to do `from torch.autograd import Variable`
  - You will need to replace all Tensor creation by Variable creation
  - You will need to replace all the lines of gradient computation by a single call to `loss.backward()`
  - You can then depend directly on the gradient attributes of your Variables, e.g. `w1.grad.data` and `w1.grad.data`
  - You should manually zero out your gradients after performing a weight update (otherwise they will be accumulated)
  - Make sure you don't mix Tensors and Variables in computation
  
2. If you haven't yet done so, refactor your code to eliminate intermediate values: `h`, `h_relu`. Note that we no longer need to keep references to intermediate values when we are not implementing the backward pass by hand.

**Further reading:**

You can read about [Defining your own autograd functions](http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-defining-new-autograd-functions).