# LIS 640 Applied Deep Learning : Backpropagation


# Set up code

In [6]:
import utils
import torch
import matplotlib.pyplot as plt
%matplotlib inline


plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.rcParams['font.size'] = 16
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Implementing a Two-Layer Network
In this exercise we will develop a Two-Layer Network with fully-connected layers to perform classification, and test it out on the MNIST dataset.

We train the network with a Cross-Entropy loss function. The network uses a ReLU activation function after the first fully connected layer. 

In other words, the network has the following architecture:

  input - fully connected layer - ReLU - fully connected layer - Softmax - Cross-Entropy

We denote the input example as $x$, the ground truth label as $y$, the first fully connected layer output as $h$, the activation function output as $a$ and the second fully connected layer output as $s$. The complete process of the two-layer network is illustrated below.

$h = W_1^Tx$

$a = ReLU(h)$

$s = W_2^Ta$

$loss = CrossEntropy(softmax(s),y)$ 

Here $softmax(z)_i=\frac{e^{z_i}}{\sum^K_{j=1}e^{z_j}}$ and $CrossEntropy(y',y)=-\sum_i y'_i log(y_i)$.



## Play with a toy data

The inputs to our network will be a batch of $N$ (`num_inputs`) $D$-dimensional vectors (`input_size`); the hidden layer will have $H$ hidden units (`hidden_size`), and we will predict classification scores for $C$ categories (`num_classes`). This means that the learnable weights of the network will have the following shapes:

*   W1: First layer weights; has shape (D, H)
*   W2: Second layer weights; has shape (H, C)

We will use `utils.get_toy_data` function to generate random weights for a small toy model while we implement the model.

### Forward pass: compute scores
We want to write a function that takes as input the model weights and a batch of images and labels, and returns the loss and the gradient of the loss with respect to each model parameter.

However rather than attempting to implement the entire function at once, we will take a staged approach and ask you to implement the full forward and backward pass one step at a time.

First we will implement the forward pass of the network which uses the weights and biases to compute scores for all inputs in `nn_forward_pass`.

Compute the scores and compare with the answer. The distance gap should be smaller than 1e-10.

In [10]:
import torch

if torch.cuda.is_available():
  print('PyTorch can use GPUs!')
else:
  print('PyTorch cannot use GPUs.')

PyTorch cannot use GPUs.


In [11]:
import utils
from problem2 import nn_forward_pass

utils.reset_seed(0)
toy_X, toy_y, params = utils.get_toy_data()

# YOUR_TURN: Implement the score computation part of nn_forward_pass
scores, _ = nn_forward_pass(params, toy_X)
print('Your scores:')
print(scores)
print(scores.dtype)
print()
print('correct scores:')
correct_scores = torch.tensor([
        [ 9.7003e-08, -1.1143e-07, -3.9961e-08],
        [-7.4297e-08,  1.1502e-07,  1.5685e-07],
        [-2.5860e-07,  2.2765e-07,  3.2453e-07],
        [-4.7257e-07,  9.0935e-07,  4.0368e-07],
        [-1.8395e-07,  7.9303e-08,  6.0360e-07]], dtype=torch.float32, device=scores.device)
print(correct_scores)
print()

# The difference should be very small. We get < 1e-10
scores_diff = (scores - correct_scores).abs().sum().item()
print('Difference between your scores and correct scores: %.2e' % scores_diff)

AssertionError: Torch not compiled with CUDA enabled

### Forward pass: compute loss
Now, we implement the first part of `nn_forward_backward` that computes the loss.

For the data loss, we compute the Cross-Entropy loss. Note that the final loss shold be an average loss of $N$ input examples.

First, Let's run the following to check your implementation.

We compute the loss for the toy data, and compare with the answer computed by our implementation. The difference between the correct and computed loss should be less than `1e-4`.

In [None]:
import utils
from problem2 import nn_forward_backward

utils.reset_seed(0)
toy_X, toy_y, params = utils.get_toy_data()

# YOUR_TURN: Implement the loss computation part of nn_forward_backward
loss, _ = nn_forward_backward(params, toy_X, toy_y)
print('Your loss: ', loss.item())
correct_loss = 1.0986121892929077
print('Correct loss: ', correct_loss)
diff = (correct_loss - loss).item()

# should be very small, we get < 1e-4
print('Difference: %.4e' % diff)

### Backward pass
Now implement the backward pass for the entire network in `nn_forward_backward`.

After doing so, we will use numeric gradient checking to see whether the analytic gradient computed by our backward pass mateches a numeric gradient.

We will use the functions `utils.compute_numeric_gradient` and `utils.rel_error` to help with numeric gradient checking.


Hint: For gradient computation of Softmax Cross-Entropy loss, please refer to https://www.michaelpiseno.com/blog/2021/softmax-gradient/.

Now we will compute the gradient of the loss with respect to the variables `W1` and `W2`. Now that you (hopefully!) have a correctly implemented forward pass, you can debug your backward pass using a numeric gradient check.

You should see relative errors less than `1e-4` for all parameters.

In [None]:
import utils
from problem2 import nn_forward_backward

utils.reset_seed(0)

toy_X, toy_y, params = utils.get_toy_data(dtype=torch.float64)

# YOUR_TURN: Implement the gradient computation part of nn_forward_backward
#            When you implement the gradient computation part, you may need to 
#            implement the `hidden` output in nn_forward_pass, as well.
loss, grads = nn_forward_backward(params, toy_X, toy_y)

for param_name, grad in grads.items():
  param = params[param_name]
  f = lambda w: nn_forward_backward(params, toy_X, toy_y)[0]
  grad_numeric = utils.compute_numeric_gradient(f, param)
  error = utils.rel_error(grad, grad_numeric)
  print('%s max relative error: %e' % (param_name, error))

### Train the network
To train the network we will use stochastic gradient descent (SGD). 

Look at the function `nn_train` and fill in the missing sections to implement the training procedure.

Once you have implemented the method, run the code below to train a two-layer network on toy data. Your final training loss should be less than 1.0.

In [None]:
import utils
from utils import get_toy_data
from problem2 import nn_forward_backward, nn_train, nn_predict

utils.reset_seed(0)
toy_X, toy_y, params = get_toy_data()

# YOUR_TURN: Implement the nn_train function.
#            You may need to check nn_predict function (the "pred_func") as well.
stats = nn_train(params, nn_forward_backward, nn_predict, toy_X, toy_y, toy_X, toy_y,
                 learning_rate=1e-1, reg=1e-6,
                 num_iters=200, verbose=False)

print('Final training loss: ', stats['loss_history'][-1])

# plot the loss history
plt.plot(stats['loss_history'], 'o')
plt.xlabel('Iteration')
plt.ylabel('training loss')
plt.title('Training Loss history')
plt.show()

## Testing our NN on a real dataset: MNIST
Now that you have implemented a two-layer network that passes gradient checks and works on toy data, it's time to load up our MNIST data so we can use it to train a classifier on a real dataset.

In [None]:
import utils


# Invoke the above function to get our data.
utils.reset_seed(0)
x_train, y_train, x_val, y_val = utils.load_data()

### Wrap all function into a Class
We will use the class `TwoLayerNet` to represent instances of our network. The network parameters are stored in the instance variable `self.params` where keys are string parameter names and values are PyTorch tensors.




### Train a network
To train our network we will use SGD. In addition, we will adjust the learning rate with an exponential learning rate schedule as optimization proceeds; after each epoch, we will reduce the learning rate by multiplying it by a decay rate.

In [None]:
import utils
from problem2 import TwoLayerNet

input_size = 1 * 28 * 28
hidden_size = 36
num_classes = 10

# fix random seed before we generate a set of parameters
utils.reset_seed(0)
net = TwoLayerNet(input_size, hidden_size, num_classes, dtype=torch.float32, device='cpu')

# Train the network
stats = net.train(x_train, y_train,
                  x_val, y_val,
                  num_iters=500, batch_size=1000,
                  learning_rate=1e-1, learning_rate_decay=0.95,
                  verbose=True)

# Predict on the validation set
y_val_pred = net.predict(x_val)
val_acc = 100.0 * (y_val_pred == y_val).double().mean().item()
print('Validation accuracy: %.2f%%' % val_acc)