# Lab 2 Exercises for COMP 691 (Deep Learning)

In this lab we will learn some basics of Pytorch.
- You will implement a feedforward neural network using different implementation styles.
- Understand how to use torch autograd for calculating gradients.
- Learn how to use GPUs for computation speed.

Save your answers for this lab as they will be used for part of Lab 3.

Start by making a **copy** of this notebook in your Google Colab.


# Exercise 1: Loading the dataset

Below we will create a dataloader for the MNIST training data using torchvision package (following e.g. https://github.com/pytorch/examples/blob/master/mnist/main.py#L112-L120).

The dataloader iterates over the training set and will output **mini-batches of size 256** image samples.

**Note**: you do not need to use the image labels in the rest of this lab since you will not be doing any training.

Remarks about using GPU:

- The "device" variable allows us to select which device to place the data on. Modify your colab (or local environment) to use a GPU.

- To use GPU in Google Colab, go to Runtime then choose "change runtime type". Then choose the hardware accelerator as GPU.

- In your Google Colab notebook set the variable device to "cuda", rerunning the cell below such that the data is placed on GPU inside the for loop.


In [None]:
from torchvision import datasets,transforms
import torch
dataset1 = datasets.MNIST('../data', train=True, download=True, transform=transforms.ToTensor())
train_loader = torch.utils.data.DataLoader(dataset1,
                                           batch_size=256,
                                           shuffle=True,
                                           drop_last=True)

device='cuda'

for (data, target) in train_loader:
  data = data.to(device)
  target = target.to(device)
print(data.shape)
print(target.shape)

torch.Size([256, 1, 28, 28])
torch.Size([256])


If you ran the code cell above, you will notice that the data is a tensor of shape ([256, 1, 28, 28]) = (batch_size, number of color channels, length of image in pixels, width of image in pixels)

# Exercise 2: Building a neural network from the ground up!

Network Architecture:
- Using only torch primitives (e.g. [torch.matmul](https://pytorch.org/docs/stable/generated/torch.matmul.html), [torch._relu](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html), etc) implement a simple feedforward neural network with 2 hidden layers that takes as input MNIST digits (28x28) and outputs **a single scalar value** i.e., the class. Avoid using any functions from torch.nn class.


- You may select the hidden layer width (greater than 20) and activations (tanh, relu, sigmoid, others) as desired.  

- A typical layer will transform its inputs as follows: $y = σ (Wx+b) $, where $σ$ is the non-linear activation function.

- Initialize the weights  with [uniform random values](https://pytorch.org/docs/stable/generated/torch.rand.html) in the range -1 to 1 and [biases at 0](https://pytorch.org/docs/stable/generated/torch.zeros.html).

Data:

Using the data obtained from Exercise 1, make a forward pass through the dataset in mini-batches of 256 (feed the network data). To check you are on the right track, the shape of your output should be ([256]).

**Hint:** Remember that the goal is to feed the MNIST images and get a class label for each image. In this exercise there is no training so do not expect that the label will be meaningful/correct!

Pay attention to the shape of the input and how it gets changed as it passes from one layer to the next in the forward pass. Ex: (256, 28*28) -> (256, hiddden_size_1) -> (256,hidden_size_2) -> (256,1). This will help you when constructing the layers of the network.



In [None]:
import torch

## Initialize and track the parameters using a list or dictionary (modify the None)
param_dict = {
    "W0": (torch.rand(28*28, 256, requires_grad = True) * 2 - 1).to(device),
    "b0": (torch.zeros(1, 256, requires_grad = True)).to(device),
    "W1": (torch.rand(256, 64, requires_grad = True) * 2 - 1).to(device),
    "b1": (torch.zeros(1, 64, requires_grad = True)).to(device),
    "W2": (torch.rand(64, 1, requires_grad = True) * 2 - 1).to(device),
    "b2": (torch.zeros(1, 1, requires_grad = True)).to(device),
    }

## Make sure your parameters in param_dict require gradient for training the network later!



## Define the network
def my_nn(input, param_dict):
    r"""Performs a single forward pass of a Neural Network with the given
    parameters in param_dict.

    Args:
        input (torch.tensor): Batch of images of shape (B, H, W), where B is
            the number of input samples,and H and W are the image height and
            width respectively.
        param_dict (dict of torch.tensor): Dictionary containing the parameters
            of the neural network. Expects dictionary keys to be of format
            "W#" and "b#" where # is the layer number.

    Returns:
        torch.tensor: Neural network output of shape (B, )
    """
    #Reshape the input image from HxW to a flat vector
    x = input.view(-1 , 28*28)

    #Your code here
    linear1 = torch.relu(torch.matmul(x, param_dict['W0']) + param_dict['b0'])
    linear2 = torch.relu(torch.matmul(linear1, param_dict['W1']) + param_dict['b1'])
    output = torch.matmul(linear2, param_dict['W2']) + param_dict['b2']

    return output


## Perform forward pass
for (data, target) in train_loader:
  data, target = data.to(device), target.to(device)
  #forward pass
  output = my_nn(data, param_dict)
output

tensor([[-193.2576],
        [ -35.0444],
        [-112.0623],
        [ -67.1052],
        [-145.6943],
        [-177.6809],
        [-107.3818],
        [ -90.8652],
        [-139.0128],
        [-260.3366],
        [-181.4067],
        [-140.9895],
        [-101.0420],
        [ -63.6918],
        [-149.7373],
        [-163.3979],
        [  17.9486],
        [ -74.0628],
        [-185.5281],
        [ -49.0663],
        [ -64.9282],
        [-218.9828],
        [-181.3718],
        [-105.0346],
        [ -97.7683],
        [-117.8391],
        [-176.0170],
        [-112.1204],
        [-146.5531],
        [-215.6198],
        [-233.8642],
        [-167.5402],
        [-116.1994],
        [-137.9766],
        [ -49.9337],
        [ -94.0929],
        [-163.2580],
        [-339.6279],
        [-110.3439],
        [-195.0311],
        [-215.9258],
        [-154.2622],
        [-122.4509],
        [-169.9740],
        [-222.4158],
        [-146.5886],
        [ -63.8838],
        [-143

#Exercise 3: Implementing the same network using torch.nn.module

Implement a new torch.nn.module that performs the equivalent of the network in Exercise 2 and call it "model".

Initialize it with the same weights (ex: **nn.Linear**(in_features,out_features) so that you could have a fair comparison between the two networks. The way to do this is through **weight.data** = insert your desired weights. You can do a similar thing with the bias).

Validate the outputs of this network is the same as the one in Exercise 2 on MNIST training set.

In [None]:
model = torch.nn.Sequential(
    torch.nn.Flatten(),
    torch.nn.Linear(28*28, 256),
    torch.nn.ReLU(),
    torch.nn.Linear(256, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 1),
)

model.to(device)
data = data.to(device)
output = model(data)
output

tensor([[ 1.5849e-02],
        [-2.5824e-02],
        [ 2.0500e-02],
        [-1.0753e-02],
        [ 7.8150e-03],
        [ 2.5953e-02],
        [-4.7853e-03],
        [ 3.4125e-02],
        [ 3.7620e-02],
        [ 7.1632e-02],
        [ 2.6456e-02],
        [ 6.3774e-02],
        [ 6.3280e-02],
        [ 4.0772e-03],
        [ 4.3197e-02],
        [-2.3406e-02],
        [ 1.3138e-02],
        [-2.7769e-02],
        [ 5.2831e-02],
        [-3.2046e-02],
        [ 2.0337e-03],
        [ 1.8843e-02],
        [ 4.6096e-02],
        [-3.2943e-02],
        [ 1.2248e-02],
        [ 1.2566e-02],
        [ 5.9859e-02],
        [ 1.2059e-02],
        [ 2.8155e-02],
        [ 6.6387e-02],
        [ 4.8159e-02],
        [ 1.4221e-02],
        [-1.0578e-02],
        [ 2.2787e-02],
        [ 4.2061e-02],
        [ 4.0826e-02],
        [ 8.6932e-03],
        [ 8.0189e-02],
        [ 2.7961e-02],
        [ 1.6780e-02],
        [-1.1999e-02],
        [ 8.3001e-03],
        [ 1.0009e-02],
        [ 1

## Exercise 3.1: Validating that the two implementations are equal.

First you will need to make sure the param_dict from Exercise 2 and the nn module version have the same parameters (weights and biases).

You can do this for example using: "**model.linear1.weight.data** = copy.deepcopy(param_dict['W0'].data.T)".

**Note**: that we do a deepcopy just to make sure this model is separate from the one in the above cell

In [None]:
model

Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=256, bias=True)
  (2): ReLU()
  (3): Linear(in_features=256, out_features=64, bias=True)
  (4): ReLU()
  (5): Linear(in_features=64, out_features=1, bias=True)
)

In [None]:
for name, param in model.named_parameters():
  print(name)

1.weight
1.bias
3.weight
3.bias
5.weight
5.bias


In [None]:
import copy

with torch.no_grad():
  model[1].weight.data = copy.deepcopy(param_dict['W0'].t())
  model[1].bias.data = copy.deepcopy(param_dict['b0'].squeeze())

  model[3].weight.data = copy.deepcopy(param_dict['W1'].t())
  model[3].bias.data = copy.deepcopy(param_dict['b1'].squeeze())

  model[5].weight.data = copy.deepcopy(param_dict['W2'].t())
  model[5].bias.data = copy.deepcopy(param_dict['b2'].squeeze())

# Run the assert statement to check if the outputs are roughly equal
for i, (data, _) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    assert (((model(data) - my_nn(data, param_dict)) ** 2).mean() < 1e-4)

print("All Clear!")

All Clear!


#Exercise 4: Calculating gradients.

For a single mini-batch of 256 samples (you can select any minibatch), compute the gradient of the average of the neural network outputs (over the minibatch) w.r.t to the weights.

### Let's break this down:

First you will need to get the mean/average of the outputs. Then you need find the gradient of this mean w.r.t to the weights.

To find the gradient you can use torch autograd, which you can use simply it by calling **.backward()** on the desired variable.

Your task is to print the gradients for the first layer weight and bias. You can use either the model defined from exercise 2 or 3 for this.  

**Note**: The network here is $f: \mathbf{R}^{HW}\rightarrow\mathbf{R}$, which means that your input layer has $HW$ neurons ($HW$ features) and your output layer has one output neuron (one scalar output = class). Since each batch has $256$ samples, the mean can be obtained by $o=\frac{1}{256}\sum_{i=0}^{255}f(x_i)$ or simply calling **.mean()** on the output of the network. You are asked to find $\nabla_w o$ and $\nabla_b o$. To access the gradient of each parameter you can call **.grad**.

In [None]:
output = model(data)
output.mean().backward()

for name, param in model.named_parameters():
    if name == '1.weight' or name == '1.bias':
        print(name, param.grad)

1.weight tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
1.bias tensor([-0.0433, -0.0967,  0.3761,  1.0731,  0.0091,  0.2243,  0.0079, -0.9251,
         0.5016,  0.3231, -0.7477,  0.1317,  0.9125, -0.2370, -0.3363, -0.7051,
         0.2523,  1.0858,  0.2168,  0.7179,  0.5952, -0.1126, -0.2810, -0.1135,
        -1.2988,  0.0150, -0.7024,  0.2878, -0.5861,  0.2274, -1.0872,  1.5264,
         0.0364,  0.9289, -0.5403, -0.0329,  0.4533, -0.0711,  0.0228, -0.0377,
        -0.4808, -0.2498, -1.0647, -0.4170, -0.2750,  0.7442,  0.4052, -0.9397,
        -0.2055,  0.2563, -0.0211, -0.2007, -1.1359,  0.9665, -0.3849, -0.1859,
        -0.6193, -1.8033, -0.1719, -0.3468,  0.4023, -0.6195,  0.8966, -0.5003,
         0.8709,  0.2455,  0.2526,  0.4900,  0.5987, -0.0866, -0.0692, -0

#Exercise 5: CPU or GPU ?

Below you will find code for comparing the speed of a model on CPU and GPU as well as comparing the speed of a forward pass to a forward/backward pass. Instantiate a version of your model from exercise 3 (preferably a larger version e.g. width 100 or 500) and run the timing code.

Write 1-2 sentences to summarize your observations about the relative speed's of CPU/GPU and forward/backward

In [None]:
#Instantiate a model defined from (3) here
model = model

In [None]:
#Run on CPU
import time as timer
data = data.to('cpu')
model.cpu()

print('Running on CPU')

start = timer.time()
for _ in range(10):
  model(data)
print("Time taken forward", timer.time() - start)

start = timer.time()
for _ in range(10):
  out = model(data).mean()
  out.backward()
print("Time taken forward/backward", timer.time() - start)

Running on CPU
Time taken forward 0.0195157527923584
Time taken forward/backward 0.037970781326293945


In [None]:
#Run on GPU
#initialize cuda
data = data.to('cuda')
model.cuda()
model(data)
print('Running on GPU')


start = timer.time()
for _ in range(10):
  model(data)
torch.cuda.synchronize()
print("Time taken", timer.time() - start)

start = timer.time()
for _ in range(10):
  out = model(data).mean()
  out.backward()
torch.cuda.synchronize()
print("Time taken forward/backward", timer.time() - start)

Running on GPU
Time taken 0.004402637481689453
Time taken forward/backward 0.008754968643188477


Summary of observations here:  
The Speed on GPU is much faster.