# A1.2 Feed forward network

In this part of the assignment we will develop our own building blocks for constructing a feed forward network.
We will follow a modular approach so that we can use these building blocks in feed forward architecture of our choice.

We will follow the logic of computation graphs where the layers and the loss have the characteristics of the compute nodes in terms of locality and ability to communicate with upstream and downstream blocks.

Instead of defining the forward and backward steps as functions that need to pass around cached variables, we will implement the compute nodes as statefull objects - instantiations of python classes with forward and backward methods.

We will then conscruct a 2 layer neural network and use our newly developed functionality to predict the target values and compute the parameter gradients.

Work through the cells below and complete the tasks indicated by <span style="color:red">**TODO**</span> here below and in the script `ann_code/layers.py` (replace `pass` with the appropriate code).

In [1]:
# necessary initialization
%load_ext autoreload
%autoreload 2

import torch

In [3]:
# load data
from helpers import load_data
in_data, labels = load_data(filename='./ann_data/toy_data.csv') # correct filename if necessary

# get data dimensions
num_inst, num_dim = in_data.shape
print(f"Number of instances: {num_inst}, input dimensions: {num_dim}.")

Number of instances: 90, input dimensions: 3.


## 1) Forward pass

We first work on the forward pass functionality of our layer objects.

### Linear layer

We start by defyining the linear layer.
Complete the `__init__` and `forward` methods of the `Linear` class in `ann_code/layers.py`.

The class object instances shall be initialized with the linear function parameters (weight and bias) as the instance attributes.
The other local information (inputs, outputs and their gradients) shall be also defined as the instance object attributes and will be populated by the `forward` and `backward` methods.

In [4]:
# after implementing Linear class, check it here
from ann_code.layers import Linear

# initiate w and b buffers
# we use these for initiating the model parameters instead of the usual random init
# this is to make sure that yours and mine results match
w_buffer = torch.logspace(start=0.1, end=10, steps=1000)
b_buffer = torch.logspace(start=0.1, end=10, steps=1000, base=2)

# linear layer dimensions
in_features = num_dim
out_features = 10

################################################################################
### START OF YOUR CODE                                                         #
### TODO: initiate a linear layer instance                                     #
################################################################################
# Initialize linear layer parameters from the buffers
# First extract from the buffers the necessary number of elements 
# followed by view() to get the correct shape
# e.g. for 2x3 w matrix with 6 elements in total do 
# w = w_buffer[:6].view(2, 3)
w = w_buffer[:out_features * in_features].view(out_features, in_features)
b = b_buffer[:out_features].view(1, -1)

# Instantiate the Linear layer object
linear_layer = Linear(w, b)
################################################################################
### END OF YOUR CODE                                                           ###
################################################################################

# forward pass in_data through the layer
outputs = linear_layer.forward(in_data)

# check outputs for the first two data instances
print(f'Your outputs {outputs[:2,:]}')


Your outputs tensor([[ 1.0220,  1.0258,  1.0295,  1.0329,  1.0361,  1.0391,  1.0418,  1.0441,
          1.0462,  1.0479],
        [-0.4527, -0.5533, -0.6615, -0.7779, -0.9030, -1.0374, -1.1819, -1.3370,
         -1.5037, -1.6827]])


Expected outputs

`tensor([[ 1.0220,  1.0258,  1.0295,  1.0329,  1.0361,  1.0391,  1.0418,  1.0441,
          1.0462,  1.0479],
        [-0.4527, -0.5533, -0.6615, -0.7779, -0.9030, -1.0374, -1.1819, -1.3370,
         -1.5037, -1.6827]])`

### ReLU nonlinearity

We next defined the class for the Rectified Linear Unit which is an element-wise operation defined as $ReLU(x) = max(0, x).$

Complete the `forward` methods of the `Relu` class in `ann_code/layers.py`. Note that in this case, there are no parameters that should be included in the object instances as initial states.

In [5]:
# After implementing Relu class, check it here
from ann_code.layers import Relu

# relu instance
relu = Relu()

# forward pass in_data through the layer
outputs = relu.forward(in_data)

# check outputs for the first two data instances
print(f'Your outputs {outputs[:2,:]}')

Your outputs tensor([[0.8872, 0.0000, 0.3707],
        [0.0000, 1.3094, 0.0000]])


Expected outputs

`tensor([[0.8872, 0.0000, 0.3707],
        [0.0000, 1.3094, 0.0000]])`

### Define network with on hidden layer

We use the linear and relu classes to create a network with the following architecture. 
We combine the layers through the `Model` class that I defined for you in the `ann_code/layers.py`

We will add the MSE less in a later step, now do just the forward pass through the layers to obtain the predicitons.

<center><img src="net_diagram.png">



In [6]:
# work with Model class to do the forward pass through the network
from ann_code.layers import Model, Linear, Relu

################################################################################
### START OF YOUR CODE                                                         #
################################################################################

# Initialize parameters for all layers from the w_buffer and b_buffer
# Extract the necessary number of elements from the buffers and reshape them as required
# Define all necessary layers as instances of the Linear and Relu classes

# For example, if you have 3 input features, 4 hidden units, and 1 output unit:
# Define the first linear layer with input features and hidden units

out_features= 1
hidden1_features = 4

w1= w_buffer[:hidden1_features * num_dim].view(hidden1_features,num_dim)
w2 = w_buffer[hidden1_features* num_dim: hidden1_features*num_dim + out_features*hidden1_features].view(out_features, hidden1_features)
b1= b_buffer[:hidden1_features].view(1, hidden1_features)
b2 = b_buffer[hidden1_features:hidden1_features+out_features].view(1, out_features)

################################################################################
### END OF YOUR CODE                                                           #
################################################################################
lin1= Linear(w1,b1)
lin2= Linear(w2,b2)
relu1 = Relu()

layers = [lin1,relu1, lin2]

# forward pass in_data through all layers to get predictions
model = Model(layers)
ypred = model.forward(in_data)


# check outputs for the first two data instances
print(f'Your outputs {ypred[:2,:]}')

Your outputs tensor([[8.1458],
        [1.1016]])


Expected output

`tensor([[8.1458],
        [1.1016]])`

## 3) MSE loss

We use the MSE loss functions defined in `ann_code/linear_regression.py` to get the mse loss for our predictions and the corresponding gradients.

In [7]:
# use mse functions defined for linear regression to get the MSE and gradient with respect to predictions
from ann_code.linear_regression import mse_forward, mse_backward

loss, mse_cache = mse_forward(ypred, labels)
ypredgrad, _ = mse_backward(mse_cache)

## 3) Backward propagation

Finally, you need to implement the `backward` methods in for the `Linear` and `Relu` classes.

Remember that you need to use the chain rule and combine the local and the upstream gradient to obtain the global gradients. Do not forget that ReLu is an element-wise operation.

In [8]:
# After implementing the backward passes of Linear class test it here

# do the backward pass of last linear layer

lin2.backward(torch.ones(num_inst, 1))

# check global gradients
print(f'Global gradient of loss with respect to weight parameters {lin2.W.g}')
print(f'Global gradient of loss with respect to bias parameters {lin2.b.g}')
print(f'Global gradient of loss with respect to linear layer inputs {lin2.ins.g[:2,:]}')

Global gradient of loss with respect to weight parameters tensor([[106.2968, 108.7577, 111.4530, 114.4143]])
Global gradient of loss with respect to bias parameters tensor([[90.]])
Global gradient of loss with respect to linear layer inputs tensor([[1.6555, 1.6937, 1.7328, 1.7728],
        [1.6555, 1.6937, 1.7328, 1.7728]])


Expected results

`Global gradient of loss with respect to weight parameters tensor([[106.2968, 108.7577, 111.4530, 114.4143]])`

`Global gradient of loss with respect to bias parameters tensor([[90.]])`

`Global gradient of loss with respect to linear layer inputs tensor([[1.6555, 1.6937, 1.7328, 1.7728],
        [1.6555, 1.6937, 1.7328, 1.7728]])`

In [9]:
# After implementing the backward passes of relu class test it here

# do the backward pass of relu
relu1.backward(torch.arange(num_inst*4).view(num_inst, 4))

# check global gradients
print(f'Global gradient of loss with respect to relu inputs {relu1.ins.g[:2,:]}')

Global gradient of loss with respect to relu inputs tensor([[0., 1., 2., 3.],
        [0., 0., 0., 0.]])


Expected results

`Global gradient of loss with respect to relu inputs tensor([[0., 1., 2., 3.],
        [0., 0., 0., 0.]])`

## Complete backward pass

We shall use the Model class to get the gradients of all the layers and their parameters with respect to the loss.

In [10]:
from helpers import grad_model

# do the backward pass through the model
model.backward(ypredgrad)

# print out your gradients of loss with respect to the parameters of the 1st model layer
print(f'Your dLoss/dW1: {model.layers[0].W.g}')
print(f'Your dLoss/db1: {model.layers[0].b.g}')
print(f'Your dLoss/dins: {model.layers[0].ins.g[:2, :]}')

# print out correct gradients of loss with respect to the parameters of the 1st model layer
# these should be the same as your gradients from above
model_check = grad_model(model, in_data, labels)
print(f'Correct dLoss/dW1: {model_check.layers[0].W.grad}')
print(f'Correct dLoss/db1: {model_check.layers[0].b.grad}')
print(f'Correct dLoss/dins: {model_check.layers[0].ins.grad[:2, :]}')

Your dLoss/dW1: tensor([[10.4693,  6.8379,  4.1449],
        [10.5790,  7.0695,  4.3389],
        [10.8324,  7.2315,  4.4382],
        [11.0693,  7.3818,  4.5600]])
Your dLoss/db1: tensor([[31.2568, 31.9208, 32.6484, 33.4148]])
Your dLoss/dins: tensor([[1.6884, 1.7274, 1.7673],
        [0.0000, 0.0000, 0.0000]])
Correct dLoss/dW1: tensor([[10.4693,  6.8379,  4.1449],
        [10.5790,  7.0695,  4.3389],
        [10.8324,  7.2315,  4.4382],
        [11.0693,  7.3818,  4.5600]])
Correct dLoss/db1: tensor([[31.2568, 31.9208, 32.6484, 33.4148]])
Correct dLoss/dins: tensor([[1.6884, 1.7274, 1.7673],
        [0.0000, 0.0000, 0.0000]])


## 4) Multilayer feed forward network

Finally, use your `Linear` and `Relu` classes and combine them with the `Model` class to construct a more complicated network.

Define a network with the following architecture:
Linear: input_dim = 3, output_dim = 5 -> Relu ->
Linear: input_dim = 5, output_dim = 10 -> Relu ->
Linear: input_dim = 10, output_dim = 4 -> Relu ->
Linear: input_dim = 4, output_dim = 1

Initialize all the linear layers with parameters W and b sampled randomly from standardat normal distribution.

Combine the layers using the `Model` class and get the predictions (`forward` method).

Use the MSE forward and backward functions to get the loss and the gradient with respect to the predictions.

Use the `backward` method of `Model` to get all the gradients.

In [11]:
################################################################################
### START OF YOUR CODE                                                         #
### TODO: define mffn as instance of Model class                               #
################################################################################

# instantiate all layers
w1 = torch.arange(5*3).view(5, 3).float()
b1 = torch.arange(5).view(1, 5).float()
w2 = torch.arange(10*5).view(10, 5).float()
b2 = torch.arange(10).view(1, 10).float()
w3 = torch.arange(4*10).view(4, 10).float()
b3 = torch.arange(4).view(1, 4).float()
w4 = torch.arange(1*4).view(1, 4).float()
b4 = torch.arange(1).view(1, 1).float()

# define model using Model class
mffn = Model(layers)

# forward, mse, backward
layer1 = Linear(w1, b1)
activation1 = Relu()
layer2 = Linear(w2, b2)
activation2 = Relu()
layer3 = Linear(w3, b3)
activation3 = Relu()
layer4 = Linear(w4, b4)
layers = [layer1, activation1, layer2,
          activation2, layer3, activation3, layer4]

# define model using Model class
mffn = Model(layers)
ypred1 = mffn.forward(in_data)
loss1, mse_cache1 = mse_forward(ypred1, labels)
ypredgrad1, _ = mse_backward(mse_cache1)
mffn.backward(ypredgrad1)
################################################################################
### END OF YOUR CODE                                                           #
################################################################################

tensor([[9.7827e+09, 1.1063e+10, 1.2343e+10],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.3440e+10, 1.4921e+10, 1.6402e+10],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [2.5364e+10, 2.9435e+10, 3.3505e+10],
        [1.8863e+09, 2.2922e+09, 2.6981e+09],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [2.2337e+10, 2.5260e+10, 2.8183e+10],
        [2.5364e+10, 2.9435e+10, 3.3505e+10],
        [0.0000e+00, 2.1428e+07, 4.2856e+07],
        [5.1239e+10, 5.9462e+10, 6.7685e+10],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [3.8424e+10, 4.4590e+10, 5.0757e+10],
        [1.3137e+10, 1.5245e+10, 1.7353e+10],
        [1.1856e+09, 1.5705e+09, 1.9555e+09],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 2.2611e+07, 4.5223e+07],
        [5.9529e+10, 6.9083e+10, 7.8637e+10],
        [2.5349e+10, 2.8666e+10, 3.1983e+10],
        [2.3095e+10, 2.6118e+10, 2.9140e+10],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0

#### Check model architecture

In [12]:
# check architecture
from helpers import check_architecture

check_architecture(mffn)

Your NN architecture definitions seems CORRECT.


#### Check gradient computation

In [13]:
# print out your gradients of loss with respect to the parameters of the 1st model layer
print(f'Your dLoss/dW1: {mffn.layers[0].W.g}')
print(f'Your dLoss/db1: {mffn.layers[0].b.g}')
print(f'Your dLoss/dins: {mffn.layers[0].ins.g[:2, :]}') 
    
# print out correct gradients of loss with respect to the parameters of the 1st model layer
# these should be the same as your gradients from above
model_check = grad_model(mffn, in_data, labels)
print(f'Correct dLoss/dW1: {model_check.layers[0].W.grad}')
print(f'Correct dLoss/db1: {model_check.layers[0].b.grad}')
print(f'Correct dLoss/dins: {model_check.layers[0].ins.grad[:2, :]}')

Your dLoss/dW1: tensor([[1.1477e+08, 1.6356e+10, 3.2376e+10],
        [1.5370e+10, 1.7434e+10, 2.6436e+10],
        [1.8302e+10, 1.9640e+10, 2.4326e+10],
        [1.9257e+10, 2.0735e+10, 2.4839e+10],
        [2.0092e+10, 2.1464e+10, 2.5725e+10]], grad_fn=<MmBackward0>)
Your dLoss/db1: tensor([[5.0174e+10, 6.6295e+10, 7.0931e+10, 7.3397e+10, 7.6006e+10]],
       grad_fn=<SumBackward1>)
Your dLoss/dins: tensor([[9.7827e+09, 1.1063e+10, 1.2343e+10],
        [0.0000e+00, 0.0000e+00, 0.0000e+00]], grad_fn=<SliceBackward0>)
Correct dLoss/dW1: tensor([[1.1477e+08, 1.6356e+10, 3.2376e+10],
        [1.5370e+10, 1.7434e+10, 2.6436e+10],
        [1.8302e+10, 1.9640e+10, 2.4326e+10],
        [1.9257e+10, 2.0735e+10, 2.4839e+10],
        [2.0092e+10, 2.1464e+10, 2.5725e+10]])
Correct dLoss/db1: tensor([[5.0174e+10, 6.6295e+10, 7.0931e+10, 7.3397e+10, 7.6006e+10]])
Correct dLoss/dins: tensor([[9.7827e+09, 1.1063e+10, 1.2343e+10],
        [0.0000e+00, 0.0000e+00, 0.0000e+00]])
