<a href="https://colab.research.google.com/github/ali-vosoughi/Large-scale-nonlinear-causality/blob/main/project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = ""

---

# Implementation of deep neural network

You will use the knowledge you'd learned in the class to build a 3 layer feed forward neural network, and apply it to a ten-class image classification problem. Hopefully, you will learn how to implement backpropagation and optimization.

Let's get started!

## 1. Package

Let's first import all the packages that you will need.

- **torch, torch.nn, torch.nn.functional** are the fundamental modules in pytorch library, supporting Python programs that facilitates building deep learning projects.
- **torchvision** is a library for Computer Vision that goes hand in hand with PyTorch
- **numpy** is the fundamental package for scientific computing with Python programs.
- **matplotlib** is a library to plot graphs and images in Python.
- **math, random** are the standard modules in Python.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import random
import math
import numpy as np
import matplotlib.pyplot as plt
from project1_utils import *

print("Import packages successfully!")

ModuleNotFoundError: ignored

A helper function is provided:

```python
def set_seed(seed):
    """
    TODO: Use random seed to ensure that results are reproducible.
    """
```

In [None]:
seed = 1
set_seed(seed)

## 2. Dataset

You will use the "cat vs. dog" dataset for this assignment.

Let's load the dataset first using pytorch dataset and loader modules.

In [None]:
# the number of images in a batch
batch_size = 4

# load dataset
trainset = dataset(path='/u/cs298/project1/dataset/trainset.h5')
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)
testset = dataset(path='/u/cs298/project1/dataset/testset.h5')
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

# name of classes
classes = ('cat', 'dog')

print ("Number of training examples: " + str(trainset.length))
print ("Number of testing examples: " + str(testset.length))

Let's visualize some examples in the dataset, a helper function to show images is provided as below:

```python
def imshow(images):
    """
    TODO: Display the input images in a plot
    """
```

In [None]:
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

num_toshow = 4

# show images
imshow(torchvision.utils.make_grid(images[:num_toshow]))

# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(num_toshow)))

## 3. Build your feedforward neural network.

In this cell, you will be required to build a **3-layer multilayer perceptron (MLP)** to classify images into different categories.

As we know from the class, **each layer** of a MLP can be denoted as the following mathematical operation:

$$z = W^T x + b$$ $$a = \sigma(z)$$

Here, $W, b$ denote the weights and biases, and $a, \sigma$ denote activation and activation function, respectively.
**The function is parameterized by $W, b$ as well as the choice of $\sigma(\cdot)$**.

Note that it is valid for $\sigma(\cdot)$ to be the identity function, or $z = \sigma(z)$.

----

**Question 1 (6 points):** Now, let's implement functions at the layer level to do the following:

Hint: To implement $W^Tx+b$ in PyTorch, one way is to write it as `x.mm(W) + b`.

1. Given the desired input, output dimensions, generate the parameters $W, b$. (2 points)

In [None]:
def get_layer_params(input_dim: int, output_dim: int, batch_size: int, sigma):
    """
    Input: 
        input_dim: number of length in the input
        output_dim: number of length produced by the layer
        batch_size: number of examples from the training dataset used in the estimate of gradients, batch_size \
            controls the accuracy of the estimate of the error gradients when training neural networks
        sigma: activation function
    Output: 
        a dictionary of generated parameters
    """
    # TODO:
    w = None
    b = None
    
    # generate the parameters
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return {'w': w,
            'b': b,
            'sigma': sigma}
    

2. Given the layer parameters $W, b$ and the choice of $\sigma(\cdot)$, compute the output for layer input $X$. (4 points)

In [None]:
def layer_forward_computation(params, x):
    """
    Input: 
        params: parameters of each layer
        x: the input to the layer
    Output: 
        the output after and before activation
    """
    # unpack params
    w, b, sigma = params['w'], params['b'], params['sigma']
    
    a = None
    
    # compute the output for layer
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return a, z

---

Back to building our 3-layer MLP for classification. If you have implemented the functions above correctly,
now the processing of putting everything together will be very easy.

Just like other parts of your programming experience,
knowing how to efficiently abstract and modularize components of your program will be critical in deep learning.

**Architecture Requirement**:

We now describe in details how our 3-layer MLP should be built in PyTorch.

1. In the dataset, the size of input image is a tensor $ X \in \mathbb{R}^{B \times 32 \times 32 \times 3}$, where $B$ denotes the batch size.
2. Flatten the image tensor to a vector $X_{flattened} \in \mathbb{R}^{B \times 3072}$.
3. We now begin describing the specific architecture of the model, although this is not the only design choice, and feel free to change the hidden dimensions of the parameters
4. Layer1: set your parameters so the input is projected from $\mathbb{R}^{B \times 3072}$ to $\mathbb{R}^{B \times 240}$, use ReLU as your activation function
5. Layer2: set your parameters so the input is projected from $\mathbb{R}^{B \times 240}$ to $\mathbb{R}^{B \times 84}$, use ReLU as your activation function
6. Layer3: set your parameters so the input is projected from $\mathbb{R}^{B \times 84}$ to $\mathbb{R}^{B \times 10}$, use identity function as your activation function

---

**Question 2 (4 points):** Let's build the 3-layer MLP using the pre-defined function **get_layer_params( )**.

In [None]:
""" TODO: define your layer parameters here """
layer1_params: dict = dict()
layer2_params: dict = dict()
layer3_params: dict = dict()

# build your network
# YOUR CODE HERE
raise NotImplementedError()

---

Now your whole network function is defined as below:

In [None]:
def net(x, params):
    assert len(params) == 3
    layer1_params, layer2_params, layer3_params = params
    
    l1_out = layer_forward_computation(layer1_params, x)
    l2_out = layer_forward_computation(layer2_params, l1_out[0])
    l3_out = layer_forward_computation(layer3_params, l2_out[0])
    return l1_out, l2_out, l3_out

## 4. Backpropagation and optimization

**Question 3 (6 points):** After computing the forward pass, you now need to compute gradients with respect to all Tensors with `requires_grad=True`, e.g., parameters of layer1. These graidents will be used to update parameters via gradient descent. 

**Requirements:** You will need to complete the function **default_backprop( )** and **zero_grad( )** and implement them in the training process. 

Hint1: You can use `autograd` in PyTorch to compute gradients.

Hint2: You should manually zero the gradients after updating weights.

---

Gradient descent is a way to minimize the final objective function (loss) parameterized by a model's parameter $\theta$ by updating the parameters in the opposite direction of the gradient $\nabla_\theta J(\theta)$ w.r.t to the parameters. The learning rate $\lambda$ determines the size of the steps you take to reach a (local) minimum.

Now following the equation to update parameters for each layer in your network.

$$\large \theta = \theta - \lambda\cdot\nabla_\theta J(\theta)$$

---

**Question 4 (3 points):** You will use the computed gradients to update the parameters for feedforward network. For this step, function **update_params( )** should be completed.

Hint: To get grident and operate it in PyTorch, you can call `x.grad`.

In [None]:
def update_params(params, learning_rate):
    #TODO: update the parameters of each layer
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""Default Backprop Helpers"""

def zero_grad(params):
    #TODO: set the gradients with respect to parameters as zero
    # YOUR CODE HERE
    raise NotImplementedError()
    
def default_backprop(loss, params, learning_rate):
    """
    Input: 
        loss: the objective funtion that can be used to compute gradients
        params: parameters of each layer
        learning_rate: the size of steps when updating parameters
    """    
    #TODO: compute gradients -> update parameters -> clean graidents
    # YOUR CODE HERE
    raise NotImplementedError()

---

## 5. Training loop

You will use a standard objective function **Binary Cross-Entropy Loss** for binary classification tasks. The detail is given as follows:

$$\large L = -\frac{1}{N}\sum_{i=1}^{N}( y_i \cdot \log(p(y_i))+(1-y_i)\log(1-p(y_i)))$$

where $y$ is the label (1 for dog and 0 for cat in our case) and $p(y)$ is the predicted probability, here $N$ equals to the batch_size.


A initialization function is provided to help your network converge faster.

```python
def init_params(params):
    """
    TODO: Initialize the parameters of each layer
    """
```

In [None]:
# define the learning rate here
learning_rate = 0.001
n_epochs = 2 # how many epochs to run

# define loss function
criterion = nn.CrossEntropyLoss()

params = [layer1_params, layer2_params, layer3_params]

# initialize network parameters
init_params(params)
    
for epoch in range(n_epochs):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):

        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # Forward 
        x = torch.flatten(inputs, 1)  # flatten all dimensions except batch
        outputs = net(x, params)
        
        # Compute the loss using the final output
        final_out = outputs[-1][0]
        loss = criterion(final_out, labels)

        # Backpropagation
        default_backprop(loss, params, learning_rate)

        # print statistics
        running_loss += loss.item()
        if i % 400 == 399:  # print every 2000 mini-batches
            print('[Epoch %d, Step %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 400))
            running_loss = 0.0

print('Finished Training')

## 6. Testing

In [None]:
dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

x = torch.flatten(images, 1)
outputs = net(x, params)
final_out = outputs[-1][0]
_, predicted = torch.max(final_out, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(4)))

**Task (1 point)**: Now testing with your trained model!

In [None]:
correct = 0
total = 0

# since you're not training, you don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data
        # calculate outputs by running images through the network
        x = torch.flatten(images, 1)
        outputs = net(x, params)
        final_out = outputs[-1][0]

        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(final_out.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 2000 test images: %d %%' % (
        100 * correct / total))

In [None]:
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        x = torch.flatten(images, 1)
        outputs = net(x, params)
        final_out = outputs[-1][0]
        _, predictions = torch.max(final_out, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1

# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
                                                         accuracy))