In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("A8.ipynb")

# Assignment 8


## **Due: June 2nd (Friday), 2023, 8:00pm (PT)**

### **Instructions:**

Your Jupyter notebook assignment will often have 3 elements: written answers, code answers, and quiz answers. For written answers, you may insert images of your handwritten work in code cells, or write your answers in markdown and LaTeX. For quiz answers, your `record.txt` file will record your answer choices in the quiz modules for submission. Both your quiz answers and code answers will be autograded on Gradescope. This assignment does not have the quiz portion.

For all elements, DO NOT MODIFY THE CELLS. Put your answers **only** in the answer cells given, and **do not delete cells**. If you fail to follow these instructions, you will lose points on your submission.

Make sure to show the steps of your solution for every question to receive credit, not just the final answer. You may search information online but you will need to write code/find solutions to answer the questions yourself. You will submit your .ipynb file and record.txt to gradescope when you are finished.

### **Late Policy:**

Late assignments will be accepted at 75% credit up to 3 days late. Consult the syllabus for more info on the late policy.

### How to Include Your Math Written Answer?

You could use inline $\LaTeX$ in markdown (recommended) or use markdowns' include image functionality to submit your written responses.

#### $\LaTeX$ (recommended)
[Here is a fantastic tutorial from CalTech about using $\LaTeX$ in Jupyter Notebook.](http://chebe163.caltech.edu/2018w/handouts/intro_to_latex.html). You could also find various $\LaTeX$ tutorials and cheat sheets online.

#### Include Images
If you are still getting familiar with using LaTeX, handwrite the response on paper or the stylus. Take a picture or screenshot of your answer, and include that image in the Jupyter Notebook. Be sure to include that image in the `\imgs` directory. Let's say you have your Q1 response saved as `imgs/Q1.png`; the markdown syntax to include that image is `![Q1](imgs/Q1.png)`. 

## Important Notice

You must check both submission output on the gradescope (`Assignment 8` and `Assignment 8 - Manual Grading`) correctly reflects your work and responses. If you notice inconsistencies between your notebook and the Manual Grading portion, you need to make a campuswire post, and we can help you with that.

**Other**

If you are not feeling comfortable with the programming assignments in this homework, it might help to take a look at [https://github.com/UCSD-COGS108/Tutorials](https://github.com/UCSD-COGS108/Tutorials).

## Neural Networks

<!-- BEGIN QUESTION -->

Neural networks are  function approximators (https://en.wikipedia.org/wiki/Universal_approximation_theorem) that are capable of learning rich mappings of inputs $x$ to targets $y$, often without explicit feature engineering. For an arbitrary function $f$, a neural network can be defined as $f(\mathbf{x};\mathbf{w})$ with inputs $\mathbf{x}$ and parameters $\mathbf{w}$. The set of parameters $\mathbf{w}$ can be optimized in a neural network through a special implementation of gradient descent: backpropagation.

Backpropagation is **the recursive application of the chain rule**.  For a single perceptron (or any other algorithm) we know how to do gradient descent to change parameters so as to minimize errors.  But in a multi-layer neural network that corresponds to the last layer of the network... that's where we know the error and that's where we can optimize the final set of weights to minimize the loss. But what about the other layers, how do we modify them to help minimize loss at the output stage? By taking the loss gradient calculated for the output layer and propagating that gradient backwards through the chain rule to the layer before. That is, we apportion "responsibility" for the errors at the output layer among the weights in the layer before that. 

Before we can run backpropagation, we must first run a forward pass with our data, inputting $\mathbf{x}$ to our network, calculating the prediction $f(\mathbf{x};\mathbf{w})$, comparing that prediction to the desired output $y$ in the form of some loss function. After the forward pass, we can calculate the loss gradients for each step from back to front through the chain rule at each stage. 

The gradient of a variable $f$ w.r.t a variable $q$ can be described as the responsibility a small change in $q$ has for causing change in $f$. What about when there's a variable in between $q$ and $f$? What if $f$ is a function of  $r$ which is a function of $q$? The chain rule says that to calculate the gradient $\frac{\partial f}{\partial q}$,  we must multiply the impact of $q$ on $r$ and the impact of $r$ on $f$. This is the recursive application of the chain rule, with the formula:

$$ \frac{\partial f}{\partial q} = \frac{\partial f}{\partial r}  \frac{\partial r}{\partial q} $$

The $\partial r$ terms cancel out if we multiply these gradients together! This flow of the gradients, where we calculate the impact of an earlier variable on a later variable by going back to front, is the mathematical mechanism behind how neural networks learn.


Instead of making you calculate backpropagation on a complicated neural network, we will be using a simple scaled linear function $f(x,w,z,b) = z(wx+b)$. But this method works for any network! You will compute all gradients and the missing values in the computational graph below given $x=2$, $w=5$, $b=-2$, $ z=3$. On the forward pass solve for the intermediary variables $q$ and $r$, as well as $f$ and their gradients. Filling in the values in the graph may help you construct the answer.  But you will not be graded on filling the graph, you must put your answers in the questions below for points. Show your work using the formula for the chain rule.



![comp-graph](imgs/computational-graph.png)


Calculate the numerical values of $q$, $r$, and $f$. 

$q = $

$r =$

$f = $


_Points:_ 0.3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Intermediary Variables
Write $q$, $r$, and $f$ in terms of the two variables that are used to compute them. This will help you calculate their gradients.

Hint: For example, you would write $r$ in terms of $q$ and $b$.

_Points:_ 0.3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Intermediary Gradients

Use the intermediary equations you found above to compute the gradients. Show your work.

$\frac{\partial r}{\partial b}$,
$\frac{\partial r}{\partial q}$,
$\frac{\partial q}{\partial w}$,
$\frac{\partial q}{\partial x}$

_Points:_ 0.3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Final Local Gradients
Use the intermediary gradients calculated above to calculate the final gradients of the function $f$ w.r.t. each variable. Show your work.

$\frac{\partial f}{\partial f}$,
$\frac{\partial f}{\partial z}$,
$\frac{\partial f}{\partial r}$,
$\frac{\partial f}{\partial q}$,
$\frac{\partial f}{\partial b}$,
$\frac{\partial f}{\partial x}$,
$\frac{\partial f}{\partial w}$
 

_Points:_ 0.3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Gradient Descent Update Rule

Typically we update the weights of a neural network with the gradient of the loss function w.r.t. the weights $\frac{\partial L}{\partial w}$. In this toy problem, since we have no loss, we will just use the gradient wr.t. the output $f$. If we use the following gradient descent update rule for $w$ and $b$, what would the new values be? The $\alpha$ is $0.01$ in this case. Show your work.


$$w_t = w_{t-1} - \alpha(\frac{\partial f}{\partial w})$$
$$b_t = b_{t-1} - \alpha(\frac{\partial f}{\partial b})$$

_Points:_ 0.3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Digit Classification with a Feed Forward Neural Network

In this coding portion, most of the code is provided for you. You will need to fill in missing code, but most of it is given. We will focus on creating a simple feed-forward classification neural network to classify handwritten digits between 0-9 from the MNIST dataset into their respective classes.
A feed-forward neural network is a classification algorithm that consists of a large number of perceptrons, organized in layers & each unit in the layer is connected with all the units or neurons present in the previous layer. 

Before, we get started we need to install Pytorch.
About Pytorch - https://pytorch.org/
Pytorch is an open-source machine learning and deep learning framework widely used in applications such as natural language processing, image classification and computer vision applications. It was developed by Facebook’s AI Research and later adapted by several conglomerates such as Uber, Twitter, Salesforce, and NVIDIA.
PyTorch comes with several specially developed modules like torchtext, torchvision and other classes such as torch.nn, torch.optim, Dataset, and Dataloader to help you create and train neural networks to work with a different machine and deep learning areas.

In [None]:
# !pip install torch==1.12.0
# !pip install torchvision

Note: Please restart your kernel after pip installing the above packages. Click kernel > Restart.
Then run the following import cell.

In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'

About the Dataset

Torchvision provides many built-in datasets in the torchvision.datasets module, as well as utility classes for building your own datasets. - https://pytorch.org/vision/stable/datasets.html
We'll be using one of these built-in datasets for our classification task.
The MNIST dataset, also known as the Modified National Institute of Standards and Technology dataset, consists of 60,000 small square 28×28 grayscale images of handwritten digits between 0 to 9 divided into ten different classes. This dataset is mainly used for text classification using deep learning models.
The MNIST database contains 60,000 training images and 10,000 testing images.
Load the MNIST dataset from Pytorch

The following cell transforms the dataset into a Pytorch friendly format.

In [None]:
from torch.utils.data import Subset

# Read in train_data and test_data from built-in MNIST dataset, 
# transform into Pytorch tensor format

train_data = torchvision.datasets.MNIST(
    root='data',
    train=True,
    transform=transforms.ToTensor(),
    download=True
)

test_data = torchvision.datasets.MNIST(
    root='data',
    train=False,
    transform=transforms.ToTensor(),
    download=True
)

Since the entire dataset is around 70k images - which will take a TON of time to train on a CPU, we will take the first 10% of images for our train and test set. Don't worry if you don't fully understand the details - just run the code provided! If you are courious about the purpose and the function of each line, feel free to post a Piazza post. PyTorch's dataloader takes a dataset object as input, which is responsible for loading and returning individual data samples. The dataloader then takes care of batching, shuffling, and multiprocessing the data samples, making it efficient to feed them into a deep learning model.

In [None]:
# Since the entire dataset is around 60k images 
# - which will take a TON of time to train on a CPU
# We subset the entire set - and pick the first 10% for train and test. 
train_dataset = Subset(train_data, indices=range(len(train_data) // 10))
test_dataset = Subset(train_data, indices=range(len(test_data) // 10))

# Just run this cell to use Pytorch dataloader to load train and test sets 
# Note that we specify batch size = 100. 
# This means that we will have 60 batches in total 
# - and each batch contains 100 images for the train set 

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=100, 
                                           shuffle=True)

# Note that we specify batch size = 100. 
# This means that we will have 10 batches in total 
# - and each batch contains 100 images for the test set 
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=100, 
                                          shuffle=False)
# A quick check to ensure our train_loader loaded our 6000 train images properly 
print('For the train set:')
print('Total number of batches:', len(train_loader))
print('Number of images in each batch in train set:', train_loader.batch_size)
print('Total number of images in train set:', len(train_loader.dataset))
print()


# A quick check to ensure our test_loader loaded our 1000 test images properly 
print('For the test set:')
print('Total number of batches:', len(test_loader))
print('Number of images in each batch in test set:', test_loader.batch_size)
print('Total number of images in test set:', len(test_loader.dataset))

Great! Now that we have successfully loaded our train and test sets, let's take a quick look at what our test set images look like.
Please run the following code cell to take a look at the 10 random images in our test set.
The ground truth labels are displayed in blue as the title of the plots.

In [None]:
examples = iter(test_loader)
example_data, example_targets = next(examples)


params = {"text.color" : "blue",
          "xtick.color" : "black",
          "ytick.color" : "black"}
plt.rcParams.update(params)


import numpy as np
indices = np.random.randint(0, len(test_loader), size=10)


fig, axs = plt.subplots(2, 5, figsize=(10, 5))
axs = axs.flatten()
examples = iter(test_loader)

for i, index in enumerate(indices):
    # Get the image and ground truth label

    example_data, example_targets = next(examples)
    image, label = example_data[index][0], example_targets[index].item()

    # Plot the image with its ground truth
    axs[i].imshow(image.reshape(28, 28), cmap='gray')
    axs[i].set_title(f'GT: {label}')
    axs[i].axis('off')

plt.tight_layout()
plt.show()

## Creating A Neural Network with 1 Hidden Layer

Now, let's focus on building our fully connected neural network that will classify these test images into one of 10 different classes, i.e the digits (0-9).

We will first define our hyperparameters for our neural network.
Given the hand written digit as the input to our model, what should be the input size of our Fully Connected Network?

_Points:_ 0.5

In [None]:
examples = iter(test_loader)
example_data, example_targets = next(examples)
image, label = example_data[index][0], example_targets[index].item()

# Take a look at our input: image
print(image.shape)

# Based on that input, what should be our network's input size?
input_size = ...

## Linear Layers
The code cell below defines a fully connected neural network with a single hidden layer. Your job is to fill in the lines for the first and second linear layer.
You can accomplish this using the nn.Linear function and setting the appropriate input and output sizes for each layer - https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
Hint: Use the hyperparameters we define below!

_Points:_ 0.5

In [None]:
# Our hidden layer will have input size 500. 
hidden_size = 500 

# num_classes = 10, since we want to classify digits into one of 10 classes 
num_classes = 10

num_epochs = 3

batch_size = 100

learning_rate = 0.001

In [None]:
# Fully connected neural network with one hidden layer

# The neural network is defined as a class called NeuralNet, which inherits from the nn.Module class in PyTorch. 
# This allows the network to take advantage of the built-in functionality of PyTorch for training and optimization.

class NeuralNet(nn.Module):
    
    # initializes the neural network and sets its parameters. 
    # It takes three arguments - input_size, hidden_size, and num_classes 
    # input_size - the size of the input layer, 
    # hidden_size - the number of neurons in the hidden layer, 
    # num_classes - the number of output classes
    
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.input_size = input_size

        # Your task: Create the first hidden layer using nn.Linear 
        # Hint: Think about the input size of the first layer. 
        # Since the hidden layer is next, what should the output size of this first linear layer be?
        self.l1 = ...
        
        
        # a Rectified Linear Unit (ReLU) activation function applied to first layer 
        self.relu = nn.ReLU()
        
        # Your task: Create the second linear layer using nn.Linear 
        # Hint: Think about the input size of the second layer (This layer is connected to the hidden layer!)
        # This layer produces the final output of the network so what should the output size be?
        self.l2 = ...
 

    # defines how the input data is processed through the neural network. 
    # connect each layer together as following:  l1 -> relu -> l2
    def forward(self, x):
        x = ...
        x = ...
        out = ...
        return out
    

# Create an instance of NeuralNet and store it in model
model = NeuralNet(input_size, hidden_size, num_classes)
model # View a brief summary of your model

## Training the network
Initially, the weights are randomly initialized so our accuracy would be quite poor. We need to update these random weight values using a training loop and a Cross Entropy loss. In the following code cells, we will set up the training loop for the network.
Your task will be to fill in the loss function within the loop which is used to calculate the error between the predicted output and the actual labels.

In [None]:
# Specify loss function
loss_func = nn.CrossEntropyLoss()

# Specify optimization algorithm to be used 
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

n_total_batches = len(train_loader)
print('Total Batches in train set:', n_total_batches)
losses = []

# Outer loop - runs over number of epochs 
for epoch in range(num_epochs):
    # Loop over each image and label 
    for i, (images, labels) in enumerate(train_loader):  
        
        # origin shape: [100, 1, 28, 28]
        # resized: [100, 784] to be able to pass into network
        images = images.reshape(-1, 28*28)

        # Your task: Fill in the forward pass
        # Forward pass - pass input image through network 
        outputs = ...
        
        # Your task: Fill in the loss function 
        # that calculates the error between the predicted output and the actual labels
        # Hint: We already defined this above. Think about what arguments a loss function should take. 
        loss = ...
 
        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

        if (i+1) % 10 == 0:
            print (f'Epoch [{epoch+1}/{num_epochs}], Batch [{i+1}/{n_total_batches}], Loss: {loss.item():.4f}')

Let's compare the accuracy of our network on the test images now after training.
We expect the network to have improved accuracy (since all the weights have been updated during the training process and we expect the network to have learned to recognize visual features to distinguish between input images.)

In [None]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    n_correct = 0
    n_samples = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28)
        outputs = model(images)
        # max returns (value ,index)
        _, predicted = torch.max(outputs.data, 1)
        n_samples += labels.size(0)
        n_correct += (predicted == labels).sum().item()

    acc = 100.0 * n_correct / n_samples
    print(f'Accuracy of the network on the 1000 test images: {acc} %')

In [None]:
# Plot random 10 images from MNIST test set with ground truth and predicted label 

import numpy as np
indices = np.random.randint(0, len(test_loader), size=10)


fig, axs = plt.subplots(2, 5, figsize=(10, 5))
axs = axs.flatten()
examples = iter(test_loader)

for i, index in enumerate(indices):
    # Get the image and ground truth label

    example_data, example_targets = next(examples)
    image, label = example_data[index][0], example_targets[index].item()

    # Make a prediction with the model
    with torch.no_grad():
        image = image.reshape(-1, 28*28)

        prediction = model(image)
        predicted_label = torch.argmax(prediction, dim=1).item()

    # Plot the image with its ground truth and predicted labels
    axs[i].imshow(image.reshape(28, 28), cmap='gray')
    axs[i].set_title(f'GT: {label}, Pred: {predicted_label}')
    axs[i].axis('off')

plt.tight_layout()
plt.show()

We can see that the predictions now match our ground truth for most images.
It's important to note that the performance of the network can depend on several factors, such as the network architecture, hyperparameters, and the size and complexity of the dataset. Additionally as with other supervised ML algorithms, the network may not be able to generalize well to new, unseen data if it was overfitted on the training data.
Congratulations! You have successfully trained your first neural network and used it to classify 10000 images from MNIST.

# End of A8

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

Please make sure to see the output of the gradescope autograder. You are responsible for waiting and ensuring that the autograder is executing normally for your submission. Please create a campuswire post if you see errors in autograder execution.

In [None]:
grader.export(pdf=False, force_save=True, run_tests=True)