In [1]:
import torch
from torch import nn
import torchvision
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline
rng_seed = 1144

# Download MNIST
torchvision.datasets.MNIST('.', download=True)

# CIS680: Assignment 1: Deep Learning Basics
### Due:
* Part (a) Sept. 12 at 11:59 p.m.
* Part (b) Sept. 12 at 11:59 p.m.

### Instructions:
* Part (a) consists of parts 1, 2 and 3, and is due on September 12 at 11:59 p.m. EDT.
* Part (b) consists of part 4 and is due on September 15 at 11:59 p.m. EDT
* This is a group assignment with one submission per group. It is expected that each member of the group will contribute
to solving each question. Be sure to specify your teammates when you submit to Gradescope! Collaborating with other
groups is not permitted.
* There is no single answer to most problems in deep learning, therefore the questions will
often be underspecified. You need to fill in the blanks and submit a solution that solves the
(practical) problem. Document the choices (hyperparameters, features, neural network
architectures, etc.) you made where specified.
* All the code should be written in Python. You should use PyTorch only to complete this
homework.


## Plot Loss and Gradient
In this part, you will write code to plot the output and gradient for a single neuron with
Sigmoid activation and two different loss functions. As shown in Figure 1, You should
implement a single neuron with input 1, and calculate different losses and corresponding
error.

<div><img src="https://github.com/LukasZhornyak/CIS680_files/raw/e676f49897a77eb8d1774057e8ea5a216f0dc273/HW1/images/fig1.png" width=1200/></div>

<center>Figure 1: Network diagram for part 1.</center>

All the figures plotted in this part should have the same range of x-axis and y-axis. The
range should be centered at 0 but the extend should be picked so as to see the difference
clearly.

A set of example plots are provided in Figure 2. Here we use ReLU (instead of Sigmoid)
activation and L2 loss as an example.

<div><img src="https://github.com/LukasZhornyak/CIS680_files/raw/e676f49897a77eb8d1774057e8ea5a216f0dc273/HW1/images/fig2.png" width=800/></div>
<center>Figure 2: Example plots with ReLU activation and L2 loss. Left: Output of ReLU function.
Middle: Loss plot with L2 loss. Right: Gradient plot.</center>

<!-- BEGIN QUESTION -->

1. (3%) Plot a 3D figure showing the relations of output of Sigmoid function and weight/bias. To be specific, x-axis is weight, y-axis is bias, and z-axis is the out-put.

 Hint: Use the Python package matplotlib and the function plot surface from mpl toolkits.mplot3d
to draw 3D figures.

In [10]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

2. (3%) Experiment with L2 loss. The L2 loss is defined as $\mathcal{L}_{L2} = (\hat{y} âˆ’ y)^2$, where $y$ is
the ground truth and $\hat{y}$ is the prediction. Let $y = 0.5$ and plot a 3D figure showing
2 the relations of L2 loss and weight/bias. To be specific, the x-axis is weight, y-axis is
bias, and z-axis is the L2 loss.

In [12]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

3. (4%) Experiment with back-propagation with L2 loss. Compute $\frac{\partial \mathcal{L}_{L2}}{\partial \text{weight}}$ and plot a 3D figure showing the relations of gradient and weight/bias. To be specific, the x-axis is weight, y-axis is bias, and z-axis is the gradient w.r.t. weight.

In [13]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

4. (3%) Experiment with cross-entropy loss. The cross-entropy loss is defined as $\mathcal{L}_{CE} = -(y \log{\hat{y}} + (1 - y)\log{(1 - \hat{y})})$, where $y$ is the ground truth probability and $\hat{y}$ is the
predicted probability. Let $y = 0.5$ and plot a 3D figure showing the relations of
cross-entropy loss and weight/bias. To be specific, the x-axis is weight, y-axis is bias,
and z-axis is the cross-entropy loss.

In [14]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

5. (4%) Experiment with back-propagation with cross-entropy loss. Compute $\frac{\partial \mathcal{L}_{CE}}{\partial \text{weight}}$ and plot a 3D figure showing the relations of gradient and weight/bias. To be specific, the x-axis is weight, y-axis is bias, and z-axis is the gradient w.r.t. weight.

In [15]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

6. (3%) Explain what you observed from the above 5 plots. The explanation should include: 
 1. What's the difference between cross-entropy loss and L2 loss?
 2. What's the difference between the gradients from cross-entropy loss and L2 loss?
 3. Predict how these differences will influence the efficiency of learning.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Solving XOR with a 2-layer Perceptron (20%)
In this question you are asked to build and visualize a 2-layer perceptron that computes
the XOR function. The network architecture is shown in Figure 3. The MLP has 1 hidden
layer with 2 neurons. The activation function used for the hidden layer is the hyperbolic
tangent function. Since we aim to model a boolean function the output of the last layer is
passed through a sigmoid activation function to constrain it between 0 and 1.

<div><img src="https://github.com/LukasZhornyak/CIS680_files/raw/e676f49897a77eb8d1774057e8ea5a216f0dc273/HW1/images/fig3.png" width=800/></div>
<center>Figure 3: Graphical representation of the 2-layer Perceptron</center>

<!-- BEGIN QUESTION -->

1. (5%) Formulate the XOR approximation as an optimization problem using the cross
entropy loss. _Hint: Your dataset consists of just 4 points, $x_1 = (0,0)$, $x_2 = (0,1)$,
$x_3 = (1,0)$ and $x_4 = (1,1)$ with ground truth labels 0, 1, 1 and 0 respectively._

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

2. (10%) Use gradient descent to learn the network weights that optimize the loss. Intuitively, the 2 layer perceptron first performs a nonlinear mapping from $(x_1,x_2) \rightarrow (h_1,h_2)$ and then learns a linear classifier in the $(h_1,h_2)$ plane. For different steps
during training visualize the image of each input point $x_i$ in the $(h_1,h_2)$ plane as well
as the decision boundary (separating line) of the classifier.

In [3]:
# Make your dataset here
data = ...
labels = ...

# Make your network here
network = ...

# Train and plot here
for i in range(1000):
    ...
    continue

In [None]:
grader.check("q2b")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

3. (5%) What will happen if we don't use an activation function in the hidden layer? Is
the network be able to learn the XOR function? Justify your answer.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Train a Convolutional Neural Network (30%)
In this part you will be asked to train a convolutional neural network on the MNIST
dataset.

1. (10%) Build a Convolutional Neural Network with architecture as shown below:
| Layers | Hyper-parameters |
| :--- | :--- |
| Covolution 1 | Kernel size $= (5, 5, 32)$, SAME padding. Followed by BatchNorm and ReLU. |
| Pooling 1 | Average operation. Kernel size $= (2, 2)$. Stride $= 2$. Padding $= 0$. |
| Covolution 2 | Kernel size $= (5, 5, 32)$, SAME padding. Followed by BatchNorm and ReLU. |
| Pooling 2 | Average operation. Kernel size $= (2, 2)$. Stride $= 2$. Padding $= 0$. |
| Covolution 3 | Kernel size $= (5, 5, 64)$, SAME padding. Followed by BatchNorm and ReLU. |
| Pooling 3 | Average operation. Kernel size $= (2, 2)$. Stride $= 2$. Padding $= 0$. |
| Fully Connected 1 | Output channels $= 64$. Followed by BatchNorm and ReLU. |
| Fully Connected 2 | Output channels $= 10$. Followed by Softmax. |

In [75]:
# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

# Create your network here
class DigitClassification(torch.nn.Module):
    def __init__(self):
        ...
        pass
        
        
    def forward(self, x):
        ...
        return x

# Instantiate your network here
model = ...

In [None]:
grader.check("q3a")

<!-- BEGIN QUESTION -->

2. (20%) Train the CNN on the MNIST dataset using the Cross Entropy loss. Report training and testing curves. Your model should reach $99%$ accuracy on the
test dataset. (Hint: Normalize the images in the $(-1,1)$ range and use the Adam
optimizer).

In [93]:
# Where your trained model will be saved (and where the autograder will load it)
model_path = 'model.pth'

# Adds your model to the output zip
grader.add_plugin_files("cis680_plugins.IncludeModelPlugin", model_path)

# Do not edit the line below, everything after it will be skipped by the autograder
# Do not attempt to train your model on the autograder, you will be timed out
## TRAINING_CODE

# Train your network here
num_epochs = 10
for epoch in range(num_epochs):
  print("Epoch %d/%d" % (epoch+1, num_epochs))

torch.save(model.state_dict(), 'model.pth')

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)