# Assignment 1 Part A - Multi-Layer Perceptrons (MLPs)

Welcome to the first assignment!

You'll be implementing your own basic version of PyTorch (that we'll cleverly call `mytorch`), using nothing but [NumPy](https://numpy.org/).

## Introduction

Let's begin with a high-level recap on how NNs are trained.

Modern NNs are generally trained by repeating these three steps:

<div>
    <img src="images/training_forward.png" width="600"/>
</div>

<div>
    <img src="images/training_backward.png" width="600"/>
</div>

<div>
    <img src="images/training_step.png" width="300"/>
</div>

In summary:

<div>
    <img src="images/training_summary.png" width="500"/>
</div>

We'll be implementing each of these steps roughly in order.

## Important Notes
- All code you write will be in the `mytorch/` folder.
    - You shouldn't need to change anything in this notebook except for running the test cells.
- Each problem has **three unit tests** to check your implementations for correctness.
    - The first unit test of every problem is shown to you in case you want to see how it works.
    - The last two are hidden, although we will provide error messages indicating potential issues in your code.
    - **Make sure you pass all three tests before continuing**.
- Code you write in other `.py` file(s) will be automatically reimported here.

Finally:
- Don't be intimidated by how long the assignment *looks*.
    - It only looks long because we provide lots of descriptions and diagrams.
    - The actual code you need to write is short
        - (if you vectorize and use NumPy well, usually $\leq$ 4 lines per problem)
    - But the challenge is doing the code correct, and that requires *understanding the concepts*. 

**IMPORTANT: Make sure to run the below cell to import everything!** 

In [52]:
# Make sure to run this cell, and don't modify these!
import numpy as np

# Import the code in `mytorch/nn.py`
from mytorch import nn, optim

# Extension to automatically update imported files if you change them
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Section 1: Forward Propagation
---

We'll begin by implementing everything needed to complete forward propagation.

## Question 1.1: `Linear.forward()`

We'll begin by implementing the forward pass of a linear layer.

Open `mytorch/nn.py` and find the `Linear.forward()` method. Complete it by implementing the equation below in NumPy and `return`ing its output. 

$$\begin{align*}
    \text{Linear}(X) = XW + b
\end{align*}$$

$$\begin{align*}
    &\text{Where $X$ is a matrix containing the input data, $W$ is the weight matrix, and $b$ is the bias vector,} \\
    &\text{and $XW$ indicates a matrix multiplication between $X$ and $W$.}
\end{align*}$$

Here's a visualized example of the above formula just to make it clearer:

<div style="text-align:center">
    <img src="images/linear_forward.png" width="600"/>
</div>

**Notes/Hints**:
- Notice that the bias is added to each row of $XW$. Thankfully, NumPy handles this automatically using [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html).
- While the above example and our test cases will use positive integers, typical values are small floats centered around zero.
- The formula we provide translates pretty neatly to NumPy.
    - Try to avoid hardcoding shapes and axes.
- [Matrix multiplications](https://en.wikipedia.org/wiki/Matrix_multiplication) are different from [element-wise matrix multiplications](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)).
    - For an intuitive visualization of matrix multiplications, see [here](http://matrixmultiplication.xyz/)
    - Hint: `np.matmul()` a.k.a. `@`, [documentation here](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html). An element-wise mult is `np.mult()` a.k.a. `*`

**Reminders:**
- Make sure to run the cell above where we import everything.
- After you complete your code, run the cell below to check that your implementation was correct.

In [53]:
def test_linear_forward_1(Linear):
    """[Given] Just to demonstrate how you'll be graded.

    Args:
        Linear (class): the entire class imported from `nn.py` 

    Returns:
        np.array: the output of passing through your Linear.forward()
    """
    # Initialize layer that feeds 3 input channels to 4 neurons.
    layer = Linear(3, 4)
    # Weights/biases are normally initialized randomly to small floats centered around 0,
    # but we'll manually set them like this for consistency/interpretability
    layer.weight = np.array([[1., 2., 3., 4.],
                             [5., 6., 7., 8.],
                             [9., 10., 11., 12.]])
    layer.bias = np.array([[1., 2., 3., 4.]])
    
    # Input array shaped (batch_size=2, in_features=3)
    x = np.array([[1., 2., 3.],
                  [4., 5., 6.]])
    
    # Run the input through Linear.forward().
    out = layer.forward(x)
    return out

print("Your answer:")
print(test_linear_forward_1(nn.Linear))
print("Should be equal to:")
print(np.array([[ 39.,  46.,  53.,  60.], [ 84., 100., 116., 132.]]))

Your answer:
[[ 39.  46.  53.  60.]
 [ 84. 100. 116. 132.]]
Should be equal to:
[[ 39.  46.  53.  60.]
 [ 84. 100. 116. 132.]]


Assign the results you obtained in the previos cell to `answer1` below as a NumPy array.

In [54]:
from tests import test_linear_forward_1, test_linear_forward_2, test_linear_forward_3

answer_1 = test_linear_forward_1(nn.Linear)
answer_2 = test_linear_forward_2(nn.Linear)
answer_3 = test_linear_forward_3(nn.Linear)

In [55]:
answer_1

array([[ 39.,  46.,  53.,  60.],
       [ 84., 100., 116., 132.]])

In [56]:
answer_2

array([[11., 22., 27.],
       [ 3., 30., 51.]])

In [57]:
answer_3

array([[  2.,  15., -14.,  13.],
       [-38.,  15., -22.,  29.]])

If you passed the tests above, assign the string "Question 1 passed" to `asn1` in the code cell below.

In [58]:
### GRADED

ans1 = "None"

# YOUR CODE HERE
# raise NotImplementedError()
asn1 = 'Question 1 passed' # just in case there's a typo in task desc.
ans1 = 'Question 1 passed'

## Question 1.2: `ReLU.forward()`

Activation functions are applied to the output of layers in order to make them non-linear. We'll begin by implementing the popular function `ReLU`.

In `mytorch/nn.py`, complete `ReLU.forward()` by implementing and `return`ing the value of this equation:

$$\begin{align*}
\text{ReLU}(X) =
    \begin{cases}
    x & x > 0\\
    0 & x \leq 0
    \end{cases}
\end{align*}$$

$$\begin{align*}
    \text{Where $X$ is a matrix of the input data, and $x$ represents some entry of $X$.}
\end{align*}$$

<div style="text-align:center">
    <img src="images/relu_forward.png" width="500"/>
</div>

**Notes**:
- Essentially, we're zeroing out entries of $X$ that are below 0 and keeping the positive values as they are.
- Notice that there are no trainable parameters here (no weights or biases). 

In [59]:
def test_relu_forward_1(ReLU):
    layer = ReLU()    
    x = np.array([[-3., 1.,  0.],
                  [ 4., 2., -5.]])
    out = layer.forward(x)
    return out

In [60]:
from tests import test_relu_forward_1, test_relu_forward_2, test_relu_forward_3

answer_1 = test_relu_forward_1(nn.ReLU)
answer_2 = test_relu_forward_2(nn.ReLU)
answer_3 = test_relu_forward_3(nn.ReLU)

In [61]:
answer_1

array([[0., 1., 0.],
       [4., 2., 0.]])

In [62]:
answer_2

array([[1., 0., 3., 0.],
       [5., 6., 0., 0.]])

In [63]:
answer_3

array([[0., 1.],
       [2., 3.]])

If you passed the tests above, assign the string "Question 2 passed" to `asn2` in the code cell below.

In [64]:
### GRADED

ans2 = None

# YOUR CODE HERE
# raise NotImplementedError()
asn2 = 'Question 2 passed' # just in case there's a typo in task desc.
ans2 = 'Question 2 passed'

## Question 1.3: `Sequential.forward()`
Now, let's implement `Sequential()`: this is a class that creates a feedforward network out of any layers we give it.

In `mytorch/nn.py`, complete `Sequential.forward()` by translating the following description to code:

Pass `x` through the first layer of your network, then pass this output to the next layer, and so on. Return the output of the final layer.

<div style="text-align:center">
    <img src="images/sequential_forward.png" width="600"/>
</div>

**Assume**:
- There could be an arbitrary number of layers.
- All layers will have a `.forward()` function.
- This method should work even if we add new layer types in the future (they'll all have `.forward()`).
    - In other words, avoid hardcoding for class types.

**Hint**: `for` loop and overwriting a variable

In [65]:
def test_sequential_forward_1(Sequential, ReLU, Linear):
    # Initialize list of layers and set their weights
    model = Sequential(ReLU(), Linear(2, 3), ReLU())
    model.layers[1].weight = np.array([[-1.,  2., -3.],
                                       [ 5., -6.,  7.]])
    model.layers[1].bias = np.array([[-1., 2., 3.]])

    # Pass input through layers
    x = np.array([[-3.,  0.],
                [ 4.,  1.],
                [-2., -1]])
    out = model.forward(x)
    return out

print("Your answer:")
print(test_sequential_forward_1(nn.Sequential, nn.ReLU, nn.Linear))
print("Should be equal to:")
print(np.array([[0., 2., 3.],[0., 4., 0.],[0., 2., 3.]]))

Your answer:
[[0. 2. 3.]
 [0. 4. 0.]
 [0. 2. 3.]]
Should be equal to:
[[0. 2. 3.]
 [0. 4. 0.]
 [0. 2. 3.]]


In [66]:
from tests import test_sequential_forward_1, test_sequential_forward_2, test_sequential_forward_3 

answer_1 = test_sequential_forward_1(nn.Sequential, nn.ReLU, nn.Linear)
answer_2 = test_sequential_forward_2(nn.Sequential, nn.ReLU, nn.Linear)
answer_3 = test_sequential_forward_3(nn.Sequential, nn.ReLU, nn.Linear)

In [67]:
answer_1

array([[0., 2., 3.],
       [0., 4., 0.],
       [0., 2., 3.]])

In [68]:
answer_2

array([[ 4.,  0.,  4.,  0.],
       [ 5.,  0.,  0.,  4.],
       [14.,  0., 18.,  0.]])

In [69]:
answer_3

array([[  0., 111.,   0., 297.],
       [  0., 255.,   0., 657.]])

If you passed the tests above, assign the string "Question 3 passed" to `asn3` in the code cell below.

In [70]:
### GRADED

ans3 = None

# YOUR CODE HERE
# raise NotImplementedError()
asn3 = 'Question 3 passed' # just in case there's a typo in task desc.
ans3 = 'Question 3 passed'

## Question 1.4: `CrossEntropyLoss.forward()`

Nice work so far!

<div>
    <img src="images/training_forward.png" width="600"/>
</div>

Let's quickly recap our work using the image from the intro.

- So far, we've implemented the forward pass of `Linear`, `ReLU`, and `Sequential` ("Network" in the image).
- Our code is currently capable of passing a batched input through a multi-layer network and getting logits.

Now, we need to implement a loss function.

The cross-entropy loss function measures divergence between the logits and the target labels. In other words, it estimates how well your network is doing in training (the lower the loss, the better).

In `mytorch/nn.py`, implement `CrossEntropyLoss.forward()` by `return`ing the value of $\text{MeanCrossEntropy}$ below.

$$
\begin{align*}
    \text{CrossEntropy}(I, T) = - \sum_{c=0}^{C-1}{T_{n,c} \odot \text{Log}(\text{Softmax}(I_{n,c}))} \\
    \text{MeanCrossEntropy}(I, T) = \frac{\sum_{n=0}^{N-1}{\text{CrossEntropy}(I, T)_{n}}}{N}
\end{align*}
$$

$$
\begin{align*}
    &\text{where $C$ is the number of classes, $N$ is the batch size,} \\
    &\text{$T$ is the one-hot matrix of target labels shaped (N, C),} \\
    &\text{$I$ is the matrix of logits shaped (N, C),}\\
    &\text{and $\odot$ is the element-wise matrix product.}
\end{align*}
$$

In other words:
- $\text{CrossEntropy(I, T)}$ calculates the loss of each observation in the batch, yielding a matrix shaped `(batch_size,)` (one loss float for each observation in the batch)
- $\text{MeanCrossEntropy(I, T)}$ just takes a simple average of these floats, yielding a single float.

**Notes/Hints**:
- `softmax()` is given in the `nn.py` file. Just call it when you need it.
    - See the appendix for a more detailed description of what `softmax` does.
- The [element-wise matrix product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) $\odot$ in NumPy is just the `*` operator.
- In `CrossEntropyLoss.forward()`, we've already converted the `target`s from a categorical encoding to a one-hot encoded matrix for you.
    - See the appendix for a visualization/explanation of this.
- The [official PyTorch documentation](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) may be helpful.

In [71]:
def test_xeloss_forward_1(CrossEntropyLoss):
    # Initialize loss function
    loss_function = CrossEntropyLoss()

    # Logits array shaped (batch_size=2, num_classes=4)
    logits = np.array([[-3., 2., -1., 0.],
                       [-1., 2., -3., 4.]])

    # Labels array shaped (batch_size=2,), indicates the index of each correct answer in the batch. 
    labels = np.array([3, 1])

    # Get the loss value given the inputs
    loss = loss_function.forward(logits, labels)

    return loss


test_xeloss_forward_1(nn.CrossEntropyLoss)


print("Your answer:")
print(test_xeloss_forward_1(nn.CrossEntropyLoss))
print("Should be equal to:")
print(2.1545793610744024)

Your answer:
2.1545793610744024
Should be equal to:
2.1545793610744024


In [72]:
from tests import test_xeloss_forward_1, test_xeloss_forward_2, test_xeloss_forward_3 

answer_1 = test_xeloss_forward_1(nn.CrossEntropyLoss)
answer_2 = test_xeloss_forward_2(nn.CrossEntropyLoss)
answer_3 = test_xeloss_forward_3(nn.CrossEntropyLoss)

In [73]:
answer_1

2.1545793610744024

In [74]:
answer_2

1.4927814118455973

In [75]:
answer_3

8.868232666695468

If you passed the tests above, assign the string "Question 4 passed" to asn4 in the code cell below.

In [76]:
### GRADED

ans4 = None

# YOUR CODE HERE
# raise NotImplementedError()
asn4 = 'Question 4 passed' # just in case there's a typo in task desc.
ans4 = 'Question 4 passed'

# Section 2: Backpropagation
---

Great work so far!

## Recap: Backpropagation
Before you move on, make sure you understand these key points:

**Why do we calculate gradients?**
- The gradient of a weight or bias measures how much the loss would increase/decrease if you increase that weight/bias.
- In particular, it measures the direction of 'steepest increase' for the loss. So moving in the *opposite* direction ('steepest *decrease*') of the gradient should decrease the loss, hopefully leading to better performance.

**Backprop is literally just an implementation of the [chain rule](https://en.wikipedia.org/wiki/Chain_rule) from intro calculus.**
- A loss function $L(I, T)$ applied to a neural network $f(X)$ is really just a big nested function $L(f(X), T)$.
    - Backprop calculates the partial derivatives (more precisely, gradients) of $L(f(X), T)$ w.r.t. each of the weights/biases of $f$.
- Derivative of nested functions? Chain rule.

**Why do we do this backwards?**
- Technically we could do this forwards or in other ways. But if you're curious, the reason we do it backwards is described in [this article](https://en.wikipedia.org/wiki/Automatic_differentiation#Forward_accumulation).

## Question 2.1: `CrossEntropyLoss.backward()`

<div>
    <img src="images/training_backward.png" width="600"/>
</div>

Here's that image again.

Let's move backwards through the pipeline, starting with the loss function.

In `mytorch/nn.py`, complete `CrossEntropyLoss.backward()` by `return`ing the following:
$$
\begin{align*}
    \nabla_{\text{CrossEntropy}(I, T)} \text{Loss} = \frac{\text{Softmax}(I) - T}{N}
\end{align*}
$$

Note: $T$ is one-hot.

In [77]:
def test_xeloss_backward_1(CrossEntropyLoss):
    loss_function = CrossEntropyLoss()
    logits = np.array([[-3., 2., -1., 0.],
                       [-1., 2., -3., 4.]])
    labels = np.array([3, 1])
    loss_function.forward(logits, labels)
    grad = loss_function.backward()
    expected_grad = np.array([[ 2.82665133e-03,  4.19512254e-01,  2.08862853e-02, -4.43225190e-01],
                              [ 2.94752177e-03, -4.40797443e-01,  3.98903693e-04,  4.37451017e-01]])

    #passed = compare_to_answer(grad, expected_grad, "CrossEntropyLoss.backward() Test 1")
    #return passed
    
    return expected_grad


print("Your answer:")
print(test_xeloss_backward_1(nn.CrossEntropyLoss))
print("Should be equal to:")
print(np.array([[ 2.82665133e-03,  4.19512254e-01,  2.08862853e-02, -4.43225190e-01], [ 2.94752177e-03, -4.40797443e-01,  3.98903693e-04,  4.37451017e-01]]))

Your answer:
[[ 2.82665133e-03  4.19512254e-01  2.08862853e-02 -4.43225190e-01]
 [ 2.94752177e-03 -4.40797443e-01  3.98903693e-04  4.37451017e-01]]
Should be equal to:
[[ 2.82665133e-03  4.19512254e-01  2.08862853e-02 -4.43225190e-01]
 [ 2.94752177e-03 -4.40797443e-01  3.98903693e-04  4.37451017e-01]]


In [78]:
from tests import test_xeloss_backward_1, test_xeloss_backward_2, test_xeloss_backward_3 

answer_1 = test_xeloss_backward_1(nn.CrossEntropyLoss)
answer_2 = test_xeloss_backward_2(nn.CrossEntropyLoss)
answer_3 = test_xeloss_backward_3(nn.CrossEntropyLoss)

In [79]:
answer_1

array([[ 2.82665133e-03,  4.19512254e-01,  2.08862853e-02,
        -4.43225190e-01],
       [ 2.94752177e-03, -4.40797443e-01,  3.98903693e-04,
         4.37451017e-01]])

In [80]:
answer_2

array([[ 4.40429565e-03, -9.28669386e-02,  8.84626429e-02],
       [-3.97662178e-02,  3.97299887e-02,  3.62290602e-05],
       [ 2.19108773e-03,  3.25186252e-01, -3.27377339e-01]])

In [81]:
answer_3

array([[ 8.32012706e-05,  2.26164502e-04,  2.48019492e-01,
        -2.48328858e-01],
       [ 1.00790932e-02, -2.22602184e-01,  1.00790932e-02,
         2.02443998e-01],
       [ 2.97970533e-02, -2.49972829e-01,  2.20172098e-01,
         3.67724850e-06],
       [ 9.11611224e-09,  2.98007293e-02, -2.49999999e-01,
         2.20199260e-01]])

If you passed the tests above, assign the string "Question 5 passed" to asn5 in the code cell below.

In [82]:
### GRADED

ans5 = None

# YOUR CODE HERE
# raise NotImplementedError()
asn5 = 'Question 5 passed' # just in case there's a typo in task desc.
ans5 = 'Question 5 passed'


## Question 2.2: `Linear.backward()`
This layer has trainable parameters (`weight` and `bias`) that you need to calculate gradients for.

In the `backward()` method, we now need to accomplish three things:

1. Calculate and **store** in `self.grad_weight` the gradient of the loss w.r.t. the weight matrix $W$.

$$\begin{align*}
    \nabla_W \text{Loss} = X^T\nabla_{\text{Linear}(X)} \text{Loss}
\end{align*}$$

$$\begin{align*}
    \text{Where $X^T$ is the transpose of $X$ and $\nabla_{\text{Linear}(X)} \text{Loss}$ is `grad'.}
\end{align*}$$
2. Calculate and **store** in `self.grad_bias` the gradient of the loss w.r.t. the bias vector $b$.

$$\begin{align*}
    \nabla_b \text{Loss} = \sum_{n=0}^{N-1}{\nabla_{\text{Linear}(X)} \text{Loss}_n}
\end{align*}$$

$$\begin{align*}
    \text{Where $N$ is the batch\_size, and the summation is across the batch\_size axis.}
\end{align*}$$

3. Calculate and `return` the gradient of the loss w.r.t. the input $X$.

$$\begin{align*}
    \nabla_X \text{Loss} = \nabla_{\text{Linear}(X)} \text{Loss} W^T
\end{align*}$$


In [83]:
def test_linear_backward_1(Linear):
    layer = Linear(2, 4)
    layer.weight = np.array([[ 1., 2.,  3., 2.],
                             [-1., 4., -2., 3.]])
    layer.bias = np.array([[1., 2., 3., 4.]])
    layer.x = np.array([[1., -2.],
                        [0., -6.]])

    # Run the backward pass
    grad = np.array([[1., 0.,  3., 2.],
                     [5., 5., -1., 0.]])
    grad_x = layer.backward(grad)
    
    # Need to check that the gradients of the input, weight, and bias are all correct.
    return grad_x, layer.grad_weight, layer.grad_bias


print("Your answer:")
print(test_linear_backward_1(nn.Linear))
print("Should be equal to:")
print(np.array([[14., -1.], [12., 17.]]), np.array([[  1.,   0.,   3.,   2.] ,[-32., -30.,   0.,  -4.]]), np.array([[6., 5., 2., 2.]]))

Your answer:
(array([[14., -1.],
       [12., 17.]]), array([[  1.,   0.,   3.,   2.],
       [-32., -30.,   0.,  -4.]]), array([6., 5., 2., 2.]))
Should be equal to:
[[14. -1.]
 [12. 17.]] [[  1.   0.   3.   2.]
 [-32. -30.   0.  -4.]] [[6. 5. 2. 2.]]


In [84]:
from tests import test_linear_backward_1, test_linear_backward_2, test_linear_backward_3 

answer_1 = test_linear_backward_1(nn.Linear)
answer_2 = test_linear_backward_2(nn.Linear)
answer_3 = test_linear_backward_3(nn.Linear)

In [85]:
answer_1

(array([[14., -1.],
        [12., 17.]]), array([[  1.,   0.,   3.,   2.],
        [-32., -30.,   0.,  -4.]]), array([6., 5., 2., 2.]))

In [86]:
answer_2

(array([[ 30.,  70., 110.],
        [ 70., 174., 278.]]), array([[21., 26., 31., 36.],
        [27., 34., 41., 48.],
        [33., 42., 51., 60.]]), array([ 6.,  8., 10., 12.]))

In [87]:
answer_3

(array([[  4.,   4., -22.,  46.],
        [ 20., -12.,  50., -90.]]), array([[  8.,  -8.,   8.,  -8.],
        [  6.,  -8.,  10., -12.],
        [  4.,  -8.,  12., -16.],
        [  2.,  -8.,  14., -20.]]), array([-2.,  0.,  2., -4.]))

If you passed the tests above, assign the string "Question 6 passed" to asn6 in the code cell below.

In [88]:
### GRADED

ans6 = None

# YOUR CODE HERE
# raise NotImplementedError()
asn6 = 'Question 6 passed' # just in case there's a typo in task desc.
ans6 = 'Question 6 passed'


## Question 2.3: `ReLU.backward()`

We'll now implement how, during backprop, `ReLU` calculates $\nabla_X \text{Loss}$: the gradient of the loss with respect to ("w.r.t.") its input $X$.

Implement and `return` the value of $\nabla_X \text{Loss}$:

$$\begin{align*}
    \nabla_X \text{Loss} = \nabla_X \text{ReLU}(X) \odot \nabla_{ReLU(X)} \text{Loss} 
\end{align*}$$

$$\begin{align*}
    \text{Where $\odot$ is the element-wise product and $\nabla_X \text{ReLU}(X)$ is:}
\end{align*}$$

$$\begin{align*}
\nabla_X \text{ReLU}(X) =
        \begin{cases}
        1 & x > 0\\
        0 & x \leq 0
        \end{cases}
\end{align*}$$

**Hint 1**: $\nabla_{ReLU(X)} \text{Loss}$ is `grad`: the gradient of the loss w.r.t. ReLU's output.

**Hint 2**: $\nabla_X \text{ReLU}(X)$ will be a matrix filled with 1's and 0's. It has 1's where the original input $X$ was positive, and 0's where $X$ was zero or negative. You can use `state` for this.

In [89]:
def test_relu_backward_1(ReLU):
    layer = ReLU()
    layer.x = np.array([[1., -2.,  3., -4.],
                        [5.,  6., -0.,  0.]])
    grad = np.array([[-1.,  2., -3.,  4.],
                     [ 0.,  6., -2.,  8.]])
    grad_x = layer.backward(grad)
    return grad_x


print("Your answer:")
print(test_relu_backward_1(nn.ReLU))
print("Should be equal to:")
print(np.array([[-1.,  0., -3.,  0.], [ 0.,  6., -0., 0.]]))

Your answer:
[[-1.  0. -3.  0.]
 [ 0.  6. -0.  0.]]
Should be equal to:
[[-1.  0. -3.  0.]
 [ 0.  6. -0.  0.]]


In [90]:
from tests import test_relu_backward_1, test_relu_backward_2, test_relu_backward_3

answer_1 = test_relu_backward_1(nn.ReLU)
answer_2 = test_relu_backward_2(nn.ReLU)
answer_3 = test_relu_backward_3(nn.ReLU)

In [91]:
answer_1

array([[-1.,  0., -3.,  0.],
       [ 0.,  6., -0.,  0.]])

In [92]:
answer_2

array([[-0.,  2., -0.],
       [ 5., -6.,  0.]])

In [93]:
answer_3

array([[-0.,  0., -3.,  0., -6.,  0.]])

If you passed the tests above, assign the string "Question 7 passed" to `asn7` in the code cell below.

In [94]:
### GRADED

ans7 = None

# YOUR CODE HERE
# raise NotImplementedError()
asn7 = 'Question 7 passed' # just in case there's a typo in task desc.
ans7 = 'Question 7 passed'

## Question 2.4: `Sequential.backward()`

Now to implement an algorithm to run backprop over the entire pipeline.

In `mytorch/nn.py`, complete `Sequential.backward()` by translating the following description to code:

Begin backprop by getting the gradient from the `loss_function`'s backward. Then, pass this gradient to the `.backward` of the last layer, then continue passing these gradients backwards through the network until you've passed the first layer.

**Note**: No need to return anything, as the purpose of backprop is to store gradients on each trainable layer

**Hint**: Code should be similar to `Sequential.forward()`; you can use `reversed()`

In [95]:
def test_sequential_backward_1(Sequential, ReLU, Linear, CrossEntropyLoss):
    loss_function = CrossEntropyLoss()
    model = Sequential(ReLU(), Linear(2, 4), ReLU())
    model.layers[1].weight = np.array([[-1., 4., -1., 4.],
                                       [-3., 8., -5., 5.]])
    model.layers[1].bias = np.array([[-2., 3., 1., -2.]])
    x = np.array([[1.,  5.],
                  [2., -3.],
                  [4., -1]])
    out = model.forward(x)
    labels = np.array([0, 1, 1])

    loss_function.forward(out, labels)
    model.backward(loss_function)
    # Return the entire model so we can check its gradients
    return model


In [96]:
from tests import test_sequential_backward_1, test_sequential_backward_2, test_sequential_backward_3

answer_1 = test_sequential_backward_1(nn.Sequential, nn.ReLU, nn.Linear, nn.CrossEntropyLoss)
answer_2 = test_sequential_backward_2(nn.Sequential, nn.ReLU, nn.Linear, nn.CrossEntropyLoss)
answer_3 = test_sequential_backward_3(nn.Sequential, nn.ReLU, nn.Linear, nn.CrossEntropyLoss)

In [97]:
answer_1

<mytorch.nn.Sequential at 0x7f8135ae7080>

In [98]:
answer_2

<mytorch.nn.Sequential at 0x7f8135ae7c50>

In [99]:
answer_3

<mytorch.nn.Sequential at 0x7f8135ae7470>

If you passed the tests above, assign the string "Question 8 passed" to `asn8` in the code cell below.

In [100]:
### GRADED

ans8 = None

# YOUR CODE HERE
# raise NotImplementedError()
asn8 = 'Question 8 passed' # just in case there's a typo in task desc.
ans8 = 'Question 8 passed'

# Section 3: Step

## Question 3: Stochastic Gradient Descent (SGD)
In `mytorch/optim.py`, complete `SGD.step()` as described below:

For each `Linear` layer in the network, we update its weight $W$ and bias $b$ like so:
 
$$\begin{align*}
    W_t &= W_{t-1} - \eta \nabla_{W_{t-1}} \text{Loss} \\
    b_t &= b_{t-1} - \eta \nabla_{b_{t-1}} \text{Loss}
\end{align*}$$

$$\begin{align*}
    &\text{Where $W_t$ is the weight matrix after the update, $W_{t-1}$ is before the update,}\\
    &\text{$\eta$ is the learning rate, and $\nabla_{W_{t-1}} \text{Loss}$ is the stored gradient of the weight matrix.}\\
    &\text{Same applies for $b$.}
\end{align*}$$

**Hint**: `layers` contains both `Linear` layers that DO need updating, and `ReLU` activations that DON'T. Peek at how `SGD.zero_grad()` uses the `isinstance()` function.

In [101]:
def test_sgd_1(SGD, Sequential, Linear, ReLU):
    model = Sequential(Linear(2, 3), ReLU())
    model.layers[0].weight = np.array([[-3.,  2., -1.],
                                       [ 0., -1.,  2.]])
    model.layers[0].bias = np.array([[1., 0., -3.]])
    model.layers[0].grad_weight = np.array([[-10.,  9., -8.],
                                            [  7., -6.,  5.]])
    model.layers[0].grad_bias = np.array([[-3., 3., -3.]])

    # Create gradients manually, and update using them
    lr = 0.15
    optimizer = SGD(model, lr)
    optimizer.step()
    return model


In [109]:
from tests import test_sgd_1, test_sgd_2, test_sgd_3

answer_1 = test_sgd_1(optim.SGD, nn.Sequential, nn.Linear, nn.ReLU)
answer_2 = test_sgd_2(optim.SGD, nn.Sequential, nn.Linear, nn.ReLU)
answer_3 = test_sgd_3(optim.SGD, nn.Sequential, nn.Linear, nn.ReLU)

In [110]:
answer_1

<mytorch.nn.Sequential at 0x7f8135ae06d8>

In [113]:
answer_2

<mytorch.nn.Sequential at 0x7f8135ae0048>

In [114]:
answer_3

<mytorch.nn.Sequential at 0x7f8135ae0630>

If you passed the tests above, assign the string "Question 9 passed" to `asn9` in the code cell below.

In [115]:
### GRADED

ans9 = None

# YOUR CODE HERE
# raise NotImplementedError()
# raise NotImplementedError()
asn9 = 'Question 9 passed' # just in case there's a typo in task desc.
ans9 = 'Question 9 passed'

# Appendix

## Categorical to One-Hot Encodings
Here's a description of [one-hot encodings](https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics). Below is a simple visualization of the conversion.
<div>
    <img src="images/categorical_to_one_hot.png" width="500"/>
</div>

## Softmax
Article on the [Softmax function](https://en.wikipedia.org/wiki/Softmax_function).

Here's a visualized example of Softmax's effect.
<div>
    <img src="images/softmax.png" width="500"/>
</div>

Notice how each row adds up to 1 after applying it, and how the values are in $[0,1]$.