# Problem 1
Use back-propagation to calculate the gradients of $$f(W,x)=||\sigma(Wx)||^2$$
with respect to x and W. Here, $∥\cdot∥^2$ is the calculation of L2 loss, $W$ is a $3×3$ matrix, and $x$ is a $3 × 1$ vector, and $\sigma(\cdot)$ is the ReLU function that performs element-wise operation.

$$
W = 
\begin{bmatrix}
W_{1,1} & W_{1,2} & W_{1,3}\\
W_{2,1} & W_{2,2} & W_{2,3}\\
W_{3,1} & W_{3,2} & W_{3,3}
\end{bmatrix},\ \ x = \begin{bmatrix} x_{1}\\ x_{2}\\ x_{3} \end{bmatrix}
$$
Let's say:
$$
z = \begin{bmatrix}
W_{1,1}x_1 + W_{1,2}x_2 + W_{1,3}x_3\\
W_{2,1}x_1 + W_{2,2}x_2 + W_{2,3}x_3\\
W_{3,1}x_1 + W_{3,2}x_2 + W_{3,3}x_3
\end{bmatrix}
=
\begin{bmatrix}
z_{1}\\
z_{2}\\
z_{3}
\end{bmatrix}

$$
so to do $a=\sigma(z)$
$$
a = \begin{bmatrix}
max(0,z_1)\\
max(0,z_2)\\
max(0,z_3)
\end{bmatrix} = \begin{bmatrix}
a_{1}\\
a_{2}\\
a_{3}
\end{bmatrix}$$

$$a =\begin{cases} z_i & z_i > 0 \\ 0 & z_i \leq 0 \end{cases} $$
Now we are left with $$f(W,x)=||a||^2$$
then Gradient with respect to $\mathbf{a}$ 
$$\dfrac{\partial f}{\partial a_i} = \dfrac{\partial }{\partial a}(a_1^2 + a_2^2 + a_3^2) = 2a_i = 
\begin{bmatrix}
2a_{1}\\
2a_{2}\\
2a_{3}
\end{bmatrix}$$
so we get the gradient of $f$ with respect to $a$ $$\nabla_af=2a$$
then we want to find  $\nabla_zf$
$$\dfrac{\partial{f}}{\partial{z}}=\dfrac{\partial{f}}{\partial{a}}\dfrac{\partial{a}}{\partial{z}}$$
and we know derivative of the ReLU is:
$$\dfrac{\partial{a}}{\partial{z}} = \begin{cases}1 & z_i > 0 \\ 0  & z_i \leq 0 \end{cases} = \begin{bmatrix}
I_{(z_1>0)} &0&0\\
0&I_{(z_2>0)}&0\\
0&0&I_{(z_3>0)}
\end{bmatrix}$$
so we can get 
$$\nabla_zf=\dfrac{\partial{f}}{\partial{z}}= \begin{bmatrix}2a_1 \cdot I(z_1 > 0) \\ 2a_2 \cdot I(z_2 > 0) \\ 2a_3 \cdot I(z_3 > 0) \end{bmatrix} = 
\begin{bmatrix}2a_1\ if\ z_1 > 0,\ else\  0 \\2a_2\ if\ z_2 > 0,\ else\  0 \\2a_3\ if\ z_3 > 0,\ else\  0  \end{bmatrix} = 2a\cdot I_{z>0}$$ Now to find $\nabla_x f$ 
$$\dfrac{\partial{f}}{\partial{x}}=\dfrac{\partial{f}}{\partial{z}}\dfrac{\partial{z}}{\partial{x}}$$
we know can find the $\dfrac{\partial{z}}{\partial{x}}$ which is :
$$\dfrac{\partial{z_k}}{\partial{x_i}}=W_{k,i}$$

so
$$\nabla_xf = \dfrac{\partial{f}}{\partial{x}}=\sum_j\dfrac{\partial{f}}{\partial{z_j}}\dfrac{\partial{z_j}}{\partial{x_i}}=\sum_j2z_i\cdot I_{(z_i>0)} W_{j,i}=W^T\cdot \nabla_zf$$
on the other hand $\nabla_Wf$ is much easier to find
$$\nabla_Wf = \nabla_zf \cdot x^T$$  

---

# Problem 2
In this problem, you need to use Gradient Descent (GD) to train the linear classifier in the HW1, i.e., find the parameters W , and then use it to recognize handwritten digits. Adopt still ”Cross Entropy” as the loss function.

Requirements: 
1) manually derive the gradients of linear classifier when using cross-entropy as the loss function, and write codes to implement it in recognizing handwritten digits
2) the test accuracy should be at least 85%

In [1]:
import numpy as np
from urllib import request
import gzip
import pickle
import numpy as np
import math
import operator
import time
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

In [2]:
filename = [
["training_images","train-images-idx3-ubyte.gz"],
["test_images","t10k-images-idx3-ubyte.gz"],
["training_labels","train-labels-idx1-ubyte.gz"],
["test_labels","t10k-labels-idx1-ubyte.gz"]
]

def download_mnist():
    base_url = "https://ossci-datasets.s3.amazonaws.com/mnist/"
    for name in filename:
        print("Downloading "+name[1]+"...")
        request.urlretrieve(base_url+name[1], name[1])
    print("Download complete.")

def save_mnist():
    mnist = {}
    for name in filename[:2]:
        with gzip.open(name[1], 'rb') as f:
            mnist[name[0]] = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1,28*28)
    for name in filename[-2:]:
        with gzip.open(name[1], 'rb') as f:
            mnist[name[0]] = np.frombuffer(f.read(), np.uint8, offset=8)
    with open("mnist.pkl", 'wb') as f:
        pickle.dump(mnist,f)
    print("Save complete.")

def init():
    download_mnist()
    save_mnist()
#    print ((load()[0]).shape)
def load():
    with open("mnist.pkl",'rb') as f:
        mnist = pickle.load(f)
    return mnist["training_images"], mnist["training_labels"], mnist["test_images"], mnist["test_labels"]

if __name__ == '__main__':
    init()

Downloading train-images-idx3-ubyte.gz...
Downloading t10k-images-idx3-ubyte.gz...
Downloading train-labels-idx1-ubyte.gz...
Downloading t10k-labels-idx1-ubyte.gz...
Download complete.
Save complete.


Here I need to load the data as Tensors

In [3]:
x_train, y_train, x_test, y_test = load() ## reload the data to convert to tensor, because the previous data is in flattened
X_train = torch.FloatTensor(x_train)
y_train = torch.LongTensor(y_train)
X_test = torch.FloatTensor(x_test)
y_test = torch.LongTensor(y_test)

Here I create the Linear Classifier as a NN module

In [4]:
class LinearClassifier(nn.Module):
    def __init__(self):
        super(LinearClassifier, self).__init__()
        self.linear = nn.Linear(784, 10)

    def forward(self, x):
        return self.linear(x)

I define the model as the the previously defined Linear Classifier    
I also define the criterion as Cross Entropy Loss

In [5]:
model = LinearClassifier()
criterion = nn.CrossEntropyLoss()

Standard Gradient(SG)