# Deep Learning
**Multilayer Perceptron (MLP)**: In this homework you are required to implement and train a 3-layer neural network to classify images of hand-written digits from the MNIST dataset. The input to the network will be a 28 × 28-pixel image, which is converted into a 784-dimensional vector. The output will be a vector of 10 probabilities (one for each digit). Specifically, the network you create should implement a function $g: \mathbb{R}^{784} \rightarrow \mathbb{R}^{10}$, where:

$$\mathbf{z}_{1} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}$$
$$\mathbf{h}_1 = ReLU(\mathbf{z}_1)$$
$$\mathbf{z}_2 = \mathbf{W}^{(2)}\mathbf{h}_1 + \mathbf{b}^{(2)}$$
$$\hat{\mathbf{y}} = g(\mathbf{x}) = Softmax(\mathbf{z}_2)$$

**Forward Propagation**: Compute the intermediate outputs $\mathbf{z}_{1}$, $\mathbf{h}_{1}$, $\mathbf{z}_{2}$, and $\hat{\mathbf{y}}$ as the directed graph shown below:

![jupyter](./img/mlp.jpg)

**Loss function**: After forward propagation, you should use the cross-entropy loss function: 
$$ f_{CE}(\mathbf{W}^{(1)},\mathbf{b}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(2)}) =  - \frac{1}{n}\sum_{i=1}^{n} \sum_{k=1}^{10} \mathbf{y}_k^{(i)} \log \hat{\mathbf{y}}_k^{(i)} $$
where $n$ is the number of examples.

**Backwards Propagation**: To train the neural network, you should use stochastic gradient descent (SGD). 

# Question 1:
Compute the individual gradient for each term

$$ \frac{\partial f_{CE}}{\partial \mathbf{W}^{(2)}} =    - \frac{1}{n}\sum_{i=1}^{n} (\mathbf{y}^{(i)} - \hat{\mathbf{y}}^{(i)}) \mathbf{h}_1^T$$


$$ \frac{\partial f_{CE}}{\partial \mathbf{b}^{(2)}} =    - \frac{1}{n}\sum_{i=1}^{n} (\mathbf{y}^{(i)} - \hat{\mathbf{y}}^{(i)}) $$


$$ \frac{\partial f_{CE}}{\partial \mathbf{W}^{(1)}} =    - \frac{1}{n}\sum_{i=1}^{n} \frac{d \mathbf{h}_1}{d \mathbf{z}_1} {\mathbf{W}^{(2)}}^T  (\mathbf{y}^{(i)} - \hat{\mathbf{y}}^{(i)})  \mathbf{x}^T $$
    
    
$$ \frac{\partial f_{CE}}{\partial \mathbf{b}^{(1)}} =     - \frac{1}{n}\sum_{i=1}^{n} \frac{d \mathbf{h}_1}{d \mathbf{z}_1} {\mathbf{W}^{(2)}}^T  (\mathbf{y}^{(i)} - \hat{\mathbf{y}}^{(i)})  $$

# Question 2: 
Implement stochastic gradient descent for the network shown above in the *Starter Code* Below

# Question 3: 
Verify that your implemented gradient functions are correct using a numerical derivative approximation in *scipy.optimize.check_grad*

See the call to check grad in the starter code. 

Note that: the discrepancy should be less than 0.01.

# Question 4: 
Train the network using proper hyper-parameters (batch size, learning rate etc), and report the train accuracy and test accuracy in the *Starter Code* Below


**NOTE THAT**: You only need to submit this '.ipynb' file.

# Starter Code: 

W1, b1, W2, b2的封装与拆解：

In [None]:
# Given a vector w containing all the weights and biased vectors, extract
# and return the individual weights and biases W1, b1, W2, b2.
# This is useful for performing a gradient check with check_grad.
def unpack (w):
    W1 = w[0:NUM_INPUT*NUM_HIDDEN].reshape(NUM_INPUT,NUM_HIDDEN)
    b1 = w[NUM_INPUT*NUM_HIDDEN:NUM_INPUT*NUM_HIDDEN+NUM_HIDDEN].reshape(NUM_HIDDEN)
    W2 = w[NUM_INPUT*NUM_HIDDEN+NUM_HIDDEN:NUM_INPUT*NUM_HIDDEN+NUM_HIDDEN+NUM_HIDDEN*NUM_OUTPUT].reshape(NUM_HIDDEN,NUM_OUTPUT)
    b2 = w[-NUM_OUTPUT:].reshape(NUM_OUTPUT)
    return W1, b1, W2, b2

# Given individual weights and biases W1, b1, W2, b2, concatenate them and
# return a vector w containing all of them.
# This is useful for performing a gradient check with check_grad.
# def pack (W1, b1, W2, b2):

def pack (W1, b1, W2, b2):
    W1f = W1.reshape(NUM_INPUT*NUM_HIDDEN)
    W2f = W2.reshape(NUM_HIDDEN*NUM_OUTPUT)
    b1f = b1.reshape(NUM_HIDDEN)
    b2f = b2.reshape(NUM_OUTPUT)
    w = np.concatenate((W1f,b1f,W2f,b2f),axis=0)
    return w

定义ReLU、softmax和ReLU的梯度函数：

In [None]:
def ReLU(x):
    x1 = x
    for i in range(NUM_HIDDEN):
        if(x[i]<0):
            x1[i]=0.
    return x1

def softmax(x):
    x = np.exp(x)/sum(np.exp(x))
    return x

def d_ReLU(h1):
    d = np.zeros((NUM_HIDDEN,NUM_HIDDEN))
    for i in range(NUM_HIDDEN):
        if(h1[i]):
            d[i][i]=1. 
    return d

fCE：

In [None]:
# Given training images X, associated labels Y, and a vector of combined weights
# and bias terms w, compute and return the cross-entropy (CE) loss. You might
# want to extend this function to return multiple arguments (in which case you
# will also need to modify slightly the gradient check code below).
def fCE (X, Y, w):
    cost = 0
    W1, b1, W2, b2 = unpack(w)
    n = len(X)
    for i in range(n):
        z1 = np.dot(X[i],W1)+b1
        h1 = ReLU(z1)
        z2 = np.dot(h1,W2)+b2
        y = softmax(z2)
        cost += (-1/n)*np.dot(Y[i],np.log(y).transpose())
    return cost

gradCE：

In [None]:
# Given training images X, associated labels Y, and a vector of combined weights
# and bias terms w, compute and return the gradient of fCE. You might
# want to extend this function to return multiple arguments (in which case you
# will also need to modify slightly the gradient check code below).
def gradCE(X,Y,w):
    W1, b1, W2, b2 = unpack(w)
    # initialize
    grad_W1 = W1*0.
    grad_b1 = b1*0.
    grad_W2 = W2*0.
    grad_b2 = b2*0.
    n = len(X)
    # compute the grad
    for i in range(n):
        z1 = np.dot(X[i],W1)+b1
        h1 = ReLU(z1)
        z2 = np.dot(h1,W2)+b2
        y = softmax(z2)
        # single time compute        
        grad_b2_s = Y[i]-y
        grad_W2_s = np.dot(h1.transpose().reshape(-1,1),grad_b2_s.reshape(1,-1))
        grad_b1_s = np.dot(np.dot(grad_b2_s,W2.transpose()),d_ReLU(h1))
        grad_W1_s = (np.dot(X[i].transpose().reshape(NUM_INPUT,1),np.dot(grad_b2_s,np.dot(W2.transpose(),d_ReLU(h1))).reshape(1,NUM_HIDDEN)))
        # sum and divided by n
        grad_b2 += (-1./n)*grad_b2_s
        grad_W2 += (-1./n)*grad_W2_s
        grad_b1 += (-1./n)*grad_b1_s
        grad_W1 += (-1./n)*grad_W1_s
    # pack
    w_grad = pack(grad_W1,grad_b1,grad_W2,grad_b2)
    return w_grad


train:

In [None]:
# Given training and testing datasets and an initial set of weights/biases b,
# train the NN.
## return the train accuracy and the test accuracy
def compute(X,w):
    W1, b1, W2, b2 = unpack(w)
    z1 = np.dot(X,W1)+b1
    h1 = ReLU(z1)
    z2 = np.dot(h1,W2)+b2
    y_pred = softmax(z2)
    return y_pred

def train (trainX, trainY, testX, testY, w, BATCH_SIZE = 64,NUM_ITERATION = 150):
    rate = 0.05
    test_acc = 0
    for i in range(NUM_ITERATION):
#         print("NUM_ITERATION:",i)
        idxs = np.random.permutation(trainX.shape[0])[0:BATCH_SIZE]
        X = np.atleast_2d(trainX[idxs,:])
        Y = np.atleast_2d(trainY[idxs,:])
        train_acc_num = 0
        for j in range(BATCH_SIZE):
            y_pred = compute(X[j],w)
            index = np.argwhere(Y[j]==1)[0][0]
            index_pred = np.argwhere(y_pred==np.max(y_pred))[0][0]
            if(index_pred == index):
                train_acc_num +=1 
            w = w-rate*gradCE(X,Y,w)
        train_acc = train_acc_num/BATCH_SIZE
        if train_acc>0.9:
            rate = 0.0001
        elif (train_acc>0.8) and (train_acc<=0.9):
            rate = 0.005
        else:
            rate = 0.05
    
    for j in range(len(testX)):
        y_pred = compute(testX[j],w)
        index = np.argwhere(testY[j]==1)[0][0]
        index_pred = np.argwhere(y_pred==np.max(y_pred))[0][0]
        if(index_pred == index):
            test_acc +=1
    test_acc = test_acc/len(testX)

    return train_acc, test_acc

main:

In [8]:
import numpy as np
import scipy.optimize

NUM_INPUT = 784  # Number of input neurons
NUM_HIDDEN = 50  # Number of hidden neurons
NUM_OUTPUT = 10  # Number of output neurons
NUM_CHECK = 5  # Number of examples on which to check the gradient

# Load the images and labels from a specified dataset (train or test).
def loadData (which):
    images = np.load("data/mnist_{}_images.npy".format(which))
    labels = np.load("data/mnist_{}_labels.npy".format(which))
    return images, labels

if __name__ == "__main__":
    # Load data
    trainX, trainY = loadData("train")
    testX, testY = loadData("test")
    print("len(trainX): ", len(trainX))
    print("len(testX): ", len(testX))
    
    # Initialize weights randomly
    W1 = 2*(np.random.random(size=(NUM_INPUT, NUM_HIDDEN))/NUM_INPUT**0.5) - 1./NUM_INPUT**0.5
    b1 = 0.01 * np.ones((1,NUM_HIDDEN))
    W2 = 2*(np.random.random(size=(NUM_HIDDEN, NUM_OUTPUT))/NUM_HIDDEN**0.5) - 1./NUM_HIDDEN**0.5
    b2 = 0.01 * np.ones((1,NUM_OUTPUT))
    w = pack(W1, b1, W2, b2)

    # Check that the gradient is correct on just a few examples (randomly drawn).
    idxs = np.random.permutation(trainX.shape[0])[0:NUM_CHECK]
    discrepancy = scipy.optimize.check_grad(lambda w_: fCE(np.atleast_2d(trainX[idxs,:]), np.atleast_2d(trainY[idxs,:]), w_), \
                                    lambda w_: gradCE(np.atleast_2d(trainX[idxs,:]), np.atleast_2d(trainY[idxs,:]), w_), \
                                    w)
    print("discrepancy",discrepancy)
    if discrepancy < 0.01:
        print("My implemented cost and gradient functions are correct")



len(trainX):  10000
len(testX):  5000
1.279702296059427e-06


In [23]:
# Train the network and return the train accuracy and test accuracy
train_acc, test_acc=train(trainX, trainY, testX, testY, w, BATCH_SIZE = 64,NUM_ITERATION = 150)
print("train_acc:{}\ntest_acc:{}".format(train_acc,test_acc))

0.984375 0.8934
