# Simple MNIST NN from scratch

In this notebook, We implemented a simple two-layer neural network and trained it on the MNIST digit recognizer dataset. It's meant to be an instructional example, through which you can understand the underlying math of neural networks better.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

data = pd.read_csv('/content/drive/MyDrive/Practical Deep Learning/Lesson 5: Logistic Regression + Overfitting/train.csv')

In [3]:
data = np.array(data)
m, n = data.shape
np.random.shuffle(data) # shuffle before splitting into dev and training sets

data_dev = data[0:1000].T
Y_dev = data_dev[0]
X_dev = data_dev[1:n]
X_dev = X_dev / 255.

data_train = data[1000:m].T
Y_train = data_train[0]
X_train = data_train[1:n]
X_train = X_train / 255.
_,m_train = X_train.shape

In [4]:
X_train.shape

(784, 41000)

Our NN will have a simple two-layer architecture. Input layer $a^{[0]}$ will have 784 units corresponding to the 784 pixels in each 28x28 input image. A hidden layer $a^{[1]}$ will have 10 units with ReLU activation, and finally our output layer $a^{[2]}$ will have 10 units corresponding to the ten digit classes with softmax activation.

**Forward propagation**

$$Z^{[1]} = W^{[1]} X$$
$$A^{[1]} = \text{ReLU}(Z^{[1]})$$
$$Z^{[2]} = W^{[2]} A^{[1]}$$
$$A^{[2]} = g_{\text{softmax}}(Z^{[2]})$$

**Backward propagation**

$$dZ^{[2]} = A^{[2]} - Y$$
$$dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T}$$
$$dZ^{[1]} = W^{[2]T} dZ^{[2]} .* g^{[1]\prime} (z^{[1]})$$
$$dW^{[1]} = \frac{1}{m} dZ^{[1]} A^{[0]T}$$

**Parameter updates**

$$W^{[2]} := W^{[2]} - \alpha dW^{[2]}$$
$$W^{[1]} := W^{[1]} - \alpha dW^{[1]}$$

**Vars and shapes**

Forward prop

- $A^{[0]} = X$: 785 x m
- $Z^{[1]} \sim A^{[1]}$: 10 x m
- $W^{[1]}$: 10 x 785 (as $W^{[1]} A^{[0]} \sim Z^{[1]}$)
- $Z^{[2]} \sim A^{[2]}$: 10 x m
- $W^{[1]}$: 10 x 10 (as $W^{[2]} A^{[1]} \sim Z^{[2]}$)

Backprop

- $dZ^{[2]}$: 10 x m ($~A^{[2]}$)
- $dW^{[2]}$: 10 x 10
- $dZ^{[1]}$: 10 x m ($~A^{[1]}$)
- $dW^{[1]}$: 10 x 10

In [5]:
def init_params():
    W1 = np.random.rand(10, 785) - 0.5
    W2 = np.random.rand(10, 10) - 0.5
    return W1, W2

def pad(x):
    return np.concatenate((np.ones((1, x.shape[1]), dtype = x.dtype), x), axis=0)

def ReLU(Z):
    "YOUR CODE HERE"
    return np.maximum(0, Z)
    pass

def softmax(Z):
    "YOUR CODE HERE"
    return np.exp(Z)/sum(np.exp(Z))
    pass

def forward_prop(W1, W2, X):
    "YOUR CODE HERE"
    # you should replace these None value
    Z1 = np.dot(W1, X)
    A1 = ReLU(Z1)
    Z2 = np.dot(W2, A1)
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2

def ReLU_deriv(Z):
  #đạo hàm reLu
    "YOUR CODE HERE"
    return Z > 0
    pass

def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    one_hot_Y = one_hot_Y.T
    return one_hot_Y

def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    "YOUR CODE HERE"
    # you should replace these None value
    print(A2.shape)
    print(one_hot_Y.shape)
    dZ2 = A2 - one_hot_Y
    dW2 = np.dot(dZ2, A1.T) / m_train
    dZ1 = np.dot(W2.T, dZ2) * ReLU_deriv(Z1)
    dW1 = np.dot(dZ1, X.T)/ m_train
    return dW1, dW2

def update_params(W1, W2, dW1, dW2, alpha):
    W1 = W1 - alpha * dW1
    W2 = W2 - alpha * dW2
    return W1, W2

In [6]:
def get_predictions(A2):
    return np.argmax(A2, 0)

def get_accuracy(predictions, Y):
    print(predictions, Y)
    return np.sum(predictions == Y) / Y.size

def gradient_descent(X, Y, alpha, iterations):
    W1, W2 = init_params()
    for i in range(iterations):
        "YOUR CODE HERE"
        # you should replace these None value
        Z1, A1, Z2, A2 = forward_prop(W1, W2, X)
        dW1, dW2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y)
        W1, W2 = update_params(W1, W2, dW1, dW2, alpha=alpha)
        if i % 10 == 0:
            print("Iteration: ", i)
            predictions = get_predictions(A2)
            print(get_accuracy(predictions, Y))
    return W1, W2

In [7]:
W1, W2 = gradient_descent(pad(X_train), Y_train, 0.10, 500)

(10, 41000)
(10, 41000)
Iteration:  0
[3 3 1 ... 9 4 4] [6 4 0 ... 8 0 5]
0.06939024390243903
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
Iteration:  10
[3 3 1 ... 3 0 4] [6 4 0 ... 8 0 5]
0.15065853658536585
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
Iteration:  20
[1 3 1 ... 3 0 4] [6 4 0 ... 8 0 5]
0.18024390243902438
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
(10, 41000)
Iteration:  30
[1 3 1 ... 3 0 4] [6 4 0 ... 

KeyboardInterrupt: 

~85% accuracy on training set.

In [None]:
def make_predictions(X, W1, W2):
    _, _, _, A2 = forward_prop(W1, W2, X)
    predictions = get_predictions(A2)
    return predictions

def test_prediction(index, W1, W2):
    current_image = X_train[:, index, None]
    prediction = make_predictions(pad(X_train[:, index, None]), W1, W2)
    label = Y_train[index]
    print("Prediction: ", prediction)
    print("Label: ", label)

    current_image = current_image.reshape((28, 28)) * 255
    plt.gray()
    plt.imshow(current_image, interpolation='nearest')
    plt.show()

Let's look at a couple of examples:

In [None]:
test_prediction(0, W1, W2)
test_prediction(1, W1, W2)
test_prediction(2, W1, W2)
test_prediction(3, W1, W2)

Finally, let's find the accuracy on the dev set:

In [None]:
dev_predictions = make_predictions(pad(X_dev), W1, W2)
get_accuracy(dev_predictions, Y_dev)

Still 84% accuracy, so our model generalized from the training data pretty well.