# Building a Neural Network from Scratch
Our NN will have a simple two-layer architecture. Input layer $a^{[0]}$ will have 784 units corresponding to the 784 pixels in each 28x28 input image. A hidden layer $a^{[1]}$ will have 10 units with ReLU activation, and finally our output layer $a^{[2]}$ will have 10 units corresponding to the ten digit classes with softmax activation.

**Forward propagation**

$$Z^{[1]} = W^{[1]} X + b^{[1]}$$
$$A^{[1]} = g_{\text{ReLU}}(Z^{[1]}))$$
$$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$$
$$A^{[2]} = g_{\text{softmax}}(Z^{[2]})$$

**Backward propagation**

$$dZ^{[2]} = A^{[2]} - Y$$
$$dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T}$$
$$dB^{[2]} = \frac{1}{m} \Sigma {dZ^{[2]}}$$
$$dZ^{[1]} = W^{[2]T} dZ^{[2]} .* g^{[1]\prime} (z^{[1]})$$
$$dW^{[1]} = \frac{1}{m} dZ^{[1]} A^{[0]T}$$
$$dB^{[1]} = \frac{1}{m} \Sigma {dZ^{[1]}}$$

**Parameter updates**

$$W^{[2]} := W^{[2]} - \alpha dW^{[2]}$$
$$b^{[2]} := b^{[2]} - \alpha db^{[2]}$$
$$W^{[1]} := W^{[1]} - \alpha dW^{[1]}$$
$$b^{[1]} := b^{[1]} - \alpha db^{[1]}$$

**Vars and shapes**

Forward prop

- $A^{[0]} = X$: 784 x m
- $Z^{[1]} \sim A^{[1]}$: 10 x m
- $W^{[1]}$: 10 x 784 (as $W^{[1]} A^{[0]} \sim Z^{[1]}$)
- $B^{[1]}$: 10 x 1
- $Z^{[2]} \sim A^{[2]}$: 10 x m
- $W^{[1]}$: 10 x 10 (as $W^{[2]} A^{[1]} \sim Z^{[2]}$)
- $B^{[2]}$: 10 x 1

Backprop

- $dZ^{[2]}$: 10 x m ($~A^{[2]}$)
- $dW^{[2]}$: 10 x 10
- $dB^{[2]}$: 10 x 1
- $dZ^{[1]}$: 10 x m ($~A^{[1]}$)
- $dW^{[1]}$: 10 x 10
- $dB^{[1]}$: 10 x 1

# 0. Importing packages

In [83]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

path = r'digit-recognizer\train.csv'

data = pd.read_csv(path)


ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

Loaded the data into a pandas dataframe.

In [None]:
data.head()


Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Converting the data from a dataframe to a numpy array.

In [None]:
data = np.array(data)


Making sure that the model isn't overfitted, i.e., the model makes fairly accurate predictions for the training data but isn't generalised for the data it's supposed to have a high accuracy for. Setting aside a portion of the training data to perform cross-validation on to avoid overfitting.

Shuffling the data before we split the data into dev and training data. Note, `np.random.shuffle()` permutes the sequence in place.

In [None]:
np.random.shuffle(data)


In [None]:
data


array([[4, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [9, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 0, 0],
       [5, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [None]:
# Storing the dimensions
m, n = data.shape

# m - Number of images; n - label + pixels for each image
m, n


(42000, 785)

Splitting the data into dev and training. We're using dev to cross validate and we're setting aside only 1000 images to do so.

In [None]:
# Transposing the data using only 1000 images
data_dev = data[:1000].T

# Storing the labels in YDev
Y_dev = data_dev[0]

# Storing the pixels
X_dev = data_dev[1:]
X_dev = X_dev / 255


In [None]:
# Storing the rest of the images
data_train = data[1000:m].T

# Extract labels
Y_train = data_train[0]

# Get the rest
X_train = data_train[1:]
X_train = X_train / 255


Printing details of all arrays implemented so far.

In [None]:
print(
    f'Printing dimensions of all existing arrays:\n(i) X - pixels\nX_dev: {X_dev.shape}\nX_train: {X_train.shape}\n\n(ii) Y - labels\nY_dev: {Y_dev.shape}\nY_train: {Y_train.shape}')


Printing dimensions of all existing arrays:
(i) X - pixels
X_dev: (784, 1000)
X_train: (784, 41000)

(ii) Y - labels
Y_dev: (1000,)
Y_train: (41000,)


# 1. Defining Initial Parameters

Defining a function to initialise the neural network by creating random weights. We use `rand()` to obtain a random value between 0 and 1 and then we subtract from those values to make sure the range in which our random values lie is `[-0.5, 0.5]`. 

In [None]:
def init_params():
    # There's 10 connections for each of the 784 nodes
    W1 = np.random.rand(10, n - 1) - 0.5

    # There's 10 biases
    b1 = np.random.rand(10, 1) - 0.5

    # Similarly,
    # There's 10 connections to 10 output nodes
    W2 = np.random.rand(10, 10) - 0.5
    b2 = np.random.rand(10, 1) - 0.5

    return W1, b1, W2, b2


Function implementing the ReLU (rectified linear unit) activation function.

In [None]:
def ReLU(Z):
    # Taking the maximum element-wise using numpy
    return np.maximum(0, Z)


Function implementing the softmax activation function

In [None]:
def softmax(Z):
    A = np.exp(Z) / sum(np.exp(Z))
    
    # Returning the probability
    return A


# 2. Forward Propagation

Defining a function to implement forward propagation through the neural net.

In [None]:
def forward_propagation(W1, b1, W2, b2, X):
    # Deactivated first layer
    Z1 = W1.dot(X) + b1
    
    # Activating Z1
    A1 = ReLU(Z1)
    
    # Creating the next layer's deactivated input
    Z2 = W2.dot(A1) + b2
    
    # Since the next layer is the output layer, we apply softmax
    A2 = softmax(Z2)
    
    return Z1, A1, Z2, A2


Function to implement one-hot encoding of Y. This is to represent the target classes as an array instead of a label.

In [None]:
def one_hot_encode(Y):
    # Encoding
    one_hot_encoded_df = pd.get_dummies(Y)

    # Taking the transpose so the columns represent images
    one_hot_encoded_array = np.array(one_hot_encoded_df).T

    return one_hot_encoded_array


Test to illustrate the working of `one_hot_encode(Y)`.

In [None]:
test = Y_train[:20]
test


array([1, 3, 7, 2, 5, 6, 6, 5, 0, 8, 9, 8, 7, 8, 7, 3, 1, 6, 6, 4],
      dtype=int64)

In [None]:
df = pd.get_dummies(test).T
df

ls = np.array(df)
ls


array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=uint8)

Comparing it to our function.

In [None]:
one_hot_encode(test)


array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=uint8)

Alternative one-hot encoding function.

In [None]:
def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    
    one_hot_Y[np.arange(Y.size), Y] = 1
    
    one_hot_Y = one_hot_Y.T
    
    return one_hot_Y


Function to implement the derivative of ReLU.

In [None]:
def derivative_ReLU(Z):
    # Returning 1 if the value is greater than 0. This is because the slope of the `linear` thing is 1.
    return Z > 0


# 3. Back Propagation

Function to back propagate through the neural network to calculate the differences in the weights and biases. 

In [None]:
def back_propagation(Z1, A1, Z2, A2, W2, X, Y):
    m = Y.size
    # one_hot_encoded_Y = one_hot_encode(Y)

    # dZ2 = A2 - one_hot_encoded_Y
    
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y
    
    dW2 = (1 / m) * dZ2.dot(A1.T)
    db2 = (1 / m) * np.sum(dZ2)

    dZ1 = W2.T.dot(dZ2) * derivative_ReLU(Z1)
    dW1 = (1 / m) * dZ1.dot(X.T)
    db1 = (1 / m) * np.sum(dZ1)
    
    return dW1, db1, dW2, db2


# 4. Updating Parameters

Function to update the parameters using learning rate `alpha`.

In [None]:
def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1
    
    W2 = W2 - alpha * dW2
    b2 = b2 - alpha * db2
    
    return W1, b1, W2, b2


# 5. Defining Gradient Descent

In [None]:
def get_predictions(A):
    # Returns the indices of the max values
    return np.argmax(A, 0)

Testing the `get_predictions()` function.

In [None]:
test, get_predictions(test)

(array([1, 3, 7, 2, 5, 6, 6, 5, 0, 8, 9, 8, 7, 8, 7, 3, 1, 6, 6, 4],
       dtype=int64),
 10)

In [None]:
def get_accuracy(predictions, Y):
    print(predictions, Y)
    
    return np.sum(predictions == Y) / Y.size

Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks.  Training data helps these models learn over time, and the cost function within gradient descent specifically acts as a barometer, gauging its accuracy with each iteration of parameter updates. Until the function is close to or equal to zero, the model will continue to adjust its parameters to yield the smallest possible error. Once machine learning models are optimized for accuracy, they can be powerful tools for artificial intelligence (AI) and computer science applications. 

In [None]:
def gradient_descent(X, Y, epochs, alpha):
    # Defining weights and biases
    W1, b1, W2, b2 = init_params()

    for i in range(epochs):
        # Step 1
        Z1, A1, Z2, A2 = forward_propagation(W1, b1, W2, b2, X)
        
        # Step 2
        dW1, db1, dW2, db2 = back_propagation(Z1, A1, Z2, A2, W2, X, Y)
        
        # Step 3
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
        
        # Every 10th iteration
        if i % 10 == 0:
            print("Iteration: ", i)
            
            # A2 is the output from the forward propagation
            predictions = get_predictions(A2)
            
            print(get_accuracy(predictions, Y))
    
    return W1, b1, W2, b2


# 6. Running the neural network

Running it on `X_train` and `Y_train`.

In [86]:
W1, b1, W2, b2 = gradient_descent(X_train, Y_train, 500, 1)

Iteration:  0
[0 0 0 ... 0 0 7] [1 3 7 ... 9 2 5]
0.10863414634146341
Iteration:  10
[9 3 3 ... 9 1 3] [1 3 7 ... 9 2 5]
0.2796585365853659
Iteration:  20
[8 5 9 ... 4 9 7] [1 3 7 ... 9 2 5]
0.46863414634146344
Iteration:  30
[2 3 3 ... 9 1 3] [1 3 7 ... 9 2 5]
0.5385365853658537
Iteration:  40
[8 3 9 ... 9 1 8] [1 3 7 ... 9 2 5]
0.5752195121951219
Iteration:  50
[2 3 9 ... 9 2 8] [1 3 7 ... 9 2 5]
0.7823170731707317
Iteration:  60
[2 3 9 ... 4 2 8] [1 3 7 ... 9 2 5]
0.799439024390244
Iteration:  70
[2 3 9 ... 9 2 8] [1 3 7 ... 9 2 5]
0.7723170731707317
Iteration:  80
[1 3 9 ... 9 1 8] [1 3 7 ... 9 2 5]
0.7841951219512195
Iteration:  90
[8 3 9 ... 9 2 8] [1 3 7 ... 9 2 5]
0.8246341463414634
Iteration:  100
[8 3 9 ... 9 2 8] [1 3 7 ... 9 2 5]
0.8322195121951219
Iteration:  110
[1 3 9 ... 9 2 8] [1 3 7 ... 9 2 5]
0.8061463414634147
Iteration:  120
[1 3 9 ... 9 2 5] [1 3 7 ... 9 2 5]
0.8509268292682927
Iteration:  130
[8 3 9 ... 9 2 5] [1 3 7 ... 9 2 5]
0.862
Iteration:  140
[8 3 9 ... 9 

Running it on `X_dev` and `Y_dev` to cross-validate the model to overcome overfitting.

In [85]:
W1, b1, W2, b2 = gradient_descent(X_dev, Y_dev, 500, 1)

Iteration:  0
[6 2 6 2 2 6 2 6 5 6 6 2 6 6 2 6 6 6 6 2 6 2 6 2 6 6 6 6 5 2 6 6 2 6 6 6 6
 5 6 5 6 2 0 2 6 6 6 9 2 2 2 6 6 2 6 2 2 6 2 4 6 2 2 2 6 6 5 2 5 9 2 2 2 6
 6 6 2 9 6 2 2 9 2 6 2 2 6 6 6 2 6 2 2 6 9 9 6 0 2 9 9 6 2 6 6 2 6 6 2 6 2
 2 9 5 6 5 6 5 6 6 5 5 2 2 5 2 2 6 2 6 5 6 2 2 6 6 6 5 0 2 6 6 6 2 5 6 0 6
 6 6 5 6 2 2 2 6 5 2 2 2 6 2 6 5 6 6 0 6 2 2 2 5 2 6 2 6 6 0 6 9 2 9 5 2 6
 6 5 6 2 2 2 2 6 2 6 5 2 2 6 6 6 2 6 6 5 6 2 7 6 6 6 2 6 6 2 6 6 2 6 2 2 2
 6 6 2 6 6 2 0 2 2 6 2 6 2 0 6 5 2 6 6 6 0 2 2 9 7 6 6 2 6 2 6 2 6 5 6 2 6
 9 5 2 2 5 6 6 2 9 6 6 5 6 6 6 2 2 0 2 6 2 6 5 6 5 5 9 2 2 0 5 6 5 6 6 6 0
 2 2 2 5 0 6 6 2 5 2 6 1 6 5 5 2 6 2 6 2 2 2 6 6 2 2 6 2 6 2 6 6 2 6 2 2 6
 5 2 6 5 2 5 2 6 2 6 0 5 2 6 2 5 6 9 6 6 6 6 5 2 2 6 6 6 2 6 6 6 5 2 5 6 2
 6 2 6 6 2 2 5 6 2 2 6 6 6 2 2 2 2 6 6 2 2 2 6 5 6 2 6 6 2 2 6 2 2 2 6 6 9
 5 6 6 6 6 6 2 6 6 6 6 2 2 6 9 6 6 6 0 6 6 9 2 2 5 2 6 6 6 5 9 2 2 9 5 2 9
 2 6 6 2 2 2 6 5 2 2 6 2 2 6 2 6 9 6 2 6 2 2 5 6 6 6 2 6 2 6 2 5 9 2 2 2 2
 6 6 6 2 2 