# Developing a Neural Network from Scratch

I will be making a neural network to solve a handwritten digit recognition problem from the MNIST data set.  No deep learning libraries such as Tensorflow and Pytorch will be used here. However, NumPy will be imported to increase efficiency and simulate the inner workings of deep learning libraries. Pandas and Matplotlib will also be used to help handle data.

In [8]:
import numpy as np
import pandas as pd

First, let's read in the training data and get familiar with it. The training set here has 1 column describing the label (which digit the picture represents) followed by columns describing the color value of the pixels.

In [9]:
train_data = pd.read_csv("./data/train.csv")
train_data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's split the data into training and cross-validation (CV) sets. There are 42,000 examples so we will pick 10,000 random ones to use for CV

In [10]:
data = np.array(train_data)
m, n = data.shape
print(f"Shape of data before splitting: {m, n}")

# Shuffle the data first
np.random.shuffle(data)

# CV set
cv = data[:10000].T
X_cv = cv[1:n] / 255
y_cv = cv[0]

# Training set
train = data[10000:m].T
X_train = train[1:n] / 255
y_train = train[0]

Shape of data before splitting: (42000, 785)


Next, let's develop the Neural Network class we will be working with. This neural network will have 3 layers not including the input layer so I can get some practice developing multi-layered neural networks. The first layer will have 20 neurons, the second will have 15 neurons, and the last (output layer) will have 10 neurons corresponding to the 10 classes of digits.

In [11]:
class NeuralNetwork():

    def __init__(self):
        '''
        Initialize the neural network with random weights (w1, b1 , w2, b2, w3, b3)
        '''

        self.W1 = np.random.rand(20, 784) - 0.5
        self.b1 = np.random.rand(20, 1) - 0.5
        self.W2 = np.random.rand(15, 20) - 0.5
        self.b2 = np.random.rand(15, 1) - 0.5
        self.W3 = np.random.rand(10, 15) - 0.5
        self.b3 = np.random.rand(10, 1) - 0.5
    
    def reLU(self, Z):
        '''
        Defines reLU activation function

        Args:
            Z (np.ndarray): Wn * An-1 + bn
        
        Returns:
            reLU activation function applied to Z
        '''
        return np.maximum(0, Z)

    def reLU_derivative(self, Z):
        '''
        Returns the derivative of reLU function

        Args:
            Z (np.ndarray): Wn * An-1 + bn
        
        Returns:
            derivative of Z (1 or 0)
        '''

        return Z > 0

    def softmax(self, Z):
        '''
        Defines softmax activation function

        Args:
            Z (np.ndarray): Wn * An-1 + bn

        Returns:
            ndarray Z with softmax applied 
        '''

        return np.exp(Z) / sum(np.exp(Z))
    
    def forward_prop(self, X):
        '''
        Carry out forward propagation in neural network layers

        Args:
            X (np.ndarray): Training data
        
        Returns:
            6-tuple containing intermediate terms (Z1, A1, Z2, A2, Z3, A3)
        '''

        Z1 = np.dot(self.W1, X) + self.b1
        A1 = self.reLU(Z1)
        Z2 = np.dot(self.W2, A1) + self.b2
        A2 = self.reLU(Z2)
        Z3 = np.dot(self.W3, A2) + self.b3
        A3 = self.softmax(Z3)

        return Z1, A1, Z2, A2, Z3, A3
    
    def one_hot(self, Y):
        '''
        Uses one-hot encoding to encode each label
         
        Args:
            Y (np.ndarray): labels for data
        
        Returns:
            one_hot (np.ndarray): one-hot encoded ndarray for labels of Y
        '''

        # Use max(Y) + 1 for # of columns because thats how many digits there are (0-9)
        one_hot = np.zeros((Y.size, max(Y) + 1))
        # For each row, apply one-hot encoding at the specified Y_row = Y column and turn it from 0 to 1
        # Note: can do this iteratively, but numpy vectorizes this for us to make our NN a little more efficient
        one_hot[np.arange(Y.size), Y] = 1
        return one_hot.T
    
    # Note: some parameters of back prop are not used, but are included as parameters
    # because it's a little easier to use since it follows the pattern Z, A, Z, A, Z, A, X, Y
    # Slowdown because of this addition is negligible
    def back_prop(self, Z1, A1, Z2, A2, Z3, A3, X, Y):
        '''
        Carry out backward propagation in neural netowrk layers

        Args:
            Z1 (np.ndarray): W1 * A0 + b1
            A1 (np.ndarray): reLU(Z1)
            Z2 (np.ndarray): W2 * A1 + b2
            A2 (np.ndarray): reLU(Z2)
            Z3 (np.ndarray): W3 * A2 + b3
            A3 (np.ndarray): softmax(Z3)
            X (np.ndarray): data
            Y (np.ndarray): labels
        
        Returns:
            6-tuple containing gradients (dW1, db1, dW2, db2, dW3, db3)
        '''
        
        # Output layer
        one_hot_y = self.one_hot(Y)
        dZ3 = A3 - one_hot_y
        dW3 = 1 / m * np.dot(dZ3, A2.T)
        db3 = 1 / m * np.sum(dZ3)

        # 2nd layer
        dZ2 = np.dot(self.W3.T, dZ3) * self.reLU_derivative(Z2)
        dW2 = 1 / m * np.dot(dZ2, A1.T)
        db2 = 1 / m * np.sum(dZ2)

        # 1st layer
        dZ1 = np.dot(self.W2.T, dZ2) * self.reLU_derivative(Z1)
        dW1 = 1 / m * np.dot(dZ1, X.T)
        db1 = 1 / m * np.sum(dZ1)

        return dW3, db3, dW2, db2, dW1, db1

    def update(self, dW1, db1, dW2, db2, dW3, db3, learning_rate):
        '''
        Updates weights and biases for gradient descent

        Args:
            dW1 (float32): gradient for layer 1 weight
            db1 (float32): gradient for layer 1 bias
            dW2 (float32): gradient for layer 2 weight
            db2 (float32): gradient for layer 2 bias
            dW3 (float32): gradient for layer 3 weight
            db3 (float32): gradient for layer 3 bias
            learning_rate (float32): learning rate alpha for gradient descent
        '''

        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W3 -= learning_rate * dW3
        self.b3 -= learning_rate * db3
    
    def gradient_descent(self, X_train, y_train, learning_rate, iters):
        '''
        Runs gradient descent to train the neural network

        Args:
            X (np.ndarray): training data
            y (np.ndarray): training labels
            learning_rate (float32): learning rate alpha
            iters (int32): number of iterations to run gradient descent
        
        Returns:
            6-tuple containing trained weight and bias terms (W1, b1, W2, b2, W3, b3)
        '''
        
        for i in range(1, iters + 1):
            Z1, A1, Z2, A2, Z3, A3 = self.forward_prop(X_train)
            dW3, db3, dW2, db2, dW1, db1 = self.back_prop(Z1, A1, Z2, A2, Z3, A3, X_train, y_train)
            self.update(dW1, db1, dW2, db2, dW3, db3, learning_rate)

            # Print status report every 100 iterations
            if i % 100 == 0 or i == 1:
                print(f"Iteration: {i}")
                pred = np.argmax(A3, 0)
                print(f"Accuracy: {np.sum(pred == y_train) / y_train.size}")
        
        return self.W1, self.b1, self.W2, self.b2, self.W3, self.b3

That was a lot. Let's test it!

In [12]:
nn = NeuralNetwork()
W1, b1, W2, b2, W3, b3 = nn.gradient_descent(X_train, y_train, 0.05, 1000)

Iteration: 1
Accuracy: 0.09084375
Iteration: 100
Accuracy: 0.4350625
Iteration: 200
Accuracy: 0.60384375
Iteration: 300
Accuracy: 0.68528125
Iteration: 400
Accuracy: 0.730125
Iteration: 500
Accuracy: 0.76246875
Iteration: 600
Accuracy: 0.782875
Iteration: 700
Accuracy: 0.800875
Iteration: 800
Accuracy: 0.81359375
Iteration: 900
Accuracy: 0.82590625
Iteration: 1000
Accuracy: 0.834875


Now, let's try it on the CV data to make sure we aren't overfitting

In [13]:
nn_cv = NeuralNetwork()
W1, b1, W2, b2, W3, b3 = nn_cv.gradient_descent(X_cv, y_cv, 0.05, 1000)

Iteration: 1
Accuracy: 0.0916
Iteration: 100
Accuracy: 0.1779
Iteration: 200
Accuracy: 0.3049
Iteration: 300
Accuracy: 0.4038
Iteration: 400
Accuracy: 0.4929
Iteration: 500
Accuracy: 0.5616
Iteration: 600
Accuracy: 0.6133
Iteration: 700
Accuracy: 0.6561
Iteration: 800
Accuracy: 0.6835
Iteration: 900
Accuracy: 0.7096
Iteration: 1000
Accuracy: 0.7295


#### The final accuracy is not exact, but it appears to be generally similar. The model likely has a slight overfitting issue (i.e: high variance), so an architecture with 1 less layer might be a better fit here. However, since this is a practice project, 3 layers were used just to get used to formally defining multiple hidden layers.

#### Overall I'm pretty happy with how this turned out, and really excited to know that I can use this knowledge to debug models I make in the future using deep learning libraries!