# Fully connected neural network from scratch

## MNIST dataset
Handwritten digits dataset (classification task).
https://www.kaggle.com/datasets/hojjatk/mnist-dataset

Training set size = 60,000  
Test set size = 10,000

Four files: training data, training labels, test data, test labels

Each data sample is a 28x28 pixel image (784 pixels total per image)
Each pixel is a value between 0 and 255 (0 is completely black, 255 is completely white)

<img src="attachment:a2c80960-20f9-41c9-8daa-1e906a606e11.png" width="50%">

In the X matrix, each row is an each input sample, in this case each image. Each row will be of length 784, each column representing each pixel of the image. The X matrix is transposed, so now each column represents an image, and each column has 784 rows.

## Neural network architecture

<img src="attachment:29658bd3-33cb-40dd-abeb-424e342f91f9.png" width="40%">

Input layer has 784 nodes, one for each pixel.

## Forward pass

<img src="attachment:c653a6d5-95b6-4728-a391-8a1940d38dc5.png" width="40%">

A[0] is the input matrix X (784 rows, m columns).  
m is the number of samples.

Z[1] equals dot product of weight matrix W[1] and inputs A[0], plus bias b[1].  
A[1] is the activation function applied to Z[1], ReLU in this case.

<img src='attachment:4ed81514-cabb-4072-94e8-16bb8f7e7a10.png' width="40%">

Z[2] equals dot product of a second weight matrix W[2] and A[1], plus a second bias term b[2].  
A[2] is another activation function applied to Z[2], this time softmax.
Softmax produces probabilities of each outcome, summing to 1 (produces probability distribution). There are ten possible outcomes, one for each written number 0-9.

<img src='attachment:254ae7f5-a9a6-41de-bc62-531eb9306d66.png' width="40%">

## Backpropagation
Equations:  

<img src='attachment:cdfd53d9-386e-46f0-88ad-8cd2e83f6f24.png' width="40%">

dZ[2] measures the diff between the predicted output A[2] and the acutal label y. If for example y was 4, y is represented as the one-hot vector:
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0].

dW[2] is the derivative of the loss function with respect to the weights in layer 2. Its the loss times by A[1], divided by the number of examples m.  
db[2] is just the average of the loss of the second layer dZ[2].

<img src='attachment:2886d135-63fc-46e7-9700-ece1aaefd985.png' width="40%">

dZ[1] is the loss with respect to the pre-activation nodes in layer 1. Its the loss times by the second weight matrix W[2], multiplied again by the derivative of the activation function g'(Z[1]).

dW[1] is the derivative of the loss function with respect to the weights in layer 1. Its the loss times by X, divided by the number of examples m.  

db[1] is the derivative of the loss function with respect to the bias in layer 1. Its the average of the loss of the first layer dZ[1].

## Update parameters

<img src='attachment:2acbaaae-58e4-42ef-aa67-5e012a3929b7.png' width="40%">


## Code

In [None]:
#Imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [None]:
#Data processing
data = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')

In [None]:
data = np.array(data)
m, n = data.shape
np.random.shuffle(data) # shuffle before splitting into dev and training sets

data_dev = data[0:1000].T
Y_dev = data_dev[0]
X_dev = data_dev[1:n]
X_dev = X_dev / 255. # Normalise the data

data_train = data[1000:m].T
Y_train = data_train[0]
X_train = data_train[1:n]
X_train = X_train / 255. # Normalise again
_,m_train = X_train.shape

In [None]:
X_train.shape

(784, 41000)

In [None]:
np.random.rand(1)

array([0.22587045])

In [None]:
def init_params():
    W1 = np.random.randn(10, 784) * np.sqrt(1. / 784)
    b1 = np.zeros((10, 1))
    W2 = np.random.randn(10, 10) * np.sqrt(1. / 10)
    b2 = np.zeros((10, 1))
    return W1, b1, W2, b2

def ReLU(Z):
    return np.maximum(Z, 0)

def ReLU_deriv(Z):
    return Z > 0

def softmax(Z):
    exp_Z = np.exp(Z - np.max(Z, axis=0, keepdims=True))
    return exp_Z / np.sum(exp_Z, axis=0, keepdims=True)

def forward_prop(W1, b1, W2, b2, X):
    Z1 = W1.dot(X) + b1
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2

def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max()+1)) #Try changing Y.max()+1 to just 10, surely this makes more sense? Investigate what would happen. *It does work
    one_hot_Y[np.arange(Y.size), Y] = 1
    one_hot_Y = one_hot_Y.T
    return one_hot_Y

def back_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y
    dW2 = dZ2.dot(A1.T) / m #Could it be A1.T.dot(dZ2) #Could it be dZ2.dot(A1.T) / m (same goes for other derivatives) *no and yes
    db2 = np.sum(dZ2) / m
    dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1)
    dW1 = dZ1.dot(X.T) / m
    db1 = np.sum(dZ1) / m
    return dW2, db2, dW1, db1

def update_params(W2, b2, W1, b1, dW2, db2, dW1, db1, alpha):
    W2 = W2 - alpha * dW2
    b2 = b2 - alpha * db2
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1
    return W2, b2, W1, b1

In [None]:
def get_predictions(A2):
    return np.argmax(A2, 0)

def get_accuracy(predictions, Y):
    #print(predictions, Y)
    return np.sum(predictions==Y) / Y.size

def gradient_descent(X, Y, alpha, num_iterations):
    W1, b1, W2, b2 = init_params()
    for i in range(num_iterations):
        Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW2, db2, dW1, db1 = back_prop(Z1, A1, Z2, A2, W1, W2, X, Y)
        W2, b2, W1, b1 = update_params(W2, b2, W1, b1, dW2, db2, dW1, db1, alpha)
        if i % 100 == 0:
            print("Iteration: ", i)
            predictions = get_predictions(A2)
            print(get_accuracy(predictions, Y))
    return W1, b1, W2, b2

In [None]:
W1, b1, W2, b2 = gradient_descent(X_train, Y_train, 0.1, 500)

Iteration:  0
0.12770731707317073
Iteration:  100
0.7362682926829268
Iteration:  200
0.8770975609756098
Iteration:  300
0.8939024390243903
Iteration:  400
0.9016341463414634


## Make predictions on dev set

In [None]:
def make_predictions(X, W1, b1, W2, b2):
    _, _, _, A2 = forward_prop(W1, b1, W2, b2, X)
    new_predictions = get_predictions(A2)
    return new_predictions

In [None]:
dev_predictions = make_predictions(X_dev, W1, b1, W2, b2)
get_accuracy(dev_predictions, Y_dev)

0.898

## Make predictions on test set

In [None]:
X_test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')
X_test = np.array(X_test)
X_test = X_test.T

In [None]:
X_test.shape

(784, 28000)

In [None]:
test_preds = make_predictions(X_test, W1, b1, W2, b2)

In [None]:
submission = pd.DataFrame({
    "ImageId": range(1, 28001),
    "Label": test_preds
})
submission.to_csv('submission.csv', index=False)