In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/digit-recognizer/sample_submission.csv
/kaggle/input/digit-recognizer/train.csv
/kaggle/input/digit-recognizer/test.csv


# **Introduction**
Welcome to my first notebook! I will be writing code along with my thought process and current knowledge upon the topics I will be diving into. Some basics of Machine Learning that I have learned through the wonders of YouTube are:
* Weights and Biases
* Forward and Back Propogation
* Basics of Gradient Descent
* Activation Functions & Softmax / Argmax Functions

I'm not sure how the formatting of Jupyter Notebooks should be or if I'm following protocol with my informal language, however, I believe it would be fun to show off this project as a sort of journey I went down and to have logs and comments that people can follow along to and not have it just be long bricks of text explaining what what does and how does this work because, I won't exactly know how everything works at this point! However, my goal is to learn the deeper level of the concepts I hold surface level information on to come back to this and continue my path down Machine Learning.

Hi! I'm **Aidan** and as of right now I am a Freshman in Stevens Institute of Technology, I HAVE made a previous submission regarding MNIST digits and Machine Learning and what not, but I have not made a formal Jupyter notebook so I will be trying to make one on an account that I will continue to use in the future.

In [2]:
# retrieve testing and training data
test_data = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')
train_data = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')

So the first thing we have to do obviously is to get the data from the CSV files provided. Nothing really new learned here, I knew how to work with **pandas** before working on this (pretty important pre-requisite to Python Machine Learning). 

Thankfully, the competition provided the data in two seperate CSV files for training and testing (how kind!) so there will be no splitting of numpy arrays or splitting data into training and testing data, because the work was done for us!

In [3]:
# testing data
test_data = np.array(test_data)

test_data = test_data.T # transpose the data
X_test = test_data
X_test = X_test / 255

# training data
train_data = np.array(train_data)
np.random.shuffle(train_data) # shuffle the data around

train_data = train_data.T # transpose the data
Y_train = train_data[0]
X_train = train_data[1:]
X_train = X_train / 255

# **Data Conversion**
So here we convert the CSV data into numpy arrays so we can mold and mash the data however we want to.

> *FUNNY STORY*: When I first did this competition, I shuffled the test data, and I was so confused as to how my predictions were so off when submitting them to the competition, and it took forever to find it... First submission was an accuracy of 5.25%

The data is transposed (changing the rows into columns and vise versa) to make it easier to extract the Y values in the training data a.k.a. the expected output to the X data. The expected outputs are then stored in **Y_train** and the inputs are stored in **X_train**.

I'm actually unsure as to why we divide the data by *255*, I will figure this out in the future.

In [4]:
# initialize weights and biases
def init_params():
    '''
    Randomly generates the weights and biases for the first generation neural network.
    '''

    W1 = np.random.rand(10, 784) - 0.5
    b1 = np.random.rand(10, 1) - 0.5
    W2 = np.random.rand(10, 10) - 0.5
    b2 = np.random.rand(10, 1) - 0.5

    return W1, b1, W2, b2

# **Initialization**
Here we are pretty much initializing the weights and biases of the neural network. 

This is just a baseline for the neural network to use for the first training session, a.k.a. the first time we pass through the data through our network. Due to the randomization, it's expected that our neural network has around a 10% accuracy, since with these untouched, randomized weights and biases its like our neural network here is closing its eyes and completely guessing as to what the digit it is given is. 

As more training iterations occur, our network will begin "learning" and these weights and biases will be adjusted slowly to the weights and biases we want, which will slowly improve our accuracy.

In [5]:
def ReLU(Z):
    '''
    Pythonic ReLU function.
    '''

    return np.maximum(0, Z)

def softmax(Z):
    '''
    Pythonic softmax activation function.
    '''

    return np.exp(Z) / sum(np.exp(Z))

# forward propagation
def forward_prop(W1, b1, W2, b2, X):
    '''
    Runs the data from the input nodes to the output nodes in a 
    forward motion.

    X -- represents the input data / input layer
    '''

    z1 = W1.dot(X) + b1
    A1 = ReLU(z1)
    z2 = W2.dot(A1) + b2
    A2 = softmax(z2)

    return z1, A1, z2, A2

# **Forward Propogation**
So the first of the three sections I learned about neural networks. **Forward Propogation**. As of right now, the information I know about forward propogation is the data is being run through the neural network, a.k.a. some X (input) is run through the network, and the Y (output) is stored and, in the future, compared to the expected Y. 

Unfortunately, I don't understand all of the math behind the formulas used above, but I understand the *general* idea behind the formula as a whole!

### Activation Functions
The activation function used in this model is the ReLU function, which is seen in the below equations. What I learned activation functions do in neural networks is they determine how important a certain neuron is when processing the input. This function pretty much determines if a node should be activated or not, hence the name: *activation function*.

Here are a few of the equations seen above (m $ \rightarrow $ amount of training images):

$ A^{[0]} = X_{784 \times m} $

$ z^{[1]}_{10 \times m} = w^{[1]}_{10 \times 784} * A^{[0]}_{784 \times m} + b^{[1]}_{10 \times m} $

$ \text{ReLU} \rightarrow \begin{bmatrix} x \text{ if } x > 0 \\ 0 \text{ if } x \le 0 \end{bmatrix} $

$ A^{[1]} = g(z^{[1]}) \text{\{g -> activation function\}} = ReLU(z^{[1]}) $

$ z^{[2]}_{10 \times m} = w^{[2]}_{10 \times 10} * A^{[1]}_{10 \times m} + b^{[2]}_{10 \times m} $

$ \text{softmax} \rightarrow \frac{e^{z_i}}{\sum^K_{j=1}e^{z_j}} $

$ A^{[2]} = \text{softmax}(z^{[2]}) $

In [6]:
def one_hot(Y):
    '''
    Formats the expected value into an array of the correct size and format.
    '''

    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    one_hot_Y = one_hot_Y.T

    return one_hot_Y

def deriv_ReLU(Z):
    '''
    Pythonic derivative of ReLU function.
    '''

    return Z > 0

# backward propagation
def back_prop(z1, A1, z2, A2, W1, W2, X, Y):
    '''
    Runs the data from the output nodes to the input nodes and
    calculates the error in a backward motion.
    '''

    m = Y.size
    one_hot_Y = one_hot(Y)

    dz2 = 2 * (A2 - one_hot_Y)
    dW2 = 1 / m * dz2.dot(A1.T)
    db2 = 1 / m * np.sum(dz2)
    dz1 = W2.T.dot(dz2) * deriv_ReLU(z1)
    dW1 = 1 / m * dz1.dot(X.T)
    db1 = 1 / m * np.sum(dz1)

    return dW1, db1, dW2, db2

# **Back Propogation**
Second of the three seconds is **back propagation**! This was the area where I learned the most about neural networks, as I already had a general idea that the way they trained was by feeding them input data and evaluating the output. However, I was uncertain of the process of evaluating said input, which I learned during my learning process with back propagation!

I understand basic ideas and mathematics behind back propagation, but I am also lacking when it comes to specifics and specific math behind the idea of back propagation. 

What I generally understand is these functions are trying to find the amount of error behind the output recieved from *forward propagation* and evaluate what weights and biases contributed the most and the amount of impact they had on the error, hence why we are finding "derivatives" of the weights and biases, as they represent the amount of an impact they had on the error within the output.

The weights and biases are then adjusted based on these derivatives, however we aren't on that stage yet. I'm a little loose on the math behind using the derivative of our activation function (ReLU), however the other function: *one_hot* I understand. We are isolating the output and formatting it into a $ 1 \times 10 $ array so we can take the difference of the expected output and our output *A2* that we got from our forward propagation functions.

I still have a lot to learn in this area, however, I believe that I learned a lot during my research and hands-on project! Here are some of the formulas I used (again; m $ \rightarrow $ amount of training images):

$ dz^{[2]}_{10 \times m} = 2 (A^{[2]}_{10 \times m} - Y_{10 \times m}) $

$ dw^{[2]}_{10 \times 10} = \frac1m dz^{[2]}_{10 \times m} A^{[1]T}_{m \times 10} $

$ db^{[2]}_{10 \times 1} = \frac1m \sum dz^{[2]} $

$ \text{ReLU'} \rightarrow \begin{bmatrix} 1 \text{ if } x > 0 \\ 0 \text{ if } x \le 0 \end{bmatrix} $

$ dz^{[1]}_{10 \times m} = w^{[2]T}_{10 \times 10} dz^{[2]}_{10 \times m} * g'(z^{[1]})_{10 \times m} = w^{[2]T}_{10 \times 10} dz^{[2]}_{10 \times m} * ReLU'(z^{[1]})_{10 \times m} $

$ dw^{[1]}_{10 \times 784} = \frac1m dz^{[1]}_{10 \times m} X^T_{m \times 784} $

$ db^{[1]}_{10 \times 1} = \frac1m \sum dz^{[1]}_{10 \times 1} $

In [7]:
def get_predictions(A2):
    return np.argmax(A2, 0)

def get_accuracy(predictions, Y):
    return np.sum(predictions == Y) / Y.size

def make_predictions(X, W1, b1, W2, b2):
    _, _, _, A2 = forward_prop(W1, b1, W2, b2, X)
    predictions = get_predictions(A2)

    return predictions

In [8]:
# update parameters
def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    '''
    Adjusts the weights and biases after front and back propagation is
    completed.
    '''

    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1
    W2 = W2 - alpha * dW2
    b2 = b2 - alpha * db2

    return W1, b1, W2, b2

# gradient descent
def gradient_descent(X, Y, iterations, alpha):
    W1, b1, W2, b2 = init_params()
    for i in range(iterations):
        z1, A1, z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW1, db1, dW2, db2 = back_prop(z1, A1, z2, A2, W1, W2, X, Y)
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
        if i % 50 == 0:
            print('Iteration: ', i)
            print('Accuracy: ', get_accuracy(get_predictions(A2), Y))

    return W1, b1, W2, b2

# **Updating Parameters & Gradient Descent**
This is the final phase of the neural network training process that I learned about and it really wrapped everything up in a nice shiny bow for me, showing the completely loop that the training of a neural network really is. 

The input data is converted into an observed output, this output is compared to the expected output, and this error is then used to adjust the weights and biases accordingly. Then we go through this loop over and over again! Although this may sound repeatative or boring to some, the learning process of this whole system was truly interesting to me and I'm excited to learn more! 

Anyways, going back on track, after we figure out what weights and biases had what effects on the error, we have to adjust the weights and biases accordingly so they do better the next time we run the input data through! We do this by updating the parameters using the function above. 

### Learning Rate
A new term introduced was a *learning rate* which is represented by the variable alpha above. The optimal (or just randomly chosen) value used in this model was 1.5, and this pretty much determines how much will the neural network adjust itself when an error is found.

Too much of a learning rate may cause occillation in the weights and biases by adjusting the weights and biases too much and being too sensitive to small errors, causing little to no learning within our neural network. The opposite could happen when the learning rate is too small, not allowing our neural network to make major enough changes to error when an error presents itself. This is why finding a sweet spot with learning rate is important.

The function as a whole runs the input data through the loop mentioned earlier, and constantly updates the parameters over and over until the desired amount of iterations are reached. The function then returns the weights and biases to be used by our neural network for testing or production!

In [9]:
def submit(W1, b1, W2, b2):
    submission_data = []
    
    for i in range(len(X_test[0])):
        current_image = X_test[:, i, None]
        prediction = make_predictions(X_test[:, i, None], W1, b1, W2, b2)
        submission_data.append([i + 1, prediction[0]])
        
    submission = pd.DataFrame(submission_data, columns=['ImageId', 'Label'])
    submission.to_csv('/kaggle/working/submission.csv', index=False)

In [10]:
def train():
    W1, b1, W2, b2 = gradient_descent(X_train, Y_train, 500, 0.15)
    submit(W1, b1, W2, b2)
    
train()

Iteration:  0
Accuracy:  0.10238095238095238
Iteration:  50
Accuracy:  0.5036190476190476
Iteration:  100
Accuracy:  0.7549523809523809
Iteration:  150
Accuracy:  0.8137142857142857
Iteration:  200
Accuracy:  0.8378571428571429
Iteration:  250
Accuracy:  0.8528571428571429
Iteration:  300
Accuracy:  0.860904761904762
Iteration:  350
Accuracy:  0.8677619047619047
Iteration:  400
Accuracy:  0.8734047619047619
Iteration:  450
Accuracy:  0.8775


# **Conclusion**
Overall this was a fun learning experience and I learned a lot! I'm not sure if this is the usual way of using Jupiter but it was cool making a journal like this to log my thoughts throughout the whole process.

Thank you so much for reading!