# Artificial Neural Network

*Note: The code in this document cannot run*

## 1. Introduction and Overview

### 1.1 Introduction

Neural Networks (NN) are like the fundamentals of Deep Learning. With enough data and computational power, they can be used to solve almost any problems. While it is totally fine to treat these NN like black boxes, it's much more exciting and educational to learn what lies behind this term and how they work. 

As such, this project aims to create an Artificial Neural Network (ANN) from scratch using Python and NumPy. This documentation explains the flow of the ANN and how it works from behind the hood. 

### 1.2 Overview

![Neural Network Diagram](../static/Neural%20Network.png)

Above is a very simplistic view of the diagram. It has some coloured circles connected to each other, often referred to as neurons. Basically, how it works it that a Neural Network consists of an Input Layer, multiple Hidden Layers and a final Output Layer. The input is taken in from the Input Layer, passed through multiple Hidden Layers before finally reaching the Output Layer. 

For this particular ANN, the neurons will train from a set of data, and determine its weights and biases will be determined through backward propagation. 

## 2. Neural Network

### 2.1 Initialization of Parameters

Referring to the code below, the parameters layer_dims is basically the dimensions of the input, hidden and output layers. For instance, for a 4 layer neural Network with input layer having 784 neurons, the two hiden layers having 128 and 64 neurons respectively and output layer having 10 neurons, the layer_dims will be equal to [784, 128, 64, 10]. In a sense it is taken in as [Input Layer Dimension, Hidden Layer Dimension, ..., Output Layer Dimension].

The weights of each neuron are randomly generated via a normal distribution and then multipled by 0.01. 
The weights are in a $m \times n$ matrix where m is the current layer dimension and n is the previous layer dimension. In a sense, the weight matrix of the lth layer is as follows:

$$
weight_{l} = \begin {bmatrix} w_{1,1} & w_{1,2} & ... & w_{1,n} \\ 
                              w_{2,1} & w_{2,2} & ... & w_{1,n} \\
                              . \\
                              . \\
                              . \\
                              w_{m,1} & w_{m,2} & ... & w_{m,n}
                                      
             \end{bmatrix}
$$

, where m is the current layer dimension and n is the previous layer dimension.

The biases of each neuron are initalized to zero. In a sense the initial bias vector will be basically an $n \times 1$ column matrix, as such:

$$
bias_{l} = \begin {bmatrix} b_{1} \\ b_{2} \\ . \\ . \\ . \\ b_{m} \end {bmatrix}
$$

, where m is the current layer dimension.

In [None]:
import numpy as np

def initialize_parameters(layer_dims: list[int]):
    np.random.seed(1)

    parameters = {}

    for l in range(1, len(layer_dims)):
        # Generation of Weights
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) *0.01
        # Generation of Biases
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))

    return parameters

### 2.2 Forward Propagation

Forward propagation is basically the process of taking in inputs and making a prediction. 

#### 2.2.1 Linear Hypothesis

For the foward propagation of this particular neural network, and with the assumption that the number of training entries is 1, the Linear Hypothesis is defined as follows: 

$$
Z_{(1, m)} = A_{(1, n)} \cdot W_{(n, m)} + b_{(1, m)}
$$

, where $Z_{(1, m)}$ is the $1 \times m$ linear output, $W_{(n, m)}$ is the $n \times m$ weight matrix, $A_{(1, n)}$ is the $n \times 1$ activated output from the previous layer and $b_{(1, m)}$ is the $m \times 1$ bias vector.

#### 2.2.2 Activation Function

The activation function for this particular neural network is the sigmoid function. It is defined as follows: 

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

#### 2.2.3 Cache

The Linear and Activation Cache are basically the input to the Linear Hypothesis and Activation Function respectively. They are stored and returned for the [backward propagation](#23-backward-propagation) which will be explained in the next section.

In [None]:
def forward_propagation(X: np.ndarray, params: dict):
    A = X
    caches = []

    for l in range(1, (len(params) // 2) + 1):
        A_prev = A
        W = params['W'+str(l)]
        b = params['b'+str(l)]

        # Linear Hypothesis
        Z = np.dot(A_prev, W) + b

        # Linear Cache
        linear_cache = (A_prev, W, b)

        # Activation Function
        A = sigmoid(Z)                     # type: ignore 

        # Activation Cache
        activation_cache = Z

        cache = (linear_cache, activation_cache)
        caches.append(cache)
    
    return A, caches

### 2.3 Backward Propagation

Backward propagation is slightly more complex than the forward one. In essence, the backward propagation is basically a learning mechanism in the neural network to tune its weights and biases for each neuron. During training, the network evaluates the cost by comparing the prediction to the labels. Then, gradient are determined via backward propagation to make subtle changes to the weights and biases of each neuron. 

#### 2.3.1 Single Layer Backward Propagation (SLBP)

SLBP is basically diving into the backward propgation process of one single layer. Referring to the code below, $dA = \frac{\partial cost}{\partial A}$ where A is its output. In other words, dA is the change in cost with respect to (w.r.t) the change in A. 

The calculation of $dZ = \frac{\partial cost}{\partial Z}$ where Z is the layer's input is derived from the chain rule. The formula is clearer as follows:

$$
dZ = \frac{\partial cost}{\partial Z} = \frac{\partial cost}{\partial A} * \frac{dA}{dZ}
$$

, where $dA = \frac{\partial cost}{\partial A}$ and $\frac{dA}{dZ}$ is the first order derivative of the [activation function](#222-activation-function).

To obtain dW the formula is shown: 

$$
dW = \frac{\partial cost}{\partial W} = \frac{1}{m} \times A_{prev}^{T}  \cdot \frac{\partial cost}{\partial Z}
$$

, where m is the total number of training sets, $dZ = \frac{\partial cost}{\partial Z}$ posses the shape of $1 \times m$ and $A_{prev}$ is the $1 \times n$ input vector to this particular layer. Do note that dW possess the same shape as the weight matrix and that m is the current layer dimension while n is the previous layer dimension.

To obtain db the formula is as follows: 

$$
db = \frac{\partial cost}{\partial b} = \frac{1}{m} \times \begin {bmatrix}
\sum_{i=1}^{m} dZ_{(1, m)} \\
\sum_{i=1}^{m} dZ_{(2, m)} \\
. \\
. \\
. \\
\sum_{i=1}^{m} dZ_{(n, m)}
\end {bmatrix}
$$

, where m is the total number of training sets and n is the current layer dimension, making db a $n \times 1$ matrix if m = 1.

In [None]:
def one_layer_back_propagation(dA, cache: tuple):
    linear_cache, activation_cache = cache

    Z = activation_cache
    dZ = dA * sigmoid_first_derivative(Z) # type: ignore

    A_prev, W, b = linear_cache
    m = A_prev.shape[1]

    dW = (1 / m) * np.dot(A_prev.T, dZ)
    db = (1 / m) * np.sum(dZ, axis=0, keepdims=True)
    db = db.reshape(db.shape[1],)
    dA_prev = np.dot(dZ, W.T)

    return dA_prev, dW, db

#### 2.3.2 Backward Propagation Process

After diving deep into the math of the [SLBP](#231-single-layer-backward-propagation-slbp) this is simply applying it across all the layers. 

In [None]:
def backward_propagation(AL, Y, caches):
    gradients = {}
    L = len(caches)
    Y = Y.reshape(AL.shape)

    dAL = -(np.divide(Y, AL)) - np.divide(1-Y, 1-AL)

    current_cache = caches[L - 1]
    gradients['dA'+str(L-1)], gradients['dW'+str(L-1)], gradients['db'+str(L-1)] = one_layer_back_propagation(dAL, current_cache)

    for l in reversed(range(L - 1)):
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = one_layer_back_propagation(gradients['dA'+str(l+1)], current_cache)
        gradients['dA' + str(l)] = dA_prev_temp
        gradients['dW' + str(l + 1)] = dW_temp
        gradients['db' + str(l + 1)] = db_temp
    
    return gradients


#### 2.3.3 Update of Parameters

Referring to the code below, the weights for each layer are simply obtained with the formula as shown:

$$
W_{new} = W_{current} - rate_{learning} \times dW
$$

, where W is the weight matrix and dW is the gradient and both are of the same shape, being $m \times n$ where m is the dimension of the current layer and n is the dimension of the previous layer. 

Similarly, biases are obtained by:

$$
b_{new} = b_{current} - rate_{learning} \times db
$$

, where b is the bias matrix and db is the gradient and both are of the same shape, being $m \times 1$ where m is the dimension of the current layer and 1 assumed number of training sets.

In [None]:
def update_parameters(parameters, gradients, learning_rate):
    L = len(parameters) // 2

    for l in range(1, L):
        parameters['W' + str(l)] = parameters['W' + str(l)] - learning_rate * gradients['dW' + str(l)]
        parameters['b' + str(l)] = parameters['b' + str(l)] - learning_rate * gradients['db' + str(l)]

    return parameters