The NN should be able to predict hand written digits from $0$ to $9$, with more than $90\%$ accuray. To train and test the accuracy of the NN, the MNIST data set is used. The training data consists of $60.000$ grayscaled images with a size of $28\times 28$ pixels, with corresponding labels, and the testing data consists of $10.000$ grayscaled images of same size, with corresponding labels.


The Neural Netowrk has the following architecture.
- Input layer $(784\times 1)$
- Hidden layer $(100\times 1)$
- Output layer $(10\times 1)$




### Forward propagation

##### Input Layer
To optimize the program, the images are converted from matrices of size $28\times 28$ to vectors of size $784\times 1$. This vector is used as the input.

##### Hidden layer
The hidden layer is a fully connected layer with 100 units. Each unit calculates the weighted sum of the input and adds a bias number. 
$$
z_{j} = \sum_{i=0}^{m-1} (x_iw_{ij}) + b_j,\; m=784
$$
Here $z_j$ is the weighted sum plus bias of the *j'th* unit in the hidden layer. $x_i$ is each pixel of the input data and $w_{ij}$ is the weight between the *i'th* pixel and the *j'th* unit. $b_j$ is the bias of the *j'th* unit.

In practice this is computed with matrix multiplication. To include the bias a row of 1 is added to the input so $X\in\mathbb{R^{785\times 1}}$. Likewise, an extra column is initialized for the weight matrix $W_0\in\mathbb{R^{100\times 785}}$.\
This enables,
$$
Z_0 = W_0X ,\; Z_0\in\mathbb{R^{100\times 1}} \\[5pt]
$$


Afterwards a non-linear activation function is used on the sum, in this case the activation function is the rectified linear unit function *(ReLU)*.

$$
\text{ReLU}(x) = 
\begin{cases}
x, & \text{if } x > 0 \\
0, & \text{if } x \le 0
\end{cases}
$$
$$
a_j = \text{ReLU}(z_j) 
$$

This can also be used with matrices for quicker computations,
$$
A_0 = \text{ReLU}(Z_0),\; A_0\in\mathbb{R^{100\times 1}}
$$

During training, before the output layer, dropout is applied to the hidden layer activations $A_0$. For each entry in $A_0$ a random value is sampled in range $[0, 1]$. If the value is below the dropout rate (e.g. $40\%$), the entry is set to $0$. This stochastic removal of units helps to prevent overfitting, by forcing the network to avoid over-reliance on specific untis.

##### Output layer

The output layer also calculates the weighted sum plus bias,
$$
Z_1 = W_1A_0
$$
with $W_1\in\mathbb{R^{10\times 100}}$ resulting in the matrix $Z_1\in\mathbb{R^{10\times 1}}$.

Afterwards, the softmax function is applied instead of ReLU, as the activation function.
$$
\text{softmax}(Z) = \frac{e^{z_i}}{\sum_{j=0}^{10-1}e^{z_j}}
$$

This transforms the output vector into a probability distribution $\hat{y}\in\mathbb{R^{10\times 1}}$, where each unit $\hat{y}_i\in[0, 1]$ and,

$$
\sum_{i=0}^{10-1}\hat{y}_i=1
$$

This is ideal for multi-class classifaction, because each output unit represents the probability of the given class.

The weight matrices $W_0$ and $W_1$ are initialized with He-initialization. This helps preventing vanishing or exploding gradients.\
This gives us,
$$
W_j \rightarrow \mathcal{N}(0, \sqrt{\frac{2}{m}})
$$
For m number of input units.

### Back propagation

To compute the cost of the output Cross-Entropy is used.
$$
L(y, \hat{y}) = -\sum_{i=0}^{10-1}y_ilog(\hat{y_i})
$$

The parameter $y$ is a one-hot true label vector  

Though, we are interested in computing the gradient of the loss function. \
The derivative of $L$ is,
$$
\frac{\delta L}{\delta Z_1} = \hat{y} - y
$$

But we want to compute the effect of any given weight on the output, and adjust thereafter. \
So we want the derivative of $L$ with respect to $W_0$
$$
\frac{\partial L}{\partial W_0} 
$$
Applying the chain rule to this we get,
$$
\frac{\partial L}{\partial W_0} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial Z_1}\frac{\partial Z_1}{\partial A_0}\frac{\partial A_0}{\partial Z_0}\frac{\partial Z_0}{\partial W_0}
$$

Which clearly shows the idea of propagating backwards, starting from the output all the way to the first layer.

This results in,
$$
\begin{align*}
\delta_1 &= \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial Z_1} = \hat{y} - y \in\mathbb{R^{10\times 1}} \\[10pt]
\delta_0 &= \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial Z_1}\frac{\partial Z_1}{\partial A_0}\frac{\partial A_0}{\partial Z_0}= (W_1\partial_1)\circ\text{ReLU}'(Z_0) \in\mathbb{R^{100\times 1}} \\[10pt]
\frac{\partial L}{\partial W_0} &= \partial_0 X^T \in\mathbb{R^{100\times 784}}
\end{align*}
$$

With the derivative of ReLU defined as,
$$
\text{ReLU}'(x) = 
\begin{cases}
1, & \text{if } x > 0 \\
0, & \text{if } x \le 0
\end{cases}
$$

Finally, the weights can be updated
$$
W_0 \leftarrow W_0-\eta(\delta_0X^T) \\[10pt]
W_1 \leftarrow W_1-\eta(\delta_1 A_0^T)
$$
where $\eta$ is the learning rate defiend as,
$$
\eta(\text{epoch}) = 
\begin{cases}
0.01, & \text{epoch} < 5 \\
0.01 \cdot 0.95^{\text{epoch}}, & \text{else}
\end{cases}
$$