### Prerequisites:

Linear Algebra (Matrix and Vector Operations)

Mathematical Analysis (Derivatives)

### Neural Network architecture and Forward Propogation procedure.

![architecture](architecture.png)

### Input layer (Layer 0):

$$A^{[0]} = X_{(dim: 784, m)}$$

Array of original images

### Hidden layer (Layer 1):

$$Z^{[1]}_{(10; m)} = W^{[1]}_{(10; 784)} A^{[0]}_{(784; m)} + bias^{[1]}_{(784; 1)}$$

$$A^{[1]}_{(10; m)} = ActivationFunction(Z^{[1]}_{(10; m)}) = ReLU(Z^{[1]})$$
Where:

$W^{[1]}_{(10; 784)}$ - weights (connections between layers)

$bias^{[1]}_{(10; m)}$ - constant added to shift the activation function (could also be interpreted as systematic error term)

$Z^{[1]}_{(10; m)}$ - linear combination of weights with original image with added bias term

$A^{[1]}_{(10; m)}$ - values of neurons after the Activation Function is applied

#### Rectified Linear Unit (ReLU)

ReLU is an activation function for the first layer of our Neural Network. It is defined as positive part of its argument:
$$\begin{equation}
  f(x)=max(0,x)=\frac{x+|x|}{2}=
    \begin{cases}
      x & \text{if  $x>0$,}\\
      0 & \text{otherwise.}
    \end{cases}       
\end{equation}$$

<center><img src="relu.png" alt="RelU" style="height: 200px; width:300px;" /></center>

### Output layer (Layer 2):

$$Z^{[2]}_{(10; m)} = W^{[2]}_{(10; 10)} A^{[1]}_{(10; m)} + bias^{[2]}_{(10; 1)}$$

$$A^{[2]}_{(10; m)} = ActivationFunction(Z^{[2]}_{(10; m)}) = SoftMax(Z^{[2]})$$
Where:

$W^{[2]}_{(10; 10)}$ - weights (connections between layers)

$bias^{[2]}_{(10; 1)}$ - constant added to shift the activation function (could also be interpreted as systematic error term)

$Z^{[2]}_{(10; m)}$ - linear combination of weights with original image with added bias term

$A^{[2]}_{(10; m)}$ - values of neurons after the Activation Function is applied

#### Softmax

Sofmax is an activation function, that we use for the second layer of Neural Network. We can compute it using formula:
$$\frac{e^{z_{i}}}{\sum_{K}^{j=1}e^{z_{j}}}$$

<center><img src="softmax_example.png" alt="RelU" style="height: 200px; width:400px;" /></center>

<center><img src="softmax_plot.png" alt="RelU" style="height: 280px; width:400px;" /></center>

(Technically we have one more step - take max value of the output vector from Softmax Function and it's index will be the predicted label)

### Backwards Propogation procedure.

The overall network is a combination of function composition and matrix multiplication:
$$ g(x) := f^{[L]} (W^{[L]} f^{[L-1]}(W^{[L-1]} ... f^{[1]}(W^{[1]}x)))

Knowing the predicted label $\hat{y}$ we will go the opposite way (right to left on our picture). We will find how much prediction diviated from the actual $y$ label via cross entropy cost function.

Cost function is cross entropy loss, which is:

$$ C(y, g(x))=-\sum_{j}y_{j}\ln (\hat{y}_{j}) $$

Let's also refresh what our actual label vectro looks like:

$$ y = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] $$

To explain what is  in cross entropy formula, we need to understand how exactly we get our predictons from the last layer. We pass throught the softmax all the predictions to get vector of probabilities, where each value is defined as:

$$  \hat{y}_{j} = g({x}_{j}) = g(Z^{[2]}) = A^{[2]} = p_{j}={\frac{e^{z_{j}}}{\sum_{k}e^{z_{k}}}} $$

To find the deviation from prediction we take combination of Cross Entropy and Softmax which will be our cost function. The added bonus is that it has really nice derivative (proof: https://www.youtube.com/watch?v=5-rVLSc2XdE&ab_channel=SmartAlphaAI):

$$ \frac{\partial C}{\partial Z^{[2]}} = \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}} =  A^{[2]} - y $$

To minimize that function we will take it's *antigradient*.

Gradient is defined as follows (derivatives we want to find are):
$$\nabla C = \left( \frac{\partial C}{\partial W^{[1]}}, \frac{\partial C}{\partial (bias)^{[1]}}, \frac{\partial C}{\partial W^{[2]}}, \frac{\partial C}{\partial (bias)^{[1]}} \right) $$

Using the chain rule (цепное правило, или же производная сложной функции) we will find the actual derivative of $C$. To refresh we will write out the chain rule:

Given
$$ y = f(t) $$
$$ x = g(t) $$

The chain rule is:
$$\frac{\partial y}{\partial x} = \frac{\partial t}{\partial x} \cdot \frac{\partial y}{\partial t}$$

Lets apply the chain rule to our case:
$$ Z^{[1]} = Z^{[1]} (W^{[1]}) $$
$$ A^{[1]} = A^{[1]} (Z^{[1]}) $$
$$ W^{[2]} = W^{[2]} (A^{[1]}) $$
$$ Z^{[2]} = Z^{[2]} (W^{[2]}) $$
$$ A^{[2]} = A^{[2]} (Z^{[2]}) $$
$$ C = C(A^{[2]}) $$
$$ C = C(A^{[2]} (Z^{[2]} (W^{[2]} (A^{[1]} (Z^{[1]} (W^{[1]})))))) $$

And the derivatives we need to compute are:
$$\frac{\partial C}{\partial W^{[1]}} = \frac{\partial Z^{[1]}}{\partial W^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial C}{\partial A^{[1]}} = \frac{\partial Z^{[1]}}{\partial W^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial W^{[2]}}{\partial A^{[1]}} \cdot \left( \frac{\partial Z^{[2]}}{\partial W^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}} \right) = \left( \frac{\partial Z^{[1]}}{\partial W^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial W^{[2]}}{\partial A^{[1]}} \right) \cdot \frac{\partial C}{\partial W^{[2]}} $$

$$\frac{\partial C}{\partial b^{[1]}} = \frac{\partial Z^{[1]}}{\partial b^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial C}{\partial A^{[1]}} = \frac{\partial Z^{[1]}}{\partial b^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial W^{[2]}}{\partial A^{[1]}} \cdot \left( \frac{\partial Z^{[2]}}{\partial W^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}} \right) = \left( \frac{\partial Z^{[1]}}{\partial b^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial W^{[2]}}{\partial A^{[1]}} \right) \cdot \frac{\partial C}{\partial W^{[2]}}$$

$$\frac{\partial C}{\partial W^{[2]}} = \frac{\partial Z^{[2]}}{\partial W^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}}$$

$$\frac{\partial C}{\partial b^{[2]}} = \frac{\partial Z^{[2]}}{\partial b^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}}$$

We'll start from the bottom:
$$ \frac{\partial C}{\partial b^{[2]}} = \frac{\partial Z^{[2]}}{\partial b^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}} = A^{[2]} - y $$

$$ \frac{\partial C}{\partial Z^{[2]}} = \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}} = A^{[2]} - y $$

$$Z^{[2]}_{(10; m)} = W^{[2]}_{(10; 10)} A^{[1]}_{(10; m)} + bias^{[2]}_{(10; 1)}$$

$$ \frac{\partial Z^{[2]}}{\partial b^{[2]}} = 1 $$


Next deriv is:

$$\frac{\partial C}{\partial W^{[2]}} = \frac{\partial Z^{[2]}}{\partial W^{[2]}} \cdot \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}} = \left( A^{[2]} - y\right) \cdot A^{[1]T}_{(10; m)}$$

$$ \frac{\partial C}{\partial Z^{[2]}} = \frac{\partial A^{[2]}}{\partial Z^{[2]}} \cdot \frac{\partial C}{\partial A^{[2]}} =  A^{[2]} - y $$

$$Z^{[2]}_{(10; m)} = W^{[2]}_{(10; 10)} A^{[1]}_{(10; m)} + bias^{[2]}_{(10; 1)}$$

$$ \frac{\partial Z^{[2]}}{\partial W^{[2]}} = \frac{\partial C}{\partial Z^{[2]}} \cdot A^{[1]T}_{(10; m)} = \left( A^{[2]} - y\right) \cdot A^{[1]T}_{(10; m)}$$

We move on to the first layer:

$$\frac{\partial C}{\partial b^{[1]}} = \frac{\partial Z^{[1]}}{\partial b^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial C}{\partial A^{[1]}} = W^{[2]T} \left( A^{[2]} - y \right) \cdot ReLU_{Z^{[2]}}^{'}{(Z^{[2]})}$$

$$\frac{\partial C}{\partial Z^{[1]}} = W^{[2]T} \frac{\partial C}{\partial Z^{[1]}} \cdot ReLU_{Z^{[2]}}^{'}{(Z^{[2]})} = W^{[2]T} \left( A^{[2]} - y \right) \cdot ReLU_{Z^{[2]}}^{'}{(Z^{[2]})}$$

ReLU derivative is:
$$A^{[1]}_{(10; m)} = ReLU(Z^{[1]}_{(10; m)})$$

$$\begin{equation}
  \frac{\partial A^{[1]}}{\partial Z^{[1]}}=
    \begin{cases}
      Z^{[1]} & \text{if $Z^{[1]}>0$,}\\
      0 & \text{if $Z^{[1]}<0$.}
    \end{cases}       
\end{equation}$$

And the last derivative of gradient is:

$$\frac{\partial C}{\partial W^{[1]}} = \frac{\partial Z^{[1]}}{\partial W^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial C}{\partial A^{[1]}} = \left( W^{[2]T} \left( A^{[2]} - y \right) \cdot ReLU_{Z^{[2]}}^{'}{(Z^{[2]})} \right) X^{T} $$

We've already computed all we need to find the derivative:

$$\frac{\partial C}{\partial W^{[1]}} = \frac{\partial C}{\partial Z^{[1]}} X^{T}$$

The actual derivatives are given as:

$${\partial Z^{[2]}_{(10; m)}} = A^{[2]}_{(10; m)} - {y}_{(10; 1)}$$

$${\partial W^{[2]}_{(10; m)}} = \frac{1}{m}{\partial Z^{[2]}_{(10; m)}} A^{[1]T}_{(m; 10)}$$

$${\partial (bias)^{[2]}_{(10; 1)}} = \frac{1}{m} \sum_{1}^{10}{\partial Z^{[2]}_{(10; m)}}$$

$${\partial Z^{[1]}_{(10; m)}} = W^{[2]T}_{(10; 10)} {\partial Z^{[2]}_{(10; m)}} \cdot ActivationFunction^{'}(Z^{[2]}_{(10; m)})$$

$${\partial W^{[1]}_{(10; m)}} = \frac{1}{m}{\partial Z^{[1]}_{(10; m)}} X^{T}_{(m; 784)}$$

$${\partial (bias)^{[1]}_{(10; 1)}} = \frac{1}{m} \sum_{1}^{10}{\partial Z^{[1]}_{(10; m)}}$$

Now we update weights and bias with found values with hyperparameter Learning Rate to find set of weights and bias on the next iteration:

$$ W^{[1]}_{(10; 784)} = W^{[1]}_{(10; 784)} - LearningRate \cdot {\partial W^{[1]}_{(10; m)}} $$

$$ bias^{[1]}_{(10; m)} = bias^{[1]}_{(10; m)} - LearningRate \cdot {\partial (bias)^{[1]}_{(10; 1)}} $$

$$ W^{[2]}_{(10; 10)} = W^{[2]}_{(10; 10)} - LearningRate \cdot {\partial W^{[2]}_{(10; m)}} $$

$$ bias^{[2]}_{(10; 1)} = bias^{[2]}_{(10; 1)} - LearningRate \cdot {\partial (bias)^{[2]}_{(10; 1)}} $$

All we need to do now is run the Neural Network throughtout the whole dataset of 60k samples to train our model.

To fully represent the model, it is necessary and sufficient to know all the parameters of weights and biases as well as the model architecture.