### How to build a Neural Network
#### `Notation`
<img src="image_neural_network.png" width=400>

where:
1. Input and Activation
   - $w^{l}_{jk}$ is weight for connection from the k - neuron in the (l - 1) - layer to the j - neuron in the l - layer.
   - $W^l$ is weight vector for connection from the (l - 1) layer to the l layer.
    Example: 
    $$
    W^l = 
    \begin{bmatrix}
    w^{l}_{11} & w^{l}_{12} & \cdots & w^{l}_{1k} \\
    w^{l}_{21} & w^{l}_{22} & \cdots & w^{l}_{2k} \\
    \vdots & \vdots & \ddots & \vdots \\
    w^{l}_{j1} & w^{l}_{j2} & \cdots & w^{l}_{jk}
    \end{bmatrix}
    $$
2. Weights
   - $b^{l}_{k}$ is bias of the k - neuron in l - layer.
   - $b^{l}$ is bias vector in l - layer. 
    Example:
    $$
    b^l = 
    \begin{bmatrix}
    b^{l}_{1} \\
    b^{l}_{2} \\
    \vdots \\
    b^{l}_{k}
    \end{bmatrix}
    $$
3. Bias
   - $a^{l}_{k}$ is activation (output) of k - neuron in l - layer.
   - $a^{l}$ is activation vector of all neurons in l - layer.
    Example:
    $$
    a^{l} = 
    \begin{bmatrix}
    a^{l}_{1} \\
    a^{l}_{2} \\
    \vdots \\
    a^{l}_{k}
    \end{bmatrix}
    $$
4. Linear Combination
   - $z^{l}_{j}$ is pre-activation of neuron j in layer l.
   - $z^{l}$ is vector of pre-activations. 
    $$
    z^{l} = W^{l} a^{l - 1} + b^{l}
    $$
5. Activation Function
   - $a^{l}_{j} = \sigma^{l}(z^{l}_{j})$: activation output of neuron j.
   - $a^{l} = \sigma^{l}(z^{l})$: vectorized activation.
6. Output
   - $a^{L} = \hat{y}$: prediction at output layer L.


#### `Step 1: Parameter Initialization`
While building and training neural networks, it is crucial to initialize the weights appropriately to ensure a model with high accuracy. If the weights are not correctly initialized, it may give rise to the Vanishing Gradient problem or the Exploding Gradient problem.

`1. Zero Initialization`
- As the name suggests, all the weights are assigned zero as the initial value is zero initialization. This kind of initialization is highly ineffective as neurons learn the same feature during each iteration. Rather, during any kind of constant initialization, the same issue happens to occur. Thus, constant initializations are not preferred.

`2. Random Initialization`
- In an attempt to overcome the shortcomings of Zero or Constant Initialization, random initialization assigns random values except for zeros as weights to neuron paths. However, assigning values randomly to the weights, problems such as **Overfitting, Vanishing Gradient Problem, Exploding Gradient Problem** might occur. 
- Random Initialization can be of two kinds:
   - `2.1 Random Normal`: The weights are initialized from values in a normal distribution.
    $$
    w_i \sim N(0, 1)
    $$
    - `2.2 Random Uniform`: The weights are initialized from values in a uniform distribution.
    $$
    w_i \sim U[min, max]
    $$
`3. Xavier/Glorot Initialization`
  - `3.1 Xavier Uniform`: Xavier/Glorot initialization often termed as Xavier Uniform initialization, is suitable for layers where the activation function used is **Sigmoid**.
    $$
    w_i \sim U[-\sqrt{\frac{6}{fan\_in + fan\_out}}, \sqrt{\frac{6}{fan\_in + fan\_out}}]
    $$
  - `3.2 Xevier Normal`: Xavier/Glorot Initialization, too, is suitable for layers where the activation function used is **Sigmoid**.
    $$
    w_i \sim N(0, \sigma)
    $$
    Here, $\sigma$ is given by:
    $$
    \sigma = \sqrt{\frac{6}{fan\_in + fan\_out}}
    $$
`4. He Initialization`
  - `4.1 He Uniform`: He Uniform Initialization is suitable for layers where ReLU activation function is used.
    $$
    w_i \sim U[-\sqrt{\frac{6}{fan\_in}}, \sqrt{\frac{6}{fan\_out}}]
    $$
  - `4.2 He Normal`: He Uniform Initialization, too, is suitable for layers where ReLU activation function is used.
    $$
    w_i \sim N(0, \sigma)
    $$
    Here, $\sigma$ is given by:
    $$
    \sigma = \sqrt{\frac{2}{fan\_in}}
    $$

#### `Step 2: Forward Propagation`
- Input: X
- For each layer l = 1,...,L:
$$
z^{l} = W^{l}a^{l - 1} + b^{l}, a^{l} = \sigma^{l}(z^{l})
$$
- $a^{0} = X$
- Final output: 
$$
a^{L} = \hat{y}
$$


#### `Step 3: Compute Cost Function`
- Cost Function:
$$
C = \frac{1}{2N} \sum_{x}^{N}\left( y(x) - a^{L}(x) \right)^2
$$
- For one input sample:
$$
C = \frac{1}{2}(y - a^{L})^2
$$

#### `Step 4: Backpropagation`
- We need to find: 
$$
\Large
\begin{cases}
\frac{\partial C}{\partial w^{l}_{jk}} \\
\frac{\partial C}{\partial b^{l}_{j}}
\end{cases}
$$
`NOTE`: In backpropagation, all values $a^{l}_{j}, z^{l}_{j}$ and y are available.
- We have: 
    $$
    \frac{\partial C}{\partial w^{l}_{jk}} = \frac{\partial C}{\partial z^{l}_{j}}\frac{\partial z^{l}_{j}}{\partial w^{l}_{jk}}
    $$
    And
    $$
    \large
    z^{l}_{j} = \sum_{k}w^{l}_{jk}.a^{l - 1}_{k} + b^{l}_{j} = w^{l}_{j1}.a^{l - 1}_{1} + w^{l}_{j2}.a^{l - 1}_{2} + ... + w^{l}_{jk}.a^{l - 1}_{k} + b^{l}_{j}
    $$
    Thus
    $$
    \large
    \frac{\partial z^{l}_{j}}{\partial w^{l}_{jk}} = a^{l - 1}_{k}
    $$
    Therefore
    $$
    \Large
    \begin{cases}
    \frac{\partial C}{\partial w^{l}_{jk}} = \frac{\partial C}{\partial z^{l}_{j}}a^{l - 1}_{k} \\
    \frac{\partial C}{\partial w^{l}_{j}} = \frac{\partial C}{\partial z^{l}_{j}}
    \end{cases}
    $$
- We define the error $\delta^{l}_{j}$ neuron j in layer l by:
    $$
    \large
    \delta^{l}_{j} := \frac{\partial C}{\partial z^{l}_{j}}
    $$
    Therefore
    $$
    \Large
    \begin{cases}
    \frac{\partial C}{\partial w^{l}_{jk}} = \delta^{l}_{j} a^{l - 1}_{k} \\
    \frac{\partial C}{\partial w^{l}_{j}} = \delta^{l}_{j}
    \end{cases}
    $$
    we have
    $$
    \large
    a^{l}_{j} = \sigma(z^{l}_{j})
    $$
    Hence
    $$
    \large
    \delta^{l}_{j} = \frac{\partial C}{\partial z^{l}_{j}} = \frac{\partial C}{\partial a^{l}_{j}} \frac{\partial a^{l}_{j}}{\partial z^{l}_{j}} = \frac{\partial C}{\partial a^{l}_{j}} \sigma'(z^{l}_{j}) \\
    $$
- `Problem for the previous layers`: $\Large \frac{\partial C}{\partial a^{l - 1}_{j}}, \frac{\partial C}{\partial a^{l - 2}_{j}},...$
    - We don't have direct ralation between C and $a^{l - 1}_{j}$, so we need to find $\large \delta^{l-1} = f(\delta^{l})$
- `Compute` $\large \delta^{l - 1}_{j}$
    - We have
    $$
    \large
    \delta^{l - 1}_{j} = \frac{\partial C}{\partial z^{l - 1}_{j}}
    $$
    - Neuron j of layer l - 1 influences all neurons in layer l
    $$
    \delta^{l - 1}_{j} = \frac{\partial C}{\partial z^{l - 1}_{j}} = \frac{\partial C}{\partial z^{l}_{1}} \frac{\partial z^{l}_{1}}{\partial z^{l - 1}_{j}} + \frac{\partial C}{\partial z^{l}_{2}} \frac{\partial z^{l}_{2}}{\partial z^{l - 1}_{j}} + ... + \frac{\partial C}{\partial z^{l}_{k}} \frac{\partial z^{l}_{k}}{\partial z^{l - 1}_{j}} = \sum_{k} \frac{\partial C}{\partial z^{l}_{k}} \frac{\partial z^{l}_{k}}{\partial z^{l - 1}_{j}}
    $$
    $$
    \large
    \Rightarrow
    \delta^{l - 1}_{j} = \sum_{k} \delta^{l}_{k} \frac{\partial z^{l}_{k}}{\partial z^{l - 1}_{j}}
    $$
    And
    $$
    \large
    z^{l}_{k} = \sum_{j} w^{l}_{kj}.a^{l - 1}_{j} + b^{l}_{k} = \sum_{j} w^{l}_{kj}.\sigma(z^{l - 1}_{j}) + b^{l}_{k}
    \Rightarrow \frac{\partial z^{l}_{k}}{\partial z^{l - 1}_{j}} = w^{l}_{kj}.\sigma'(z^{l - 1}_{j})
    $$
    Thus
    $$
    \large
    \delta^{l - 1}_{j} = \sum_{k} \delta^{l}_{k} w^{l}_{kj}.\sigma'(z^{l - 1}_{j})
    $$
    $$
    \large
    \Leftrightarrow \delta^{l - 1}_{j} = [\sum_{k} w^{l}_{kj}.\delta^{l}_{k}] \sigma'(z^{l - 1}_{j})
    $$
    For the (l - 1) layer
    $$
    \large
    \begin{bmatrix}
    \delta^{l - 1}_{1} \\
    \delta^{l - 1}_{2} \\
    \vdots \\
    \delta^{l - 1}_{j}
    \end{bmatrix} =
    \begin{bmatrix}
    \sum_{k} w^{l}_{k1} . \delta^{l}_{k} \\
    \sum_{k} w^{l}_{k2} . \delta^{l}_{k} \\
    \vdots \\
    \sum_{k} w^{l}_{kj} . \delta^{l}_{k} \\
    \end{bmatrix}
    \;\odot\;
    \begin{bmatrix}
    \sigma'(z^{l - 1}_{1}) \\
    \sigma'(z^{l - 1}_{2}) \\
    \vdots \\
    \sigma'(z^{l - 1}_{j}) \\
    \end{bmatrix}
    $$
    $$
    \large
    \Leftrightarrow
    \begin{bmatrix}
    \delta^{l - 1}_{1} \\
    \delta^{l - 1}_{2} \\
    \vdots \\
    \delta^{l - 1}_{j}
    \end{bmatrix} =
    \begin{bmatrix}
    w^{l}_{11} . \delta^{l}_{1} + w^{l}_{21} . \delta^{l}_{2} + ... + w^{l}_{k1} . \delta^{l}_{k} \\
    w^{l}_{12} . \delta^{l}_{1} + w^{l}_{22} . \delta^{l}_{2} + ... + w^{l}_{k2} . \delta^{l}_{k} \\
    \vdots \\
    w^{l}_{1j} . \delta^{l}_{1} + w^{l}_{2j} . \delta^{l}_{2} + ... + w^{l}_{kj} . \delta^{l}_{k}
    \end{bmatrix}
    \;\odot\;
    \begin{bmatrix}
    \sigma'(z^{l - 1}_{1}) \\
    \sigma'(z^{l - 1}_{2}) \\
    \vdots \\
    \sigma'(z^{l - 1}_{j}) \\
    \end{bmatrix}
    $$
    $$
    \large
    \Leftrightarrow
    \begin{bmatrix}
    \delta^{l - 1}_{1} \\
    \delta^{l - 1}_{2} \\
    \vdots \\
    \delta^{l - 1}_{j}
    \end{bmatrix} =
    \begin{bmatrix}
    w^{l}_{11} & w^{l}_{21} & ... & w^{l}_{k1} \\
    w^{l}_{12} & w^{l}_{22} & ... & w^{l}_{k2}\\
    \vdots & \vdots & \ddots & \vdots \\
    w^{l}_{1j} & w^{l}_{2j} & ... & w^{l}_{kj}
    \end{bmatrix}
    \begin{bmatrix}
    \delta^{l}_{1} \\
    \delta^{l}_{2} \\
    \vdots \\
    \delta^{l}_{k}
    \end{bmatrix}
    \;\odot\;
    \begin{bmatrix}
    \sigma'(z^{l - 1}_{1}) \\
    \sigma'(z^{l - 1}_{2}) \\
    \vdots \\
    \sigma'(z^{l - 1}_{j}) \\
    \end{bmatrix}
    $$
    $$
    \large
    \Leftrightarrow
    \delta^{l - 1} = (W^{l})^{T} \delta^{l} \odot \sigma'(z^{l - 1})
    $$
#### `Step 5: Gradient Descent`
- `Gradient`:
    $$
    \large
    \begin{cases}
    \frac{\partial C}{\partial w^{l}_{jk}} = \delta^{l}_{j} a^{l - 1}_{k} \\
    \frac{\partial C}{\partial b^{l}_{j}} = \delta^{l}_{j}
    \end{cases}
    \Rightarrow
    \begin{cases}
    \frac{\partial C}{\partial W^{l}} = \delta^{l} (a^{l - 1})^T \\
    \frac{\partial C}{\partial b^{l}} = \delta^{l}
    \end{cases}
    $$
- `Update`:
    $$
    \large
    \begin{cases}
    W^{l} := W^{l} - \alpha \frac{\partial C}{\partial W^{l}} = W^{l} - \alpha \delta^{l}(a^{l - 1})^T \\
    b^{l} := b^{l} - \alpha \frac{\partial C}{\partial b^{l}} = b^{l} - \alpha \delta^{l}
    \end{cases}
    $$