## Variable Definitions
- let $i, j, k$ denote the input, hidden and output layer
- $n = 784$ dimension of input
- $m = $ dimension of hidden layer
- $l = 10$ dimension of output
- $X$ will be our training data input
- $Z^L$ is the unactivated value at layer $L$
- $A^L$ is the activated value at layer $L$
- $L$ will takes values $0, 1$, where $0$ represents the first hidden layer.
- $\sigma$ will represent our activation function
## Weights and biases
- $w^0$ will be a linear map $w^0: \mathbb{R}^n \to \mathbb{R}^m$
- $w^1$ will be a linear map $w^1: \mathbb{R}^m \to \mathbb{R}^l$ 
- $b^{0}$ will be a vector of size $m$
- $b^1$ will be a vector of size $l$


## Forward Propogation

Begin with input $X \in \mathbb{R}^n$.
$$
\begin{aligned}
  Z^0_j &=
      \sum_{i}^{n} w^0_{ji} X_i + b^0_j  \iff Z^0 = w^0 X + b^0
  \\
  A^0 &= \sigma (Z^0)
  \\

  Z^1_k &= \sum_{j}^{m} w^1_{kj} A^0_j + b^1_k \iff Z^1 = w^1 A^0 + b^1
  \\

  A^1 &= \text{softmax} (Z^1)
\end{aligned}
$$

$P = A^1$ is our prediction value. We can visualize the relationship as follows

$$
i \text{th example}: X \rightarrow Z^0 \rightarrow A^0 \rightarrow Z^1 \rightarrow A^1 = P \rightarrow Y_i
$$


##  Back Propogation

Define the error function as the squared difference between prediction and labelled value.
$$
\begin{aligned}
  \xi &= \frac{1}{N} \sum_l^N E_l
  \\
  E_l &= \frac{1}{2} \sum_k (P^l_k - Y^l_k) ^2 
\end{aligned}
$$

Where 
- $N$ is the number of training examples. 
- $Y^l$ is the label for the $l$ th training example
- $E_l$ is the error of the $l$ th training example.

## Common Terms
Fix the training example $E = \xi_l$ for some $0 \leq l \leq N$. For any variable $V$ we have

$$
  \frac{\partial E}{\partial V} 
  = 
  \sum_k 
    \frac{\partial E}{\partial P_k} 
    \frac{\partial P_k}{\partial Z^1_k} 
    \frac{\partial Z^1_k}{\partial V}
$$

Calculate the first two factors which are common for any $V$

$$
\begin{aligned}
  \frac{\partial E}{\partial P_k} 
    &= (P_k - \hat{Y}_k) \\
  \\
  \frac{\partial P_k}{\partial Z^1_k} 
    &= \frac{\partial}{\partial Z^1_k} \text{softmax} (Z^1_k)
  \\
  & = e^{Z^1_k} 
    \bigg ( 
        \frac{\sum_l e^{Z_l} - e^{Z_k}}{
          \big (
            \sum_l e^{Z_l}
          \big ) ^2
        }
    \bigg ) 
  \\
  &= P_k ( 1 + P_k)
  
\end{aligned}
$$

## Bias $b^0$

$$
\begin{aligned}
  \frac{\partial Z^1_k}{\partial b^0_j} &= \frac{\partial}{\partial b^0_j}
  \bigg (
      \sum_l w^1_{kl} \sigma ( Z^0_l ) + b^1_k
  \bigg )
  \\
  &= \sum_l w^1_{kl} \sigma'( Z^0_l ) \frac{\partial}{\partial b^0_j}
  \bigg (
    \sum_i w^0_{li} X_i + b^0_l
  \bigg)
  \\
  &= \sum_l w^1_{kl} \sigma'( Z^0_l ) \delta_{lj}
  \\
  &= w^1_{kj} \sigma'(Z^0_j)
\end{aligned}
$$

Update rule for $b^0$

$$
\begin{aligned}
  \Delta b^0_j &= - \epsilon \cdot  \sigma'(Z^0_j) \cdot \sum_k w^1_{kj} (P_k - Y_k)P_k (1 + P_k) 
  \\
  \Delta b^0 &= - \epsilon \cdot \sigma'(Z^0) \cdot (w^1)^T \cdot (P - Y)  \cdot P \cdot (1 + P)
\end{aligned}
$$


## Bias $b^1$

$$
\begin{aligned}
  \frac{\partial Z^1_k}{\partial b^1_l} &= \frac{\partial}{\partial b^1_l}
  \bigg (
      \sum_m w^1_{km} \sigma ( Z^0_m ) + b^1_k
  \bigg )
  \\
  & =  \delta_{kl}
\end{aligned}
$$

Update rule for $b^1$

$$
\begin{aligned}
  \Delta b^0_l &= - \epsilon  \cdot \sum_k (P_k - Y_k) P_k  (1 + P_k) \delta_{kl}
  \\
  &= - \epsilon \cdot (P_l - Y_l) P_l  (1 + P_l)
\end{aligned}


$$


## Weights $w^1$

$$
\begin{align}
  \frac{ \partial Z^1_l }{ \partial w^1_{kj} } 
    &= 
    \frac{\partial}{\partial w^1_{kj}} 
    \bigg( 
      \sum_m w^1_{lm} \sigma ( Z^0_m ) + b^1_l
    \bigg)
  \\
&= \sum_m \sigma(Z^0_m) \delta_{kl} \delta_{jm} = \delta_{kl} \sigma ( Z^0_j )
\end{align}
$$
Update rule for $w^1$
$$
\begin{align}

  \Delta w^1_{kj} &= - \epsilon \sum_l P_l(P_l - Y_l)(1 + P_l) \delta_{kl} \sigma ( Z^0_j )
  \\
  &= - \epsilon \cdot P_k(P_k - Y_k)(1 + P_k) \sigma ( Z^0_j )
  \\
  \Delta w^1 &= - \epsilon \cdot \sigma (Z^0) \otimes P(P - Y)(1 + P)
\end{align}
$$




## Weights $w^0$

$$
\begin{align}
  \frac{\partial Z^1_k}{\partial w^0_{ji}} 
  
    &= \frac{ \partial }{ \partial w^0_{ji} } 
      \bigg (
        \sum_l w^1_{kl} \sigma (Z^0_l) + b^1_k
      \bigg)
    \\

     &= \sum_l 
      w^1_{kl} 
      \cdot
      \frac{ \partial }{ \partial w^0_{ji} }
        \sigma 
          \big( 
            Z^0_l
          \big ) 
    \\

    &= \sum_l 
        w^1_{kl} 
        \cdot
        \sigma'(Z^0_l)
        \cdot
        \frac{ \partial }{ \partial w^0_{ji} }
          \bigg ( 
            \sum_{m} w^0_{lm} X_m + b^0_l
          \bigg )
    \\
    &= \sum_l
        w^1_{kl}
        \cdot
        \sigma'(Z^0_l)
        \cdot
        \sum_m
        \delta_{lj} \delta_{mi} X_m 
    \\
    &= \sum_l
        w^1_{kl}
        \cdot
        \sigma'(Z^0_l)
        \cdot
        \delta_{lj} X_i
    \\
    &= w^1_{kj}
      \cdot 
      \sigma'(Z^0_j)
      \cdot
      X_i
        
\end{align}
$$



Update rule for $w^0$

$$
\begin{aligned}

  \Delta w^0_{ji}
  &= 
  - \epsilon \cdot X_i \cdot \sigma'(Z^0_j) \cdot \sum_k
    (P_k - Y_k)
    P_k
    (1 + P_k) 
    \cdot 
    w^1_{kj}
  \\
  \Delta w^0 &= - \epsilon \cdot X
    \otimes \sigma'(Z^0) 
    \cdot
    (w^1)^T \cdot (P - Y) P (1 + P) 
\end{aligned}
$$
