## Variable Definitions
- let $i, j, k$ denote the input, hidden and output layer
- $n = 784$ dimension of input
- $m = $ dimension of hidden layer
- $l = 10$ dimension of output
- $X$ will be our training data input
- $Z^L$ is the unactivated value at layer $L$
- $A^L$ is the activated value at layer $L$
- $L$ will takes values $0, 1$, where $0$ represents the first hidden layer.
- $\sigma$ will represent our activation function
## Weights and biases
- $w^0$ will be a linear map $w^0: \mathbb{R}^n \to \mathbb{R}^m$
- $w^1$ will be a linear map $w^1: \mathbb{R}^m \to \mathbb{R}^l$ 
- $b^{0}$ will be a vector of size $m$
- $b^1$ will be a vector of size $l$


## Forward Propogation

Begin with input $X \in \mathbb{R}^n$.
$$
\begin{aligned}
  Z^0_j &=
      \sum_{i}^{n} w^0_{ji} X_i + b^0_j  \iff Z^0 = w^0 X + b^0
  \\
  A^0 &= \sigma (Z^0)
  \\

  Z^1_k &= \sum_{j}^{m} w^1_{kj} A^0_j + b^1_k \iff Z^1 = w^1 A^0 + b^1
  \\

  A^1 &= \text{softmax} (Z^1)
\end{aligned}
$$

$P = A^1$ is our prediction value. We can visualize the relationship as follows

$$
i \text{th example}: X \rightarrow Z^0 \rightarrow A^0 \rightarrow Z^1 \rightarrow A^1 = P \rightarrow Y_i
$$


##  Back Propogation

Define the error function as the squared difference between prediction and labelled value.
$$
\begin{aligned}
  \xi &= \frac{1}{N} \sum_l^N E_l
  \\
  E_l &= \frac{1}{2} \sum_k (P_k - Y^l_k) ^2 
\end{aligned}
$$

Where 
- $N$ is the number of training examples. 
- $Y^l$ is the label for the $l$ th training example
- $E_l$ is the error of the $l$ th training example.

## Common Terms
Fix the training example $E = \xi_l$ for some $0 \leq l \leq N$. For any variable $V$ we have
$$
  \frac{\partial E}{\partial V} 
  = 
  \sum_k 
    \frac{\partial E}{\partial P_k} 
    \frac{\partial P_k}{\partial Z^1_k} 
    \frac{\partial Z^1_k}{\partial V}
$$

Calculate the first two factors which are common for any $V$
$$
\begin{aligned}
  \frac{\partial E}{\partial P_k} 
    &= (P_k - \hat{Y}_k) \\
  \\
  \frac{\partial P_k}{\partial Z^1_k} 
    &= \frac{\partial}{\partial Z^1_k} \text{softmax} (Z^1_k)
  \\
  & = e^{Z^1_k} 
    \bigg ( 
        \frac{\sum_l e^{Z_l} - e^{Z_k}}{
          \big (
            \sum_l e^{Z_l}
          \big ) ^2
        }
    \bigg ) 
  \\
  &= P_k ( 1 + P_k)
  
\end{aligned}
$$

## Bias $b^0$

$$
\begin{align}
  \frac{\partial Z^1_k}{\partial b^0_j} &= \frac{\partial}{\partial b^0_j}
  \bigg (
      \sum_l w^1_{kl} \sigma ( Z^0_l ) + b^1_k
  \bigg )
  \\
  &= \sum_l w^1_{kl} \sigma'( Z^0_l ) \frac{\partial}{\partial b^0_j}
  \bigg (
    \sum_i w^0_{li} X_i + b^0_l
  \bigg)
  \\
  &= \sum_l w^1_{kl} \sigma'( Z^0_l ) \delta_{lj}
  \\
  &= w^1_{kj} \sigma'(Z^0_j)
\end{align}
$$
Update rule for $b^0$
$$
\begin{align}
  \Delta b^0_j &= - \epsilon \cdot  \sigma'(Z^0_j) \cdot \sum_k w^1_{kj} (P_k - Y_k)P_k (1 + P_k) 
  \\
  \Delta b^0 &= - \epsilon \cdot \sigma'(Z^0) \cdot (w^1)^T \cdot (P - Y)  \cdot P \cdot (1 + P)
\end{align}
$$


## Bias $b^1$

$$
\begin{align}
  \frac{\partial Z^1_k}{\partial b^1_l} &= \frac{\partial}{\partial b^1_l}
  \bigg (
      \sum_m w^1_{km} \sigma ( Z^0_m ) + b^1_k
  \bigg )
  \\
  & =  \delta_{kl}
\end{align}
$$
Update rule for $b^1$
$$
\begin{align}
  \Delta b^0_l &= - \epsilon  \cdot \sum_k (P_k - Y_k) P_k  (1 + P_k) \delta_{kl}
  \\
  &= - \epsilon \cdot (P_l - Y_l) P_l  (1 + P_l)
\end{align}


$$


## Weights $w^1$

$$
\begin{align}
  \frac{ \partial \alpha^1_k }{ \partial w^1_{ij} } 
    &= 
    \frac{\partial}{\partial w^1_{ij}} 
    \bigg( 
      \sum_l w^1_{lk} \sigma ( \alpha^0_l ) + b^1_k
    \bigg)
  \\
&= \sum_l \sigma(\alpha^0_l) \delta_{kj} \delta_{il} = \delta_{kj} \sigma ( \alpha^0_i )
\end{align}
$$
Where $\delta$ denotes the Kroneckor delta. Increment the hidden layer output weights proportional to the derivative of the error where $\epsilon$ is some proportionality factor
$$
\begin{align}

  \Delta w^1_{ij} &= - \epsilon \sum_k p_k(p_k - l_k)(1 + p_k) \delta_{kj} \sigma ( \alpha^0_i )
  \\
  &= - \epsilon \cdot p_j(p_j - l_j)(1 + p_j) \sigma ( \alpha^0_i )
\end{align}
$$




## Weights $w^0$
$$
  \frac{ \partial E }{ \partial w^0_{ij} } = 
  \sum_k
    \frac{ \partial E }{ \partial p_k } 
    \frac{ \partial p_k }{ \partial \alpha^1_k } 
    \frac{ \partial \alpha^1_k }{ w^0_{ij} }
$$

Only the last term is different. Keeping $\sigma$ generic:
$$
\begin{align}
  \frac{\partial \alpha^1_k}{\partial w^0_{ij}} 
  
    &= \frac{ \partial }{ \partial w^0_{ij} } 
      \bigg (
        \sum_l w^1_{lk} \sigma (\alpha^0_l) + b^1_k
      \bigg)
    \\

     &= \sum_l 
      w^1_{lk} 
      \cdot
      \frac{ \partial }{ \partial w^0_{ij} }
        \sigma 
          \big( 
            \alpha^0_l
          \big ) 
    \\

    &= \sum_l 
        w^1_{lk} 
        \cdot
        \sigma'(\alpha^0_l)
        \cdot
        \frac{ \partial }{ \partial w^0_{ij} }
          \bigg ( 
            \sum_{m} w^0_{ml} v_m + b^0_l
          \bigg )
    \\
    &= \sum_l
        w^1_{lk}
        \cdot
        \sigma'(\alpha^0_l)
        \cdot
        \sum_m
        \delta_{im} \delta_{jl} v_m 
    \\
    &= \sum_l
        w^1_{lk}
        \cdot
        \sigma'(\alpha^0_l)
        \cdot
        \delta_{jl} v_i
    \\
    &= w^1_{jk}
      \cdot 
      \sigma'(\alpha^0_j)
      \cdot
      v_i
        
\end{align}
$$

Combining the previously calculated terms we get
$$
\begin{align}

  \Delta w^0_{ij}
  &= 
  - \epsilon \cdot \sum_k
    (p_k - l_k)
    p_k
    (1 + p_k) 
    \cdot 
    w^1_{jk}
      \cdot 
      \sigma'(\alpha^0_j)
      \cdot
      v_i
  \\
  &= - \epsilon 
    \cdot \sigma'(\alpha^0_j) 
    v_i 
    \sum_k
     (p_k - l_k)
      p_k
      (1 + p_k) 
      w^1_{jk}
\end{align}
$$
