# U of U IS Deep Learning Study Group
## Backpropagation Formula Sheet
#### Author: Brian Sheng

This is an excerpt from the longer set of notes on computational graphs and backpropagation. The formulas here outline the example case of a fully connnected neural net with one hidden layer, one input layer, one output layer, and non-linear activation functions applied to the hidden and output layers. 

$\boldsymbol{x}=$ Inputs

$\boldsymbol{z_1}=\boldsymbol{W_1 x}+\boldsymbol{b_1}=$ Input values to "hidden" activation function

$\boldsymbol{a}=\boldsymbol{f}(\boldsymbol{z_1})=$ Hidden Activations

$\boldsymbol{z_2}=\boldsymbol{W_2 a}+\boldsymbol{b_2}=$ Inputs to "output" activation function

$\boldsymbol{y ̂}=\boldsymbol{f}(\boldsymbol{z_2})=$Output Activations

$$\frac{\partial J}{\partial \boldsymbol{b_2}}
=
\frac{\partial J}{ \partial \boldsymbol{y ̂ }}  
\frac{\partial \boldsymbol{y ̂ }}{\partial \boldsymbol{z_2} }  
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{b_2} }
$$

$$
\frac{\partial J}{\partial \boldsymbol{y ̂}}  
\frac{\partial \boldsymbol{y ̂}}{\partial\boldsymbol{z_2}}
=
\frac{\partial J}{\partial \boldsymbol{y ̂}}  
\odot
\frac{\partial \boldsymbol{y ̂}}{\partial\boldsymbol{z_2}}
=
\boldsymbol{\delta_2}
$$

Where $\odot$ is **element-wise multiplication**, a.k.a the Hadamard Product: https://en.wikipedia.org/wiki/Hadamard_product_(matrices), and $\frac{\partial \boldsymbol{y ̂ }}{\partial \boldsymbol{z_2} }$ is the derivative of the output activation function w.r.t its inputs.

$$
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{b_2}}
=
\frac{\partial}{\partial \boldsymbol{b_2}}
\left( \boldsymbol{w_2 a}+\boldsymbol{b_2}\right)=\boldsymbol{1}
$$

Where $\boldsymbol{1}$ is the ones vector. (i.e. $\boldsymbol{1}=\begin{bmatrix}
         1 \\
         ... \\
         1
        \end{bmatrix}$ for some arbitrary length) In the case of $\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{b_2}}$, the length of the vector is the same as the length of $\boldsymbol{b_2}$.

$$
\frac{\partial J}{\partial \boldsymbol{b_2}}=\boldsymbol{\delta_2}\odot\ \boldsymbol{1}=\boldsymbol{\delta_2}
$$

$$
\frac{\partial J}{\partial \boldsymbol{ w_2 }}
=
\frac{\partial J}{ \partial \boldsymbol{y ̂ }}  
\frac{\partial \boldsymbol{y ̂ }}{\partial \boldsymbol{z_2} }  
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{w_2} }
=
\boldsymbol{\delta_2}
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{w_2}}
$$

$$\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{w_2}}
=
\frac{\partial}{\partial \boldsymbol{w_2}}\left( \boldsymbol{w_2 a}+\boldsymbol{b_2}\right)
=
\boldsymbol{a}$$

$$\frac{\partial J}{\partial \boldsymbol{ w_2 }}
=
\boldsymbol{\delta_2}
\otimes
\boldsymbol{a}
=
\boldsymbol{\delta_2}\boldsymbol{a}^T
$$

Where $\otimes$ is the **outer (or tensor) product**: https://en.wikipedia.org/wiki/Outer_product

$$\frac{\partial J}{\partial \boldsymbol{b_1}}
=
\frac{\partial J}{\partial \boldsymbol{y ̂}} 
\frac{\partial \boldsymbol{y ̂}}{\partial\boldsymbol{z_2}} 
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{a} }  
\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z_1} } 
\frac{\partial \boldsymbol{z_1}}{\partial \boldsymbol{b_1} }
=
\boldsymbol{\delta_2}
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{a} } 
\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z_1} } 
\frac{\partial \boldsymbol{z_1}}{\partial \boldsymbol{b_1}}
$$

Where $\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z_1} }$ is the derivative of the hidden activation function w.r.t. its inputs. 

$$
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{a}}
=
\frac{\partial}{\partial \boldsymbol{a}}
\left( \boldsymbol{w_2 a}+\boldsymbol{b_2}\right)
=
\boldsymbol{w_2}
$$

$$
\frac{\partial \boldsymbol{z_1}}{\partial \boldsymbol{b_1}}
=
\frac{\partial}{\partial \boldsymbol{b_1}}
\left( \boldsymbol{w_1 x}+\boldsymbol{b_1}\right)
=
\boldsymbol{1}
$$

$$\boldsymbol{\delta_2}
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{a} }
=
\langle \boldsymbol{w_2},\boldsymbol{\delta_2}\rangle
=
\boldsymbol{w_2} \cdot \boldsymbol{\delta_2} 
=
\boldsymbol{w_2}^T\boldsymbol{\delta_2}
$$

Where $\langle \boldsymbol{w_2},\boldsymbol{\delta_2}\rangle$ is the **inner product** (in this case, called the **dot product**) of the matrices $\boldsymbol{w_2}$ and $\boldsymbol{\delta_2}$ :
https://en.wikipedia.org/wiki/Inner_product_space
https://en.wikipedia.org/wiki/Dot_product

$$
\frac{\partial J}{\partial \boldsymbol{b_1}}
=
\boldsymbol{\delta_2}
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{a} } 
\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z_1} } 
\frac{\partial \boldsymbol{z_1}}{\partial \boldsymbol{b_1}}
=
\langle \boldsymbol{w_2},\boldsymbol{\delta_2}\rangle
\odot \frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z_1} }
\odot \boldsymbol{1}
=
\boldsymbol{\delta_1} \odot \boldsymbol{1}
=
\boldsymbol{\delta_1} 
$$

$$
\frac{\partial J}{\partial \boldsymbol{ w_1 }}
=
\frac{\partial J}{\partial \boldsymbol{y ̂}} 
\frac{\partial \boldsymbol{y ̂}}{\partial\boldsymbol{z_2}} 
\frac{\partial \boldsymbol{z_2}}{\partial \boldsymbol{a} } 
\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z_1} } 
\frac{\partial \boldsymbol{z_1}}{\partial \boldsymbol{w_1}}
=
\boldsymbol{\delta_1}
\frac{\partial \boldsymbol{z_1}}{\partial \boldsymbol{ w_1 }}
$$

$$
\frac{\partial \boldsymbol{z_1}}{\partial \boldsymbol{ w_1 }}
=
\frac{\partial}{\partial \boldsymbol{ w_1 }}
\left( \boldsymbol{w_1 x}+\boldsymbol{b_1}\right)
=
\boldsymbol{x}
$$

$$
\frac{\partial J}{\partial \boldsymbol{w_1}}
=
\boldsymbol{\delta_1}
\otimes
\boldsymbol{x}
=
\boldsymbol{\delta_1}
\boldsymbol{x}^T
$$