# Back Propagation 
This notebook aims to help gain a mathematically understanding of back propagation by exploring forward and backward passes and how to compute each of them

## Example Model 
For this backpropagation, define a model with eight parameters whose loss is determined by the least squares loss function

Define the model as:
$$
f[x,\phi] = \beta_3 + \omega_3 \cdot \cos\!\left[
\beta_2 + \omega_2 \cdot \exp\!\left(
\beta_1 + \omega_1 \cdot \sin(\beta_0 + \omega_0 \cdot x)
\right)
\right]
$$

Define the least squares loss function as:
$$
L[\phi] = \sum_i \ell_i,
\qquad
\ell_i = \left( f[x_i,\phi] - y_i \right)^2
$$

## Forward Pass 
In order to effectively compute the backward pass of the model, we need to first compute the forward pass. By computing the forward pass, we are able to compute and store the value and equation of the intermediate variables (pre-activation, activation/hidden unit output)

Define the equations for the pre-activation and activation of this model as: 
$$
\begin{aligned}
f_0 &= \beta_0 + \omega_0 \cdot x_i \\
h_1 &= \sin(f_0) \\
f_1 &= \beta_1 + \omega_1 \cdot h_1 \\
h_2 &= \exp(f_1) \\
f_2 &= \beta_2 + \omega_2 \cdot h_2 \\
h_3 &= \cos(f_2) \\
f_3 &= \beta_3 + \omega_3 \cdot h_3 \\
\ell_i &= (f_3 - y_i)^2
\end{aligned}
$$

Where $f$ represents the pre-activation equations and $h$ represents the activation/hidden unit output

## Backward Pass 

### Backward Pass #1 
Backward Pass #1 computes the derivative of the loss function (least squares loss function in this case) in respect to each of the pre-activation/activation. These derivatives are represented as:
$$
\frac{\partial \ell_i}{\partial f_3},\quad
\frac{\partial \ell_i}{\partial h_3},\quad
\frac{\partial \ell_i}{\partial f_2},\quad
\frac{\partial \ell_i}{\partial h_2},\quad
\frac{\partial \ell_i}{\partial f_1},\quad
\frac{\partial \ell_i}{\partial h_1},\quad
\text{and}\quad
\frac{\partial \ell_i}{\partial f_0}.
$$ 

In order to compute the derivatives, we start backwards with the farthest pre-activation function. Therfore, we take theh derivative of the loss function in respect to the last pre-activation function ($f_3$). This is represented as:
$$
\frac{\partial \ell_i}{\partial f_3}
= 2 (f_3 - y_i).
$$ 

Next, we move up the list and compute the derivative of the loss function in respect to $h_3$. This is represented as: 
$$
\frac{\partial \ell_i}{\partial h_3}
= \frac{\partial f_3}{\partial h_3}
  \frac{\partial \ell_i}{\partial f_3}.
$$

In order to compute the derivative, we needed to multiply $\frac{\partial \ell_i}{\partial f_3}$ by $\frac{\partial f_3}{\partial h_3}$. This can be thought as an application of chain rule since: 

1) If the pre-activation ($f_3$) depends on the activation function ($h_3$)
    
2) The loss function ($\ell_i$) depends on the pre-activation ($f_3$)
    
We can simplify this to a statement that the loss function ($\ell_i$) depends on the activation function ($h_3$)

<ins>Result</ins>: This same chain rule idea is repeated through all the pre-activation and activation functions until we have computed the derivative for the loss function ($\ell_i$) in respect to all the activation and pre-activation functions

### Backward Pass #2 
Backward Pass #2 computes the derivative of the loss function (least squares loss function in this case) in respect to each parameter by using the computed derivatives from backward pass #1. 

For parameters ($\beta_k$, $\omega_k$), the derivatives are represented as:

$$
\frac{\partial \ell_i}{\partial \beta_k} 
= 
\frac{\partial f_k}{\partial \beta_k} 
\frac{\partial \ell_i}{\partial f_k}
\qquad
\frac{\partial \ell_i}{\partial \omega_k} 
= 
\frac{\partial f_k}{\partial \omega_k} 
\frac{\partial \ell_i}{\partial f_k}
$$ 

In order to compute these derivatives, we apply chain rule again since: 

1) If the pre-activation ($f_k$) depends on the parameter value ($\beta_k$)
   
2) The loss function ($\ell_i$) depends on the pre-activation ($f_k$)

We can simplify this to a statement that the loss function ($\ell_i$) depends on the parameter value ($\beta_k$). The same reasoning process can be said for the $\omega_k$. 

<ins>Result</ins>: The same chain rule idea is repeated through all the pre-activation functions until we have computed the derivatives for the loss function ($\ell_i$) in respect all the parameters

## Indicator Functions
Indicator functions are binary filtering functions that control the output based on a condition of the indicator function. Indicator functions determine which activations contribute to the overall gradient. 

For example: 

- An indicator function might only allow the gradient to be computed if the activation function has a positive value. If the activation function is zero, the indicator function makes the gradient zero