# Back Propogation

The idea is that we want to tune the weights such that our loss points travel down to the global minima within the loss space over iterations of learning. To do this we must formulate how the inputs relate to the loss. This is done via the loss gradient, expressed as 

$$
\nabla \mathcal{L} = [\frac{\partial C}{\partial W^i},\frac{\partial C}{\partial b^i}]
$$

where the index refers to a matrix representing the weights or bias for an entire layer, where its location (subscript or superscript) is non important 

# $\frac{\partial C}{\partial W^i}$

## Derivation of Binary Cross Entropy Gradient

We start with the loss function:

$$
\mathcal{L} = \frac{1}{M} \sum_{i=1}^{M} -\left( y_i \ln(\hat{y}_i) + (1 - y_i) \ln(1 - \hat{y}_i) \right)
$$

**(Defined binary cross entropy loss averaged over $M$ samples)**



The network output at node $i$ is given by:

$$
\hat{y}_i = \text{act}_2(z_i)
$$

**(Defined the activation output at layer $i$)**



And the pre-activation is:

$$
z_i = A_{i-1} W_i + B_i
$$

**(Defined $z_i$ as the affine transformation of the previous layer)**



We want the derivative of the loss with respect to $W_i$:

$$
\frac{\partial \mathcal{L}}{\partial W_i}
$$

**(Stated our backpropagation objective)**



We have:

$$
\frac{\partial z_i}{\partial W_i}, \quad \frac{\partial \hat{y}_i}{\partial z_i}, \quad \frac{\partial \mathcal{L}}{\partial \hat{y}_i}
$$

**(Listed the components needed via the chain rule)**



Using the chain rule:

$$
\frac{\partial \mathcal{L}}{\partial W_i} = \frac{\partial \mathcal{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial W_i}
$$

**(Applied the chain rule to compute $\frac{\partial \mathcal{L}}{\partial W_i}$)**


# $\frac{\partial \mathcal{L}}{\partial \hat{y}_i}$

We begin with the binary cross entropy loss function:

$$
\mathcal{L}(\hat{y}, y) = -\left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right]
$$



### Step 1: Differentiate $\mathcal{L}$ with respect to $\hat{y}$

We apply the derivative term-by-term:

$$
\frac{d\mathcal{L}}{d\hat{y}} = \frac{d}{d\hat{y}} \left[ - y \log(\hat{y}) - (1 - y) \log(1 - \hat{y}) \right]
$$

**(Expanded the negative sign to both terms)**



### Step 2: Differentiate the first term

$$
\frac{d}{d\hat{y}} \left[ - y \log(\hat{y}) \right] = - y \cdot \frac{1}{\hat{y}}
$$

**(Used derivative $\frac{d}{dx} \log(x) = \frac{1}{x}$)**



### Step 3: Differentiate the second term

$$
\frac{d}{d\hat{y}} \left[ - (1 - y) \log(1 - \hat{y}) \right] = - (1 - y) \cdot \left( \frac{-1}{1 - \hat{y}} \right)
$$

**(Applied chain rule: derivative of $\log(1 - \hat{y})$ is $-\frac{1}{1 - \hat{y}}$)**



### Step 4: Simplify the second derivative

$$
- (1 - y) \cdot \left( \frac{-1}{1 - \hat{y}} \right) = \frac{1 - y}{1 - \hat{y}}
$$

**(Simplified negative signs)**



### Step 5: Combine both derivative terms

$$
\frac{d\mathcal{L}}{d\hat{y}} = - \frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}
$$

**(Substituted results from Step 2 and Step 4 into Step 1)**


# $\frac{\partial \hat{y}_i}{\partial z_i}$

We start with the definition of the sigmoid function:

$$
\hat{y}_i = \frac{1}{1 + e^{-z}}
$$

**(Defined the sigmoid activation function)**



We rewrite this with a negative exponent:

$$
\hat{y}_i = (1 + e^{-z})^{-1}
$$

**(Rewrote the expression using negative exponent notation)**



Now take the derivative using the chain rule:

$$
\frac{d\sigma}{dz} = -1 \cdot (1 + e^{-z})^{-2} \cdot \left(-e^{-z}\right)
$$

**(Applied the chain rule: outer function is power, inner is exponential)**



Simplify the negatives:

$$
\frac{d\sigma}{dz} = \frac{e^{-z}}{(1 + e^{-z})^2}
$$

**(Simplified the signs in the numerator)**



Now express numerator and denominator in terms of $\hat{y}_i$:

Recall:
- $\hat{y}_i = \frac{1}{1 + e^{-z}}$
- $1 - \hat{y}_i = \frac{e^{-z}}{1 + e^{-z}}$

Multiply them:

$$
\frac{d\sigma}{dz} = \hat{y}_i(1 - \hat{y}_i)
$$

**(Rewrote the derivative in terms of $\hat{y}_i$ only)**


# $\frac{\partial z_i}{\partial W_i}$

recall: 

$$
z_i = A_{i-1} W_i + B_i
$$

The derivative is simply 

$$
\frac{\partial z_i}{\partial W_i} = A_{i-1}^T
$$

# $\frac{\partial \mathcal{L}}{\partial W_i}$

Recall: 

$$
\frac{\partial \mathcal{L}}{\partial W_i} = \frac{\partial \mathcal{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial W_i}
$$

Which is now

$$
\frac{\partial \mathcal{L}}{\partial W_i} =  [- \frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}][\hat{y}_i(1 - \hat{y}_i)]A_{i-1}^T
$$

Or more compactly known as 

$$
\frac{1}{m} \sum (A^i - y)(A^{i-1})^T
$$

# $\frac{\partial \mathcal{L}}{\partial b_i}$

since the bias per layer can be evaluated as the sum of non dependence number, it's derivative is just 1.

$$
Z^i = W^iA^{i-1} + b_i
$$

$$
\frac{\partial z^i}{\partial b^i} = 1
$$

To obtain $\frac{\partial \mathcal{L}}{\partial b^i}$ we can use the chain rule

$$
\frac{\partial \mathcal{L}}{\partial b^i} = \frac{\partial \mathcal{L}}{\partial z^i} \frac{\partial z^i}{\partial b^i}
$$

using the chain rule  
$$
\frac{\partial \mathcal{L}}{\partial z_i} = \frac{\partial \mathcal{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial z_i} 
$$

therefore

$$
\frac{\partial \mathcal{L}}{\partial b_i} = \frac{\partial \mathcal{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial z_i} \cdot \frac{\partial z^i}{\partial b^i}
$$

finally, we have 

$$
\frac{\partial \mathcal{L}}{\partial b_i} = \frac{1}{m} \sum (A^i - y)
$$

now that we have a way to directly relate the initial weights and biases to the loss function, we can tune the initial weights depending on the way they cause the derivative of the loss function to change! this is gradient descent