# <span style="color: #2E86C1; font-weight: bold;">Understanding Backpropagation</span>


  Most ML and AI libraries (e.g., TensorFlow, scikit-learn) allow you to implement backpropagation without needing to understand the math behind it.Understanding backpropagation enhances your ability to optimize neural networks and tackle complex deep learning architectures.

## <span style="color: #D35400; font-weight: bold;">What Is Backpropagation?</span>

- **Iterative Process:**
  - The machine learning process is iterative: we feed data into the model and evaluate its performance through an objective function.
  - We adjust the model's parameters (weights and biases) using optimization algorithms to reach desired outputs.

- **Forward Propagation:**
  - Involves passing inputs through the network.
  - At the end of each epoch, we compare the actual outputs to the target values to calculate errors.

- **Backpropagation:**
  - This process adjusts weights and biases in reverse based on calculated errors to minimize loss.
  - Essential for improving the accuracy of artificial neural networks and a key part of the gradient descent optimization process.

## <span style="color: #28B463; font-weight: bold;">Components of a Deep Neural Network</span>

- **Structure of the Network:**
  - **Input Layer:** Contains two inputs, $ x_1 $ and $ x_2 $
  - **Hidden Layer:** Comprises three hidden units (nodes), $ h_1, h_2, $ and $ h_3 $
  - **Output Layer:** Produces two outputs, $ y_1 $ and $ y_2 $

- **Weights:**
  - Arrows connecting the layers represent weights.
  - $ w $ weights connect the input and hidden layers.
  - $ u $ weights connect the hidden and output layers.

- **Targets:**
  - The target values are denoted as $ t_1 $ and $ t_2 $.

<center><img src="../../images/backpropogation_network.png" alt="Deep Neural Network" width="600"/></center>

## <span style="color: #F39C12; font-weight: bold;">The Sigmoid Function</span>

- **Non-Linearity:**
  - Deep neural networks require non-linear activation functions to represent complex relationships.
  - Stacking layers with only linear relationships is insufficient.

- **Transformation:**
  - Each arrow in the diagram represents a mathematical transformation of input values.
  - Weights are applied to inputs, followed by the introduction of non-linearity, producing the hidden layer's units.

- **Sigmoid Activation Function:**
  - The sigmoid function is one of the most common activation functions:
  
  $$ 
  \sigma(x) = \frac{1}{1 + e^{-x}} 
  $$

  - Its derivative is:

  $$ 
  \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x)) 
  $$

  - The sigmoid function modifies input values to produce outputs for the next layer.
.


## <span style="color: #2E86C1; font-weight: bold;">The L2 Norm</span>

- **Objective Functions:**
  - Objective functions can be divided into loss (cost) functions and reward functions.
  - This section focuses on loss functions, which measure prediction errors.

- **Minimizing Cost:**
  - A lower cost function indicates higher model accuracy.
  - The goal is to minimize prediction error and, consequently, the cost.

## <span style="color: #D35400; font-weight: bold;">L2 Norm (Squared Loss)</span>

- **Definition:**
  - The L2 norm, commonly used in supervised learning and regression, represents a typical loss function.
  - The term 'norm' refers to the Euclidean distance between the outputs and the targets.

- **Calculation:**
  - The L2-norm loss is computed by summing the squared differences between outputs $ y $ and targets $ t $.
  - The mathematical expression is as follows:

  $$
  L = \frac{1}{2} \sum_{i} (y_i - t_i)^2
  $$

## <span style="color: #28B463; font-weight: bold;">Backpropagation Algorithm</span>

- **Separate Methodologies:**
  - The backpropagation algorithm will be examined for both output and hidden layers.
  - These methodologies differ, so they will be reviewed separately.

- **Linear Model Function:**
  - The linear model function is defined as:

  $$
  f(x) = xw + b
  $$

  Where:
  - $ x $ = input
  - $ w $ = coefficient (weight)
  - $ b $ = intercept (bias)

- **Notation for Linear Combination:**
  - For the linear combination before activation, we define:
  
  $$
  a^{(1)} = xw + b^{(1)} \quad \text{and} \quad a^{(2)} = hu + b^{(2)}
  $$

- **Output and Hidden Layer Activations:**
  - Using this notation, the output $ y $ becomes the activated linear combination.
  - For the output layer, we have:

  $$
  y = \sigma(a^{(2)})
  $$

  - For the hidden layer, the activation is given by:

  $$
  h = \sigma(a^{(1)})
  $$

## <span style="color: #F39C12; font-weight: bold;">Common Functions</span>

- **Focus on Activation and Loss Functions:**
  - While there are various activation and loss functions, we concentrate on the most common ones:
    - **Activation Function:** Sigmoid
    - **Loss Function:** L2-norm loss




---

# <span style="color: #2E86C1; font-weight: bold;">Backpropagation for the Output Layer</span>


- **Objective of Backpropagation:**
  - In supervised learning, the primary goal is to minimize the loss.
  - Backpropagation computes the gradient of the loss function with respect to each unit's weights and biases in the network.

- **Process Overview:**
  - We use the gradients obtained to update parameters so that the loss computed with new values is less than the current loss.
  - The loss decreases by iteratively adjusting the weights and biases based on the obtained gradients, allowing the network to gradually learn better predictions.

## <span style="color: #D35400; font-weight: bold;">Update Rule</span>

- **Importance of Deltas:**
  - Updates are directly related to the partial derivatives of the loss and indirectly connected to errors (deltas) between targets and outputs.
  - Deltas enable modification of parameters using the update rule.

- **Mathematical Expression:**
  - The update rule for a weight $ u $ is given by:

  $$
  u \leftarrow u - \eta \nabla_u L(u)
  $$

  Where:
  - $ \eta $ (eta) is the learning rate of the machine learning algorithm.

## <span style="color: #28B463; font-weight: bold;">Calculating the Gradient</span>

- **Single Weight Derivative:**
  - For a single weight $ u_{ij} $, the partial derivative of the loss with respect to $ u_{ij} $ can be expressed as:

  $$
  \frac{\partial L}{\partial u_{ij}} = \frac{\partial L}{\partial y_j} \cdot \frac{\partial y_j}{\partial a^{(2)}_j} \cdot \frac{\partial a^{(2)}_j}{\partial u_{ij}}
  $$

  Where:
  - $ i $ corresponds to the previous layer (input layer for this transformation).
  - $ j $ corresponds to the next layer (output layer of the transformation).

- **Using the Chain Rule:**
  - The partial derivatives are computed using the chain rule:
    1. **L2-Norm Loss Derivative:**

    $$
    \frac{\partial L}{\partial y_j} = (y_j - t_j)
    $$

    2. **Sigmoid Derivative:**

    $$
    \frac{\partial y_j}{\partial a^{(2)}_j} = \sigma(a^{(2)}_j)(1 - \sigma(a^{(2)}_j)) = y_j(1 - y_j)
    $$

    3. **Derivative of the Linear Combination:**
       - The linear model function is defined as:

    $$
    a^{(2)} = h u + b^{(2)}
    $$

       - The derivative of $ a^{(2)} $ with respect to $ u_{ij} $ is:

    $$
    \frac{\partial a^{(2)}_j}{\partial u_{ij}} = h_i
    $$

## <span style="color: #F39C12; font-weight: bold;">Final Expression</span>

- **Substituting Partial Derivatives:**
  - By combining these derivatives, we obtain:

  $$
  \frac{\partial L}{\partial u_{ij}} = \frac{\partial L}{\partial y_j} \cdot \frac{\partial y_j}{\partial a^{(2)}_j} \cdot \frac{\partial a^{(2)}_j}{\partial u_{ij}} = (y_j - t_j) \cdot y_j(1 - y_j) \cdot h_i = \delta_j \cdot h_i
  $$

- **Update Rule for the Output Layer:**
  - Therefore, the update rule for a single weight in the output layer is:

  $$
  u_{ij} \leftarrow u_{ij} - \eta \cdot \delta_j \cdot h_i
  $$



---

# <span style="color: #2E86C1; font-weight: bold;">Backpropagation for a Hidden Layer</span>


- **Updating Weights in Hidden Layers:**
  - When dealing with deep neural networks, it's essential to update the weights across multiple hidden layers.
  - The activation functions and their derivatives must also be considered when updating the weights.

## <span style="color: #D35400; font-weight: bold;">Update Rule for Hidden Layers</span>

- **General Update Rule:**
  - Similar to the output layer, the update rule for a single weight $ w_{ij} $ in the hidden layer can be expressed as:

  $$
  \frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial h_j} \cdot \frac{\partial h_j}{\partial a^{(1)}_j} \cdot \frac{\partial a^{(1)}_j}{\partial w_{ij}}
  $$

- **Calculating Partial Derivatives:**
  - We compute backpropagation using the chain rule, utilizing the sigmoid activation and linear model formulas:

  1. **Sigmoid Derivative:**

  $$
  \frac{\partial h_j}{\partial a^{(1)}_j} = \sigma(a^{(1)}_j)(1 - \sigma(a^{(1)}_j)) = h_j(1 - h_j)
  $$

  2. **Linear Model Derivative:**

  $$
  a^{(1)} = x w + b^{(1)} \quad \Rightarrow \quad \frac{\partial a^{(1)}_j}{\partial w_{ij}} = x_i
  $$

## <span style="color: #28B463; font-weight: bold;">Challenges in Calculating Derivatives</span>

- **Hidden Layer Complexity:**
  - Calculating $ \frac{\partial L}{\partial h_j} $ is more complex as we don't have direct targets for the hidden layer outputs.
  - Instead, we trace the contribution of each unit (hidden or otherwise) to the output errors.

- **Example of Contribution Tracing:**
  - For instance, consider a weight $ u_{11} $ that contributes to the output $ y_1 $ and its error $ e_1 $.
  - The weight $ w_{11} $ contributes to $ h_1 $, which is connected to weights $ u_{11} $ and $ u_{12} $, impacting two outputs $ y_1 $ and $ y_2 $, and their respective errors $ e_1 $ and $ e_2 $.

- **Backpropagating Errors:**
  - To solve this, we backpropagate the errors through the network using the weights $ u $ to measure the contribution of the hidden layers to the errors, which aids in updating the weights $ w $.

## <span style="color: #F39C12; font-weight: bold;">Calculating Derivatives for Hidden Layers</span>

- **Calculating $ \frac{\partial L}{\partial h_1} $:**
  - For weight $ w_{11} $, we calculate:

  $$
  \frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial y_1} \cdot \frac{\partial y_1}{\partial a^{(2)}_1} \cdot \frac{\partial a^{(2)}_1}{\partial h_1} + \frac{\partial L}{\partial y_2} \cdot \frac{\partial y_2}{\partial a^{(2)}_2} \cdot \frac{\partial a^{(2)}_2}{\partial h_1}
  $$

  - Substituting the derivatives, we have:

  $$
  = (y_1 - t_1) \cdot y_1(1 - y_1) \cdot u_{11} + (y_2 - t_2) \cdot y_2(1 - y_2) \cdot u_{12}
  $$

## <span style="color: #E74C3C; font-weight: bold;">Final Expression for Weight Update</span>

- **Calculating $ \frac{\partial L}{\partial w_{11}} $:**
  - The final expression for the weight update is:

  $$
  \frac{\partial L}{\partial w_{11}} = \left[ (y_1 - t_1) \cdot y_1 \cdot (1 - y_1) \cdot u_{11} + (y_2 - t_2) \cdot y_2 \cdot (1 - y_2) \cdot u_{12} \right] \cdot h_1(1 - h_1) \cdot x_1
  $$

- **Generalized Form:**
  - The generalized update rule for weights in hidden layers is given by:

  $$
  \frac{\partial L}{\partial w_{ij}} = \sum_k (y_k - t_k) \cdot y_k(1 - y_k) \cdot u_{jk} \cdot h_j(1 - h_j) \cdot x_i
  $$


---

# <span style="color: #28B463; font-weight: bold;">How Does It Work? This Time In Simple Terms</span>

1. **Forward Pass:**
   - The model takes an input (like a question) and passes it through various layers (like chapters in a textbook) to get an output (an answer).

2. **Calculating Error:**
   - The model checks how far off its answer is from the correct answer. This difference is called the error.

3. **Backward Pass:**
   - Backpropagation then sends this error back through the network:
     - It determines how much each weight (connection) contributed to the error.
     - It uses this information to update the weights, making small adjustments to reduce the error in future predictions.

4. **Learning Rate:**
   - The learning rate is like how much you change your study methods based on feedback. A high learning rate means you make big changes, while a low learning rate means you make small changes.

## <span style="color: #F39C12; font-weight: bold;">Why is it Important?</span>

- Backpropagation is crucial for training deep neural networks effectively.
- It allows the model to learn complex patterns and make better predictions over time.

## <span style="color: #E74C3C; font-weight: bold;">In Summary</span>

- **Think of Backpropagation as a Feedback Loop:** 
  - The model learns from its mistakes, adjusts its internal parameters, and improves its performance iteratively.
- It helps machines get better at tasks, similar to how we learn from experience!

