![image.png](attachment:image.png)

# 🔁 Steps Involved in Backpropagation

1. **Forward Pass**  
Compute the outputs by passing the input through all layers. Store intermediate activations and weighted sums for use in backpropagation.

2. **Calculate the Error (Loss)**  
Measure the difference between predicted output and actual output using a loss function (e.g., MSE, Cross-Entropy).

3. **Calculate Gradient of Loss with Respect to Output**  
Compute the gradient of the loss function with respect to the network’s final predicted output. This indicates how changes in output affect the loss.

4. **Backward Pass (Error Propagation through Layers)**  
Propagate the error backward through each layer using the chain rule. Compute gradients of loss with respect to weights, biases, activations, and weighted sums, moving from output layer back to input layer.

5. **Weight and Bias Update**  
Update weights and biases using an optimization algorithm (e.g., gradient descent) based on the gradients calculated to reduce the loss.

6. **Iteration and Convergence**  
Repeat the entire process over many epochs until the loss converges to a minimum or the performance is satisfactory.


# 📘 Lecture: Loss Functions in Machine Learning & Deep Learning

---

## 1️⃣ What is a Loss Function?

A **loss function** is a mathematical function that measures the **difference between predicted output and actual target (ground truth)**.

> 🧠 **In simple terms**: It tells the model *"how wrong"* it is.

---

## 2️⃣ Why is it Important?

- It **guides the learning process**.
- During training, **optimizers** like Gradient Descent use the loss to update weights.
- A **lower loss** = better model performance.

---

## 3️⃣ Key Terminology

| Term         | Description                                 |
|--------------|---------------------------------------------|
| **Loss**     | Error for a single training example         |
| **Cost**     | Average loss over the entire dataset        |
| **Objective**| The function we aim to minimize             |

---

## 4️⃣ Types of Loss Functions

---

### 🔷 A. Regression Loss Functions

#### 🔹 Mean Squared Error (MSE)

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

- Penalizes large errors heavily.
- Smooth and differentiable.

#### 🔹 Mean Absolute Error (MAE)

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^n \left| y_i - \hat{y}_i \right|
$$

- Treats all errors equally.
- More robust to outliers than MSE.


---

### 🔷 B. Classification Loss Functions

#### 🔹 Binary Cross-Entropy (Log Loss)

$$
\text{BCE} = -\left[ y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y}) \right]
$$

- Used for binary classification (output via sigmoid).

#### 🔹 Categorical Cross-Entropy

$$
\text{CCE} = -\sum_{i=1}^{C} y_i \cdot \log(\hat{y}_i)
$$

- Used for multi-class classification (output via softmax).

#### 🔹 Sparse Categorical Cross-Entropy

- Used when labels are **integers** instead of one-hot vectors.
- Functionally same as CCE but more efficient.


---

## 5️⃣ Summary Table

| Task Type            | Common Loss Function            |
|----------------------|---------------------------------|
| Regression           | MSE, MAE, Huber                 |
| Binary Classification| Binary Cross-Entropy            |
| Multi-Class          | Categorical Cross-Entropy       |

---



## 🔁 Gradients of Common Loss Functions

---

### 1. Mean Squared Error (MSE)

**Loss:**

$ \mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 $

**Gradient:**

$ \frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{n} (\hat{y}_i - y_i) $

---

### 2. Mean Absolute Error (MAE)

**Loss:**

$ \mathcal{L}_{\text{MAE}} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| $

**Gradient (not differentiable at 0):**

$
\frac{\partial \mathcal{L}}{\partial \hat{y}_i} =
\begin{cases}
\frac{1}{n} & \text{if } \hat{y}_i > y_i \\
-\frac{1}{n} & \text{if } \hat{y}_i < y_i \\
0 & \text{if } \hat{y}_i = y_i
\end{cases}
$

---

### 3. Binary Cross Entropy (BCE)

**Loss:**

$ \mathcal{L}_{\text{BCE}} = - \left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right] $

**Gradient:**

$ \frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} $

---

### 4. Categorical Cross Entropy (CCE)

#### A. When using predicted probabilities ($\hat{y}_i$):

**Loss:**

$ \mathcal{L}_{\text{CCE}} = -\sum_{i=1}^C y_i \log(\hat{y}_i) $

**Gradient:**

$ \frac{\partial \mathcal{L}}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i} $

#### B. When using logits + softmax (common in PyTorch):

Let:
- $ z_i $: logits
- $ \hat{y}_i = \text{softmax}(z_i) $

Then:

$ \frac{\partial \mathcal{L}}{\partial z_i} = \hat{y}_i - y_i $

---
