## Table of Contents

1. [Introduction](#1-introduction)
2. [Binary Cross-Entropy (BCE)](#2-binary-cross-entropy-bce)
   - [2.1. What is Binary Cross-Entropy?](#21-what-is-binary-cross-entropy)
   - [2.2. Mathematical Formulation](#22-mathematical-formulation)
   - [2.3. Gradient Derivation](#23-gradient-derivation)
     - [2.3.1. Derivative of the Sigmoid Function](#231-derivative-of-the-sigmoid-function)
     - [2.3.2. Gradient with Respect to Logit $ z $](#232-gradient-with-respect-to-logit-z)
     - [2.3.3. Gradient with Respect to Weights $ \mathbf{w} $ and Bias $ b $](#233-gradient-with-respect-to-weights-w-and-b)
3. [Categorical Cross-Entropy (CCE)](#3-categorical-cross-entropy-cce)
   - [3.1. What is Categorical Cross-Entropy?](#31-what-is-categorical-cross-entropy)
   - [3.2. Mathematical Formulation](#32-mathematical-formulation)
   - [3.3. Gradient Derivation](#33-gradient-derivation)
     - [3.3.1. Derivative of the Softmax Function](#331-derivative-of-the-softmax-function)
     - [3.3.2. Gradient with Respect to Logits $ z_{i,k} $](#332-gradient-with-respect-to-logits-zik)
     - [3.3.3. Gradient with Respect to Weights $ \mathbf{W} $ and Bias $ \mathbf{b} $](#333-gradient-with-respect-to-weights-w-and-b)
4. [Activation Functions](#4-activation-functions)
   - [4.1. Sigmoid Function](#41-sigmoid-function)
   - [4.2. Softmax Function](#42-softmax-function)
5. [Why Pair Sigmoid with BCE and Softmax with CCE?](#5-why-pair-sigmoid-with-bce-and-softmax-with-cce)
6. [Practical Examples](#6-practical-examples)
   - [6.1. BCE Example: Binary Classification](#61-bce-example-binary-classification)
   - [6.2. CCE Example: Multi-Class Classification](#62-cce-example-multi-class-classification)
7. [Conclusion](#7-conclusion)

---

## 1. Introduction

In machine learning, particularly in classification tasks, **loss functions** play a pivotal role in guiding the training of models. Among these, **Binary Cross-Entropy (BCE)** and **Categorical Cross-Entropy (CCE)** are two fundamental loss functions used for different types of classification problems. Understanding their formulations, associated **activation functions** (including the detailed derivations of the **sigmoid** and **softmax** functions), and gradient computations is essential for effectively training models like logistic regression and neural networks.

---

## 2. Binary Cross-Entropy (BCE)

### 2.1. What is Binary Cross-Entropy?

**Binary Cross-Entropy (BCE)**, also known as **Log Loss**, is a loss function used for **binary classification** tasks. In these tasks, each instance is categorized into one of two classes, often labeled as 0 or 1. BCE measures the discrepancy between the true labels and the predicted probabilities output by the model.

**Key Characteristics:**
- **Binary Classification:** Suited for problems with two distinct classes.
- **Probability Output:** Models typically output a probability representing the likelihood of the instance belonging to the positive class.

### 2.2. Mathematical Formulation

#### **Single Sample BCE Loss**

For a single training instance, the BCE loss is defined as:

$$
L = -\left[ y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y}) \right]
$$

Where:
- $ y \in \{0, 1\} $ is the **true label**.
- $ \hat{y} \in (0, 1) $ is the **predicted probability** of the instance belonging to the positive class.

#### **Average BCE Loss Over $ n $ Samples**

For a dataset containing $ n $ samples, the average BCE loss is:

$$
\text{BCE}_{\text{avg}} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i) \right]
$$

**Understanding the Components:**
- **First Term ($ y_i \cdot \log(\hat{y}_i) $):** Penalizes the model when it predicts a low probability for the true positive class.
- **Second Term ($ (1 - y_i) \cdot \log(1 - \hat{y}_i) $):** Penalizes the model when it predicts a high probability for the false positive class.

### 2.3. Gradient Derivation

To train a model using gradient-based optimization (like Gradient Descent), we need to compute the gradient of the loss with respect to the model parameters. Let's derive the gradients step by step.

#### 2.3.1. Derivative of the Sigmoid Function

Before diving into the gradient of the BCE loss, it's essential to understand the derivative of the **sigmoid** activation function, as it plays a critical role in the gradient computation.

##### **Sigmoid Function Definition**

The sigmoid function maps any real-valued number into the (0, 1) interval:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

##### **Derivative of the Sigmoid Function**

To compute the gradient of the BCE loss, we'll need the derivative of the sigmoid function with respect to its input $ z $.

**Step-by-Step Derivation:**

1. **Express the Sigmoid Function:**

   $$
   \sigma(z) = \frac{1}{1 + e^{-z}}
   $$

2. **Differentiate $ \sigma(z) $ with Respect to $ z $:**

   $$
   \frac{d\sigma(z)}{dz} = \frac{d}{dz} \left( \frac{1}{1 + e^{-z}} \right )
   $$

3. **Apply the Chain Rule:**

   Let $ u = 1 + e^{-z} $, so $ \sigma(z) = \frac{1}{u} $.

   $$
   \frac{d\sigma(z)}{dz} = \frac{d\sigma(z)}{du} \cdot \frac{du}{dz}
   $$

4. **Compute Each Derivative:**

   - **Derivative of $ \sigma(z) $ with respect to $ u $:**

     $$
     \frac{d\sigma(z)}{du} = -\frac{1}{u^2}
     $$

   - **Derivative of $ u $ with respect to $ z $:**

     $$
     \frac{du}{dz} = \frac{d}{dz} \left( 1 + e^{-z} \right ) = -e^{-z}
     $$

5. **Combine Using the Chain Rule:**

   $$
   \frac{d\sigma(z)}{dz} = -\frac{1}{u^2} \cdot (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^2}
   $$

6. **Express in Terms of $ \sigma(z) $:**

   Notice that:

   $$
   \sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{and} \quad 1 - \sigma(z) = \frac{e^{-z}}{1 + e^{-z}}
   $$

   Therefore:

   $$
   \frac{d\sigma(z)}{dz} = \sigma(z) \cdot (1 - \sigma(z))
   $$

**Final Expression:**

$$
\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))
$$

**Interpretation:**
- The derivative $ \sigma'(z) $ represents how sensitive the sigmoid function's output is to changes in the input $ z $.
- **Properties:**
  - The derivative is maximum at $ z = 0 $, where $ \sigma(z) = 0.5 $, resulting in $ \sigma'(0) = 0.25 $.
  - As $ z $ moves away from 0 (either positively or negatively), $ \sigma'(z) $ decreases, leading to the **vanishing gradient** problem in deep networks.

#### 2.3.2. Gradient with Respect to Logit $ z $

**Objective:** Compute $ \frac{\partial L}{\partial z} $, where $ z $ is the logit (the input to the activation function).

**Model Setup:**
- **Logistic Regression Model:**

  $$
  z = \mathbf{x}^\top \mathbf{w} + b
  $$

  Where:
  - $ \mathbf{x} $ is the input feature vector.
  - $ \mathbf{w} $ is the weight vector.
  - $ b $ is the bias term.

- **Sigmoid Activation Function:**

  $$
  \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}
  $$

**Step-by-Step Derivation:**

1. **Define the Loss Function for a Single Sample:**

   $$
   L = -\left[ y \cdot \log(\sigma(z)) + (1 - y) \cdot \log(1 - \sigma(z)) \right]
   $$

2. **Compute the Derivative of $ L $ with Respect to $ z $:**

   $$
   \frac{dL}{dz} = -\left[ y \cdot \frac{1}{\sigma(z)} \cdot \sigma'(z) + (1 - y) \cdot \frac{-1}{1 - \sigma(z)} \cdot \sigma'(z) \right]
   $$

   **Explanation:**
   - **First Term ($ y \cdot \log(\sigma(z)) $):**

     $$
     \frac{\partial}{\partial z} \left( y \cdot \log(\sigma(z)) \right) = y \cdot \frac{1}{\sigma(z)} \cdot \sigma'(z)
     $$

   - **Second Term ($ (1 - y) \cdot \log(1 - \sigma(z)) $):**

     $$
     \frac{\partial}{\partial z} \left( (1 - y) \cdot \log(1 - \sigma(z)) \right) = (1 - y) \cdot \frac{-1}{1 - \sigma(z)} \cdot \sigma'(z)
     $$

3. **Substitute the Sigmoid Derivative ($ \sigma'(z) = \sigma(z) \cdot (1 - \sigma(z)) $) into the Expression:**

   $$
   \frac{dL}{dz} = -\left[ y \cdot \frac{\sigma(z) \cdot (1 - \sigma(z))}{\sigma(z)} + (1 - y) \cdot \frac{-\sigma(z) \cdot (1 - \sigma(z))}{1 - \sigma(z)} \right]
   $$

4. **Simplify the Expression:**

   - **First Term Simplification:**

     $$
     y \cdot \frac{\sigma(z) \cdot (1 - \sigma(z))}{\sigma(z)} = y \cdot (1 - \sigma(z))
     $$

   - **Second Term Simplification:**

     $$
     (1 - y) \cdot \frac{-\sigma(z) \cdot (1 - \sigma(z))}{1 - \sigma(z)} = -(1 - y) \cdot \sigma(z)
     $$

   - **Combine Both Terms:**

     $$
     \frac{dL}{dz} = -\left[ y \cdot (1 - \sigma(z)) - (1 - y) \cdot \sigma(z) \right]
     $$

5. **Factor Out $ \sigma(z) $ and Rearrange Terms:**

   $$
   \frac{dL}{dz} = \sigma(z) - y
   $$

   **Final Gradient Expression:**

   $$
   \frac{dL}{dz} = \hat{y} - y
   $$

   **Interpretation:**
   - **If $ \hat{y} > y $:** The model predicts a higher probability than the true label. The gradient is positive, indicating the need to **decrease** $ z $ to reduce $ \hat{y} $.
   - **If $ \hat{y} < y $:** The model predicts a lower probability than the true label. The gradient is negative, indicating the need to **increase** $ z $ to boost $ \hat{y} $.
   - **If $ \hat{y} = y $:** The gradient is zero, indicating no adjustment is needed.

#### 2.3.3. Gradient with Respect to Weights $ \mathbf{w} $ and Bias $ b $

**Objective:** Derive the gradients of the BCE loss with respect to the model's weights $ \mathbf{w} $ and bias $ b $.

**Using the Chain Rule:**

$$
\frac{\partial L}{\partial \mathbf{w}} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} \quad \text{and} \quad \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial b}
$$

Where:
- $ \frac{\partial L}{\partial z} = \hat{y} - y $
- $ \frac{\partial z}{\partial \mathbf{w}} = \mathbf{x} $
- $ \frac{\partial z}{\partial b} = 1 $

**Step-by-Step Derivation:**

1. **Define $ z $ in Terms of $ \mathbf{w} $ and $ b $:**

   $$
   z = \mathbf{x}^\top \mathbf{w} + b
   $$

   Where:
   - $ \mathbf{x} $ is the input feature vector.
   - $ \mathbf{w} $ is the weight vector.
   - $ b $ is the bias term.

2. **Compute the Partial Derivative of $ z $ with Respect to $ \mathbf{w} $ and $ b $:**

   - **With Respect to $ \mathbf{w} $:**

     $$
     \frac{\partial z}{\partial \mathbf{w}} = \mathbf{x}
     $$

     **Explanation:** Since $ z $ is a linear combination of weights and inputs, the derivative with respect to each weight $ w_j $ is the corresponding input feature $ x_j $.

   - **With Respect to $ b $:**

     $$
     \frac{\partial z}{\partial b} = 1
     $$

     **Explanation:** The bias term $ b $ directly adds to $ z $, so its derivative is 1.

3. **Apply the Chain Rule to Compute the Gradients:**

   - **Gradient with Respect to $ \mathbf{w} $:**

     $$
     \frac{\partial L}{\partial \mathbf{w}} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} = (\hat{y} - y) \cdot \mathbf{x}
     $$

   - **Gradient with Respect to $ b $:**

     $$
     \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial b} = (\hat{y} - y) \cdot 1 = \hat{y} - y
     $$

4. **Gradient Expressions for $ n $ Samples:**

   When dealing with a batch of $ n $ samples, the gradients are averaged over all samples to ensure scale-invariant updates.

   - **Gradient with Respect to Weights $ \mathbf{w} $:**

     $$
     \frac{\partial \text{BCE}_{\text{avg}}}{\partial \mathbf{w}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot \mathbf{x}_i = \frac{1}{n} \mathbf{X}^\top (\hat{\mathbf{y}} - \mathbf{y})
     $$

     Where:
     - $ \mathbf{X} $ is the input matrix of shape $ [n, d] $.
     - $ \hat{\mathbf{y}} $ is the predicted probabilities vector of shape $ [n] $.
     - $ \mathbf{y} $ is the true labels vector of shape $ [n] $.

   - **Gradient with Respect to Bias $ b $:**

     $$
     \frac{\partial \text{BCE}_{\text{avg}}}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) = \frac{1}{n} (\hat{\mathbf{y}} - \mathbf{y})^\top \mathbf{1}
     $$

     Where $ \mathbf{1} $ is a vector of ones of shape $ [n] $.

**Summary for BCE:**
- **Gradient with Respect to Logit $ z $:** $ \frac{dL}{dz} = \hat{y} - y $
- **Gradient with Respect to Weights $ \mathbf{w} $:** $ \frac{\partial L}{\partial \mathbf{w}} = (\hat{y} - y) \cdot \mathbf{x} $
- **Gradient with Respect to Bias $ b $:** $ \frac{\partial L}{\partial b} = \hat{y} - y $

---

## 3. Categorical Cross-Entropy (CCE)

### 3.1. What is Categorical Cross-Entropy?

**Categorical Cross-Entropy (CCE)** is a loss function used for **multi-class classification** tasks, where each instance is assigned to one of three or more classes. Unlike binary classification, multi-class classification involves more complexity due to the increased number of classes and the mutual exclusivity of class assignments.

**Key Characteristics:**
- **Multi-Class Classification:** Suited for problems with three or more distinct classes.
- **Probability Distribution Output:** Models output a probability distribution over all possible classes.

### 3.2. Mathematical Formulation

#### **Single Sample CCE Loss**

For a single training instance, the CCE loss is defined as:

$$
L = -\sum_{k=1}^{C} y_k \cdot \log(\hat{y}_k)
$$

Where:
- $ C $ is the number of classes.
- $ y_k \in \{0, 1\} $ is the **true label** in one-hot encoded form (only one $ y_k = 1 $, the rest are 0).
- $ \hat{y}_k \in (0, 1) $ is the **predicted probability** for class $ k $, obtained via the **softmax** activation function.

#### **Average CCE Loss Over $ n $ Samples**

For a dataset containing $ n $ samples, the average CCE loss is:

$$
\text{CCE}_{\text{avg}} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{C} y_{i,k} \cdot \log(\hat{y}_{i,k})
$$

**Understanding the Components:**
- **Summation Over Classes:** Ensures that only the probability assigned to the true class contributes to the loss.
- **Logarithmic Penalty:** Penalizes the model more severely when the predicted probability for the true class is low.

### 3.3. Gradient Derivation

To train a multi-class classification model using gradient-based optimization, we need to derive the gradients of the CCE loss with respect to the model parameters. Let's proceed with a meticulous step-by-step derivation.

#### 3.3.1. Derivative of the Softmax Function

Before diving into the gradient of the CCE loss, it's essential to understand the derivative of the **softmax** activation function, as it plays a critical role in the gradient computation.

##### **Softmax Function Definition**

The softmax function converts a vector of raw scores (logits) into a probability distribution. For a vector $ \mathbf{z} = [z_1, z_2, \ldots, z_C] $, the softmax function is defined as:

$$
\text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^{C} e^{z_j}} \quad \text{for } k = 1, 2, \ldots, C
$$

Where:
- $ z_k $ is the logit for class $ k $.
- $ C $ is the total number of classes.

##### **Derivative of the Softmax Function**

The derivative of the softmax function is more involved than the sigmoid function due to the interdependence of class probabilities. The derivative is represented by the **Jacobian matrix**, which contains all first-order partial derivatives of the softmax outputs with respect to the inputs.

**Step-by-Step Derivation:**

1. **Express the Softmax Function:**

   $$
   \text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^{C} e^{z_j}}
   $$

2. **Differentiate $ \text{softmax}(z)_k $ with Respect to $ z_i $:**

   $$
   \frac{\partial \text{softmax}(z)_k}{\partial z_i} = \frac{\partial}{\partial z_i} \left( \frac{e^{z_k}}{\sum_{j=1}^{C} e^{z_j}} \right )
   $$

3. **Apply the Quotient Rule:**

   Let $ u = e^{z_k} $ and $ v = \sum_{j=1}^{C} e^{z_j} $, so:

   $$
   \frac{\partial \text{softmax}(z)_k}{\partial z_i} = \frac{v \cdot \frac{\partial u}{\partial z_i} - u \cdot \frac{\partial v}{\partial z_i}}{v^2}
   $$

4. **Compute the Partial Derivatives:**

   - **For $ i = k $:**

     $$
     \frac{\partial u}{\partial z_i} = \frac{\partial e^{z_k}}{\partial z_k} = e^{z_k} = u
     $$

     $$
     \frac{\partial v}{\partial z_i} = \frac{\partial}{\partial z_k} \left( \sum_{j=1}^{C} e^{z_j} \right ) = e^{z_k} = u
     $$

   - **For $ i \neq k $:**

     $$
     \frac{\partial u}{\partial z_i} = \frac{\partial e^{z_k}}{\partial z_i} = 0
     $$

     $$
     \frac{\partial v}{\partial z_i} = \frac{\partial}{\partial z_i} \left( \sum_{j=1}^{C} e^{z_j} \right ) = e^{z_i}
     $$

5. **Substitute Back into the Quotient Rule:**

   - **For $ i = k $:**

     $$
     \frac{\partial \text{softmax}(z)_k}{\partial z_k} = \frac{v \cdot u - u \cdot u}{v^2} = \frac{u(v - u)}{v^2} = \frac{u}{v} \cdot \left(1 - \frac{u}{v}\right ) = \text{softmax}(z)_k \cdot (1 - \text{softmax}(z)_k)
     $$

   - **For $ i \neq k $:**

     $$
     \frac{\partial \text{softmax}(z)_k}{\partial z_i} = \frac{v \cdot 0 - u \cdot e^{z_i}}{v^2} = -\frac{u \cdot e^{z_i}}{v^2} = -\frac{e^{z_k}}{v} \cdot \frac{e^{z_i}}{v} = -\text{softmax}(z)_k \cdot \text{softmax}(z)_i
     $$

6. **Combine the Results:**

   $$
   \frac{\partial \text{softmax}(z)_k}{\partial z_i} = 
   \begin{cases}
   \text{softmax}(z)_k \cdot (1 - \text{softmax}(z)_k) & \text{if } i = k \\
   -\text{softmax}(z)_k \cdot \text{softmax}(z)_i & \text{if } i \neq k
   \end{cases}
   $$

7. **Expressing the Jacobian Matrix:**

   The Jacobian matrix $ J $ for the softmax function is given by:

   $$
   J_{k,i} = \frac{\partial \text{softmax}(z)_k}{\partial z_i} = \text{softmax}(z)_k \cdot (\delta_{ki} - \text{softmax}(z)_i)
   $$

   Where $ \delta_{ki} $ is the Kronecker delta, which is 1 if $ k = i $ and 0 otherwise.

**Final Expression:**

$$
\frac{\partial \text{softmax}(z)_k}{\partial z_i} = \text{softmax}(z)_k \cdot (\delta_{ki} - \text{softmax}(z)_i)
$$

**Interpretation:**
- **Diagonal Elements ($ k = i $):** Represent the derivative of the probability of class $ k $ with respect to its own logit $ z_k $. This term is always positive and less than 1.
- **Off-Diagonal Elements ($ k \neq i $):** Represent the derivative of the probability of class $ k $ with respect to the logit of a different class $ z_i $. These terms are always negative, reflecting the mutual exclusivity enforced by softmax.

#### 3.3.2. Gradient with Respect to Logits $ z_{i,k} $

**Objective:** Compute $ \frac{\partial L}{\partial z_{i,k}} $, where $ z_{i,k} $ is the logit for class $ k $ in sample $ i $.

**Model Setup:**
- **Softmax Activation Function:**

  $$
  \hat{y}_{i,k} = \text{softmax}(z_{i,k}) = \frac{e^{z_{i,k}}}{\sum_{j=1}^{C} e^{z_{i,j}}}
  $$

- **Loss Function:**

  $$
  L = -\sum_{k=1}^{C} y_k \cdot \log(\hat{y}_k)
  $$

  Where:
  - $ y_k $ is 1 if the true class for sample $ i $ is $ k $, else 0 (one-hot encoding).

**Step-by-Step Derivation:**

1. **Define the Loss Function for a Single Sample:**

   $$
   L_i = -\sum_{k=1}^{C} y_{i,k} \cdot \log(\hat{y}_{i,k})
   $$

2. **Compute the Derivative of $ L_i $ with Respect to $ z_{i,k} $:**

   $$
   \frac{\partial L_i}{\partial z_{i,k}} = -\sum_{j=1}^{C} y_{i,j} \cdot \frac{\partial}{\partial z_{i,k}} \log(\hat{y}_{i,j})
   $$

   **Explanation:**
   - The loss depends on all logits through the softmax function, but due to one-hot encoding, only the term corresponding to the true class contributes directly.

3. **Differentiate $ \log(\hat{y}_{i,j}) $ with Respect to $ z_{i,k} $:**

   - **For $ j = k $:**

     $$
     \frac{\partial}{\partial z_{i,k}} \log(\hat{y}_{i,k}) = \frac{1}{\hat{y}_{i,k}} \cdot \frac{\partial \hat{y}_{i,k}}{\partial z_{i,k}} = \frac{1}{\hat{y}_{i,k}} \cdot \hat{y}_{i,k} \cdot (1 - \hat{y}_{i,k}) = 1 - \hat{y}_{i,k}
     $$

   - **For $ j \neq k $:**

     $$
     \frac{\partial}{\partial z_{i,k}} \log(\hat{y}_{i,j}) = \frac{1}{\hat{y}_{i,j}} \cdot \frac{\partial \hat{y}_{i,j}}{\partial z_{i,k}} = \frac{1}{\hat{y}_{i,j}} \cdot (-\hat{y}_{i,j} \cdot \hat{y}_{i,k}) = -\hat{y}_{i,k}
     $$

4. **Substitute Back into the Loss Derivative:**

   $$
   \frac{\partial L_i}{\partial z_{i,k}} = -\left[ y_{i,k} \cdot (1 - \hat{y}_{i,k}) + \sum_{j \neq k} y_{i,j} \cdot (-\hat{y}_{i,k}) \right]
   $$

   **Simplification:**
   - Due to one-hot encoding, $ y_{i,j} = 0 $ for $ j \neq k $, so the summation term vanishes.

   $$
   \frac{\partial L_i}{\partial z_{i,k}} = -\left[ y_{i,k} \cdot (1 - \hat{y}_{i,k}) \right] = \hat{y}_{i,k} - y_{i,k}
   $$

   **Final Gradient Expression:**

   $$
   \frac{\partial L_i}{\partial z_{i,k}} = \hat{y}_{i,k} - y_{i,k}
   $$

   **Interpretation:**
   - **If $ \hat{y}_{i,k} > y_{i,k} $:** The model assigns a higher probability to class $ k $ than the true label. The gradient is positive, indicating the need to **decrease** $ z_{i,k} $.
   - **If $ \hat{y}_{i,k} < y_{i,k} $:** The model assigns a lower probability to class $ k $ than the true label. The gradient is negative, indicating the need to **increase** $ z_{i,k} $.
   - **If $ \hat{y}_{i,k} = y_{i,k} $:** The gradient is zero, indicating no adjustment is needed.

#### 3.3.3. Gradient with Respect to Weights $ \mathbf{W} $ and Bias $ \mathbf{b} $

**Objective:** Derive the gradients of the CCE loss with respect to the model's weight matrix $ \mathbf{W} $ and bias vector $ \mathbf{b} $.

**Using the Chain Rule:**

$$
\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{W}} \quad \text{and} \quad \frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{b}}
$$

Where:
- $ \frac{\partial L}{\partial \mathbf{z}} $ is the matrix of gradients with respect to each logit $ z_{i,k} $.

**Step-by-Step Derivation:**

1. **Define the Logit Matrix:**

   $$
   \mathbf{Z} = \mathbf{X} \mathbf{W} + \mathbf{b}
   $$

   Where:
   - $ \mathbf{X} \in \mathbb{R}^{n \times d} $ is the input feature matrix.
   - $ \mathbf{W} \in \mathbb{R}^{d \times C} $ is the weight matrix.
   - $ \mathbf{b} \in \mathbb{R}^{C} $ is the bias vector.

2. **Compute the Gradient of the Loss with Respect to the Logit Matrix $ \mathbf{Z} $:**

   From the previous section:

   $$
   \frac{\partial L_i}{\partial z_{i,k}} = \hat{y}_{i,k} - y_{i,k}
   $$

   This can be represented in matrix form as:

   $$
   \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{Z}} = \frac{1}{n} (\hat{\mathbf{Y}} - \mathbf{Y})
   $$

   Where:
   - $ \hat{\mathbf{Y}} \in \mathbb{R}^{n \times C} $ is the matrix of predicted probabilities.
   - $ \mathbf{Y} \in \mathbb{R}^{n \times C} $ is the matrix of true labels.

3. **Compute the Gradient with Respect to Weights $ \mathbf{W} $:**

   $$
   \frac{\partial \mathbf{Z}}{\partial \mathbf{W}} = \mathbf{X}
   $$

   **Explanation:**
   - Each element $ z_{i,k} = \mathbf{x}_i^\top \mathbf{w}_k + b_k $.
   - Differentiating $ z_{i,k} $ with respect to $ \mathbf{w}_k $ yields $ \mathbf{x}_i $.

   Therefore, the gradient with respect to $ \mathbf{W} $ is:

   $$
   \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{W}} = \mathbf{X}^\top \cdot \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{Z}} = \frac{1}{n} \mathbf{X}^\top (\hat{\mathbf{Y}} - \mathbf{Y})
   $$

4. **Compute the Gradient with Respect to Bias $ \mathbf{b} $:**

   $$
   \frac{\partial \mathbf{Z}}{\partial \mathbf{b}} = \mathbf{1}
   $$

   Where $ \mathbf{1} $ is a vector of ones of shape $ [n] $.

   Therefore, the gradient with respect to $ \mathbf{b} $ is:

   $$
   \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{b}} = \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{Z}} \cdot \mathbf{1} = \frac{1}{n} (\hat{\mathbf{Y}} - \mathbf{Y})^\top \mathbf{1}
   $$

   **Interpretation:**
   - Each element of the bias gradient is the average of $ \hat{y}_{i,k} - y_{i,k} $ across all samples $ i $ for class $ k $.

**Summary for CCE:**
- **Gradient with Respect to Logit $ z_{i,k} $:** $ \frac{\partial L_i}{\partial z_{i,k}} = \hat{y}_{i,k} - y_{i,k} $
- **Gradient with Respect to Weights $ \mathbf{W} $:** $ \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{W}} = \frac{1}{n} \mathbf{X}^\top (\hat{\mathbf{Y}} - \mathbf{Y}) $
- **Gradient with Respect to Bias $ \mathbf{b} $:** $ \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{b}} = \frac{1}{n} (\hat{\mathbf{Y}} - \mathbf{Y})^\top \mathbf{1} $

---

## 4. Activation Functions

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. As established, **Sigmoid** and **Softmax** are the go-to activation functions for BCE and CCE, respectively. Let's revisit their definitions and properties to reinforce understanding.

### 4.1. Sigmoid Function

**Definition:**

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

**Properties:**
- **Range:** (0, 1)
- **Monotonic:** Always increasing.
- **Derivative:**

  $$
  \sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))
  $$

**Usage:**
- **Binary Classification:** Provides a single probability output representing the likelihood of the positive class.

**Advantages:**
- **Probabilistic Interpretation:** Outputs can be directly interpreted as probabilities.
- **Smooth Gradient:** Facilitates gradient-based optimization.

**Disadvantages:**
- **Vanishing Gradient Problem:** Gradients can become very small for large positive or negative inputs, slowing down learning.
- **Output Not Zero-Centered:** Can cause issues in optimization dynamics.

### 4.2. Softmax Function

**Definition:**

$$
\text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^{C} e^{z_j}} \quad \text{for } k = 1, 2, \ldots, C
$$

Where:
- $ z_k $ is the logit for class $ k $.
- $ C $ is the total number of classes.

**Properties:**
- **Range:** (0, 1) for each class $ k $.
- **Sum to One:** $ \sum_{k=1}^{C} \text{softmax}(z)_k = 1 $.
- **Derivative:**

  $$
  \frac{\partial \text{softmax}(z)_k}{\partial z_i} = \text{softmax}(z)_k \cdot (\delta_{ki} - \text{softmax}(z)_i)
  $$

  Where $ \delta_{ki} $ is the Kronecker delta, which is 1 if $ k = i $ and 0 otherwise.

**Usage:**
- **Multi-Class Classification:** Provides a probability distribution over multiple classes, ensuring that the sum of probabilities across classes is 1.

**Advantages:**
- **Probability Distribution:** Ensures outputs form a valid probability distribution.
- **Mutual Exclusivity:** Suitable for tasks where each instance belongs to only one class.

**Disadvantages:**
- **Computationally Intensive:** Requires computing exponentials for all classes.
- **Sensitive to Input Scaling:** Large input values can cause numerical instability.

---

## 5. Why Pair Sigmoid with BCE and Softmax with CCE?

The pairing of activation functions with loss functions is not arbitrary; it's driven by the mathematical properties and requirements of the classification tasks.

### 5.1. BCE with Sigmoid

**Reasoning:**

- **Single Probability Output:** Binary classification requires predicting the probability of an instance belonging to one of two classes. The sigmoid function naturally outputs a single probability value in the (0, 1) range.
  
- **Loss Function Alignment:** BCE is designed to measure the discrepancy between true binary labels and predicted probabilities. The sigmoid activation provides probabilities that BCE expects.
  
- **Gradient Compatibility:** The gradient derivation for BCE with sigmoid results in simple and effective updates ($ \hat{y} - y $), facilitating efficient learning.

**Conclusion:**

- **Task Compatibility:** Perfectly aligns with binary classification needs.
- **Mathematical Alignment:** Ensures loss and gradient computations are coherent and effective for optimization.

### 5.2. CCE with Softmax

**Reasoning:**

- **Probability Distribution Over Classes:** Multi-class classification involves predicting the probability distribution across multiple classes. Softmax converts logits into a probability distribution where all probabilities sum to one.
  
- **Mutual Exclusivity:** In multi-class classification, classes are mutually exclusive; an instance can belong to only one class. Softmax inherently enforces this exclusivity by distributing the probability mass among classes.
  
- **Loss Function Alignment:** CCE computes the loss based on the probability assigned to the true class, making softmax's distribution an excellent fit.

- **Gradient Compatibility:** The gradient derivation for CCE with softmax results in $ \hat{y}_k - y_k $ for each class $ k $, which effectively guides the model to adjust logits to better match true labels.

**Conclusion:**

- **Task Compatibility:** Ideal for multi-class classification scenarios.
- **Mathematical Alignment:** Ensures loss and gradient computations are coherent and effective for optimization.

---

## 6. Practical Examples

To solidify the understanding of BCE and CCE, let's walk through concrete examples demonstrating their application and gradient computations.

### 6.1. BCE Example: Binary Classification

**Scenario:** Email Spam Detection (Spam vs. Not Spam)

**Objective:** Train a logistic regression model to classify emails as spam (1) or not spam (0) based on input features.

**Model Setup:**
- **Input Features ($ \mathbf{x} $):** Vector representing email characteristics (e.g., word counts, presence of certain keywords).
- **Weights ($ \mathbf{w} $):** Weight vector to be learned.
- **Bias ($ b $):** Bias term to be learned.
- **Logit ($ z $):**

  $$
  z = \mathbf{x}^\top \mathbf{w} + b
  $$

- **Activation Function:** Sigmoid, producing $ \hat{y} = \sigma(z) $.
- **Loss Function:** BCE, $ L = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})] $.

**Sample Computation:**

1. **Given:**
   - **True Label:** $ y = 1 $ (Spam)
   - **Input Features:** $ \mathbf{x} = [2, 3] $
   - **Initial Weights:** $ \mathbf{w} = [0.5, -0.5] $
   - **Bias:** $ b = 0 $
   - **Learning Rate ($ \eta $):** 0.1

2. **Compute Logit ($ z $):**

   $$
   z = \mathbf{x}^\top \mathbf{w} + b = (2 \times 0.5) + (3 \times -0.5) + 0 = 1 - 1.5 = -0.5
   $$

3. **Compute Predicted Probability ($ \hat{y} $):**

   $$
   \hat{y} = \sigma(z) = \frac{1}{1 + e^{-(-0.5)}} = \frac{1}{1 + e^{0.5}} \approx 0.3775
   $$

4. **Compute BCE Loss:**

   $$
   L = -\left[ 1 \cdot \log(0.3775) + 0 \cdot \log(1 - 0.3775) \right] = -\log(0.3775) \approx 0.974
   $$

5. **Compute Gradient with Respect to $ z $:**

   $$
   \frac{dL}{dz} = \hat{y} - y = 0.3775 - 1 = -0.6225
   $$

6. **Compute Gradient with Respect to Weights $ \mathbf{w} $:**

   $$
   \frac{\partial L}{\partial \mathbf{w}} = (\hat{y} - y) \cdot \mathbf{x} = (-0.6225) \cdot [2, 3] = [-1.245, -1.8675]
   $$

7. **Compute Gradient with Respect to Bias $ b $:**

   $$
   \frac{\partial L}{\partial b} = \hat{y} - y = -0.6225
   $$

8. **Update Weights and Bias Using Gradient Descent:**

   - **Weights Update:**

     $$
     \mathbf{w}_{\text{new}} = \mathbf{w} - \eta \cdot \frac{\partial L}{\partial \mathbf{w}} = [0.5, -0.5] - 0.1 \cdot [-1.245, -1.8675] = [0.5 + 0.1245, -0.5 + 0.18675] = [0.6245, -0.31325]
     $$

   - **Bias Update:**

     $$
     b_{\text{new}} = b - \eta \cdot \frac{\partial L}{\partial b} = 0 - 0.1 \cdot (-0.6225) = 0 + 0.06225 = 0.06225
     $$

**Interpretation:**
- The weights have been adjusted to increase the predicted probability $ \hat{y} $ closer to the true label $ y = 1 $, thereby reducing the loss in subsequent iterations.

### 6.2. CCE Example: Multi-Class Classification

**Scenario:** Handwritten Digit Recognition (Digits 0-3 for Simplicity)

**Objective:** Train a neural network to classify images of handwritten digits into one of four classes (0, 1, 2, 3).

**Model Setup:**
- **Input Features ($ \mathbf{x} $):** Vector representing image pixel intensities.
- **Weight Matrix ($ \mathbf{W} $):** Matrix to be learned, with one column per class.
- **Bias Vector ($ \mathbf{b} $):** Vector to be learned, with one element per class.
- **Logit Matrix ($ \mathbf{Z} $):**

  $$
  \mathbf{Z} = \mathbf{X} \mathbf{W} + \mathbf{b}
  $$

  Where:
  - $ \mathbf{X} \in \mathbb{R}^{n \times d} $ is the input feature matrix.
  - $ \mathbf{W} \in \mathbb{R}^{d \times C} $ is the weight matrix.
  - $ \mathbf{b} \in \mathbb{R}^{C} $ is the bias vector.

- **Activation Function:** Softmax, producing $ \hat{\mathbf{Y}} = \text{softmax}(\mathbf{Z}) $.
- **Loss Function:** CCE, $ L = -\sum_{k=1}^{C} y_k \cdot \log(\hat{y}_k) $.

**Sample Computation:**

1. **Given:**
   - **True Label:** Digit 3 (One-hot encoded as $ \mathbf{y} = [0, 0, 0, 1] $)
   - **Input Features:** $ \mathbf{x} = [2, 1, 3] $ (for simplicity)
   - **Initial Weight Matrix ($ \mathbf{W} $):**

     $$
     \mathbf{W} = \begin{bmatrix}
     0.1 & 0.2 & 0.3 & 0.4 \\
     0.5 & 0.6 & 0.7 & 0.8 \\
     0.9 & 1.0 & 1.1 & 1.2 \\
     \end{bmatrix}
     $$

   - **Initial Bias Vector ($ \mathbf{b} $):** $ [0, 0, 0, 0] $
   - **Learning Rate ($ \eta $):** 0.1

2. **Compute Logit Vector ($ \mathbf{z} $):**

   $$
   \mathbf{z} = \mathbf{x}^\top \mathbf{W} + \mathbf{b} = [2 \times 0.1 + 1 \times 0.5 + 3 \times 0.9, \, 2 \times 0.2 + 1 \times 0.6 + 3 \times 1.0, \, 2 \times 0.3 + 1 \times 0.7 + 3 \times 1.1, \, 2 \times 0.4 + 1 \times 0.8 + 3 \times 1.2]
   $$

   $$
   \mathbf{z} = [0.2 + 0.5 + 2.7, \, 0.4 + 0.6 + 3.0, \, 0.6 + 0.7 + 3.3, \, 0.8 + 0.8 + 3.6] = [3.4, 4.0, 4.6, 5.2]
   $$

3. **Apply Softmax Activation to Compute $ \hat{\mathbf{Y}} $:**

   $$
   \hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^{4} e^{z_j}} \quad \text{for } k = 1, 2, 3, 4
   $$

   - **Compute Exponentials:**

     $$
     e^{z} = [e^{3.4}, e^{4.0}, e^{4.6}, e^{5.2}] \approx [29.9641, 54.5982, 99.4843, 181.2725]
     $$

   - **Compute Sum of Exponentials:**

     $$
     S = 29.9641 + 54.5982 + 99.4843 + 181.2725 \approx 365.3181
     $$

   - **Compute Softmax Outputs:**

     $$
     \hat{\mathbf{Y}} = \left[ \frac{29.9641}{365.3181}, \frac{54.5982}{365.3181}, \frac{99.4843}{365.3181}, \frac{181.2725}{365.3181} \right] \approx [0.0819, 0.1492, 0.2724, 0.4965]
     $$

4. **Compute CCE Loss:**

   $$
   L = -\sum_{k=1}^{4} y_k \cdot \log(\hat{y}_k) = -\log(\hat{y}_4) = -\log(0.4965) \approx 0.700
   $$

5. **Compute Gradient with Respect to Logits $ z_k $:**

   $$
   \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k
   $$

   Specifically:
   - **For $ k = 4 $ (true class):**

     $$
     \frac{\partial L}{\partial z_4} = \hat{y}_4 - y_4 = 0.4965 - 1 = -0.5035
     $$

   - **For $ k = 1, 2, 3 $ (other classes):**

     $$
     \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k = \hat{y}_k - 0 = \hat{y}_k
     $$

     So:
     $$
     \frac{\partial L}{\partial z_1} = 0.0819, \quad \frac{\partial L}{\partial z_2} = 0.1492, \quad \frac{\partial L}{\partial z_3} = 0.2724
     $$

6. **Compute Gradient with Respect to Weight Matrix $ \mathbf{W} $:**

   $$
   \frac{\partial \text{CCE}_{\text{avg}}}{\partial \mathbf{W}} = \frac{1}{n} \mathbf{X}^\top (\hat{\mathbf{Y}} - \mathbf{Y})
   $$

   Since $ n = 1 $ in this single-sample example:

   $$
   \frac{\partial \text{CCE}}{\partial \mathbf{W}} = \mathbf{x}^\top (\hat{\mathbf{Y}} - \mathbf{Y}) = \begin{bmatrix} x_1 (\hat{y}_1 - y_1) & x_1 (\hat{y}_2 - y_2) & x_1 (\hat{y}_3 - y_3) & x_1 (\hat{y}_4 - y_4) \\ x_2 (\hat{y}_1 - y_1) & x_2 (\hat{y}_2 - y_2) & x_2 (\hat{y}_3 - y_3) & x_2 (\hat{y}_4 - y_4) \\ x_3 (\hat{y}_1 - y_1) & x_3 (\hat{y}_2 - y_2) & x_3 (\hat{y}_3 - y_3) & x_3 (\hat{y}_4 - y_4) \end{bmatrix}
   $$

   Plugging in the values:

   $$
   \mathbf{x} = [2, 1, 3], \quad \hat{\mathbf{Y}} = [0.0819, 0.1492, 0.2724, 0.4965], \quad \mathbf{Y} = [0, 0, 0, 1]
   $$

   $$
   \hat{\mathbf{Y}} - \mathbf{Y} = [0.0819, 0.1492, 0.2724, -0.5035]
   $$

   $$
   \frac{\partial \text{CCE}}{\partial \mathbf{W}} = \begin{bmatrix}
   2 \times 0.0819 & 2 \times 0.1492 & 2 \times 0.2724 & 2 \times (-0.5035) \\
   1 \times 0.0819 & 1 \times 0.1492 & 1 \times 0.2724 & 1 \times (-0.5035) \\
   3 \times 0.0819 & 3 \times 0.1492 & 3 \times 0.2724 & 3 \times (-0.5035) \\
   \end{bmatrix} = \begin{bmatrix}
   0.1638 & 0.2984 & 0.5448 & -1.007 \\
   0.0819 & 0.1492 & 0.2724 & -0.5035 \\
   0.2457 & 0.4476 & 0.8172 & -1.5105 \\
   \end{bmatrix}
   $$

7. **Compute Gradient with Respect to Bias Vector $ \mathbf{b} $:**

   $$
   \frac{\partial \text{CCE}}{\partial \mathbf{b}} = \hat{\mathbf{Y}} - \mathbf{Y} = [0.0819, 0.1492, 0.2724, -0.5035]
   $$

8. **Update Weights and Bias Using Gradient Descent:**

   - **Weights Update:**

     $$
     \mathbf{W}_{\text{new}} = \mathbf{W} - \eta \cdot \frac{\partial \text{CCE}}{\partial \mathbf{W}} = \begin{bmatrix}
     0.1 & 0.2 & 0.3 & 0.4 \\
     0.5 & 0.6 & 0.7 & 0.8 \\
     0.9 & 1.0 & 1.1 & 1.2 \\
     \end{bmatrix} - 0.1 \cdot \begin{bmatrix}
     0.1638 & 0.2984 & 0.5448 & -1.007 \\
     0.0819 & 0.1492 & 0.2724 & -0.5035 \\
     0.2457 & 0.4476 & 0.8172 & -1.5105 \\
     \end{bmatrix} = \begin{bmatrix}
     0.1 - 0.01638 & 0.2 - 0.02984 & 0.3 - 0.05448 & 0.4 + 0.1007 \\
     0.5 - 0.00819 & 0.6 - 0.01492 & 0.7 - 0.02724 & 0.8 + 0.05035 \\
     0.9 - 0.02457 & 1.0 - 0.04476 & 1.1 - 0.08172 & 1.2 + 0.15105 \\
     \end{bmatrix} \approx \begin{bmatrix}
     0.08362 & 0.17016 & 0.24552 & 0.5007 \\
     0.49181 & 0.58508 & 0.67276 & 0.85035 \\
     0.87543 & 0.95524 & 1.01828 & 1.35105 \\
     \end{bmatrix}
     $$

   - **Bias Update:**

     $$
     \mathbf{b}_{\text{new}} = \mathbf{b} - \eta \cdot \frac{\partial \text{CCE}}{\partial \mathbf{b}} = [0, 0, 0, 0] - 0.1 \cdot [0.0819, 0.1492, 0.2724, -0.5035] = [-0.00819, -0.01492, -0.02724, 0.05035]
     $$

**Interpretation:**
- The weights and biases have been adjusted to increase the predicted probability $ \hat{y}_3 $ for the true class (Digit 3) and decrease the probabilities for other classes, thereby reducing the loss.

---

## 7. Conclusion

Understanding **Binary Cross-Entropy (BCE)** and **Categorical Cross-Entropy (CCE)** loss functions is fundamental for effectively training classification models in machine learning. Here's a summary of key points:

- **Binary Cross-Entropy (BCE):**
  - **Use Case:** Binary classification tasks.
  - **Activation Function:** Sigmoid, providing a single probability output.
  - **Loss Function:** Measures the discrepancy between true binary labels and predicted probabilities.
  - **Gradient Derivation:** Results in $ \hat{y} - y $, guiding weight updates to minimize loss.

- **Categorical Cross-Entropy (CCE):**
  - **Use Case:** Multi-class classification tasks with mutually exclusive classes.
  - **Activation Function:** Softmax, providing a probability distribution over classes.
  - **Loss Function:** Measures the discrepancy between true one-hot labels and predicted probability distributions.
  - **Gradient Derivation:** Results in $ \hat{y}_k - y_k $ for each class $ k $, guiding weight updates to minimize loss.

**Key Takeaways:**

1. **Activation and Loss Function Pairing:**
   - **BCE with Sigmoid:** Ideal for scenarios with two classes, ensuring output probabilities align with loss expectations.
   - **CCE with Softmax:** Perfect for scenarios with multiple classes, ensuring output probabilities form a valid distribution over classes.

2. **Gradient Computations:**
   - Accurate gradient derivations are crucial for effective optimization.
   - Both BCE and CCE gradients involve the difference between predicted probabilities and true labels, enabling models to adjust parameters to better fit the data.

3. **Model Training:**
   - Utilizing these loss functions with appropriate activation functions facilitates efficient and effective learning, driving models to make accurate predictions.

By meticulously understanding the mathematical foundations and practical implementations of BCE and CCE, you can design and train robust classification models tailored to a wide array of machine learning tasks.


### BCE

In [1]:
import numpy as np

# -------------------------------
# 1. Data Preparation
# -------------------------------

# Seed for reproducibility
np.random.seed(42)

# Sample input data: shape [batch_size, seq_length, input_dim]
# Randomly generated float values
X = np.random.randn(2, 3, 5)  # 2 sequences, 3 tokens each, 5 features per token

# Sample targets: shape [batch_size, seq_length]
# Each target is a binary value (0 or 1) representing the true class
targets = np.array([
    [1, 0, 1],  # First sequence's true classes
    [0, 1, 0]   # Second sequence's true classes
])

print("Input Features (X):\n", X)
print("\nTargets:\n", targets)

# -------------------------------
# 2. Parameter Initialization
# -------------------------------

# Parameters
batch_size, seq_length, input_dim = X.shape  # (2, 3, 5)
output_dim = 1  # Binary classification

# Initialize weights (W) and biases (b)
W = np.random.uniform(-0.5, 0.5, (input_dim, output_dim))  # Shape: [5, 1]
b = np.random.uniform(-0.5, 0.5, output_dim)              # Shape: [1]

print("\nInitial Weights (W):\n", W)
print("\nInitial Biases (b):\n", b)

# -------------------------------
# 3. Activation and Loss Functions
# -------------------------------

def sigmoid(logits):
    """
    Applies the sigmoid function to the logits.
    
    Args:
        logits (np.ndarray): Logits array of any shape.
        
    Returns:
        np.ndarray: Sigmoid probabilities of the same shape as logits.
    """
    return 1 / (1 + np.exp(-logits))

def binary_cross_entropy_loss(probs, targets):
    """
    Computes the average binary cross-entropy loss over the batch.
    
    Args:
        probs (np.ndarray): Predicted probabilities of shape [batch_size, seq_length, 1]
        targets (np.ndarray): True binary labels of shape [batch_size, seq_length]
        
    Returns:
        float: Average binary cross-entropy loss
    """
    # Number of samples
    n = batch_size * seq_length
    
    # Clip probabilities to prevent log(0)
    epsilon = 1e-12
    probs = np.clip(probs, epsilon, 1. - epsilon)
    
    # Reshape targets to match probs shape
    targets = targets.reshape(batch_size, seq_length, 1)
    
    # Compute binary cross-entropy loss
    loss = -np.sum(targets * np.log(probs) + (1 - targets) * np.log(1 - probs)) / n
    return loss

def compute_accuracy(probs, targets):
    """
    Computes the accuracy over the batch.
    
    Args:
        probs (np.ndarray): Predicted probabilities of shape [batch_size, seq_length, 1]
        targets (np.ndarray): True binary labels of shape [batch_size, seq_length]
        
    Returns:
        float: Accuracy (between 0 and 1)
    """
    # Convert probabilities to binary predictions
    predictions = (probs >= 0.5).astype(int).reshape(batch_size, seq_length)
    correct = (predictions == targets).astype(float)
    accuracy = np.mean(correct)
    return accuracy

# -------------------------------
# 4. Training Loop
# -------------------------------

# Training parameters
learning_rate = 0.1
epochs = 1000

for epoch in range(1, epochs + 1):
    # ---------------------------
    # Forward Pass
    # ---------------------------
    
    # Compute logits: [batch_size, seq_length, output_dim]
    # X: [batch_size, seq_length, input_dim]
    # W: [input_dim, output_dim]
    # b: [output_dim]
    logits = np.dot(X, W) + b  # Broadcasting b to [batch_size, seq_length, output_dim]
    
    # Compute probabilities using sigmoid
    probs = sigmoid(logits)
    
    # Compute loss
    loss = binary_cross_entropy_loss(probs, targets)
    
    # Compute accuracy
    accuracy = compute_accuracy(probs, targets)
    
    # ---------------------------
    # Backward Pass (Gradient Computation)
    # ---------------------------
    
    # Number of samples
    n = batch_size * seq_length
    
    # Reshape targets to match probs shape
    targets_reshaped = targets.reshape(batch_size, seq_length, 1)
    
    # Gradient of loss w.r.t logits
    dL_dlogits = (probs - targets_reshaped) / n  # Shape: [batch_size, seq_length, 1]
    
    # Gradient w.r.t W: [input_dim, output_dim]
    dL_dW = np.dot(X.reshape(-1, input_dim).T, dL_dlogits.reshape(-1, output_dim))
    
    # Gradient w.r.t b: [output_dim]
    dL_db = np.sum(dL_dlogits, axis=(0,1))
    
    # ---------------------------
    # Parameter Update
    # ---------------------------
    
    W -= learning_rate * dL_dW
    b -= learning_rate * dL_db
    
    # ---------------------------
    # Logging
    # ---------------------------
    
    if epoch % 100 == 0 or epoch == 1:
        print(f"Epoch {epoch:4d}: Loss = {loss:.4f}, Accuracy = {accuracy*100:.2f}%")

# -------------------------------
# 5. Final Evaluation
# -------------------------------

print("\nTraining Complete!")
print(f"Final Loss: {loss:.4f}")
print(f"Final Accuracy: {accuracy*100:.2f}%")


Input Features (X):
 [[[ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337]
  [-0.23413696  1.57921282  0.76743473 -0.46947439  0.54256004]
  [-0.46341769 -0.46572975  0.24196227 -1.91328024 -1.72491783]]

 [[-0.56228753 -1.01283112  0.31424733 -0.90802408 -1.4123037 ]
  [ 1.46564877 -0.2257763   0.0675282  -1.42474819 -0.54438272]
  [ 0.11092259 -1.15099358  0.37569802 -0.60063869 -0.29169375]]]

Targets:
 [[1 0 1]
 [0 1 0]]

Initial Weights (W):
 [[ 0.18423303]
 [-0.05984751]
 [-0.37796177]
 [-0.00482309]
 [-0.46561148]]

Initial Biases (b):
 [0.4093204]
Epoch    1: Loss = 0.6569, Accuracy = 66.67%
Epoch  100: Loss = 0.4270, Accuracy = 100.00%
Epoch  200: Loss = 0.3402, Accuracy = 100.00%
Epoch  300: Loss = 0.2867, Accuracy = 100.00%
Epoch  400: Loss = 0.2499, Accuracy = 100.00%
Epoch  500: Loss = 0.2228, Accuracy = 100.00%
Epoch  600: Loss = 0.2018, Accuracy = 100.00%
Epoch  700: Loss = 0.1850, Accuracy = 100.00%
Epoch  800: Loss = 0.1711, Accuracy = 100.00%
Epoch  900: Los

### CCE

Refer to [ML_0047_Gradient_Descent_Variants.ipynb](./data_structures/ML_0047_Gradient_Descent_Variants.ipynb)

In [2]:
import numpy as np

# -------------------------------
# 1. Data Preparation
# -------------------------------

# Seed for reproducibility
np.random.seed(42)

# Sample input data: shape [batch_size, seq_length, input_dim]
# Randomly generated float values
X = np.random.randn(2, 3, 5)  # 2 sequences, 3 tokens each, 5 features per token

# Sample targets: shape [batch_size, seq_length]
# Each target is an integer representing the true class index
targets = np.array([
    [0, 1, 3],  # First sequence's true classes
    [2, 4, 1]   # Second sequence's true classes
])

print("Input Features (X):\n", X)
print("\nTargets:\n", targets)

# -------------------------------
# 2. Parameter Initialization
# -------------------------------

# Parameters
batch_size, seq_length, input_dim = X.shape  # (2, 3, 5)
vocab_size = 5  # Number of classes

# Initialize weights (W) and biases (b)
W = np.random.uniform(-0.5, 0.5, (input_dim, vocab_size))  # Shape: [5, 5]
b = np.random.uniform(-0.5, 0.5, vocab_size)              # Shape: [5]

print("\nInitial Weights (W):\n", W)
print("\nInitial Biases (b):\n", b)

# -------------------------------
# 3. Activation and Loss Functions
# -------------------------------

def softmax(logits):
    """
    Applies the softmax function to the logits.
    
    Args:
        logits (np.ndarray): Logits array of shape (..., vocab_size)
        
    Returns:
        np.ndarray: Softmax probabilities of the same shape as logits
    """
    # For numerical stability, subtract the max logit from each logit
    shifted_logits = logits - np.max(logits, axis=-1, keepdims=True)
    exp_logits = np.exp(shifted_logits)
    sum_exp = np.sum(exp_logits, axis=-1, keepdims=True)
    return exp_logits / sum_exp

def cross_entropy_loss(probs, targets):
    """
    Computes the average cross-entropy loss over the batch.
    
    Args:
        probs (np.ndarray): Predicted probabilities of shape [batch_size, seq_length, vocab_size]
        targets (np.ndarray): True class indices of shape [batch_size, seq_length]
        
    Returns:
        float: Average cross-entropy loss
    """
    # Number of samples
    n = batch_size * seq_length
    
    # Clip probabilities to prevent log(0)
    epsilon = 1e-12
    probs = np.clip(probs, epsilon, 1. - epsilon)
    
    # Create one-hot encoding for targets
    one_hot_targets = np.zeros_like(probs)
    
    # Generate batch_indices and seq_indices
    batch_indices = np.repeat(np.arange(batch_size), seq_length)
    seq_indices = np.tile(np.arange(seq_length), batch_size)
    
    # Flatten targets for indexing
    flat_targets = targets.flatten()
    
    # Assign 1s to the correct class positions
    for b in range(batch_size):
        for s in range(seq_length):
            one_hot_targets[b, s, targets[b, s]] = 1
    # Compute cross-entropy loss
    loss = -np.sum(one_hot_targets * np.log(probs)) / n
    return loss

def compute_accuracy(probs, targets):
    """
    Computes the accuracy over the batch.
    
    Args:
        probs (np.ndarray): Predicted probabilities of shape [batch_size, seq_length, vocab_size]
        targets (np.ndarray): True class indices of shape [batch_size, seq_length]
        
    Returns:
        float: Accuracy (between 0 and 1)
    """
    predictions = np.argmax(probs, axis=-1)
    correct = (predictions == targets).astype(float)
    accuracy = np.mean(correct)
    return accuracy

# -------------------------------
# 4. Training Loop
# -------------------------------

# Training parameters
learning_rate = 0.1
epochs = 1000

for epoch in range(1, epochs + 1):
    # ---------------------------
    # Forward Pass
    # ---------------------------
    
    # Compute logits: [batch_size, seq_length, vocab_size]
    # X: [batch_size, seq_length, input_dim]
    # W: [input_dim, vocab_size]
    # b: [vocab_size]
    logits = np.dot(X, W) + b  # Broadcasting b to [batch_size, seq_length, vocab_size]
    
    # Compute probabilities using softmax
    probs = softmax(logits)
    
    # Compute loss
    loss = cross_entropy_loss(probs, targets)
    
    # Compute accuracy
    accuracy = compute_accuracy(probs, targets)
    
    # ---------------------------
    # Backward Pass (Gradient Computation)
    # ---------------------------
    
    # Number of samples
    n = batch_size * seq_length
    
    # Create one-hot encoding for targets
    one_hot_targets = np.zeros_like(probs)
    
    for b in range(batch_size):
        for s in range(seq_length):
            one_hot_targets[b, s, targets[b, s]] = 1
            
    # Gradient of loss w.r.t logits
    dL_dlogits = (probs - one_hot_targets) / n  # Shape: [batch_size, seq_length, vocab_size]
    
    # Gradient w.r.t W: [input_dim, vocab_size]
    dL_dW = np.dot(X.reshape(-1, input_dim).T, dL_dlogits.reshape(-1, vocab_size))
    
    # Gradient w.r.t b: [vocab_size]
    dL_db = np.sum(dL_dlogits, axis=(0,1))
    
    # ---------------------------
    # Parameter Update
    # ---------------------------
    
    W -= learning_rate * dL_dW
    b -= learning_rate * dL_db
    
    # ---------------------------
    # Logging
    # ---------------------------
    
    if epoch % 100 == 0 or epoch == 1:
        print(f"Epoch {epoch:4d}: Loss = {loss:.4f}, Accuracy = {accuracy*100:.2f}%")

# -------------------------------
# 5. Final Evaluation
# -------------------------------

print("\nTraining Complete!")
print(f"Final Loss: {loss:.4f}")
print(f"Final Accuracy: {accuracy*100:.2f}%")

Input Features (X):
 [[[ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337]
  [-0.23413696  1.57921282  0.76743473 -0.46947439  0.54256004]
  [-0.46341769 -0.46572975  0.24196227 -1.91328024 -1.72491783]]

 [[-0.56228753 -1.01283112  0.31424733 -0.90802408 -1.4123037 ]
  [ 1.46564877 -0.2257763   0.0675282  -1.42474819 -0.54438272]
  [ 0.11092259 -1.15099358  0.37569802 -0.60063869 -0.29169375]]]

Targets:
 [[0 1 3]
 [2 4 1]]

Initial Weights (W):
 [[ 0.18423303 -0.05984751 -0.37796177 -0.00482309 -0.46561148]
 [ 0.4093204  -0.24122002  0.16252228 -0.18828892  0.02006802]
 [ 0.04671028 -0.31514554  0.46958463  0.27513282  0.43949894]
 [ 0.39482735  0.09789998  0.42187424 -0.4115075  -0.30401714]
 [-0.45477271 -0.17466967 -0.11132271 -0.22865097  0.32873751]]

Initial Biases (b):
 [-0.14324667 -0.21906549  0.04269608 -0.35907578  0.30219698]
Epoch    1: Loss = 1.6869, Accuracy = 16.67%
Epoch  100: Loss = 0.6148, Accuracy = 83.33%
Epoch  200: Loss = 0.3972, Accuracy = 100.00%
Ep