## Gradient Descent:

Gradient Descent is an optimization algorithm used to train machine learning models by adjusting their parameters (weights and biases) to minimize a given loss function. It is fundamental in machine learning, particularly in training models like linear regression, logistic regression, and neural networks.

Here’s a complete explanation:



### **1. The Goal of Gradient Descent**
The primary objective of gradient descent is to find the parameters of a model that minimize the loss function, which measures the error between predicted outputs and actual outputs. Lower loss indicates a better-performing model.



### **2. The Concept of Gradients**
- A **gradient** is a vector that points in the direction of the steepest increase of a function.
- For gradient descent, we are interested in minimizing a function (the loss), so we move in the direction opposite to the gradient.



### **3. How Gradient Descent Works**
Gradient descent works iteratively to update model parameters. The process can be summarized as:

#### **Step 1: Initialization**
- Randomly initialize the parameters (weights $w$ and bias $b$).

#### **Step 2: Compute the Gradient**
- Calculate the gradient of the loss function with respect to each parameter. This gives the direction and rate of change of the loss.

#### **Step 3: Update Parameters**
- Update the parameters by moving them in the opposite direction of the gradient. The update rule for each parameter is:
  $$
  \theta \gets \theta - \eta \cdot \frac{\partial L}{\partial \theta}
  $$
  Where:
  - $\theta$: The parameter being updated (e.g., weight or bias).
  - $\eta$: The learning rate (step size).
  - $\frac{\partial L}{\partial \theta}$: The gradient of the loss $L$ with respect to $\theta$.

#### **Step 4: Repeat**
- Repeat steps 2 and 3 until convergence (when the loss stops decreasing significantly).



### **4. Learning Rate ($\eta$)**
The learning rate is a crucial hyperparameter:
- If $\eta$ is too large, updates may overshoot the minimum, causing divergence.
- If $\eta$ is too small, convergence will be slow.



### **5. Types of Gradient Descent**
There are three main variants of gradient descent, differing in how they calculate the gradient:

#### (a) **Batch Gradient Descent**
- Uses the entire dataset to compute the gradient in each iteration.
- Pros: Converges steadily.
- Cons: Can be computationally expensive for large datasets.

#### (b) **Stochastic Gradient Descent (SGD)**
- Computes the gradient using only one data point (randomly selected) at a time.
- Pros: Faster and can escape local minima.
- Cons: Noisy updates can cause fluctuations around the minimum.

#### (c) **Mini-Batch Gradient Descent**
- Uses a small batch of data points (e.g., 32 or 64) to compute the gradient.
- Pros: Combines the benefits of batch and stochastic gradient descent.



### **6. Gradient Descent in Machine Learning**
In supervised learning:
- **For Linear Regression**: Gradient descent minimizes the Mean Squared Error (MSE) loss.
- **For Logistic Regression**: Gradient descent minimizes the Log Loss (Cross-Entropy Loss).
- **For Neural Networks**: Gradient descent minimizes a loss function by backpropagating errors.



### **7. Challenges and Solutions**
1. **Local Minima**:
   - For some loss functions, gradient descent might get stuck in a local minimum.
   - Solution: Use techniques like momentum or Adam optimizer.

2. **Vanishing/Exploding Gradients**:
   - Gradients may become too small or too large in deep networks.
   - Solution: Use techniques like ReLU activations, batch normalization, or gradient clipping.

3. **Choosing the Right Learning Rate**:
   - Solution: Use learning rate schedules or adaptive learning rate optimizers (e.g., Adam, RMSProp).



### **8. Visual Understanding**
- Imagine a ball rolling down a hill. The slope of the hill represents the gradient. The ball's movement (parameter updates) will eventually lead it to the bottom of the hill (the minimum of the loss function).



### **Mathematical Example**
Let’s minimize the function $f(w) = w^2$ (a simple loss function).

1. Compute the gradient: $\frac{\partial f}{\partial w} = 2w$.
2. Update rule: $w \gets w - \eta \cdot 2w$.
3. If $w = 1$ and $\eta = 0.1$:
   - Gradient: $2 \times 1 = 2$.
   - Update: $w = 1 - 0.1 \times 2 = 0.8$.
   - Repeat until $w \approx 0$.



### **Key Takeaway**
Gradient Descent iteratively optimizes the parameters of a model by minimizing the loss function, enabling the model to make better predictions. Variants and adaptive techniques ensure its efficiency for various types of problems.

---

## Mathematical Equations of Gradient Descent:

Let’s dive deep into the **mathematics of gradient descent**, covering all the essential concepts, starting from the loss function to the update formulas for the parameters. This includes derivatives, gradients, and step-by-step explanation.



### **1. Loss Function**
The **loss function** $L(\theta)$ quantifies the error of a model's prediction. 

#### Example Loss Functions:
- **For Regression**: Mean Squared Error (MSE):
  $$
  L(\theta) = \frac{1}{m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right)^2
  $$
  where:
  - $m$: Number of training samples.
  - $y_i$: True value of the $i$-th sample.
  - $\hat{y}_i$: Predicted value of the $i$-th sample.
  
- **For Classification**: Binary Cross-Entropy (Log Loss):
  $$
  L(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
  $$

Here, $\theta$ represents the parameters (weights and biases) of the model. The goal is to minimize $L(\theta)$.



### **2. Gradient Calculation**
To minimize the loss function, we compute the **gradient** of the loss function with respect to the model parameters, denoted as:
$$
\nabla_\theta L(\theta) = \left[ \frac{\partial L}{\partial \theta_1}, \frac{\partial L}{\partial \theta_2}, \dots, \frac{\partial L}{\partial \theta_n} \right]
$$

#### General Form of Gradient:
For any function $f(\theta)$, the gradient points in the direction of the steepest increase:
$$
\frac{\partial f(\theta)}{\partial \theta} = \lim_{h \to 0} \frac{f(\theta + h) - f(\theta)}{h}
$$



### **3. Gradient Descent Algorithm**
Gradient descent updates the model parameters iteratively to reduce the loss function. The **update rule** is:

#### Update Rule:
$$
\theta \gets \theta - \eta \cdot \nabla_\theta L(\theta)
$$

Where:
- $\theta$: Current parameters (e.g., weights and biases).
- $\eta$: Learning rate (a hyperparameter controlling the step size).
- $\nabla_\theta L(\theta)$: Gradient of the loss function with respect to $\theta$.



### **4. Step-by-Step Explanation**
#### Step 1: Compute the Prediction
For a model $h_\theta(x)$:
- **Linear Regression**: $h_\theta(x) = \theta_0 + \theta_1 x$.
- **Logistic Regression**: $h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}$.

#### Step 2: Define the Loss Function
The loss function $L(\theta)$ measures the error between predictions $h_\theta(x)$ and true values $y$.

#### Step 3: Compute the Gradient
For each parameter $\theta_j$:
$$
\frac{\partial L}{\partial \theta_j} = \text{gradient of the loss function with respect to } \theta_j
$$

Example for MSE:
$$
\frac{\partial L}{\partial \theta_j} = -\frac{2}{m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right) \cdot x_{ij}
$$

#### Step 4: Update the Parameters
Using the update rule:
$$
\theta_j \gets \theta_j - \eta \cdot \frac{\partial L}{\partial \theta_j}
$$



### **5. Gradient Descent for Linear Regression**
Let’s solve a practical example to illustrate the formulas:

#### Model:
$$
h_\theta(x) = \theta_0 + \theta_1 x
$$

#### Loss Function (MSE):
$$
L(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m \left( y_i - (\theta_0 + \theta_1 x_i) \right)^2
$$

#### Gradients:
$$
\frac{\partial L}{\partial \theta_0} = -\frac{2}{m} \sum_{i=1}^m \left( y_i - (\theta_0 + \theta_1 x_i) \right)
$$
$$
\frac{\partial L}{\partial \theta_1} = -\frac{2}{m} \sum_{i=1}^m \left( y_i - (\theta_0 + \theta_1 x_i) \right) \cdot x_i
$$

#### Updates:
$$
\theta_0 \gets \theta_0 - \eta \cdot \frac{\partial L}{\partial \theta_0}
$$
$$
\theta_1 \gets \theta_1 - \eta \cdot \frac{\partial L}{\partial \theta_1}
$$



### **6. Gradient Descent for Logistic Regression**
For binary classification using logistic regression:

#### Model:
$$
h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}
$$

#### Loss Function (Binary Cross-Entropy):
$$
L(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[ y_i \log(h_\theta(x_i)) + (1 - y_i) \log(1 - h_\theta(x_i)) \right]
$$

#### Gradient:
$$
\frac{\partial L}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x_i) - y_i \right) \cdot x_{ij}
$$

#### Updates:
$$
\theta_j \gets \theta_j - \eta \cdot \frac{\partial L}{\partial \theta_j}
$$



### **7. Convergence**
Gradient descent repeats until the algorithm converges, i.e., when:
- The loss $L(\theta)$ stops decreasing significantly.
- The gradients $\nabla_\theta L(\theta)$ are close to zero.



### **8. Variants of Gradient Descent**
- **Batch Gradient Descent**: Uses the entire dataset.
- **Stochastic Gradient Descent (SGD)**: Uses one data point per iteration.
- **Mini-Batch Gradient Descent**: Uses a subset of data points per iteration.



### Example Illustration
Let’s optimize the simple quadratic function $f(w) = w^2$:

1. Loss function: $f(w) = w^2$.
2. Gradient: $\frac{df}{dw} = 2w$.
3. Update rule: $w \gets w - \eta \cdot 2w$.

For $w = 1$ and $\eta = 0.1$:
- Gradient: $2 \times 1 = 2$.
- Update: $w = 1 - 0.1 \times 2 = 0.8$.
- Repeat until $w \approx 0$.



### Summary of Key Formulas
1. **Gradient Descent Update**:
   $$
   \theta \gets \theta - \eta \cdot \nabla_\theta L(\theta)
   $$
2. **Gradient for Each Parameter**:
   $$
   \frac{\partial L}{\partial \theta_j} = \text{Partial derivative of the loss with respect to } \theta_j
   $$

Gradient descent systematically minimizes the loss by adjusting parameters step by step, guided by the calculated gradients.


---


## **Batch Gradient Descent:**

Batch Gradient Descent is a type of gradient descent algorithm that computes the gradient of the entire **loss function** over the complete dataset in each iteration. It updates the parameters after considering all the training data.



### **1. Steps in Batch Gradient Descent**

#### **Step 1: Initialize Parameters**
Randomly initialize the model parameters (e.g., weights $ \theta $).

#### **Step 2: Compute Predictions**
For all training examples:
$$
\hat{y}_i = h_\theta(x_i)
$$
Where:
- $h_\theta(x_i)$ is the model’s prediction (e.g., $h_\theta(x) = \theta^T x$ for linear regression).

#### **Step 3: Define the Loss Function**
The loss function $L(\theta)$ measures the error between predictions ($\hat{y}_i$) and true values ($y_i$):

- **For Linear Regression (Mean Squared Error)**:
$$
L(\theta) = \frac{1}{m} \sum_{i=1}^m \left( y_i - \hat{y}_i \right)^2
$$

- **For Logistic Regression (Binary Cross-Entropy)**:
$$
L(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$$

Here:
- $m$: Number of training samples.
- $y_i$: True label for the $i$-th example.
- $\hat{y}_i$: Predicted value for the $i$-th example.

#### **Step 4: Compute Gradients**
The gradient is computed as the derivative of the loss function $L(\theta)$ with respect to each parameter $\theta_j$.

- **General Gradient Formula**:
$$
\frac{\partial L}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}_i - y_i \right) \cdot x_{ij}
$$
Where:
- $x_{ij}$: The value of the $j$-th feature for the $i$-th example.

#### **Step 5: Update Parameters**
Update all parameters $\theta_j$ simultaneously using the formula:
$$
\theta_j \gets \theta_j - \eta \cdot \frac{\partial L}{\partial \theta_j}
$$
Where:
- $\eta$: Learning rate (controls the step size).
- $\frac{\partial L}{\partial \theta_j}$: Gradient of the loss with respect to $\theta_j$.

#### **Step 6: Repeat**
Repeat steps 2–5 until:
- The loss function $L(\theta)$ converges (i.e., changes very little).
- The gradient $\nabla_\theta L(\theta)$ becomes close to zero.


### **2. Mathematical Example for Linear Regression**

#### **Model:**
$$
h_\theta(x_i) = \theta_0 + \theta_1 x_i
$$

#### **Loss Function (MSE):**
$$
L(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m \left( y_i - (\theta_0 + \theta_1 x_i) \right)^2
$$

#### **Gradients:**
$$
\frac{\partial L}{\partial \theta_0} = -\frac{2}{m} \sum_{i=1}^m \left( y_i - (\theta_0 + \theta_1 x_i) \right)
$$
$$
\frac{\partial L}{\partial \theta_1} = -\frac{2}{m} \sum_{i=1}^m \left( y_i - (\theta_0 + \theta_1 x_i) \right) \cdot x_i
$$

#### **Parameter Updates:**
$$
\theta_0 \gets \theta_0 - \eta \cdot \frac{\partial L}{\partial \theta_0}
$$
$$
\theta_1 \gets \theta_1 - \eta \cdot \frac{\partial L}{\partial \theta_1}
$$


### **3. Properties of Batch Gradient Descent**

#### **Advantages:**
1. **Convergence Stability**: The gradient is calculated over the entire dataset, leading to stable and smooth convergence.
2. **Exact Gradient**: Since the entire dataset is used, the gradient is accurate.
3. **Deterministic Updates**: Each update is the same for the same data, ensuring repeatability.

#### **Disadvantages:**
1. **Computationally Expensive**: For large datasets, computing the gradient for the entire dataset in every iteration is slow.
2. **Memory Intensive**: Requires loading the entire dataset into memory, which can be infeasible for very large datasets.


### **4. Visual Understanding**
- Imagine a ball rolling down a hill. In batch gradient descent, you compute the average slope of the entire hill (dataset) before taking a step. While this ensures precision, it may take longer to reach the bottom of the hill.


### **5. Convergence Criterion**
Batch Gradient Descent stops when:
1. The change in the loss function $L(\theta)$ between iterations becomes very small:
   $$
   |L_{\text{new}} - L_{\text{old}}| < \epsilon
   $$
   Where $\epsilon$ is a small threshold value.
2. A maximum number of iterations is reached.


### **6. Comparison with Other Gradient Descent Variants**

| Feature                        | Batch Gradient Descent            | Stochastic Gradient Descent (SGD)      | Mini-Batch Gradient Descent          |
|--------------------------------|-----------------------------------|---------------------------------------|---------------------------------------|
| **Gradient Calculation**       | Entire dataset                   | One data point                        | Small batch of data points           |
| **Stability**                  | Very stable                      | Noisy, fluctuates                     | Moderately stable                    |
| **Speed**                      | Slow for large datasets          | Faster                                | Balances speed and stability         |
| **Memory Requirement**         | High (entire dataset in memory)  | Low (one point in memory)             | Medium                               |


### **7. Example Walkthrough**

#### Dataset:
| $x$   | $y$   |
|---------|---------|
| 1       | 2       |
| 2       | 4       |
| 3       | 6       |

#### Model:
$$
h_\theta(x) = \theta_0 + \theta_1 x
$$

#### Loss Function:
$$
L(\theta) = \frac{1}{3} \sum_{i=1}^3 \left( y_i - (\theta_0 + \theta_1 x_i) \right)^2
$$

#### Gradients:
$$
\frac{\partial L}{\partial \theta_0} = -\frac{2}{3} \sum_{i=1}^3 \left( y_i - (\theta_0 + \theta_1 x_i) \right)
$$
$$
\frac{\partial L}{\partial \theta_1} = -\frac{2}{3} \sum_{i=1}^3 \left( y_i - (\theta_0 + \theta_1 x_i) \right) \cdot x_i
$$

#### Updates (with $\eta = 0.01$):
1. Initialize $\theta_0 = 0$, $\theta_1 = 0$.
2. Compute predictions $\hat{y}_i = \theta_0 + \theta_1 x_i$ and update $\theta_0, \theta_1$ using the formulas above.
3. Repeat until convergence.


### **8. Summary**
- **Batch Gradient Descent** updates parameters using the average gradient computed over the entire dataset.
- It is computationally expensive but provides precise and stable updates.
- Best suited for smaller datasets or cases where high accuracy is critical.

---

## **Stochastic Gradient Descent** :


Stochastic Gradient Descent is a variant of the gradient descent optimization algorithm where the gradient is computed and updated using only **one training sample** at a time, instead of the entire dataset. This makes it faster and more memory-efficient for large datasets but introduces some noise in the updates.



### **1. Steps in Stochastic Gradient Descent**

#### **Step 1: Initialize Parameters**
Randomly initialize the parameters $ \theta $ (e.g., weights for a model).

#### **Step 2: Compute Prediction**
For a single randomly selected training example $(x_i, y_i)$:
$$
\hat{y}_i = h_\theta(x_i)
$$
Where $ h_\theta(x_i) $ is the model's prediction.

#### **Step 3: Define the Loss Function**
The loss function $L_i(\theta)$ measures the error for the chosen training sample:
- **For Linear Regression (Mean Squared Error):**
$$
L_i(\theta) = \left( y_i - h_\theta(x_i) \right)^2
$$

- **For Logistic Regression (Binary Cross-Entropy):**
$$
L_i(\theta) = -\left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$$



#### **Step 4: Compute the Gradient**
Compute the gradient of the loss function with respect to each parameter $ \theta_j $:
$$
\frac{\partial L_i}{\partial \theta_j} = \left( \hat{y}_i - y_i \right) x_{ij}
$$
Where:
- $x_{ij}$: Value of the $j$-th feature for the $i$-th training example.



#### **Step 5: Update Parameters**
Update the parameters using the computed gradient:
$$
\theta_j \gets \theta_j - \eta \cdot \frac{\partial L_i}{\partial \theta_j}
$$
Where:
- $\eta$: Learning rate, controls the size of the step.



#### **Step 6: Repeat for Each Training Example**
Repeat steps 2–5 for a fixed number of epochs or until convergence:
1. Shuffle the dataset at the start of each epoch to reduce bias.
2. Iterate through all training examples.

---

### **2. Mathematical Example for Linear Regression**

#### Dataset:
| $x_i$ | $y_i$ |
|---------|---------|
| 1       | 2       |
| 2       | 4       |
| 3       | 6       |

#### Model:
$$
h_\theta(x) = \theta_0 + \theta_1 x
$$

#### Loss Function:
$$
L_i(\theta_0, \theta_1) = \left( y_i - (\theta_0 + \theta_1 x_i) \right)^2
$$

#### Gradients:
$$
\frac{\partial L_i}{\partial \theta_0} = -2 \cdot \left( y_i - (\theta_0 + \theta_1 x_i) \right)
$$
$$
\frac{\partial L_i}{\partial \theta_1} = -2 \cdot \left( y_i - (\theta_0 + \theta_1 x_i) \right) \cdot x_i
$$

#### Updates:
1. Initialize $\theta_0 = 0$, $\theta_1 = 0$.
2. For each example $(x_i, y_i)$:
   $$
   \theta_0 \gets \theta_0 - \eta \cdot \frac{\partial L_i}{\partial \theta_0}
   $$
   $$
   \theta_1 \gets \theta_1 - \eta \cdot \frac{\partial L_i}{\partial \theta_1}
   $$



### **3. Key Features of SGD**

#### **Advantages:**
1. **Faster Updates**: Each update is computed for a single training example, making the algorithm faster for large datasets.
2. **Memory Efficient**: Only one example is loaded into memory at a time, making it ideal for large datasets.
3. **Online Learning**: Can be used for streaming data, as it updates the model parameters incrementally.

#### **Disadvantages:**
1. **Noisy Updates**: Each update uses a single example, leading to high variance and oscillations in the convergence path.
2. **May Overshoot**: Because of the noisy updates, SGD may overshoot the minimum or take a longer, winding path to convergence.
3. **Requires Careful Tuning**: Learning rate $ \eta $ must be carefully chosen to avoid divergence.



### **4. Comparison to Batch Gradient Descent**

| Feature                        | Stochastic Gradient Descent         | Batch Gradient Descent               |
|--------------------------------|-------------------------------------|--------------------------------------|
| **Gradient Calculation**       | Single training example             | Entire dataset                       |
| **Speed per Update**           | Faster                              | Slower                               |
| **Convergence Stability**      | Noisy, may oscillate                | Stable                               |
| **Memory Requirement**         | Low (single example)                | High (entire dataset)                |
| **Use Case**                   | Large datasets, online learning     | Small datasets, stable optimization  |



### **5. Variants to Improve SGD**

1. **Mini-Batch Gradient Descent**:
   - Combines advantages of SGD and Batch Gradient Descent by computing gradients on a small subset (batch) of data.

2. **Learning Rate Schedulers**:
   - Adjust the learning rate over time to improve convergence (e.g., exponential decay, step decay).

3. **Momentum**:
   - Incorporates a fraction of the previous update to smooth out oscillations:
     $$
     v_j = \gamma v_j + \eta \cdot \frac{\partial L_i}{\partial \theta_j}
     $$
     $$
     \theta_j \gets \theta_j - v_j
     $$

4. **Adaptive Methods**:
   - Techniques like **Adam** or **RMSprop** adapt the learning rate for each parameter, improving convergence speed and stability.



### **6. Visual Intuition**
- Imagine a ball rolling down a hilly path. With SGD, the ball takes steps based on the local slope at its current position, leading to quick, uneven progress. While it may zig-zag around the optimal solution, it gets there faster than Batch Gradient Descent.



### **7. Convergence Criteria**
1. **Loss Function Convergence**:
   - Stop when the change in loss is less than a small threshold:
     $$
     |L_{\text{new}} - L_{\text{old}}| < \epsilon
     $$
2. **Gradient Norm Convergence**:
   - Stop when the norm of the gradient is close to zero:
     $$
     \|\nabla_\theta L_i(\theta)\| < \epsilon
     $$



### **8. Summary**
- **Stochastic Gradient Descent** updates parameters using a single training sample at a time.
- It is computationally efficient for large datasets but may converge noisily.
- Enhancements like momentum, learning rate schedules, or adaptive optimizers can mitigate its weaknesses.

---

### **Mini-Batch Gradient Descent (MBGD)**

Mini-Batch Gradient Descent is a variant of the Gradient Descent algorithm that splits the dataset into smaller groups called *mini-batches*. Instead of computing gradients for the entire dataset (like Batch Gradient Descent) or for one sample (like Stochastic Gradient Descent), MBGD computes the gradient using a **subset of training samples** at each step.



### **1. How Mini-Batch Gradient Descent Works**

#### **Step 1: Initialize Parameters**
Randomly initialize the parameters $ \theta $ (e.g., weights for the model).



#### **Step 2: Divide Dataset into Mini-Batches**
Split the training dataset $ D $ of size $ N $ into $ M $ mini-batches, each of size $ B $:
$$
B = \frac{N}{M}
$$
Where:
- $N$: Total number of training examples.
- $B$: Batch size (e.g., 32, 64, 128).
- $M$: Number of mini-batches.



#### **Step 3: Compute Prediction**
For each mini-batch $ B_k $ containing a subset of examples $ \{(x_1, y_1), (x_2, y_2), \dots, (x_B, y_B)\} $, compute the model's predictions:
$$
\hat{y}_i = h_\theta(x_i), \quad i \in B_k
$$
Where $ h_\theta(x_i) $ is the model's prediction for the $i$-th sample in the mini-batch.



#### **Step 4: Define the Loss Function**
Calculate the average loss over all examples in the mini-batch:
$$
L_{B_k}(\theta) = \frac{1}{B} \sum_{i \in B_k} L_i(\theta)
$$
Where:
- $ L_i(\theta) $ is the loss for each individual training sample (e.g., Mean Squared Error or Cross-Entropy Loss).



#### **Step 5: Compute the Gradient**
Compute the gradient of the loss function with respect to parameters $ \theta_j $:
$$
\frac{\partial L_{B_k}}{\partial \theta_j} = \frac{1}{B} \sum_{i \in B_k} \frac{\partial L_i}{\partial \theta_j}
$$



#### **Step 6: Update Parameters**
Update parameters $ \theta $ using the computed gradients:
$$
\theta_j \gets \theta_j - \eta \cdot \frac{\partial L_{B_k}}{\partial \theta_j}
$$
Where:
- $\eta$: Learning rate.



#### **Step 7: Repeat for All Mini-Batches**
1. Iterate through all mini-batches in the dataset for one **epoch**.
2. Repeat the process for multiple epochs until convergence.


### **2. Mathematical Example**

#### Dataset:
| $x_i$ | $y_i$ |
|---------|---------|
| 1       | 2       |
| 2       | 4       |
| 3       | 6       |
| 4       | 8       |



#### Model:
$$
h_\theta(x) = \theta_0 + \theta_1 x
$$

#### Loss Function:
$$
L_i(\theta) = \left( y_i - (\theta_0 + \theta_1 x_i) \right)^2
$$

#### Mini-Batch Size:
Let $ B = 2 $. Split the dataset into 2 mini-batches:
- $B_1 = \{(1, 2), (2, 4)\}$
- $B_2 = \{(3, 6), (4, 8)\}$

#### Gradient Updates:
For $B_1$, compute the gradients:
$$
\frac{\partial L_{B_1}}{\partial \theta_0} = \frac{1}{2} \sum_{i=1}^2 -2 \cdot \left( y_i - (\theta_0 + \theta_1 x_i) \right)
$$
$$
\frac{\partial L_{B_1}}{\partial \theta_1} = \frac{1}{2} \sum_{i=1}^2 -2 \cdot \left( y_i - (\theta_0 + \theta_1 x_i) \right) \cdot x_i
$$

Update parameters:
$$
\theta_j \gets \theta_j - \eta \cdot \frac{\partial L_{B_1}}{\partial \theta_j}
$$

Repeat for $B_2$.

### **3. Comparison with Other Gradient Descent Variants**

| Feature                        | Stochastic Gradient Descent         | Mini-Batch Gradient Descent         | Batch Gradient Descent            |
|--------------------------------|-------------------------------------|-------------------------------------|-----------------------------------|
| **Gradient Calculation**       | Single training example             | Subset of training examples         | Entire dataset                    |
| **Speed per Update**           | Fast                                | Moderate                            | Slow                              |
| **Convergence Stability**      | Noisy                               | Balanced                            | Stable                            |
| **Memory Requirement**         | Low (single example)                | Moderate (mini-batch size)          | High (entire dataset)             |
| **Use Case**                   | Streaming data, large datasets      | Most common choice                  | Small datasets, high precision    |





### **4. Advantages of Mini-Batch Gradient Descent**

1. **Efficient Computation**:
   - Uses vectorized operations on mini-batches, leveraging hardware accelerations like GPUs.
   
2. **Stable Updates**:
   - Reduces the noisy oscillations of SGD while being faster than Batch Gradient Descent.

3. **Memory-Friendly**:
   - Suitable for large datasets that do not fit into memory, as it processes smaller batches.

4. **Faster Convergence**:
   - Provides a good trade-off between convergence stability (Batch) and computational speed (SGD).



### **5. Challenges in Mini-Batch Gradient Descent**

1. **Choosing Batch Size**:
   - Small batches ($ B < 32 $) may lead to noisy updates, similar to SGD.
   - Large batches ($ B > 512 $) may approach Batch Gradient Descent and slow down.

2. **Hyperparameter Tuning**:
   - Requires careful tuning of the learning rate $\eta$ and batch size $B$ for optimal performance.

3. **Convergence**:
   - May still get stuck in local minima or saddle points, especially for non-convex problems.



### **6. Practical Considerations**

1. **Batch Size**:
   - Common choices: $32, 64, 128, 256$.
   - Depends on hardware (GPU/CPU) and dataset size.

2. **Learning Rate**:
   - Combine with learning rate schedulers to adjust $\eta$ during training.

3. **Shuffling**:
   - Shuffle the dataset at the start of each epoch to ensure diverse mini-batches and reduce bias.

4. **Regularization**:
   - Use techniques like $ L2 $-regularization or dropout to avoid overfitting.



### **7. Applications**

1. **Deep Learning**:
   - Mini-Batch Gradient Descent is the standard choice for training deep neural networks due to its ability to balance memory usage and computation speed.

2. **Large Datasets**:
   - Suitable for datasets too large to fit into memory.



### **8. Summary**

Mini-Batch Gradient Descent computes updates using a subset of training data, combining the benefits of Stochastic and Batch Gradient Descent:
- **Efficiency**: Faster than Batch Gradient Descent and less noisy than SGD.
- **Flexibility**: Works well for large datasets using hardware acceleration.
- **Challenges**: Requires careful tuning of batch size and learning rate for optimal results.

---