# <span style="color:#2E86C1;"><b>Batch, Stochastic, and Mini-Batch Gradient Descent</b></span>

---

### <span style="color:#D35400;"><b>1. Batch Gradient Descent</b></span>
- **<span style="color:#28B463;">Process</span>**:
  - Batch gradient descent computes the gradient of the loss function with respect to the entire dataset.
  - At each iteration, all data points are processed simultaneously to calculate the gradient and adjust model parameters.
  - This approach ensures that each update reflects the complete dataset, which usually leads to a smoother convergence path.
- **<span style="color:#F39C12;">Update Rule</span>**:
  $$
  w = w - \alpha \cdot \nabla J(w)
  $$
  - Here, $ w $ represents the weight parameters, $ \alpha $ is the learning rate, and $ \nabla J(w) $ is the gradient of the loss over the entire dataset.
- **<span style="color:#E74C3C;">Pros</span>**:
  - **Stable Convergence**: Since it computes gradients over the full dataset, it produces stable, less noisy updates.
  - **Fewer Oscillations**: Reduces random fluctuations as it reflects the global error gradient.
- **<span style="color:#9B59B6;">Cons</span>**:
  - **High Computational Cost**: Processing the entire dataset per update can be very slow, especially for large datasets.
  - **Memory Intensive**: Requires loading the entire dataset into memory, which can be problematic for big data applications.
- **<span style="color:#2E86C1;">When to Use</span>**:
  - Ideal for small-to-medium-sized datasets where memory and computational resources can handle the entire dataset at once.
  - Useful when model training stability is prioritized over speed.

---

### <span style="color:#D35400;"><b>2. Stochastic Gradient Descent (SGD)</b></span>
- **<span style="color:#28B463;">Process</span>**:
  - In SGD, each model update is made after computing the gradient of the loss function with respect to just one data point.
  - Each iteration uses a single randomly chosen data sample (stochastic means "random") to compute the gradient and adjust the parameters.
  - This approach leads to frequent, smaller updates, introducing some randomness to the path of convergence.
- **<span style="color:#F39C12;">Update Rule</span>**:
  $$
  w = w - \alpha \cdot \nabla J(w; x^{(i)}, y^{(i)})
  $$
  - Here, $ (x^{(i)}, y^{(i)}) $ is a single data point, and $ \nabla J(w; x^{(i)}, y^{(i)}) $ represents the gradient of the loss with respect to that one data point.
- **<span style="color:#E74C3C;">Pros</span>**:
  - **Memory Efficient**: Requires only a single data point to compute each gradient, making it highly memory efficient.
  - **Faster Iterations**: Can be faster for very large datasets as it updates frequently.
  - **Explores Local Minima**: The randomness can help the model escape local minima, potentially finding better solutions.
- **<span style="color:#9B59B6;">Cons</span>**:
  - **High Variance**: The updates are noisy, which can cause the model to converge slowly and even overshoot the minimum.
  - **Unstable Convergence**: Without proper tuning, it can result in an erratic path to convergence, oscillating around the minimum.
- **<span style="color:#2E86C1;">When to Use</span>**:
  - Suitable for very large datasets, especially when computational resources are limited.
  - Ideal for applications where quicker, approximate convergence is acceptable, like real-time systems or online learning.

---

### <span style="color:#D35400;"><b>3. Mini-Batch Gradient Descent</b></span>
- **<span style="color:#28B463;">Process</span>**:
  - Mini-Batch Gradient Descent splits the dataset into small, manageable batches (e.g., 32, 64, 128 samples per batch).
  - The gradient is computed over each mini-batch, and model parameters are updated based on the average gradient of the samples in each mini-batch.
  - This strikes a balance between the stability of Batch Gradient Descent and the speed of SGD.
- **<span style="color:#F39C12;">Update Rule</span>**:
  $$
  w = w - \alpha \cdot \frac{1}{m} \sum_{j=1}^{m} \nabla J(w; x^{(j)}, y^{(j)})
  $$
  - Here, $ m $ is the mini-batch size, and the gradient is averaged across all samples within each mini-batch.
- **<span style="color:#E74C3C;">Pros</span>**:
  - **Computational Efficiency**: Allows processing in smaller batches, which is faster and more memory-efficient than full-batch updates.
  - **Reduced Noise**: Offers less noisy updates than SGD, helping to stabilize the convergence process.
  - **Parallelization**: Mini-batches can be parallelized on hardware like GPUs, speeding up computations.
- **<span style="color:#9B59B6;">Cons</span>**:
  - **Choice of Batch Size**: Choosing an optimal mini-batch size is important; a batch size too small may be noisy, while a large one may slow down training.
  - **Memory Usage**: Larger mini-batches require more memory, which may limit batch size on certain hardware.
- **<span style="color:#2E86C1;">When to Use</span>**:
  - Commonly used for deep learning applications, especially when training on large datasets where full-batch processing is impractical.
  - Preferred for tasks that benefit from both training stability and efficiency.

---