# Batch Gradient Descent: An Overview

Batch Gradient Descent (BGD) is a fundamental optimization algorithm used in machine learning and deep learning for minimizing the cost function associated with training a model. It belongs to the family of gradient-based optimization techniques, where the model parameters are iteratively updated to minimize a given objective function.

### How Batch Gradient Descent Works

In Batch Gradient Descent, the algorithm calculates the gradient of the cost function with respect to the model parameters using the entire training dataset. This means that for each iteration, it processes all the training examples to compute the gradient. The gradients are then averaged across the entire dataset, and the model parameters are updated in the direction opposite to the gradient, scaled by a predefined learning rate.

### Advantages of Batch Gradient Descent

- **Global Convergence:** Batch Gradient Descent guarantees convergence to the global minimum for convex cost functions, given a sufficiently small learning rate. This property makes it reliable for optimization tasks where finding the global minimum is crucial.

- **Stable Convergence:** Since BGD computes gradients using the entire dataset, the updates are less noisy compared to stochastic methods. This leads to more stable convergence and smoother optimization trajectories.

- **Efficient with Vectorized Operations:** Batch processing allows for efficient vectorized computations, leveraging optimized linear algebra libraries. This makes BGD suitable for high-dimensional datasets and complex models.

- **Deterministic Updates:** The updates in BGD are deterministic, meaning that given the same initial parameters and dataset, the algorithm will converge to the same solution. This deterministic nature simplifies debugging and reproducibility.

### Disadvantages of Batch Gradient Descent

- **High Memory Requirement:** Batch Gradient Descent requires storing the entire dataset in memory to compute gradients, which can be computationally expensive for large datasets. This memory requirement limits its scalability to big data scenarios.

- **Computational Cost:** Processing the entire dataset for each iteration can be computationally expensive, especially for large datasets and complex models. This leads to slower convergence compared to stochastic methods.

- **Prone to Local Minima:** In non-convex optimization problems, Batch Gradient Descent may converge to a local minimum instead of the global minimum. This limitation can hinder its effectiveness in optimizing complex, non-convex cost functions.

- **Sensitivity to Learning Rate:** The choice of learning rate in Batch Gradient Descent is critical. A learning rate that is too small may lead to slow convergence, while a learning rate that is too large may cause oscillations or divergence.


### Problem Setup
Assume we have a small dataset with 6 data points:

$$
\{(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4), (x_5, y_5), (x_6, y_6)\}
$$

For simplicity, let's say:

$$
\{(1, 2), (2, 2.5), (3, 3), (4, 3.5), (5, 5), (6, 5.5)\}
$$

We want to fit a linear model:

$$
\hat{y} = w_0 + w_1 x
$$

### Initial Weights
Let's start with initial weights $ w_0 = 0 $ and $ w_1 = 0 $.

### Gradient Calculation
The cost function for linear regression is:

$$
J(w_0, w_1) = \frac{1}{2m} \sum_{i=1}^m (\hat{y_i} - y_i)^2
$$

where $ m $ is the number of data points (6 in this case), and $ \hat{y}_i = w_0 + w_1 x_i $.

The gradients of the cost function with respect to the weights are:

$$
\frac{\partial J}{\partial w_0} = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i)
$$
$$
\frac{\partial J}{\partial w_1} = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) x_i
$$

We will calculate these gradients for all 6 data points and then average them.

### Step-by-Step Calculation

Compute the predictions and errors for each data point:

- For $ i = 1 $:
  - $ \hat{y}_1 = w_0 + w_1 x_1 = 0 + 0 \cdot 1 = 0 $
  - $ Error_1 = \hat{y}_1 - y_1 = 0 - 2 = -2 $ 

- For $ i = 2 $:
  - $ \hat{y}_2 = w_0 + w_1 x_2 = 0 + 0 \cdot 2 = 0 $
  - $ Error 2 = \hat{y}_2 - y_2 = 0 - 2.5 = -2.5 $

Repeat similarly for $ i = 3, 4, 5, 6 $.

Compute the gradients:

$$
\frac{\partial J}{\partial w_0} = \frac{1}{6} \sum_{i=1}^6 (\text{Error}_i) = \frac{1}{6} (-2 - 2.5 - 3 - 3.5 - 5 - 5.5) = \frac{1}{6} (-21.5) = -3.5833
$$
$$
\frac{\partial J}{\partial w_1} = \frac{1}{6} \sum_{i=1}^6 (\text{Error}_i x_i) = \frac{1}{6} (-2 \cdot 1 - 2.5 \cdot 2 - 3 \cdot 3 - 3.5 \cdot 4 - 5 \cdot 5 - 5.5 \cdot 6) = \frac{1}{6} (-88) = -14.6667
$$

Update the weights:

Using a learning rate $ \alpha = 0.01 $:

$$
w_0 = w_0 - \alpha \frac{\partial J}{\partial w_0} = 0 - 0.01 \cdot (-3.5833) = 0 + 0.035833 = 0.0358
$$
$$
w_1 = w_1 - \alpha \frac{\partial J}{\partial w_1} = 0 - 0.01 \cdot (-14.6667) = 0 + 0.146667 = 0.1467
$$

### Summary
After one iteration of batch gradient descent, the updated weights are:

$$
w_0 \approx 0.0358
$$
$$
w_1 \approx 0.1467
$$

These weights are updated in the negative direction of the averaged gradient, calculated using all data instances. In subsequent iterations, these weights would be further adjusted in the same manner until the model converges to the optimal solution.


# Stochastic Gradient Descent: An Overview

Stochastic Gradient Descent (SGD) is a popular optimization algorithm widely used in machine learning and deep learning for training models. It belongs to the family of gradient-based optimization techniques, similar to Batch Gradient Descent (BGD), but with a key difference in how it computes gradients and updates model parameters.

### How Stochastic Gradient Descent Works
In Stochastic Gradient Descent, the algorithm updates the model parameters based on the gradient of the cost function computed using a single randomly chosen training example at each iteration. Unlike BGD, which processes the entire dataset, SGD processes one training example at a time, making it much faster and more scalable, especially for large datasets.

### Advantages of Stochastic Gradient Descent
- **Efficiency with Large Datasets:** SGD is well-suited for large-scale datasets since it processes one training example at a time. This leads to faster convergence compared to BGD, as it avoids the computational overhead of processing the entire dataset in each iteration.

- **Reduced Memory Requirements:** Since SGD only requires storing a single training example in memory at a time, it has significantly lower memory requirements compared to BGD. This makes it more memory-efficient, particularly for datasets that cannot fit into memory.

- **Escape from Local Minima:** The stochastic nature of SGD introduces noise in the optimization process, which helps the algorithm escape from local minima and explore a wider range of solutions. This property is advantageous in optimizing non-convex cost functions.

- **Online Learning:** SGD lends itself naturally to online learning scenarios, where data arrives sequentially or in streams. It can continuously update the model parameters as new data becomes available, making it suitable for real-time applications.

### Disadvantages of Stochastic Gradient Descent

- **Noisy Updates:** The stochastic nature of SGD introduces noise in the optimization process, leading to high variance in parameter updates. This can result in erratic convergence behavior and slower convergence compared to BGD, especially in the presence of noisy gradients.

- **Less Stable Convergence:** Due to the noisy updates, SGD may exhibit more oscillatory behavior during optimization, making the convergence trajectory less stable compared to BGD. This can complicate the selection of an appropriate learning rate and require careful tuning.

- **Potential for Convergence Issues:** While SGD can escape local minima, it may also converge to suboptimal solutions or oscillate around the minimum, especially in the presence of noisy or sparse gradients. Balancing exploration and exploitation is crucial for achieving convergence to the optimal solution.

- **Learning Rate Selection:** The choice of learning rate in SGD is critical and may require careful tuning. A learning rate that is too large can lead to divergent behavior or overshooting the minimum, while a learning rate that is too small may result in slow convergence.

### Problem Setup
We'll use the same dataset with 6 data points:

$$
\{(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4), (x_5, y_5), (x_6, y_6)\}
$$

For simplicity, let's say:

$$
\{(1,2), (2,2.5), (3,3), (4,3.5), (5,5), (6,5.5)\}
$$

We want to fit a linear model:

$$
\hat{y} = w_0 + w_1x
$$

### Initial Weights

Let's start with initial weights $ w_0 = 0 $ and $ w_1 = 0 $.

### Gradient Calculation

In stochastic gradient descent, we update the weights for each data point individually, rather than averaging the gradients over the entire dataset. The cost function for linear regression is:

$$
J(w_0, w_1) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2
$$

However, in SGD, we consider the cost for a single data point at a time:

$$
J_i(w_0, w_1) = \frac{1}{2} (\hat{y}_i - y_i)^2
$$

The gradients of the cost function with respect to the weights for a single data point are:

$$
\frac{\partial J_i}{\partial w_0} = (\hat{y}_i - y_i)
$$

$$
\frac{\partial J_i}{\partial w_1} = (\hat{y}_i - y_i) x_i
$$

### Step-by-Step Calculation

Iteration over each data point:

For $ i = 1 $:
  - $ \hat{y}_1 = w_0 + w_1 x_1 = 0 + 0 \cdot 1 = 0 $
  - $ Error_1 = \hat{y}_1 - y_1 = 0 - 2 = -2 $

Update the weights:

- $ \frac{\partial J_1}{\partial w_0} = -2 $
- $ \frac{\partial J_1}{\partial w_1} = -2 \cdot 1 = -2 $

Using a learning rate $ \alpha = 0.01 $:

- $ w_0 = w_0 - \alpha \frac{\partial J_1}{\partial w_0} = 0 + 0.02 = 0.02 $  
- $ w_1 = w_1 - \alpha \frac{\partial J_1}{\partial w_1} = 0 + 0.02 = 0.02 $  

For $ i = 2 $:
  - $ \hat{y}_2 = w_0 + w_1 x_2 = 0.02 + 0.02 \cdot 2 = 0.06 $
  - $ Error_2 = \hat{y}^2 - y_2 = 0.06 - 2.5 = -2.44 $

Update the weights:

- $ \frac{\partial J_2}{\partial w_0} = -2.44 $
- $ \frac{\partial J_2}{\partial w_1} = -2.44 \cdot 2 = -4.88 $  

Using a learning rate $ \alpha = 0.01 $:

- $ w_0 = 0.02 - 0.01 \cdot (-2.44) = 0.0444 $  
- $ w_1 = 0.02 - 0.01 \cdot (-4.88) = 0.0688 $  


For $ i = 3 $:
  - $ \hat{y}_3 = w_0 + w_1 x_3 = 0.0444 + 0.0688 \cdot 3 = 0.25 $
  - $ Error_3 = \hat{y}_3 - y_3 = 0.25-3 = -2.7 $

Update the weights:

- $\frac{\partial J_3}{\partial w_0}=-2.75$
- $\frac{\partial J_3}{\partial w_1}=-2.75 \cdot 3 = -8.25$

Using a learning rate $\alpha=0.01$:

- $w_0 = 0.0444 - 0.01 \cdot (-2.75) = 0.0444 + 0.0275 = 0.0719$
- $w_1= 0.0688 - 0.01 \cdot (-8.25) = 0.0688 + 0.0825 = 0.1513$

For $ i = 4 $:
  - $ \hat{y}_4 = w_0 + w_1 x_4 = 0.0719 + 0.1513 \cdot 4 = 0.6771 $
  - $ Error_4 = \hat{y}_4 - y_4 = 0.6771 - 3.5 = -2.8229 $

Update the weights:

- $\frac{\partial J_4}{\partial w_0}=-2.8229$
- $\frac{\partial J_4}{\partial w_1}=-2.8229 \cdot 4 = -11.2916$

Using a learning rate $\alpha=0.01$:

- $w_0 = 0.0719 - 0.01 \cdot (-2.8229) = 0.0719 + 0.0282 = 0.1001$
- $w_1 = 0.1513-0.01 \cdot (-11.2916) = 0.1513 + 0.1129 = 0.2642$

For $ i = 5 $:
  - $ \hat{y}_5 = w_0 + w_1 x_5 = 0.1001 + 0.2642 \cdot 5 = 1.4211 $
  - $ Error_5 = \hat{y}_5 - y_5 = 1.4211 - 5 = -3.5789 $

Update the weights:

- $\frac{\partial J_5}{\partial w_0}=-3.5789$
- $\frac{\partial J_5}{\partial w_1}=-3.5789 \cdot 5 = -17.8945$

Using a learning rate $\alpha=0.01$:

- $w_0 = 0.1001 - 0.01 \cdot (-3.5789) = 0.1001 + 0.0358 = 0.1359$
- $w_1 = 0.2642-0.01 \cdot (-17.8945) = 0.2642 + 0.1789 = 0.4431$

For $ i = 6 $:
  - $ \hat{y}_6 = w_0 + w_1 x_6 = 0.1359 + 0.4431 \cdot 6 = 2.8045 $
  - $ Error_6 = \hat{y}_6 - y_6 = 2.8045 - 5.5 = -2.6955 $

Update the weights:

- $\frac{\partial J_6}{\partial w_0}=-2.6955$
- $\frac{\partial J_6}{\partial w_1}=-2.6955 \cdot 6 = -16.173$

Using a learning rate $\alpha=0.01$:

- $w_0 = 0.1359-0.01 \cdot (-2.6955) = 0.1359 + 0.0269 = 0.1628$
- $w_1 = 0.4431-0.01 \cdot (-16.173) = 0.4431 + 0.1617 = 0.6048$

### Summary
After one complete pass (epoch) through the dataset, the updated weights are:

- $w_0 \approx 0.1628$
- $w_1 \approx 0.6048$

These weights are updated individually for each data point, and this process is repeated for multiple epochs until the model converges to the optimal solution. Stochastic gradient descent generally converges faster than batch gradient descent because it updates the weights more frequently, but it may have more noise in the updates due to the variability of the individual data points.


# Mini-Batch Gradient Descent: An Overview

Mini-batch gradient descent is a variant of the gradient descent optimization algorithm used in machine learning and deep learning. Unlike traditional gradient descent, which updates the model parameters using the gradient computed from the entire training dataset (batch gradient descent), mini-batch gradient descent computes the gradient and updates the parameters using small random subsets of the training data called mini-batches.

### How Mini-Batch Gradient Descent Works:

- **Data Partitioning:** The training dataset is divided into smaller subsets called mini-batches. These mini-batches typically contain a fixed number of data samples, such as 32, 64, or 128 examples.

- **Gradient Computation:** For each mini-batch, the gradient of the loss function with respect to the model parameters is computed using only the examples in that mini-batch.

- **Parameter Update:** After computing the gradient for a mini-batch, the model parameters are updated using the gradient descent update rule, such as stochastic gradient descent (SGD), Adam, or RMSprop.

- **Iterative Optimization:** This process is repeated for multiple iterations (epochs) over the entire training dataset until convergence, with each mini-batch contributing to the overall optimization of the model parameters.

### Advantages of Mini-Batch Gradient Descent:

- **Efficiency:** Mini-batch gradient descent strikes a balance between the efficiency of stochastic gradient descent (SGD) and the stability of batch gradient descent. By using mini-batches, it allows for parallelization and vectorized operations, leading to faster convergence compared to batch gradient descent.

- **Memory Efficiency:** Mini-batch gradient descent requires less memory than batch gradient descent since it processes only a subset of the training data at each iteration. This makes it suitable for training deep learning models on large datasets that may not fit entirely into memory.

- **Regularization Effect:** The noise introduced by mini-batch updates can act as a form of regularization, preventing the model from overfitting to the training data. This stochasticity in the updates helps the model generalize better to unseen data.

- **Better Convergence:** Mini-batch gradient descent often converges faster than batch gradient descent due to more frequent parameter updates. It can navigate complex loss landscapes more efficiently and escape local minima more easily.

### Disadvantages of Mini-Batch Gradient Descent:

    - **Hyperparameter Tuning:** Mini-batch gradient descent introduces additional hyperparameters, such as the mini-batch size and learning rate schedule, that need to be tuned. Finding the optimal values for these hyperparameters can require extensive experimentation.

- **Sensitivity to Mini-Batch Size:** The choice of mini-batch size can impact the convergence speed and stability of mini-batch gradient descent. Small mini-batches may introduce high variance in the parameter updates, while large mini-batches may slow down convergence.

- **Potential for Poor Generalization:** Mini-batch gradient descent relies on random sampling of mini-batches, which can lead to biased estimates of the gradient, especially when the mini-batches are not representative of the overall dataset. This can result in poor generalization performance on unseen data.

- **Complexity:** Implementing mini-batch gradient descent requires additional programming complexity compared to batch gradient descent. Handling mini-batch loading, synchronization, and parallelization adds overhead to the training process.

### Problem Setup

We'll use the same dataset with 6 data points:

$$ 
\{ (x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4), (x_5, y_5), (x_6, y_6) \}
$$

For simplicity, let's say:

$$
\{ (1,2), (2,2.5), (3,3), (4,3.5), (5,5), (6,5.5) \}
$$

We want to fit a linear model:

$$
\hat{y} = w_0 + w_1x
$$

### Initial Weights

Let's start with initial weights $ w_0 = 0 $ and $ w_1 = 0 $.

### Gradient Calculation

The cost function for linear regression is:

$$
J(w_0, w_1) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2
$$

In mini-batch gradient descent, we update the weights based on small random subsets (mini-batches) of the dataset rather than using the entire dataset (as in batch gradient descent) or a single data point (as in stochastic gradient descent).

### Mini-Batch Size

Let's choose a mini-batch size of 2 for this example. This means we'll update the weights after processing every 2 data points.

### Step-by-Step Calculation

1. Divide the dataset into mini-batches:

- Mini-batch 1: $ \{ (x_1, y_1), (x_2, y_2) \} = \{ (1,2), (2,2.5) \} $

- Mini-batch 2: $ \{ (x_3, y_3), (x_4, y_4) \} = \{ (3,3), (4,3.5) \} $

- Mini-batch 3: $ \{ (x_5, y_5), (x_6, y_6) \} = \{ (5,5), (6,5.5) \} $

2. Process each mini-batch:

- Mini-batch 1:

 - For $ i=1 $ and $ i=2 $:
    - $ x_1 = 1, y_1 = 2 $
    - $ x_2 = 2, y_2 = 2.5 $

Compute predictions and errors:

- $ \hat{y}_1 = w_0 + w_1x_1 = 0 + 0 \cdot 1 = 0 $
- $ Error_1 = \hat{y}_1 - y_1 = 0 - 2 = -2 $

- $ \hat{y}_2 = w_0 + w_1x_2 = 0 + 0 \cdot 2 = 0 $
- $ Error_2 = \hat{y}_2 - y_2 = 0 - 2.5 = -2.5 $

Compute gradients:

$ \frac{\partial J_{\text{mini}}}{\partial w_0} = \frac{1}{2} \sum_{i=1}^{2} \text{Error}_i = \frac{1}{2} (-2 - 2.5) = \frac{1}{2} (-4.5) = -2.25 $  
$\frac{\partial J_{\text{mini}}}{\partial w_1} = \frac{1}{2} \sum_{i=1}^{2} \text{Error}_i x_i = \frac{1}{2} (-2 \cdot 1 - 2.5 \cdot 2) = \frac{1}{2} (-2 - 5) = \frac{1}{2} (-7) = -3.5 $

Update the weights using a learning rate $ \alpha = 0.01 $:

$ w_0 = w_0 - \alpha \frac{\partial J_{\text{mini}}}{\partial w_0} = 0 - 0.01 \cdot (-2.25) = 0 + 0.0225 = 0.0225 $  
$ w_1 = w_1 - \alpha \frac{\partial J_{\text{mini}}}{\partial w_1} = 0 - 0.01 \cdot (-3.5) = 0 + 0.035 = 0.035 $

- Mini-batch 2:

 - For $ i=3 $ and $ i=4 $:
    - $ x_3 = 3, y_3 = 3 $
    - $ x_4 = 4, y_2 = 3.5 $

Compute predictions and errors:

- $ \hat{y}_3 = w_0 + w_1x_3 = 0.0225 + 0.035 \cdot 3 = 0.1275 $
- $ Error_3 = \hat{y}_3 - y_3 = 0.1275 - 3 = -2.8725 $

- $ \hat{y}_4 = w_0 + w_1x_4 = 0.0225 + 0.035 \cdot 4 = 0.1625 $
- $ Error_4 = \hat{y}_4 - y_4 = 0.1625 - 3.5 = -3.3375 $

Compute gradients:

$ \frac{\partial J_{\text{mini}}}{\partial w_0} = \frac{1}{2} \sum_{i=3}^{4} \text{Error}_i = \frac{1}{2} (-2.8725 - 3.3375) = \frac{1}{2} (-6.21) = -3.105 $  
$\frac{\partial J_{\text{mini}}}{\partial w_1} = \frac{1}{2} \sum_{i=3}^{4} \text{Error}_i x_i = \frac{1}{2} (-2.8725 \cdot 3 - 3.3375 \cdot 4) = \frac{1}{2} (-8.6175 - 13.35) = \frac{1}{2} (−21.9675) = -10.98375 $

Update the weights using a learning rate $ \alpha = 0.01 $:

$ w_0 = w_0 - \alpha \frac{\partial J_{\text{mini}}}{\partial w_0} = 0.0225 - 0.01 \cdot (-3.105) = 0.0225 + 0.03105 = 0.05355 $  
$ w_1 = w_1 - \alpha \frac{\partial J_{\text{mini}}}{\partial w_1} = 0.035 - 0.01 \cdot (-10.98375) = 0.035 + 0.1098375 = 0.1448375 $


- Mini-batch 3:

 - For $ i=5 $ and $ i=6 $:
    - $ x_5 = 5, y_5 = 5 $
    - $ x_6 = 6, y_6 = 5.5 $

Compute predictions and errors:

- $ \hat{y}_5 = w_0 + w_1x_5 = 0.05355 + 0.1448375 \cdot 5 = 0.7787375 $
- $ Error_5 = \hat{y}_5 - y_5 = 0.7787375 - 5 = -4.2212625 $

- $ \hat{y}_6 = w_0 + w_1x_6 = 0.05355 + 0.1448375 \cdot 6 = 0.927575 $
- $ Error_6 = \hat{y}_6 - y_6 = 0.927575 - 5.5 = -4.572425 $

Compute gradients:

$ \frac{\partial J_{\text{mini}}}{\partial w_0} = \frac{1}{2} \sum_{i=5}^{6} \text{Error}_i = \frac{1}{2} (-4.2212625 - 4.572425) = \frac{1}{2} (-8.7936875) = -4.39684375 $  
$\frac{\partial J_{\text{mini}}}{\partial w_1} = \frac{1}{2} \sum_{i=5}^{6} \text{Error}_i x_i = \frac{1}{2} (-4.221265 \cdot 5 - 4.572425 \cdot 6) = \frac{1}{2} (-21.1063125 - 27.43455) = \frac{1}{2} (-48.5408625) = -24.27043125 $

Update the weights using a learning rate $ \alpha = 0.01 $:

$ w_0 = w_0 - \alpha \frac{\partial J_{\text{mini}}}{\partial w_0} = 0.05355 - 0.01 \cdot (-4.39684375) = 0.05355 + 0.0439684375 = 0.0975184375 $  
$ w_1 = w_1 - \alpha \frac{\partial J_{\text{mini}}}{\partial w_1} = 0.1448375 - 0.01 \cdot (-24.27043125) = 0.1448375 + 0.2427043125 = 0.3875418125 $

### Summary

After one complete pass (epoch) through the dataset, the updated weights are:

$ w_0 \approx 0.0975 $  
$ w_1 \approx 0.3875 $

These weights are updated after processing each mini-batch. This process is repeated for multiple epochs until the model converges to the optimal solution. Mini-batch gradient descent balances the frequent updates of stochastic gradient descent and the stability of batch gradient descent.


### Mini-Batches

1. **Error Calculation**:
   - The error in calculating the mean from $N$ samples is given by $\sigma / \sqrt{N}$, where $\sigma$ is the standard deviation of the data distribution.
   
2. **Diminishing Returns**:
   - Increasing the mini-batch size has diminishing returns on error reduction.
   - For example, increasing the mini-batch size by 100 times only reduces the error by 10 times.
   
3. **Hardware Efficiency**:
   - Choosing the mini-batch size should also consider the efficiency of the hardware architecture.
   - Some hardware platforms perform better with mini-batch sizes that are powers of 2 (e.g., 64, 128, 256).

These points summarize the considerations for determining the optimal mini-batch size in a clear and professional manner.
