# 🔹 Gradient Descent Optimizer

## 1. Theory

**Gradient Descent** is the most fundamental optimization algorithm used in training neural networks.  
It updates model parameters (weights) in the direction **opposite to the gradient** of the cost function to minimize the loss.

---

## 2. Formula

For weight \( w \) at iteration \( t \):
$$
w^{(t+1)} = w^{(t)} - \eta \frac{\partial J}{\partial w}
$$

Where:
- \( w \) → weight
- \( \eta \) → learning rate (controls step size)
- \( \frac{\partial J}{\partial w} \) → gradient of cost function \( J \) w.r.t \( w \)

---

## 3. Types of Gradient Descent

| **Type**                        | **Description**                                                         | **Use Case**                               |
|---------------------------------|------------------------------------------------------------------------|-------------------------------------------|
| **Batch Gradient Descent**      | Uses the **entire dataset** to compute gradients per iteration.       | Small datasets where computation is feasible. |
| **Stochastic Gradient Descent (SGD)** | Uses **one sample** at a time for weight updates.                    | Large datasets, online learning.          |
| **Mini-Batch Gradient Descent** | Uses a **small batch** of samples per update.                         | Most widely used in deep learning.        |

---

## 4. Advantages and Disadvantages

| **Advantages**                                        | **Disadvantages**                                      |
|------------------------------------------------------|------------------------------------------------------|
| Simple to understand and implement.                  | Sensitive to choice of learning rate \( \eta \).     |
| Guarantees convergence for convex functions.         | Can get stuck in local minima (for non-convex problems). |
| Forms the basis for all advanced optimizers.         | Slow convergence for deep networks.                  |

---

## 5. Interview Questions and Answers

### **Q1: Why is learning rate important in Gradient Descent?**
**Answer:**  
- A very high \( \eta \) → may overshoot minima.  
- A very low \( \eta \) → very slow convergence.  

---

### **Q2: Why is Mini-Batch Gradient Descent preferred in deep learning?**
**Answer:**  
- It balances **computational efficiency** and **convergence stability**, making it the standard choice.

---

## ✅ Conclusion
Gradient Descent is the **foundation** of optimization in machine learning.  
All advanced optimizers (e.g., SGD with Momentum, RMSProp, Adam) are **extensions** of Gradient Descent to improve convergence speed and stability.


# 🔹 Stochastic Gradient Descent (SGD)

## 1. Theory

**Stochastic Gradient Descent (SGD)** is a variant of Gradient Descent where the model parameters are updated **for each training example** rather than using the entire dataset.  
This introduces **stochasticity** (randomness), which can help escape local minima but also causes noisy updates.

---

## 2. Formula

For weight \( w \) at iteration \( t \):
$$
w^{(t+1)} = w^{(t)} - \eta \frac{\partial J(w; x^{(i)}, y^{(i)})}{\partial w}
$$

Where:
- \( (x^{(i)}, y^{(i)}) \) → single training example
- \( \eta \) → learning rate
- \( \frac{\partial J}{\partial w} \) → gradient of loss for that single example

---

## 3. Workflow

1. Shuffle dataset.
2. For each sample \( i \), compute gradient \( \nabla J(w; x^{(i)}, y^{(i)}) \).
3. Update weights using the formula.
4. Repeat for multiple epochs.

---

## 4. Advantages and Disadvantages

| **Advantages**                                              | **Disadvantages**                                      |
|------------------------------------------------------------|------------------------------------------------------|
| Faster updates since only one sample is processed at a time.| Updates are noisy; loss fluctuates.                   |
| Helps escape local minima due to randomness.               | May have difficulty converging to the exact minimum.  |
| Suitable for large datasets (online learning).             | Requires careful tuning of learning rate.             |

---

## 5. Improvements Over SGD

- ✅ **Mini-Batch SGD** → Uses a batch of data to reduce noise while maintaining efficiency.  
- ✅ **SGD with Momentum** → Adds a momentum term to smooth updates and accelerate convergence.

---

## 6. Interview Questions and Answers

### **Q1: Why is SGD faster than Batch Gradient Descent?**
**Answer:**  
- Because it updates parameters after processing each sample instead of waiting for the entire dataset.

---

### **Q2: Why does SGD sometimes fail to converge?**
**Answer:**  
- The updates are noisy; without a decaying learning rate, it may oscillate around the minimum.

---

## ✅ Conclusion
- **SGD** is widely used for training large-scale machine learning models.  
- In practice, **Mini-Batch SGD** with optimizers like **Momentum** or **Adam** is preferred for deep learning.


# 🔹 Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD)

## 1. Theory

**Mini-Batch SGD** is an improvement over both **Batch Gradient Descent** and **Stochastic Gradient Descent (SGD)**.  
- Instead of updating weights for each sample (SGD) or using the entire dataset (Batch GD),  
- **Mini-Batch SGD** uses a **small batch** of data (e.g., 32, 64, 128 samples) to compute the gradient.

This combines the **stability** of Batch Gradient Descent with the **efficiency** of SGD.

---

## 2. Formula

For weight \( w \) at iteration \( t \):
$$
w^{(t+1)} = w^{(t)} - \eta \frac{1}{B} \sum_{i=1}^{B} \frac{\partial J(w; x^{(i)}, y^{(i)})}{\partial w}
$$

Where:
- \( B \) → batch size (number of samples in a mini-batch)
- \( (x^{(i)}, y^{(i)}) \) → training samples in the batch
- \( \eta \) → learning rate

---

## 3. Workflow

1. Divide dataset into mini-batches.
2. For each batch:
   - Compute average gradient over the batch.
   - Update weights using the computed gradient.
3. Repeat for multiple epochs.

---

## 4. Advantages and Disadvantages

| **Advantages**                                                   | **Disadvantages**                                      |
|------------------------------------------------------------------|------------------------------------------------------|
| Reduces noise compared to SGD.                                   | Slightly more complex than plain SGD.                |
| Faster training compared to Batch Gradient Descent.             | Still may oscillate if learning rate is not tuned.   |
| Efficiently utilizes **vectorized operations (GPU-friendly)**.  | Choice of batch size affects performance.            |

---

## 5. Why Mini-Batch is Preferred in Deep Learning?

- ✅ Provides **better generalization** than Batch GD.  
- ✅ **Faster convergence** than SGD due to reduced noise.  
- ✅ Allows **parallel computation** using GPUs.  
- ✅ Standard practice in deep learning frameworks (TensorFlow, PyTorch).

---

## 6. Interview Questions and Answers

### **Q1: What is the typical size of a mini-batch?**
**Answer:**  
- Common sizes are **32, 64, 128**. It depends on hardware and dataset.

---

### **Q2: Why is Mini-Batch SGD better than Batch GD or SGD?**
**Answer:**  
- It combines the **computational efficiency** of batch processing and the **regularization effect** (noise) of SGD, leading to faster and stable convergence.

---

## ✅ Conclusion
- **Mini-Batch SGD** is the **default optimizer** used in most deep learning models.  
- It serves as the **foundation** for advanced optimizers like **Adam** and **RMSProp**.


# 🔹 SGD with Momentum

## 1. Theory

**Stochastic Gradient Descent with Momentum** is an enhanced version of SGD.  
It introduces a **momentum term** that helps the optimizer:

- Accelerate in the direction of consistent gradients.  
- Reduce oscillations, especially in areas with steep and narrow curves.  

This idea is inspired by **physics**: the optimizer behaves like a ball rolling down a hill, gaining speed in the right direction.

---

## 2. Formula

SGD with Momentum updates weights using both the current gradient and the past update:

$$
v_t = \beta v_{t-1} + \eta \frac{\partial J}{\partial w}
$$

$$
w^{(t+1)} = w^{(t)} - v_t
$$

Where:
- \( v_t \) → velocity term (accumulated gradient)
- \( \beta \) → momentum coefficient (commonly 0.9)
- \( \eta \) → learning rate
- \( \frac{\partial J}{\partial w} \) → gradient of cost function

---

## 3. Intuition

- **Plain SGD** → can oscillate back and forth in steep valleys.  
- **SGD with Momentum** → accumulates gradients, allowing **faster convergence** and **less oscillation**.

---

## 4. Advantages and Disadvantages

| **Advantages**                                              | **Disadvantages**                                      |
|------------------------------------------------------------|------------------------------------------------------|
| Accelerates learning in the correct direction.             | Requires tuning of momentum parameter \( \beta \).   |
| Reduces oscillations in narrow ravines.                    | May overshoot if \( \beta \) is too high.            |
| Converges faster than plain SGD.                           | Slightly more computationally expensive.             |

---

## 5. Interview Questions and Answers

### **Q1: What is the role of the momentum term \( \beta \)?**
**Answer:**  
- \( \beta \) controls how much of the past gradients influence the current update.  
- Common value: **0.9** (meaning 90% of previous gradient direction is retained).

---

### **Q2: How does momentum help in optimization?**
**Answer:**  
- It smooths out the updates, avoids oscillation, and helps escape shallow local minima faster.

---

## ✅ Conclusion
- **SGD with Momentum** is a powerful improvement over SGD.  
- It forms the basis for **Nesterov Accelerated Gradient (NAG)** and other advanced optimizers.
