# 🔹 AdaGrad (Adaptive Gradient Optimizer)

## 1. Theory

**AdaGrad (Adaptive Gradient)** is an optimizer that **adapts the learning rate** for each parameter individually based on how frequently it has been updated.  
- Parameters with **frequent updates** → get **smaller learning rates**.  
- Parameters with **rare updates** → retain **larger learning rates**.

This makes AdaGrad particularly effective for **sparse data** (e.g., NLP, recommender systems).

---

## 2. Formula

Weight update rule:

$$
g_t = \frac{\partial J}{\partial w_t}
$$

$$
G_t = G_{t-1} + g_t^2
$$

$$
w_{t+1} = w_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \cdot g_t
$$

Where:
- \( g_t \) → gradient at time \( t \)
- \( G_t \) → sum of squares of past gradients (per parameter)
- \( \eta \) → initial learning rate
- \( \epsilon \) → small constant to prevent division by zero

---

## 3. Intuition

- AdaGrad maintains a separate **learning rate** for every parameter.  
- Frequently updated parameters shrink in step size, preventing overshooting.  
- Rarely updated parameters keep larger steps, helping in sparse features.

---

## 4. Advantages and Disadvantages

| **Advantages**                                        | **Disadvantages**                                      |
|------------------------------------------------------|------------------------------------------------------|
| Works well for sparse data (e.g., NLP, text classification). | Learning rate keeps decreasing, may lead to early stopping. |
| Automatically adjusts learning rates for each parameter.     | Cannot recover once learning rate becomes too small.  |
| Requires less tuning of learning rate.                        | Not ideal for deep networks due to aggressive decay. |

---

## 5. Use Cases

- ✅ Natural Language Processing (NLP)  
- ✅ Sparse feature problems (e.g., recommendation systems)  
- ✅ Logistic Regression with sparse inputs  

---

## 6. Interview Questions and Answers

### **Q1: Why is AdaGrad good for sparse data?**
**Answer:**  
- It maintains **larger learning rates** for infrequently updated parameters, ensuring they still learn effectively.

---

### **Q2: What is the main limitation of AdaGrad?**
**Answer:**  
- The accumulated squared gradients \( G_t \) keep growing, causing the learning rate to shrink to near zero over time, which may stop learning prematurely.

---

## ✅ Conclusion
- **AdaGrad** is adaptive and great for sparse features.  
- However, its **learning rate decay** is too aggressive, which led to the development of **RMSProp** and **Adam**.


# 🔹 AdaDelta & RMSProp Optimizers

---

## 1. RMSProp (Root Mean Square Propagation)

### ✅ Theory
- **RMSProp** is an improvement over AdaGrad.
- It solves AdaGrad’s problem of **aggressively decreasing learning rates** by using an **exponentially decaying average** of past squared gradients.
- Keeps learning rate more stable throughout training.

---

### ✅ Formula

For parameter \( w \):

$$
E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2
$$

$$
w_{t+1} = w_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} g_t
$$

Where:
- \( \gamma \) → decay rate (commonly 0.9)
- \( E[g^2]_t \) → moving average of squared gradients
- \( \eta \) → learning rate

---

### ✅ Advantages and Disadvantages

| **Advantages**                                       | **Disadvantages**                                      |
|-----------------------------------------------------|------------------------------------------------------|
| Solves AdaGrad’s learning rate decay problem.       | Requires tuning of decay rate \( \gamma \).          |
| Performs well on non-stationary problems (e.g., RNNs). | Can still oscillate without momentum.                |
| Commonly used for training deep networks.           | Slightly more computationally intensive.             |

---

### ✅ Use Cases
- ✅ Recurrent Neural Networks (RNNs)  
- ✅ Non-stationary data (changing patterns)  
- ✅ Deep learning tasks where AdaGrad fails  

---

## 2. AdaDelta

### ✅ Theory
- **AdaDelta** is an **extension of AdaGrad** that also fixes the **decaying learning rate problem**.
- Unlike RMSProp, it **does not require a manually set learning rate \( \eta \)**.  
- Uses a **moving window of gradient updates** to adapt step sizes.

---

### ✅ Formula

1. Compute running average of squared gradients:
   $$
   E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2
   $$

2. Compute parameter updates using running average of updates:
   $$
   \Delta w_t = - \frac{\sqrt{E[\Delta w^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t
   $$

3. Update parameters:
   $$
   w_{t+1} = w_t + \Delta w_t
   $$

---

### ✅ Advantages and Disadvantages

| **Advantages**                                      | **Disadvantages**                                      |
|-----------------------------------------------------|------------------------------------------------------|
| Eliminates the need for a global learning rate \( \eta \). | More complex than RMSProp.                           |
| Prevents learning rate from shrinking too much.     | Slightly slower than Adam in practice.               |
| Works well with sparse and dense data.              | Less commonly used compared to Adam.                 |

---

### ✅ Use Cases
- ✅ When you want **adaptive learning** without tuning learning rate.  
- ✅ Suitable for large-scale deep learning problems.  

---

## 3. Interview Q&A

### **Q1: How is RMSProp different from AdaGrad?**
**Answer:**  
- RMSProp uses an **exponential moving average** of gradients instead of summing them, preventing the learning rate from decaying too quickly.

---

### **Q2: Why is AdaDelta considered an improvement over AdaGrad and RMSProp?**
**Answer:**  
- AdaDelta **removes the need for a manually set learning rate** and maintains stable updates.

---

## ✅ Conclusion
- **RMSProp**: Fixes AdaGrad’s decay issue, widely used in RNNs.  
- **AdaDelta**: Similar to RMSProp but eliminates the need for setting \( \eta \).  
- These optimizers influenced the development of **Adam**, which combines their strengths.


# 🔹 Adam Optimizer (Adaptive Moment Estimation)

## 1. Theory

**Adam (Adaptive Moment Estimation)** combines the advantages of:
- ✅ **RMSProp** (adaptive learning rate per parameter)  
- ✅ **Momentum** (uses exponentially decaying averages of past gradients)  

Adam maintains two moving averages:
- \( m_t \) → first moment (mean of gradients, like momentum)
- \( v_t \) → second moment (uncentered variance, like RMSProp)

This makes Adam:
- **fast to converge**
- **robust to noisy gradients**
- **suitable for large datasets & deep networks**

---

## 2. Formula

1. Compute moving averages:
   $$
   m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
   $$
   $$
   v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
   $$

2. Bias correction:
   $$
   \hat{m}_t = \frac{m_t}{1 - \beta_1^t}
   $$
   $$
   \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
   $$

3. Parameter update:
   $$
   w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
   $$

Where:
- \( \beta_1 \) → decay rate for momentum (default 0.9)  
- \( \beta_2 \) → decay rate for RMSProp term (default 0.999)  
- \( \eta \) → learning rate (default 0.001)  
- \( \epsilon \) → small constant to avoid division by zero  

---

## 3. Intuition

- Adam keeps track of **both**:
  - **average gradient** (helps direction)
  - **average squared gradient** (controls step size)
- This allows **stable and fast** optimization.

---

## 4. Advantages and Disadvantages

| **Advantages**                                              | **Disadvantages**                                      |
|------------------------------------------------------------|------------------------------------------------------|
| Combines benefits of Momentum & RMSProp.                   | May sometimes lead to worse generalization than SGD. |
| Works well with noisy, sparse gradients.                   | Requires tuning of multiple hyperparameters.         |
| Default optimizer in most deep learning frameworks.        | Can converge to sharp minima in some cases.          |
| Adaptive learning rates for each parameter.                | Heavier computation than simple SGD.                 |

---

## 5. Use Cases

- ✅ Deep Neural Networks (CNNs, RNNs, Transformers)  
- ✅ Large datasets with sparse gradients (e.g., NLP)  
- ✅ Most modern architectures (default optimizer in TensorFlow & PyTorch)  

---

## 6. Interview Questions and Answers

### **Q1: How is Adam better than SGD with Momentum?**
**Answer:**  
- Adam adapts the learning rate for each parameter individually using the second moment estimate, making it more efficient and stable.

---

### **Q2: Why does Adam use bias correction?**
**Answer:**  
- At the start of training, \( m_t \) and \( v_t \) are biased toward zero. Bias correction ensures proper scaling.

---

## ✅ Conclusion
- **Adam** is the most popular optimizer for deep learning due to its **adaptive learning rate** and **fast convergence**.  
- However, for some problems, **SGD with Momentum** may provide better generalization.
