# 📜 Optimization Techniques in AI, ML, and Deep Learning

---

## 🔹 1. What is Optimization?

Optimization in AI/ML refers to the process of **minimizing or maximizing an objective (loss) function** to improve a model’s performance.

- In **Machine Learning**, optimization ensures the algorithm generalizes well to unseen data.  
- In **Deep Learning**, optimization is the heart of training neural networks, where millions (or billions) of parameters must be tuned efficiently.  

**General formulation:**

$$
\theta^* = \arg\min_{\theta} \; L(f(x;\theta), y)
$$

---

## 🔹 2. Classical Optimization in ML

- **Gradient Descent (GD):**

$$
\theta_{t+1} = \theta_t - \eta \, \nabla_\theta L(f(x;\theta_t), y)
$$

- **Momentum (Polyak, 1983):**

$$
v_{t+1} = \mu v_t - \eta \nabla_\theta L_t
$$

$$
\theta_{t+1} = \theta_t + v_{t+1}
$$

- **Nesterov Accelerated Gradient (NAG):**

$$
v_{t+1} = \mu v_t - \eta \nabla_\theta L(\theta_t + \mu v_t)
$$

$$
\theta_{t+1} = \theta_t + v_{t+1}
$$

---

## 🔹 3. Advanced Optimizers in Deep Learning

- **Adagrad (2011):**

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta L_t
$$

- **RMSProp (2012):**

$$
E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma) g_t^2
$$

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t
$$

- **Adam (2014):**

$$
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
$$  

$$
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
$$  

$$
\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad
\hat{v}_t = \frac{v_t}{1-\beta_2^t}
$$  

$$
\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}
$$

---

## 🔹 4. Optimization Tricks in DL Training

- **Learning Rate Scheduling (cosine annealing):**

$$
\eta_t = \eta_0 \cdot \cos\!\left(\frac{\pi t}{T}\right)
$$

- **Gradient Clipping:**

$$
g_t \leftarrow \frac{g_t}{\max(1, \|g_t\|/c)}
$$

- **Weight Initialization:**

$$
\text{Xavier: } \; \mathcal{U}\!\left[-\frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}}, \; \frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}}\right]
$$  

$$
\text{He: } \; \mathcal{N}(0, \tfrac{2}{n_{in}})
$$

- **Sharpness-Aware Minimization (SAM, 2021):**

$$
\min_\theta \; \max_{\|\epsilon\|\leq\rho} L(f(x;\theta+\epsilon), y)
$$

---

## 🔹 5. Optimization Challenges in Deep Learning

- **Vanishing/Exploding Gradients:**

$$
\prod_{t=1}^T W_t \quad \to \quad 0 \;\; \text{or} \;\; \infty
$$

---

## ✅ Key Takeaways

- **ML era:** convex optimization → Gradient Descent, SVMs.  
- **DL era:** adaptive optimizers (Adam, RMSProp) + training tricks (LR schedules, normalization).  
- **Modern era (2020s):** scaling to **billions of params**, robustness (flat minima, adversarial), and efficiency (distributed & optimizer-free methods).  


# 📊 Comparison of Optimization Algorithms in Deep Learning

| Optimizer | Update Rule (simplified) | Pros | Cons | Typical Use Cases |
|-----------|--------------------------|------|------|-------------------|
| **SGD (Stochastic Gradient Descent)** | $$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$$ | Simple, memory efficient, good generalization | Slow convergence, sensitive to learning rate | Small to medium models, baseline training |
| **Momentum** | $$v_t = \beta v_{t-1} + (1-\beta)\nabla L, \quad \theta_{t+1} = \theta_t - \eta v_t$$ | Faster convergence, helps escape shallow minima | Can overshoot, extra hyperparameter ($$\beta$$) | CNN training, image recognition |
| **NAG (Nesterov Accelerated Gradient)** | $$v_t = \beta v_{t-1} + \nabla L(\theta - \eta \beta v_{t-1})$$ | Lookahead improves stability, faster than Momentum | Slightly more complex, tuning needed | Sequence models, RNNs |
| **Adagrad** | $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla L$$ | Adaptive learning rate per parameter, good for sparse features | Learning rate decays too fast | NLP, text embeddings |
| **RMSProp** | $$E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t$$ | Controls Adagrad’s decay issue, stable | Still sensitive to hyperparams | RNNs, speech recognition |
| **Adam (Adaptive Moment Estimation)** | Combines Momentum + RMSProp with bias correction | Fast, robust, widely used, less tuning needed | Can overfit, sometimes worse generalization than SGD | Standard for most DL tasks (NLP, CV, GANs) |
| **AdamW** | Adam + decoupled weight decay | Better generalization than Adam | Still requires careful LR tuning | Transformers, LLMs |
| **LAMB (Layer-wise Adaptive Moments)** | Adam variant with layer-wise normalization | Enables training of very large models (BERT, GPT) | More complex, heavier compute | Large-scale models, foundation models |
| **SAM (Sharpness-Aware Minimization, 2021)** | $$\min_\theta \; \max_{\|\epsilon\|\leq\rho} L(f(x;\theta+\epsilon), y)$$ | Improves robustness, better generalization | Slower, higher compute cost | Vision Transformers, LLM fine-tuning |

---

## ✅ Key Insights
- **Classical baseline**: SGD (+Momentum, NAG).  
- **Adaptive methods**: Adagrad → RMSProp → Adam (standard for DL).  
- **Modern large-scale**: AdamW, LAMB, SAM for foundation models.  
- **Trade-off**: SGD often generalizes better, while Adam converges faster.  
