# Optimizers in Deep Learning — Comprehensive Overview

---

## 1) What an Optimizer Actually Does (and Why It’s Hard)

**Goal:** Minimize the expected risk  
$$
\mathbb{E}_{(x,y)\sim D}[\ell(f_\theta(x), y)]
$$
by iteratively updating parameters \( \theta \).

**Reality:**  
We only see minibatches → noisy gradients.  
The loss landscape is **non-convex**, **ill-conditioned**, and full of **saddles** and **valleys** of varying curvature.

**Trade-offs:**  
- Stability vs speed  
- Memory vs accuracy  
- Sharp vs flat minima  
- Generalization vs training loss  

---

## 2) Vanilla → Momentum → Nesterov (the “Classics”)

### 2.1 Gradient Descent (GD) / Stochastic Gradient Descent (SGD)

**Update:**
$$
\theta_{t+1} = \theta_t - \eta \, \nabla_\theta L_B(\theta_t)
$$
where \( L_B \) is minibatch loss, \( \eta \) the learning rate.

**Pros:** Simple, strong generalization (SGD).  
**Cons:** Sensitive to \( \eta \); slow in narrow valleys.

---

### 2.2 Momentum (Polyak)

**Idea:** EMA of gradients → accelerate in consistent directions, damp noise.

**Update:**
$$
v_t = \mu v_{t-1} + (1-\mu)\nabla L_B(\theta_t), \qquad
\theta_{t+1} = \theta_t - \eta v_t
$$
Typical \( \mu \in [0.8,0.99] \); higher = smoother but laggier.

---

### 2.3 Nesterov Accelerated Gradient (NAG)

**Look-ahead gradient:**
$$
v_t = \mu v_{t-1} + \nabla L_B(\theta_t - \eta\mu v_{t-1}), \qquad
\theta_{t+1} = \theta_t - \eta v_t
$$

**Use when:** Need finer control vs momentum; small consistent gains for vision (SGD + Nesterov).

---

## 3) Adaptive Gradient Methods (Per-Parameter Step Sizes)

### 3.1 AdaGrad

Accumulate squared gradients:
$$
G_t = \sum_{\tau\le t} g_\tau \odot g_\tau
$$

**Update:**
$$
\theta_{t+1} = \theta_t - \eta \, g_t / (\sqrt{G_t} + \epsilon)
$$

**Intuition:** Infrequent features get larger steps → great for sparse problems.  
**Con:** Learning rate decays to zero → training can stall.

---

### 3.2 RMSProp

Fix AdaGrad’s decay:
$$
E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho) g_t^2
$$

**Update:**
$$
\theta_{t+1} = \theta_t - \eta g_t / (\sqrt{E[g^2]_t} + \epsilon)
$$
Historically strong for RNNs.

---

### 3.3 Adam (Adaptive Moment Estimation)

First / second-moment EMAs:
$$
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \qquad
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
$$

Bias correction:
$$
\hat m_t = \frac{m_t}{1-\beta_1^t}, \quad
\hat v_t = \frac{v_t}{1-\beta_2^t}
$$

**Update:**
$$
\theta_{t+1} = \theta_t - \eta \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}
$$

**Pros:** Fast, scale-robust.  
**Cons:** Can converge to sharper minima → weaker final generalization.

**Adam Variants:**

| Variant | Key Idea |
|----------|-----------|
| **AdamW** | Decoupled weight decay: \( \theta \!\leftarrow\! (1-\eta\lambda)\theta \) — fixes L2-coupling. |
| **AMSGrad** | Non-decreasing \( \hat v_t^{max} \) for convergence proofs. |
| **Nadam** | Adds Nesterov look-ahead. |
| **AdaBelief** | Uses \( (g_t - m_t)^2 \) → better generalization. |
| **AdamP / RAdam / Ranger** | Directionality & warmup improvements. |
| **Lion** | Updates via sign(\(m_t\)) → low memory, effective for ViTs / LLMs. |

---

## 4) Second-Order & Preconditioning (Curvature-Aware)

### 4.1 Newton / Gauss–Newton

$$
\theta_{t+1} = \theta_t - H^{-1}\nabla L
$$
\(H\): Hessian approximation.  
**Issue:** Intractable for deep nets.

---

### 4.2 Quasi-Newton (L-BFGS)

Approximates \(H^{-1}\) from gradient history.  
**Use:** Small-batch convex or fine-tuning; rare in large-scale deep training.

---

### 4.3 Natural Gradient / K-FAC / Shampoo / Adafactor

- **Natural Gradient:** Precondition by Fisher Information; parameterization-invariant.  
- **K-FAC:** Kronecker-factored block approximation.  
- **Shampoo:** Matrix-root preconditioning per dimension.  
- **Adafactor:** Memory-efficient Adam (factored moments) for huge LMs.

---

## 5) Normalization, Signs & Robust Tricks

- **Normalized SGD / Gradient Centralization:** Mean-center gradients for stability.  
- **signSGD / QSGD:** Use only sign bits to reduce comms; robust but may plateau.

---

## 6) Schedules & Warmup (The Silent Super-Power)

**Why:** Schedule often matters more than optimizer choice.

| Schedule | Description |
|-----------|--------------|
| **Step Decay** | Drop η by 10× at milestones. |
| **Cosine Decay** | Smooth to 0; combine with linear warmup (1–10%). |
| **Cyclical / OneCycle** | Ramp up then down; great for SGD. |
| **Linear Decay** | Common in NLP (AdamW). |

---

## 7) Practical Defaults (Per Domain)

### Vision (CNNs / ViTs)

| Setup | Optimizer | LR | Momentum / Betas | WD | Notes |
|--------|------------|----|------------------|----|-------|
| From scratch | SGD + Nesterov | 0.1–1.0 | 0.9 | 1e-4–5e-4 | Cosine + warmup |
| Deep/ViT | AdamW | 1e-4–3e-3 | (0.9, 0.999) | 1e-4–1e-2 | Cosine + warmup |

---

### NLP (Transformers / LLMs)

**AdamW default:**  
η = 1e-5–5e-4, β = (0.9, 0.98 or 0.999), wd ≈ 0.01  
Linear decay + warmup (1–10%).  
Use **Adafactor** for extremely large models.

---

### Speech / Sequential (RNN / Conformer)

Adam / RMSProp + warmup; always clip gradients (1.0 – 5.0).

---

### Reinforcement Learning

Adam / AdamW; tune η carefully; clip + schedule essential; entropy bonuses interact with optimizer noise.

---

### Tabular / Small Data

SGD + momentum or AdamW + early stopping.  
Regularization > optimizer choice.

---

## 8) Optimizer “Add-Ons”

| Add-On | Function |
|---------|-----------|
| **Gradient Clipping** | Prevents explosion (esp. RNNs). |
| **Lookahead** | Slow EMA of fast steps → smoother convergence. |
| **SWA** | Average late checkpoints → flatter minima. |
| **SAM** | Minimize worst-case loss in local neighborhood → flatness boost (~2× cost). |
| **EMA / Polyak Averaging** | \( \tilde\theta_t = \alpha \tilde\theta_{t-1} + (1-\alpha)\theta_t \), α ∈ [0.99, 0.9999]; stabilizes eval. |

---

## 9) Reading Curves & Tuning Playbook

| Symptom | Likely Cause | Fix |
|----------|---------------|-----|
| Loss zig-zag / early stall | LR too high / small batch | Lower LR |
| Val gap ↑ | Overfit | More WD / stronger aug / earlier schedule |
| Adam → sharp minima | Poor generalization | Switch to SGD late / add SAM or SWA |
| Plateau post LR drop | Schedule too late / shallow | Drop earlier or use cosine |
| NaNs / explode | LR too high or mixed-precision error | Clip grad / lower LR / check scaling |

---

### Typical Sweep Ranges

| Optimizer | η range | WD | Momentum / β₂ |
|------------|----------|----|----------------|
| SGD + Nest. | {0.05, 0.1, 0.2, 0.4} | 1e-4 – 1e-3 | 0.9 |
| AdamW | {1e-4, 3e-4, 1e-3, 3e-3} | 1e-4 – 1e-2 | 0.98 / 0.999 |

Warmup steps: 1–5 % of total.  

---

## 10) Concise Equations Cheat-Sheet

| Optimizer | Update Rule |
|------------|-------------|
| **SGD** | \( \theta_{t+1} = \theta_t - \eta g_t \) |
| **Momentum** | \( v_t = \mu v_{t-1} + g_t,\;\theta_{t+1} = \theta_t - \eta v_t \) |
| **Nesterov** | \( v_t = \mu v_{t-1} + g(\theta_t - \eta\mu v_{t-1}),\;\theta_{t+1} = \theta_t - \eta v_t \) |
| **RMSProp** | \( s_t = \rho s_{t-1} + (1-\rho)g_t^2,\;\theta_{t+1} = \theta_t - \eta g_t / (\sqrt{s_t}+\epsilon) \) |
| **Adam** | see full moment formulas above |
| **AdamW Decay** | \( \theta \leftarrow (1-\eta\lambda)\theta \) |

---

## 11) Pros / Cons / When to Use

| Optimizer | Pros | Cons | Where It Shines | Key Knobs |
|------------|------|------|----------------|------------|
| **SGD + Nesterov** | Great generalization · simple · low mem | Needs tuned LR & schedule | Vision (CNNs from scratch) | LR, momentum, WD, schedule |
| **AdamW** | Fast · stable · decoupled WD | Sharper minima possible | Transformers / ViTs | LR, WD, β’s, warmup |
| **RMSProp** | Handles non-stationary grads | Fewer modern wins | Older RNN / RL | LR, ρ |
| **AdaGrad** | Sparse features | LR decays → stall | NLP sparse grads | LR |
| **AMSGrad** | Theoretical guarantee | Usually same as Adam | Safety-critical | LR, β’s |
| **AdaBelief** | Often better generalization | Inconsistent results | Vision/NLP | LR, β’s |
| **Lion** | Low memory | Needs tuning | ViTs / LLMs | LR, β’s |
| **L-BFGS** | Strong on smooth losses | Full-batch / mem heavy | Small models | Hist size |
| **K-FAC / Shampoo** | Curvature speed-ups | Complex · memory heavy | Large models | Damping freq |
| **Adafactor** | Memory saving | Stability quirks | Massive LMs | Factored moments on/off |

---

## 12) Common Pitfalls & Fixes

- Use **AdamW**, not L2 inside Adam.  
- Always use **warmup** for deep / transformer models.  
- Don’t over-rely on Adam for vision → switch to SGD late or add SWA/SAM.  
- Always **schedule** learning rate.  
- Scale LR roughly linearly with batch size.

---

## 13) Quick Recipes (Drop-In Defaults)

| Context | Recommended Setup |
|----------|------------------|
| **General (default)** | AdamW (η = 3e-4, β = (0.9, 0.999), wd = 0.01) + 5 % warmup + cosine decay. |
| **Vision from scratch** | SGD + Nesterov (ηₘₐₓ = 0.4, momentum = 0.9, wd = 1e-4) + OneCycle schedule. |
| **Transformer fine-tune** | AdamW (η = 2e-5 – 5e-5, β = (0.9, 0.98), wd = 0.01) + linear decay + 10 % warmup. |
| **Very large models** | Adafactor / AdamW + memory optimizations + EMA for eval. |

---

**Summary Insight:**  
Optimizers navigate a noisy, high-dimensional, curved landscape.  
The interplay of **update rule**, **learning-rate schedule**, and **regularization** dictates whether training lands in a **flat, generalizing minimum** or a **sharp, brittle one**.  
Mastery lies not in choosing “the best optimizer,” but in matching the **dynamics and schedule** to your model, data, and compute.


# Optimizer Research Landscape — Chronological & Thematic Foundations

---

## **A) Foundations & Core Theory**

| Theme | Key Work | Authors / Venue / Year | Contribution |
|--------|-----------|------------------------|---------------|
| **Stochastic Approximation (SA)** | *On the Stability of Inverse Problems* | **Tikhonov**, 1943 | Introduced regularization for ill-posed inverse problems. |
|  | *A Stochastic Approximation Method* | **Robbins & Monro**, *Ann. Math. Stat.*, 1951 | First principled analysis of noisy gradient descent; foundation of SGD. |
|  | *Kiefer–Wolfowitz SA* | **Kiefer & Wolfowitz**, *Ann. Math. Stat.*, 1952 | Finite-difference stochastic approximation. |
| **Convex Optimization** | *Convex Optimization (book)* | **Boyd & Vandenberghe**, 2004 | Unified treatment of convex problems, KKT, proximal and dual methods. |
| **Acceleration Theory** | *Introductory Lectures on Convex Optimization* | **Nesterov**, 2004 | Proved optimal first-order convergence; basis of NAG. |
| **Mirror Descent** | *Problem Complexity and Method Efficiency in Optimization* | **Nemirovski & Yudin**, 1983 | Introduced geometry-aware first-order framework. |
|  | *Mirror Descent & Proximal Analysis* | **Beck & Teboulle**, *SIAM J. Optim.*, 2003 | Modern convergence and composite extension. |
| **Momentum & Step Rules** | *Polyak’s Heavy Ball* | **Polyak**, *USSR Comp. Math.*, 1964 | Introduced momentum & adaptive step-size principles. |
| **Spectral Steps** | *Two-Point Step Size Gradient Methods* | **Barzilai & Borwein**, *IMA JNA*, 1988 | Proposed step-size using curvature (spectral rule). |
| **Second-Order Frameworks** | *Trust-Region Methods (book)* | **Conn, Gould & Toint**, 2000 | Rigorous analysis of second-order optimization frameworks. |

---

## **B) SGD, Momentum, Nesterov**

| Key Work | Authors / Venue / Year | Contribution |
|-----------|------------------------|---------------|
| *SGD for Large-Scale Learning* | **Bottou**, *Neurocomputing*, 2010 | Statistical view of SGD and convergence tradeoffs. |
| *Optimization Methods for Large-Scale ML* | **Bottou et al.**, *NIPS LSS*, 2016 | Practical perspective bridging theory and implementation. |
| *Nesterov Accelerated Gradient* | **Nesterov**, 1983; **Sutskever et al.**, *ICML*, 2013 | Theoretically optimal look-ahead acceleration adapted to DL. |
| *Large-Batch Sharp Minima* | **Keskar et al.**, *ICLR*, 2017 | Showed large batches lead to sharper minima; inspired flatness research. |

---

## **C) Adaptive Gradient Methods (Per-Parameter Steps)**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **AdaGrad** | **Duchi, Hazan & Singer**, *JMLR*, 2011 | Per-coordinate adaptivity via accumulated squared gradients. |
| **RMSProp** | **Tieleman & Hinton**, Coursera Notes, 2012 | Exponential moving average of gradient magnitudes. |
| **Adam** | **Kingma & Ba**, *ICLR (Best Paper)*, 2015 | Combines EMA of first and second moments + bias correction. |
| **AMSGrad** | **Reddi et al.**, *ICLR*, 2018 | Fixed non-convergent behavior in Adam. |
| **AdamW** | **Loshchilov & Hutter**, *ICLR*, 2019 | Decoupled weight decay; fixed L2 coupling bug. |
| **RAdam** | **Liu et al.**, *NeurIPS*, 2019 | Variance-rectified warmup for stable early training. |
| **AdaBound** | **Luo et al.**, *ICLR*, 2019 | Transition from Adam → SGD asymptotically. |
| **AdaBelief** | **Zhuang et al.**, *NeurIPS*, 2020 | Uses (g−m)² variance tracking → smoother generalization. |
| **NovoGrad** | **Ginsburg et al.**, *arXiv*, 2019 | Adam-like but cheaper; used in speech recognition. |
| **Lion** | **Chen et al.**, *NeurIPS*, 2023 | Momentum-sign updates (L1-like); low memory; strong for ViTs/LLMs. |

---

## **D) Learning-Rate Schedules & Warmup**

| Method | Authors / Venue / Year | Key Idea |
|---------|------------------------|-----------|
| **Cyclical Learning Rates** | **Smith**, *WACV*, 2017 | Periodic LR exploration; LR range test. |
| **OneCycle Policy** | **Smith & Topin**, *arXiv*, 2019 | LR increases then decreases; momentum inverted. |
| **Cosine Annealing (SGDR)** | **Loshchilov & Hutter**, *ICLR*, 2017 | Smooth cosine decay; inspired modern default. |
| **Warmup** | **He et al.**, *CVPR*, 2016; **Vaswani et al.**, *NeurIPS*, 2017 | Linear warmup for stability in deep / transformer training. |

---

## **E) Second-Order, Natural Gradient & Preconditioning**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **Hessian-Free Optimization** | **Martens**, *ICML*, 2010 | Practical curvature-based optimization for deep nets. |
| **Natural Gradient** | **Amari**, *Neural Computation*, 1998 | Fisher information geometry for parameter-invariant descent. |
| **K-FAC** | **Martens & Grosse**, *ICML*, 2015 | Kronecker-factored Fisher approximation; scalable NG. |
| **Shampoo** | **Gupta et al.**, *ICML*, 2018; **Anil et al.**, *ICML*, 2021 | Matrix-root preconditioning; large-scale curvature. |
| **Adafactor** | **Shazeer & Stern**, *ICML*, 2018 | Memory-efficient Adam; factored second moments. |
| **AdaHessian** | **Yao et al.**, *AAAI*, 2021 | Curvature-aware adaptive optimizer. |

---

## **F) Variance Reduction (Finite-Sum Optimization)**

| Method | Authors / Venue / Year | Key Concept |
|---------|------------------------|-------------|
| **SAG** | **Schmidt, Le Roux & Bach**, *NeurIPS*, 2013 | Memory of gradients for variance reduction. |
| **SVRG** | **Johnson & Zhang**, *NeurIPS*, 2013 | Periodically computed full gradient snapshot. |
| **SAGA** | **Defazio et al.**, *NeurIPS*, 2014 | Simplified SAG with unbiased updates. |
| **Katyusha** | **Allen-Zhu**, *STOC*, 2017 | Accelerated variance reduction. |
| **SARAH** | **Nguyen et al.**, *ICML*, 2017 | Recursive variance-reduced gradient estimator. |

---

## **G) Proximal, Composite, and Constraints**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **ISTA / Proximal Gradient** | **Daubechies et al.**, *Comm. Pure Appl. Math.*, 2004 | Foundation of proximal methods. |
| **FISTA** | **Beck & Teboulle**, *SIAM J. Imaging Sci.*, 2009 | Accelerated proximal convergence. |
| **ADMM (survey)** | **Boyd et al.**, *Found. Trends ML*, 2011 | Unified constrained optimization via splitting. |
| **Mirror-Prox** | **Nemirovski**, *SIAM J. Optim.*, 2004 | Saddle-point and constrained gradient frameworks. |

---

## **H) Nonconvex Landscapes, Saddles & Escape**

| Work | Authors / Venue / Year | Contribution |
|-------|------------------------|---------------|
| **Strict Saddle Escape** | **Jin et al.**, *COLT*, 2017 | SGD with noise escapes saddles efficiently. |
| **Overparameterized GD** | **Du et al.**, *ICLR*, 2019 | Why simple GD converges in NTK regime. |
| **PL Condition** | **Karimi et al.**, *CDC*, 2016 | Linear rates possible beyond convexity. |

---

## **I) Sharpness, Flat Minima & Generalization-Aware Optimizers**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **Flat Minima** | **Hochreiter & Schmidhuber**, *Neural Comput.*, 1997 | Linked flatness with generalization. |
| **Sharp Minima (Batch Size)** | **Keskar et al.**, *ICLR*, 2017 | Large-batch → sharp minima discovery. |
| **SWA** | **Izmailov et al.**, *UAI*, 2018 | Weight averaging → flatter basins. |
| **SAM** | **Foret et al.**, *ICLR*, 2021 | Adversarial step in parameter space → flatness bias. |
| **ASAM / GSAM** | **Kwon et al.**, *NeurIPS*, 2021; **Zhuang et al.**, *NeurIPS*, 2022 | Smoother, stable SAM refinements. |

---

## **J) Normalization as Optimization Aid**

| Technique | Authors / Venue / Year | Effect |
|------------|------------------------|--------|
| **BatchNorm** | **Ioffe & Szegedy**, *ICML*, 2015 | Stabilizes optimization, boosts effective LR. |
| **LayerNorm** | **Ba, Kiros & Hinton**, *arXiv*, 2016 | Crucial for transformers; scale-invariant. |
| **GroupNorm** | **Wu & He**, *ECCV*, 2018 | Effective for small batches. |
| **WeightNorm** | **Salimans & Kingma**, *NeurIPS*, 2016 | Reparameterization improving conditioning. |
| **Gradient Centralization** | **Yong et al.**, *ECCV*, 2020 | Mean-centered gradients → smoother optimization. |

---

## **K) Large-Batch, Distributed & Communication-Efficient Training**

| Technique | Authors / Venue / Year | Contribution |
|------------|------------------------|--------------|
| **Hogwild!** | **Recht et al.**, *NeurIPS*, 2011 | Lock-free asynchronous SGD. |
| **Linear Scaling Rule (ImageNet 1h)** | **Goyal et al.**, *arXiv*, 2017 | Warmup + LR scaling → massive parallelism. |
| **LARS** | **You et al.**, *arXiv*, 2017 | Layerwise rate scaling for large-batch CNNs. |
| **LAMB** | **You et al.**, *NeurIPS*, 2019 | Large-batch BERT optimizer; layerwise adaptivity. |
| **QSGD** | **Alistarh et al.**, *NeurIPS*, 2017 | Quantized gradients for low-bandwidth updates. |
| **signSGD** | **Bernstein et al.**, *ICML*, 2018 | Sign-based distributed SGD with robustness. |

---

## **L) Classical Second-Order & Quasi-Newton**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **L-BFGS** | **Liu & Nocedal**, *Math. Prog. B*, 1989 | Memory-limited quasi-Newton; strong small-scale performance. |
| **Trust-Region / Line Search** | **Wolfe, Armijo**, 1960s | Conditions for step acceptance; classical backtracking. |
| **Practical L-BFGS in DL** | **Byrd et al.**, *SIAM Review*, 1995 | Overview and limited deep-learning applications. |

---

## **M) Optimizer Hybrids & Meta-Optimizers**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **Lookahead** | **Zhang et al.**, *NeurIPS*, 2019 | EMA of “fast” weights → smoother, better convergence. |
| **Hypergradient Descent** | **Baydin et al.**, *AAAI*, 2018 | Learns learning rate online. |
| **YellowFin** | **Zhang et al.**, *ICML*, 2017 | Auto-tunes LR and momentum for SGD. |
| **GradNorm / Adaptive Loss Balancing** | **Chen et al.**, *ICML*, 2018 | Dynamic weighting for multi-task optimization. |

---

## **N) Optimization for Sequences & Stability Tricks**

| Key Work | Authors / Venue / Year | Contribution |
|-----------|------------------------|---------------|
| **Training RNNs Difficulties** | **Pascanu, Mikolov & Bengio**, *ICML*, 2013 | Identified gradient explosion/vanish; formalized clipping. |
| **Gradient Clipping (origin)** | **Mikolov**, *2012 thesis/tech notes* | Canonical practical clipping trick. |
| **RMSProp in RNNs** | **Hinton Notes**, 2012 | Early use for stabilizing recurrent learning. |

---

## **O) Online Learning & Regret (Theoretical Roots of Adaptivity)**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **Online Gradient Descent** | **Zinkevich**, *ICML*, 2003 | Regret bounds for online convex optimization. |
| **Dual Averaging / FTRL** | **McMahan**, *COLT*, 2011; **Nesterov**, *Math. Program.*, 2009 | Foundations behind AdaGrad & adaptive updates. |

---

## **P) Geometry, Invariances & Parameterization**

| Method | Authors / Venue / Year | Contribution |
|---------|------------------------|---------------|
| **Path-SGD** | **Neyshabur et al.**, *NeurIPS*, 2015 | Scale-invariant optimization for ReLU nets. |
| **Scaled Weight Standardization** | **Qiao et al.**, *ICLR*, 2021 | Normalization for very deep ViT/CvT architectures. |

---

## **Q) Surveys & Tutorials**

| Title | Authors / Venue / Year | Contribution |
|--------|------------------------|---------------|
| **Optimization Methods for Large-Scale ML** | **Bottou, Curtis & Nocedal**, *SIAM Review*, 2018 | Comprehensive survey bridging convex and DL optimizers. |
| **Deep Learning Optimization Landscape** | **Goodfellow, Vinyals & Saxe**, *ICML Workshop*, 2014 | Linear structure of loss surfaces. |
| **Generalization & Optimization Series** | **Neyshabur et al.**, *NeurIPS*, 2017–2019 | Theoretical links between generalization and optimization. |
| **A Modern Look at Momentum** | **Goh**, Blog Notes, 2017 | Popular and clear practitioner-friendly analysis. |

---
