## Adam Optimizer

Adam (**Adaptive Moment Estimation**) is an optimization algorithm that combines the ideas of:

- **Momentum:** keeps an exponential moving average of past gradients (smooth direction).
- **RMSProp:** keeps an exponential moving average of squared gradients (adaptive step size).
- **Bias correction:** fixes the underestimation in early steps.

---

### 🔹 Formulas

At step \(t\), given gradient:

$$
g_t = \nabla_\theta L(\theta_t)
$$


1. **Update moving averages**
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$

2. **Bias correction**
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

3. **Parameter update**
$$
\theta_{t+1} = \theta_t - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

---

### 🔹 Hyperparameters (common defaults)

- Learning rate:  
$$ \alpha = 0.001 $$  

- Momentum decay:  
$$ \beta_1 = 0.9 $$  

- Squared gradient decay:  
$$ \beta_2 = 0.999 $$  

- Numerical stability:  
$$ \epsilon = 10^{-8} $$  

---

### 🔹 Intuition

- **Momentum** \((m_t)\): remembers the direction, reduces oscillations.  
- **RMSProp** \((v_t)\): adapts learning rate for each parameter.  
- **Bias correction**: ensures stable updates in early iterations.  
- **Together → stable, adaptive, and efficient convergence.**


---

### 🔹 Pros
✅ Works well out-of-the-box  
✅ Fast convergence compared to SGD  
✅ Handles sparse/noisy gradients well  
✅ Adaptive learning rates per parameter  

### 🔹 Cons
❌ More memory usage (stores \(m\) and \(v\) for each parameter)  
❌ Can generalize worse than SGD in some cases  
❌ Still needs learning rate scheduling sometimes  

---


In [1]:
import torch
import torch.optim as optim
import torch.nn as nn

import math

In [2]:
# Adam (from scratch)
class AdamOptimizer:
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8):
        self.params = list(params)
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps

        # Per-parameter state
        self.m = [torch.zeros_like(p) for p in self.params]  # first moment
        self.v = [torch.zeros_like(p) for p in self.params]  # second moment
        self.t = 0  # step counter

    @torch.no_grad()
    def step(self):
        self.t += 1
        b1t = 1.0 - self.beta1 ** self.t   # for bias correction
        b2t = 1.0 - self.beta2 ** self.t

        for p, m, v in zip(self.params, self.m, self.v):
            if p.grad is None:
                continue

            g = p.grad

            # 1) Update moving averages
            m.mul_(self.beta1).add_(g, alpha=1.0 - self.beta1)          # m_t
            v.mul_(self.beta2).addcmul_(g, g, value=1.0 - self.beta2)   # v_t

            # 2) Bias-corrected moments
            m_hat = m / b1t
            v_hat = v / b2t

            # 3) Parameter update
            p.addcdiv_(m_hat, torch.sqrt(v_hat) + self.eps, value=-self.lr)

    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()


In [3]:
# Tiny sanity test: single-batch regression
torch.manual_seed(0)

model = torch.nn.Sequential(
    torch.nn.Linear(2, 16), torch.nn.ReLU(),
    torch.nn.Linear(16, 1)
)

optimizer = AdamOptimizer(model.parameters(), lr=1e-3, betas=(0.9, 0.999), eps=1e-8)
criterion = torch.nn.MSELoss()

# Dummy data (single sample; goal is just to see loss go down)
x = torch.tensor([[1.0, 2.0]])
y = torch.tensor([[1.0]])

for epoch in range(50):
    optimizer.zero_grad()
    y_pred = model(x)
    loss = criterion(y_pred, y)
    loss.backward()
    optimizer.step()
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:02d}: loss = {loss.item():.6f}")


Epoch 05: loss = 0.809774
Epoch 10: loss = 0.721449
Epoch 15: loss = 0.638959
Epoch 20: loss = 0.562460
Epoch 25: loss = 0.491979
Epoch 30: loss = 0.428558
Epoch 35: loss = 0.372196
Epoch 40: loss = 0.320848
Epoch 45: loss = 0.274360
Epoch 50: loss = 0.232563
