# RMSProp Optimizer (From Scratch)

In this notebook, we implement **RMSProp** from scratch in PyTorch style and test it on a simple model.  
RMSProp is an **adaptive learning rate optimizer** introduced by *Geoff Hinton* to stabilize training.  

Unlike vanilla SGD, which uses the same learning rate for all parameters, RMSProp:  
- Keeps a **running average of squared gradients**.  
- Uses this average to **rescale step sizes** for each parameter.  
- Prevents overshooting when gradients are large, and speeds up learning when gradients are small.


## 🔹 RMSProp Update Rule

1. **Running average of squared gradients:**
$$
E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2
$$

2. **Parameter update:**
$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t
$$

- $g_t$: gradient at step $t$  
- $\eta$: learning rate  
- $\gamma$: decay factor (e.g. 0.9)  
- $\epsilon$: small constant for stability  

👉 Intuition:  
- Large recent gradients → denominator grows → smaller steps.  
- Small recent gradients → denominator shrinks → larger steps.  
- Each parameter adapts its learning rate individually.


## RMSProp Implementation

Below is a **from-scratch PyTorch-style implementation** of RMSProp.  
We maintain a moving average of squared gradients for each parameter and use it to scale updates.


In [1]:
import torch

class RMSPropOptimizer:
    def __init__(self, params, lr=0.01, gamma=0.9, eps=1e-8):
        self.params = list(params)
        self.lr = lr
        self.gamma = gamma                # decay rate for moving avg
        self.eps = eps                    # numerical stability term
        self.avg_sq_grad = [torch.zeros_like(p) for p in self.params]

    def step(self):
        # Update each parameter
        for p, avg_sq in zip(self.params, self.avg_sq_grad):
            if p.grad is None:
                continue

            g = p.grad

            # Update moving average of squared gradients
            avg_sq.mul_(self.gamma).addcmul_(g, g, value=1 - self.gamma)

            # Compute parameter update
            p.data.addcdiv_(g, torch.sqrt(avg_sq + self.eps), value=-self.lr)

    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()


## 🎯 Testing RMSProp on a Simple Linear Model

We apply our RMSProp optimizer to a dummy regression task.  
This demonstrates how the optimizer stabilizes learning compared to vanilla SGD.


In [2]:
# Dummy model
model = torch.nn.Linear(2, 1)

# Optimizer
optimizer = RMSPropOptimizer(model.parameters(), lr=0.01, gamma=0.9)

# Dummy data
x = torch.tensor([[1.0, 2.0]])
y = torch.tensor([[1.0]])

criterion = torch.nn.MSELoss()

# Training loop
for epoch in range(10):
    optimizer.zero_grad()
    y_pred = model(x)
    loss = criterion(y_pred, y)
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1}, Loss = {loss.item():.4f}")


Epoch 1, Loss = 2.9726
Epoch 2, Loss = 2.5525
Epoch 3, Loss = 2.2779
Epoch 4, Loss = 2.0650
Epoch 5, Loss = 1.8882
Epoch 6, Loss = 1.7356
Epoch 7, Loss = 1.6008
Epoch 8, Loss = 1.4799
Epoch 9, Loss = 1.3700
Epoch 10, Loss = 1.2694


## ✅ Key Takeaways

- **RMSProp ≠ plain SGD**: it adapts learning rates per parameter.  
- Tracks the **running average of squared gradients** to normalize updates.  
- Prevents instability from noisy or unbalanced gradients.  
- Forms the foundation of **Adam**, which combines RMSProp with momentum.  
