# SGD with Momentum (From Scratch)

In this notebook, we implement **Stochastic Gradient Descent (SGD) with Momentum** from scratch in PyTorch style.  
We also compare it with vanilla SGD to see how momentum accelerates learning and reduces oscillations.

---


## 🔹 SGD with Momentum — Definition & Formula

- **Vanilla SGD update rule:**
$$
\theta_{t+1} = \theta_t - \eta g_t
$$

- **SGD + Momentum update rule:**
$$
v_t = \mu v_{t-1} - \eta g_t
$$
$$
\theta_{t+1} = \theta_t + v_t
$$

Where:
- $g_t$: gradient at step $t$  
- $\eta$: learning rate  
- $\mu$: momentum coefficient (0–1), e.g. 0.9  
- $v_t$: velocity (smoothed direction of past gradients)  

👉 **Intuition:** Momentum builds velocity in consistent gradient directions,  
so training is faster and smoother, especially in ravines or noisy gradients.


In [1]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

## Manual Implementation

We first implement the **update step manually** using velocity buffers.


In [2]:
def init_velocity(params):
    return [torch.zeros_like(p) for p in params]

@torch.no_grad()
def sgd_momentum_step(params, grads, velocity, lr=0.1, mu=0.9):
    for p, g, v in zip(params, grads, velocity):
        v.mul_(mu).add_(g, alpha=-lr)   # update velocity
        p.add_(v)                       # update parameter


## SGD with Momentum — Class Implementation

In [3]:
class SGDWithMomentum:
    def __init__(self, params, lr=0.1, momentum=0.9):
        self.params = list(params)
        self.lr = lr
        self.mu = momentum
        self.velocity = [torch.zeros_like(p) for p in self.params]

    @torch.no_grad()
    def step(self):
        for p, v in zip(self.params, self.velocity):
            if p.grad is None:
                continue
            v.mul_(self.mu).add_(p.grad, alpha=-self.lr)
            p.add_(v)

    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()


## Testing the Optimizer

We apply our SGD with Momentum optimizer to a simple linear regression model.


In [4]:
# Dummy dataset
x = torch.tensor([[1.0], [2.0], [3.0]])
y = torch.tensor([[2.0], [4.0], [6.0]])  # y = 2x

# Simple model
model = nn.Linear(1, 1)

criterion = nn.MSELoss()
optimizer = SGDWithMomentum(model.parameters(), lr=0.1, momentum=0.9)

# Training loop
for epoch in range(20):
    optimizer.zero_grad()
    y_pred = model(x)
    loss = criterion(y_pred, y)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss = {loss.item():.4f}")


Epoch 1, Loss = 11.7060
Epoch 2, Loss = 0.1488
Epoch 3, Loss = 11.3892
Epoch 4, Loss = 5.4412
Epoch 5, Loss = 1.4292
Epoch 6, Loss = 9.2542
Epoch 7, Loss = 1.7752
Epoch 8, Loss = 2.8406
Epoch 9, Loss = 6.4027
Epoch 10, Loss = 0.2354
Epoch 11, Loss = 3.5880
Epoch 12, Loss = 3.7387
Epoch 13, Loss = 0.0309
Epoch 14, Loss = 3.5315
Epoch 15, Loss = 1.7629
Epoch 16, Loss = 0.4119
Epoch 17, Loss = 2.8978
Epoch 18, Loss = 0.5914
Epoch 19, Loss = 0.8551
Epoch 20, Loss = 2.0245


## Key Takeaways

- Momentum accelerates SGD in consistent directions and reduces oscillation.  
- Update rule introduces a **velocity term** that smooths gradients.  
- Widely used in training deep networks before Adam became standard.  
- Forms the foundation of more advanced optimizers like **Nesterov** and **Adam**.
