# üöÄ Optimizer Comparison: SGD vs Momentum vs Adam

This notebook provides a hands-on comparison of different optimization algorithms using the **FashionMNIST** dataset. We'll train the same neural network architecture with three different optimizers and observe how they affect training speed and convergence.

## What You'll Learn

1. How to set up a classification task with PyTorch
2. The practical differences between SGD, SGD+Momentum, and Adam
3. How optimizer choice affects training dynamics
4. How to evaluate model performance with classification metrics

---

## 1. Setup and Imports

We start by importing the necessary libraries:
- `torch`: Core PyTorch library
- `nn`: Neural network modules
- `optim`: Optimization algorithms (SGD, Adam, etc.)
- `datasets`: Pre-built datasets like FashionMNIST
- `DataLoader`: Utility for batching and shuffling data

In [None]:
import torch
from torch import nn, optim
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
from matplotlib import pyplot as plt

## 2. Loading the FashionMNIST Dataset

**FashionMNIST** is a dataset of Zalando's article images, designed as a drop-in replacement for MNIST. It contains:
- 60,000 training images
- 10,000 test images
- 10 clothing categories
- 28√ó28 grayscale images

The `ToTensor()` transform converts PIL images to PyTorch tensors and normalizes pixel values to [0, 1].

In [None]:
train_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

## 3. Creating DataLoaders

DataLoaders wrap our datasets and provide:
- **Batching**: Groups samples into mini-batches (64 images per batch)
- **Shuffling**: Randomizes data order each epoch (reduces overfitting)
- **Parallel loading**: Efficiently loads data in the background

A batch size of 64 is a common choice that balances:
- Memory usage (smaller batches use less GPU memory)
- Gradient stability (larger batches give more stable gradients)
- Training speed (larger batches are more efficient on GPUs)

In [None]:
batch_size = 64
train_loader = DataLoader(train_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

## 4. Exploring the Data

Let's examine the shape of our data:
- **Images**: `[batch_size, channels, height, width]` = `[64, 1, 28, 28]`
- **Labels**: `[batch_size]` = `[64]` (one label per image)

The images have 1 channel (grayscale) and are 28√ó28 pixels.

In [None]:
for images, labels in train_loader:
    print(f"Image batch shape: {images.size()}")
    print(f"Label batch shape: {labels.size()}")
    break

### Visualizing a Sample Image

Let's look at one of the training images. The `.squeeze()` removes the channel dimension (from `[1, 28, 28]` to `[28, 28]`) for visualization.

In [None]:
plt.figure(figsize=(2,2))
plt.imshow(images[0].squeeze(), cmap="gray")
plt.show()

In [None]:
labels[0]

### Class Labels

FashionMNIST has 10 classes, each representing a type of clothing item. The labels are integers 0-9, which we map to human-readable names.

In [None]:
classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot"
]
classes[labels[9].item()]

## 5. Device Configuration

We check if a GPU (CUDA) is available. Training on GPU is significantly faster than CPU for neural networks. If no GPU is available, we fall back to CPU.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## 6. Defining the Neural Network

We create a simple **feedforward neural network** (Multi-Layer Perceptron) for classification:

```
Input (28√ó28 = 784) ‚Üí Hidden (128) ‚Üí Hidden (64) ‚Üí Output (10)
```

**Architecture breakdown:**
1. `Flatten()`: Converts 28√ó28 image to 784-element vector
2. `Linear(784, 128)`: First hidden layer with 128 neurons
3. `ReLU()`: Activation function (introduces non-linearity)
4. `Linear(128, 64)`: Second hidden layer with 64 neurons
5. `ReLU()`: Another activation
6. `Linear(64, 10)`: Output layer (10 classes)

**Note:** We don't apply Softmax at the end because `CrossEntropyLoss` expects raw logits.

In [None]:
class ClothsClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        return self.network(x)

---

## 7. Training with Vanilla SGD

**Stochastic Gradient Descent (SGD)** is the simplest optimizer:

```
weights = weights - learning_rate √ó gradient
```

**Characteristics:**
- Simple and well-understood
- Can be slow to converge
- May oscillate in narrow valleys
- Often generalizes well (with proper tuning)

We use `lr=0.01` as our learning rate.

In [None]:
model = ClothsClassifier().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

### Training Loop

The training loop follows this pattern for each batch:

1. **Forward pass**: Compute predictions (`model(images)`)
2. **Compute loss**: Compare predictions to true labels
3. **Zero gradients**: Clear old gradients (`optimizer.zero_grad()`)
4. **Backward pass**: Compute gradients (`loss.backward()`)
5. **Update weights**: Apply gradients (`optimizer.step()`)

We print the loss every 100 batches to monitor training progress.

In [None]:
epochs = 2

for epoch in range(epochs):
    for batch, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = loss_fn(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


        if (batch + 1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{epochs}], Step [{batch+1}/{len(train_loader)}], Loss: {loss.item():.4f}")

### SGD Results Analysis

Notice how the loss decreases over time, but the progress can be slow. Vanilla SGD often:
- Starts with high loss (around 2.0+)
- Gradually decreases but may plateau
- Shows some oscillation in loss values

---

## 8. Training with SGD + Momentum

**Momentum** adds "velocity" to gradient descent, helping it:
- Accelerate in consistent gradient directions
- Dampen oscillations in narrow valleys
- Escape small local minima

```
velocity = Œ≤ √ó velocity + gradient
weights = weights - learning_rate √ó velocity
```

We use `momentum=0.9`, which means 90% of the previous velocity is retained. This is the most common setting.

**Expected improvement:** Faster convergence and lower final loss compared to vanilla SGD.

In [None]:
model = ClothsClassifier().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
loss_fn = nn.CrossEntropyLoss()

In [None]:
epochs = 2

for epoch in range(epochs):
    for batch, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = loss_fn(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


        if (batch + 1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{epochs}], Step [{batch+1}/{len(train_loader)}], Loss: {loss.item():.4f}")

### Momentum Results Analysis

Compare these results to vanilla SGD:
- **Faster initial progress**: Loss drops more quickly in early iterations
- **Lower loss values**: Reaches better minima in the same number of epochs
- **Smoother convergence**: Less oscillation in loss values

Momentum is especially helpful when the loss landscape has narrow valleys or noisy gradients.

---

## 9. Training with Adam

**Adam (Adaptive Moment Estimation)** combines the best of:
- **Momentum**: Tracks gradient direction (first moment)
- **RMSProp**: Adapts learning rate per-parameter (second moment)

```
m = Œ≤‚ÇÅ √ó m + (1 - Œ≤‚ÇÅ) √ó gradient        # Momentum
v = Œ≤‚ÇÇ √ó v + (1 - Œ≤‚ÇÇ) √ó gradient¬≤       # RMSProp
weights = weights - lr √ó mÃÇ / (‚àövÃÇ + Œµ)   # Adaptive update
```

**Key advantages:**
- Works well out-of-the-box with default parameters
- Adapts learning rate for each parameter individually
- Handles sparse gradients effectively
- Fast convergence on most problems

**Note:** We use `lr=0.01` here, but Adam's default is `lr=0.001`. Higher learning rates can work but may cause instability.

In [None]:
model = ClothsClassifier().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

In [None]:
epochs = 2

for epoch in range(epochs):
    for batch, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = loss_fn(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


        if (batch + 1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{epochs}], Step [{batch+1}/{len(train_loader)}], Loss: {loss.item():.4f}")

### Adam Results Analysis

Adam typically shows:
- **Fastest initial convergence**: Loss drops very quickly in early iterations
- **Adaptive behavior**: Automatically adjusts step sizes for different parameters
- **Stable training**: Less sensitive to learning rate choice

Adam is often the default choice for deep learning because it "just works" without much tuning.

---

## 10. Model Evaluation

Now let's evaluate our Adam-trained model on the test set. We:
1. Set the model to evaluation mode (`model.eval()`)
2. Disable gradient computation (`torch.no_grad()`) for efficiency
3. Collect predictions and true labels for all test samples

**Why `model.eval()`?**
- Disables dropout (if any)
- Uses running statistics for batch normalization (if any)
- Ensures consistent inference behavior

In [None]:
model.eval()

all_predicted = []
all_labels = []


with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)

        _, predicted = torch.max(outputs.data, 1)

        all_predicted.extend(predicted.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

### Checking Predictions

Let's verify our predictions by looking at the first few samples. The labels and predictions should match for correctly classified images.

In [None]:
all_labels[:5]

In [None]:
all_predicted[:5]

### Classification Report

The classification report provides detailed metrics for each class:

- **Precision**: Of all items predicted as class X, what fraction were actually X?
- **Recall**: Of all actual class X items, what fraction did we correctly identify?
- **F1-Score**: Harmonic mean of precision and recall (balanced metric)
- **Support**: Number of samples in each class

**Accuracy** is the overall percentage of correct predictions.

In [None]:
from sklearn.metrics import classification_report

report = classification_report(all_labels, all_predicted, target_names=classes)
print(report)

---

## 11. Summary: Optimizer Comparison

| Optimizer | Convergence Speed | Ease of Use | Notes |
|-----------|-------------------|-------------|-------|
| **SGD** | Slow | Requires tuning | Good generalization with proper LR |
| **SGD + Momentum** | Medium | Moderate | Faster than SGD, handles valleys well |
| **Adam** | Fast | Easy (works out-of-box) | Default choice for most tasks |

### Key Takeaways

1. **Vanilla SGD** is simple but can be slow and oscillate
2. **Momentum** accelerates training by building velocity in consistent directions
3. **Adam** combines momentum with adaptive learning rates for fast, stable training
4. **Start with Adam** for quick experiments, try **SGD+Momentum** for potentially better generalization

### Next Steps

- Try different learning rates for each optimizer
- Add learning rate scheduling (e.g., `StepLR`, `CosineAnnealingLR`)
- Experiment with `AdamW` (Adam with proper weight decay)
- Compare final test accuracy across optimizers with more epochs