# ⚡ Comparing Gradient Descent Variants

This notebook compares three gradient descent approaches on the same problem:

| Variant | Data per Update | Updates per Epoch |
|---------|-----------------|-------------------|
| **Batch GD** | All 100 samples | 1 |
| **Mini-Batch GD** | 16 samples | 7 (100/16) |
| **SGD** | 1 sample | 100 |

**Problem:** Find weights for `bonus = w₁×performance + w₂×experience + w₃×projects + bias`

---
## Setup & Data Loading

In [None]:
import pandas as pd
import torch
from matplotlib import pyplot as plt

In [None]:
# Load the employee bonus dataset
df = pd.read_csv('emp_bonus.csv')
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Convert to PyTorch tensors
performance = torch.tensor(df['performance'].values, dtype=torch.float32)
years_of_experience = torch.tensor(df['years_of_experience'].values, dtype=torch.float32)
projects_completed = torch.tensor(df['projects_completed'].values, dtype=torch.float32)
bonus = torch.tensor(df['bonus'].values, dtype=torch.float32)

### Helper Functions

In [None]:
def plot_loss(x_range, loss_history, title):
    """Plot loss over iterations."""
    plt.figure(figsize=(10, 6))
    plt.plot(x_range, loss_history, color='blue', linewidth=2)
    plt.title(title)
    plt.xlabel("Iteration")
    plt.ylabel("Mean Squared Error")
    plt.grid(True)
    plt.show()

def safe_plot_loss(loss_history, start, end, title):
    """Plot loss with bounds checking to avoid IndexError."""
    n = len(loss_history)
    if n == 0:
        print(f"Warning: No loss history to plot for '{title}'")
        return
    
    # Clamp indices to valid range
    safe_start = max(0, min(start, n - 1))
    safe_end = min(end, n)
    
    if safe_start >= safe_end:
        print(f"Warning: Requested range [{start}:{end}] out of bounds (history length: {n})")
        print(f"Plotting available range [0:{n}] instead.")
        safe_start, safe_end = 0, n
    elif safe_start != start or safe_end != end:
        print(f"Warning: Adjusted range from [{start}:{end}] to [{safe_start}:{safe_end}]")
    
    plot_loss(range(safe_start, safe_end), loss_history[safe_start:safe_end], title)

---
## 1. Batch Gradient Descent

**Characteristics:**
- Uses **entire dataset** for each gradient computation
- **1 update per epoch**
- Smooth, stable convergence
- High memory usage
- Slow for large datasets

In [None]:
# Initialize weights randomly
w1 = torch.rand(1, requires_grad=True)
w2 = torch.rand(1, requires_grad=True)
w3 = torch.rand(1, requires_grad=True)
bias = torch.rand(1, requires_grad=True)

# Hyperparameters
learning_rate = 0.006
epochs = 5000

# Track loss history
batch_loss_history = []

In [None]:
# Training loop - Batch Gradient Descent
for epoch in range(epochs):
    # Forward pass: use ALL data
    predicted_bonus = w1 * performance + w2 * years_of_experience + w3 * projects_completed + bias
    
    # Compute MSE loss over entire dataset
    loss = ((predicted_bonus - bonus) ** 2).mean()
    batch_loss_history.append(loss.item())
    
    # Backward pass
    loss.backward()
    
    # Update weights
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w3 -= learning_rate * w3.grad
        bias -= learning_rate * bias.grad
    
    # Zero gradients
    w1.grad.zero_()
    w2.grad.zero_()
    w3.grad.zero_()
    bias.grad.zero_()
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")

print(f"\nLearned: w1={w1.item():.2f}, w2={w2.item():.2f}, w3={w3.item():.2f}, bias={bias.item():.2f}")
print(f"True:    w1=12, w2=6, w3=2, bias=20")

In [None]:
# Plot loss curve with safe bounds checking
safe_plot_loss(batch_loss_history, 1000, 2000, "Batch GD: Smooth Convergence")

---
## 2. Mini-Batch Gradient Descent

**Characteristics:**
- Uses **small batches** (16 samples) for each update
- **Multiple updates per epoch** (100/16 ≈ 7)
- **Shuffles data each epoch** to prevent learning order-dependent patterns
- Balances speed and stability
- **Industry standard** for deep learning

In [None]:
# Re-initialize weights
w1 = torch.rand(1, requires_grad=True)
w2 = torch.rand(1, requires_grad=True)
w3 = torch.rand(1, requires_grad=True)
bias = torch.rand(1, requires_grad=True)

# Hyperparameters
learning_rate = 0.001  # Lower LR for more frequent updates
epochs = 5000
batch_size = 16
n_samples = len(performance)

mini_loss_history = []

In [None]:
# Training loop - Mini-Batch Gradient Descent
for epoch in range(epochs):
    # Shuffle data at the start of each epoch (important for generalization!)
    indices = torch.randperm(n_samples)
    perf_shuffled = performance[indices]
    exp_shuffled = years_of_experience[indices]
    proj_shuffled = projects_completed[indices]
    bonus_shuffled = bonus[indices]
    
    # Process data in batches
    for i in range(0, n_samples, batch_size):
        # Select mini-batch from shuffled data
        batch_perf = perf_shuffled[i:i + batch_size]
        batch_exp = exp_shuffled[i:i + batch_size]
        batch_proj = proj_shuffled[i:i + batch_size]
        batch_bonus = bonus_shuffled[i:i + batch_size]
        
        # Forward pass on batch only
        predicted = w1 * batch_perf + w2 * batch_exp + w3 * batch_proj + bias
        
        # Loss on batch
        loss = ((predicted - batch_bonus) ** 2).mean()
        mini_loss_history.append(loss.item())
        
        # Backward pass
        loss.backward()
        
        # Update weights
        with torch.no_grad():
            w1 -= learning_rate * w1.grad
            w2 -= learning_rate * w2.grad
            w3 -= learning_rate * w3.grad
            bias -= learning_rate * bias.grad
        
        # Zero gradients
        w1.grad.zero_()
        w2.grad.zero_()
        w3.grad.zero_()
        bias.grad.zero_()
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")

print(f"\nLearned: w1={w1.item():.2f}, w2={w2.item():.2f}, w3={w3.item():.2f}, bias={bias.item():.2f}")
print(f"Total iterations: {len(mini_loss_history)}")

In [None]:
# Plot loss curve with safe bounds checking
safe_plot_loss(mini_loss_history, 10000, 10300, "Mini-Batch GD: Moderate Noise")

---
## 3. Stochastic Gradient Descent (SGD)

**Characteristics:**
- Uses **single sample** for each update
- **100 updates per epoch** (one per sample)
- **Shuffles data each epoch**
- Very noisy but fast updates
- Low memory usage
- Noise can help escape local minima

In [None]:
# Re-initialize weights
w1 = torch.randn(1, requires_grad=True)
w2 = torch.randn(1, requires_grad=True)
w3 = torch.randn(1, requires_grad=True)
bias = torch.randn(1, requires_grad=True)

# Hyperparameters
learning_rate = 0.001
epochs = 500  # Fewer epochs needed (more updates per epoch)
n_samples = len(performance)

sgd_loss_history = []

In [None]:
# Training loop - Stochastic Gradient Descent
for epoch in range(epochs):
    # Shuffle data at the start of each epoch
    indices = torch.randperm(n_samples)
    
    for i in range(n_samples):
        # Use single data point (from shuffled indices)
        idx = indices[i]
        single_perf = performance[idx]
        single_exp = years_of_experience[idx]
        single_proj = projects_completed[idx]
        single_bonus = bonus[idx]
        
        # Forward pass on single sample
        predicted = w1 * single_perf + w2 * single_exp + w3 * single_proj + bias
        
        # Loss on single sample (squared error, not mean)
        loss = (predicted - single_bonus) ** 2
        
        # Record every 10th iteration to reduce noise in plot
        if i % 10 == 0:
            sgd_loss_history.append(loss.item())
        
        # Backward pass
        loss.backward()
        
        # Update weights
        with torch.no_grad():
            w1 -= learning_rate * w1.grad
            w2 -= learning_rate * w2.grad
            w3 -= learning_rate * w3.grad
            bias -= learning_rate * bias.grad
        
        # Zero gradients
        w1.grad.zero_()
        w2.grad.zero_()
        w3.grad.zero_()
        bias.grad.zero_()
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")

print(f"\nLearned: w1={w1.item():.2f}, w2={w2.item():.2f}, w3={w3.item():.2f}, bias={bias.item():.2f}")
print(f"Total recorded iterations: {len(sgd_loss_history)}")

In [None]:
# Plot loss curve with safe bounds checking
safe_plot_loss(sgd_loss_history, 1000, 1300, "SGD: High Variance (Noisy)")

---
## Comparison Summary

| Aspect | Batch GD | Mini-Batch GD | SGD |
|--------|----------|---------------|-----|
| **Data per update** | 100 (all) | 16 | 1 |
| **Updates per epoch** | 1 | ~7 | 100 |
| **Shuffling** | Not needed | Each epoch | Each epoch |
| **Convergence** | Smooth | Moderate noise | Very noisy |
| **Speed** | Slow | Balanced | Fast updates |
| **Memory** | High | Moderate | Low |
| **Best for** | Small data | Most tasks | Large data, online learning |

### Key Observations:
1. **Batch GD** has the smoothest loss curve but slowest convergence
2. **Mini-Batch GD** balances speed and stability (recommended default)
3. **SGD** converges fastest but with high variance in loss
4. **Shuffling** is essential for Mini-Batch and SGD to prevent learning patterns from data order

---
## Key Takeaways

1. **All three methods converge** to the same optimal weights
2. **Mini-Batch is the industry standard** for deep learning
3. **Always shuffle data** at the start of each epoch (except for Batch GD)
4. **Learning rate must be adjusted** based on batch size:
   - Larger batches → can use higher learning rate
   - Smaller batches → need lower learning rate
5. **Noise in SGD can be beneficial** for escaping local minima