# RMSProp: Root Mean Square Propagation

## üéØ What This Notebook Covers

**RMSProp** fixes AdaGrad's diminishing learning rate problem by using exponentially weighted averages. In this notebook, we explore:

1. ‚úÖ **The Problem** - AdaGrad's limitation
2. ‚úÖ **RMSProp Solution** - Exponentially weighted averages
3. ‚úÖ **Mathematical Formulation** - How RMSProp works
4. ‚úÖ **Implementation** - RMSProp from scratch
5. ‚úÖ **Performance Comparison** - RMSProp vs AdaGrad vs SGD

### Why This Matters

- **Fixes AdaGrad**: Learning rate doesn't vanish üîß
- **Better for Deep Networks**: Works well in practice üèóÔ∏è
- **Foundation for Adam**: Key component of Adam optimizer ‚≠ê

Let's master RMSProp! üöÄ

---

## 1. Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from IPython.display import display, Markdown

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

## 2. The Problem: AdaGrad's Diminishing Learning Rate

### Recap: AdaGrad

$$
\begin{align}
G_t &= G_{t-1} + g_t^2 \quad \text{(accumulate squared gradients)} \\
\theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \cdot g_t
\end{align}
$$

### The Problem

```
G_t only grows (never shrinks)
    ‚Üì
Effective LR = Œ± / sqrt(G_t)
    ‚Üì
As G_t ‚Üí ‚àû, Effective LR ‚Üí 0
    ‚Üì
Training stops! ‚ùå
```

### RMSProp's Solution

**Key Idea**: Use **exponentially weighted average** instead of sum!

$$
E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2
$$

This way:
- Recent gradients have more weight
- Old gradients fade away
- Learning rate doesn't vanish! ‚úÖ

---

## 3. Mathematical Formulation

### RMSProp Update Rule

$$
\begin{align}
E[g^2]_t &= \beta E[g^2]_{t-1} + (1-\beta) g_t^2 \quad \text{(exponentially weighted average)} \\
\theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t \quad \text{(parameter update)}
\end{align}
$$

Where:
- $g_t = \nabla L(\theta_t)$ = gradient at time $t$
- $E[g^2]_t$ = exponentially weighted average of squared gradients
- $\beta$ = decay rate (typically 0.9 or 0.999)
- $\alpha$ = learning rate (e.g., 0.001)
- $\epsilon$ = small constant for numerical stability (e.g., $10^{-8}$)

### Comparison: AdaGrad vs RMSProp

| Aspect | AdaGrad | RMSProp |
|--------|---------|----------|
| Gradient Accumulation | $G_t = G_{t-1} + g_t^2$ | $E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2$ |
| Type | Sum (monotonic) | Exponential average |
| Learning Rate | Diminishes to 0 | Stays active |
| Best For | Sparse features | Deep networks |

### Why "Root Mean Square"?

The denominator is the **root mean square** of gradients:

$$
\text{RMS}(g) = \sqrt{E[g^2]_t}
$$

Hence the name: **RMS**Prop (Root Mean Square Propagation)

---

## 4. Visualize: AdaGrad vs RMSProp

In [None]:
# Simulate gradient history
np.random.seed(42)
iterations = 200
gradients = np.random.randn(iterations) * 0.5 + 0.2

# AdaGrad accumulation
G_adagrad = np.zeros(iterations)
for t in range(iterations):
    if t == 0:
        G_adagrad[t] = gradients[t]**2
    else:
        G_adagrad[t] = G_adagrad[t-1] + gradients[t]**2

# RMSProp accumulation
beta = 0.9
E_rmsprop = np.zeros(iterations)
for t in range(iterations):
    if t == 0:
        E_rmsprop[t] = gradients[t]**2
    else:
        E_rmsprop[t] = beta * E_rmsprop[t-1] + (1 - beta) * gradients[t]**2

# Compute effective learning rates
alpha = 0.01
epsilon = 1e-8
lr_adagrad = alpha / np.sqrt(G_adagrad + epsilon)
lr_rmsprop = alpha / np.sqrt(E_rmsprop + epsilon)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Accumulated values
axes[0].plot(G_adagrad, linewidth=2.5, label='AdaGrad (Sum)', color='#FF6B6B')
axes[0].plot(E_rmsprop, linewidth=2.5, label='RMSProp (Exp. Avg)', color='#4ECDC4')
axes[0].set_xlabel('Iteration', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Accumulated Squared Gradients', fontsize=12, fontweight='bold')
axes[0].set_title('Gradient Accumulation', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Plot 2: Effective learning rates
axes[1].plot(lr_adagrad, linewidth=2.5, label='AdaGrad (Vanishes)', color='#FF6B6B')
axes[1].plot(lr_rmsprop, linewidth=2.5, label='RMSProp (Stable)', color='#4ECDC4')
axes[1].set_xlabel('Iteration', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Effective Learning Rate', fontsize=12, fontweight='bold')
axes[1].set_title('Learning Rate Evolution', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("  ‚Ä¢ AdaGrad: Accumulation grows unbounded ‚Üí LR vanishes")
print("  ‚Ä¢ RMSProp: Accumulation stabilizes ‚Üí LR stays active")
print("  ‚Ä¢ RMSProp fixes the diminishing LR problem!")

## 5. Generate Dataset

In [None]:
def generate_spiral_data(n_samples=300, noise=0.1):
    """
    Generate spiral dataset for binary classification.
    
    Returns:
    - X: Features (n_x, m)
    - Y: Labels (1, m)
    """
    np.random.seed(42)
    m = n_samples
    
    # Create spiral
    theta = np.linspace(0, 4*np.pi, m//2)
    r = np.linspace(0.5, 2, m//2)
    
    # Class 0: spiral
    X_class0 = np.vstack([r * np.cos(theta), r * np.sin(theta)])
    Y_class0 = np.zeros((1, m//2))
    
    # Class 1: spiral (rotated)
    X_class1 = np.vstack([r * np.cos(theta + np.pi), r * np.sin(theta + np.pi)])
    Y_class1 = np.ones((1, m//2))
    
    # Combine
    X = np.hstack([X_class0, X_class1])
    Y = np.hstack([Y_class0, Y_class1])
    
    # Add noise
    X += np.random.randn(*X.shape) * noise
    
    # Shuffle
    indices = np.random.permutation(m)
    X = X[:, indices]
    Y = Y[:, indices]
    
    return X, Y

# Generate data
X, Y = generate_spiral_data(n_samples=300, noise=0.1)

print(f"Dataset shape: X={X.shape}, Y={Y.shape}")
print(f"Number of samples: {X.shape[1]}")
print(f"Number of features: {X.shape[0]}")

## 6. Neural Network with RMSProp

In [None]:
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def relu(z):
    """ReLU activation function."""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU."""
    return (z > 0).astype(float)

print("‚úÖ Activation functions defined!")

In [None]:
class RMSProp:
    """
    Neural network with RMSProp optimizer.
    
    Architecture: Input (2) ‚Üí Hidden (10, ReLU) ‚Üí Output (1, Sigmoid)
    """
    
    def __init__(self, n_x=2, n_h=10, n_y=1, learning_rate=0.001, 
                 beta=0.9, epsilon=1e-8, random_seed=42):
        """
        Initialize neural network with RMSProp.
        
        Parameters:
        - learning_rate: Learning rate (Œ±)
        - beta: Decay rate for exponential average (typically 0.9 or 0.999)
        - epsilon: Small constant for numerical stability
        """
        np.random.seed(random_seed)
        
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        self.lr = learning_rate
        self.beta = beta
        self.epsilon = epsilon
        
        # Initialize parameters
        self.W1 = np.random.randn(n_h, n_x) * 0.1
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(n_y, n_h) * 0.1
        self.b2 = np.zeros((n_y, 1))
        
        # Initialize exponentially weighted averages of squared gradients
        self.E_dW1 = np.zeros_like(self.W1)
        self.E_db1 = np.zeros_like(self.b1)
        self.E_dW2 = np.zeros_like(self.W2)
        self.E_db2 = np.zeros_like(self.b2)
        
        # Training history
        self.losses = []
        self.accuracies = []
    
    def forward_propagation(self, X):
        """Forward propagation."""
        Z1 = self.W1 @ X + self.b1
        A1 = relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = sigmoid(Z2)
        
        cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
        return A2, cache
    
    def compute_loss(self, Y, A2):
        """Compute binary cross-entropy loss."""
        m = Y.shape[1]
        loss = -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
        return loss
    
    def backward_propagation(self, X, Y, cache):
        """Backward propagation."""
        m = X.shape[1]
        Z1, A1, Z2, A2 = cache['Z1'], cache['A1'], cache['Z2'], cache['A2']
        
        # Backprop
        dZ2 = A2 - Y
        dW2 = (1/m) * (dZ2 @ A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        
        dA1 = self.W2.T @ dZ2
        dZ1 = dA1 * relu_derivative(Z1)
        dW1 = (1/m) * (dZ1 @ X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        
        return dW1, db1, dW2, db2
    
    def update_parameters_rmsprop(self, dW1, db1, dW2, db2):
        """
        Update parameters using RMSProp.
        
        E[g^2]_t = Œ≤ * E[g^2]_{t-1} + (1-Œ≤) * g_t^2
        Œ∏_t = Œ∏_{t-1} - (Œ± / sqrt(E[g^2]_t + Œµ)) * g_t
        """
        # Update exponentially weighted averages of squared gradients
        self.E_dW1 = self.beta * self.E_dW1 + (1 - self.beta) * dW1**2
        self.E_db1 = self.beta * self.E_db1 + (1 - self.beta) * db1**2
        self.E_dW2 = self.beta * self.E_dW2 + (1 - self.beta) * dW2**2
        self.E_db2 = self.beta * self.E_db2 + (1 - self.beta) * db2**2
        
        # Update parameters with adaptive learning rates
        self.W1 -= (self.lr / np.sqrt(self.E_dW1 + self.epsilon)) * dW1
        self.b1 -= (self.lr / np.sqrt(self.E_db1 + self.epsilon)) * db1
        self.W2 -= (self.lr / np.sqrt(self.E_dW2 + self.epsilon)) * dW2
        self.b2 -= (self.lr / np.sqrt(self.E_db2 + self.epsilon)) * db2
    
    def compute_accuracy(self, X, Y):
        """Compute accuracy."""
        A2, _ = self.forward_propagation(X)
        predictions = (A2 > 0.5).astype(int)
        accuracy = np.mean(predictions == Y)
        return accuracy
    
    def fit(self, X, Y, epochs=1000, verbose=False):
        """Train the network with RMSProp."""
        for epoch in range(epochs):
            # Forward propagation
            A2, cache = self.forward_propagation(X)
            
            # Compute loss
            loss = self.compute_loss(Y, A2)
            self.losses.append(loss)
            
            # Compute accuracy
            accuracy = self.compute_accuracy(X, Y)
            self.accuracies.append(accuracy)
            
            # Backward propagation
            dW1, db1, dW2, db2 = self.backward_propagation(X, Y, cache)
            
            # Update parameters with RMSProp
            self.update_parameters_rmsprop(dW1, db1, dW2, db2)
            
            # Print progress
            if verbose and (epoch + 1) % 200 == 0:
                print(f"Epoch {epoch+1:4d}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}")
        
        if verbose:
            print(f"\n‚úÖ Training Complete!")
            print(f"   Final Loss: {self.losses[-1]:.4f}")
            print(f"   Final Accuracy: {self.accuracies[-1]:.4f}")
        
        return self

print("‚úÖ RMSProp class defined!")

## 7. Comparison: RMSProp vs AdaGrad vs SGD

In [None]:
# We need AdaGrad and SGD for comparison
class AdaGrad:
    """AdaGrad for comparison."""
    def __init__(self, n_x=2, n_h=10, n_y=1, learning_rate=0.1, epsilon=1e-8, random_seed=42):
        np.random.seed(random_seed)
        self.lr = learning_rate
        self.epsilon = epsilon
        self.W1 = np.random.randn(n_h, n_x) * 0.1
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(n_y, n_h) * 0.1
        self.b2 = np.zeros((n_y, 1))
        self.G_dW1 = np.zeros_like(self.W1)
        self.G_db1 = np.zeros_like(self.b1)
        self.G_dW2 = np.zeros_like(self.W2)
        self.G_db2 = np.zeros_like(self.b2)
        self.losses = []
        self.accuracies = []
    
    def forward_propagation(self, X):
        Z1 = self.W1 @ X + self.b1
        A1 = relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = sigmoid(Z2)
        return A2, {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    def compute_loss(self, Y, A2):
        return -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
    
    def backward_propagation(self, X, Y, cache):
        m = X.shape[1]
        Z1, A1, A2 = cache['Z1'], cache['A1'], cache['A2']
        dZ2 = A2 - Y
        dW2 = (1/m) * (dZ2 @ A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        dA1 = self.W2.T @ dZ2
        dZ1 = dA1 * relu_derivative(Z1)
        dW1 = (1/m) * (dZ1 @ X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        return dW1, db1, dW2, db2
    
    def compute_accuracy(self, X, Y):
        A2, _ = self.forward_propagation(X)
        return np.mean((A2 > 0.5).astype(int) == Y)
    
    def fit(self, X, Y, epochs=1000, verbose=False):
        for epoch in range(epochs):
            A2, cache = self.forward_propagation(X)
            self.losses.append(self.compute_loss(Y, A2))
            self.accuracies.append(self.compute_accuracy(X, Y))
            dW1, db1, dW2, db2 = self.backward_propagation(X, Y, cache)
            self.G_dW1 += dW1**2
            self.G_db1 += db1**2
            self.G_dW2 += dW2**2
            self.G_db2 += db2**2
            self.W1 -= (self.lr / np.sqrt(self.G_dW1 + self.epsilon)) * dW1
            self.b1 -= (self.lr / np.sqrt(self.G_db1 + self.epsilon)) * db1
            self.W2 -= (self.lr / np.sqrt(self.G_dW2 + self.epsilon)) * dW2
            self.b2 -= (self.lr / np.sqrt(self.G_db2 + self.epsilon)) * db2
        return self

class VanillaSGD:
    """Vanilla SGD for comparison."""
    def __init__(self, n_x=2, n_h=10, n_y=1, learning_rate=0.01, random_seed=42):
        np.random.seed(random_seed)
        self.lr = learning_rate
        self.W1 = np.random.randn(n_h, n_x) * 0.1
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(n_y, n_h) * 0.1
        self.b2 = np.zeros((n_y, 1))
        self.losses = []
        self.accuracies = []
    
    def forward_propagation(self, X):
        Z1 = self.W1 @ X + self.b1
        A1 = relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = sigmoid(Z2)
        return A2, {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    def compute_loss(self, Y, A2):
        return -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
    
    def backward_propagation(self, X, Y, cache):
        m = X.shape[1]
        Z1, A1, A2 = cache['Z1'], cache['A1'], cache['A2']
        dZ2 = A2 - Y
        dW2 = (1/m) * (dZ2 @ A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        dA1 = self.W2.T @ dZ2
        dZ1 = dA1 * relu_derivative(Z1)
        dW1 = (1/m) * (dZ1 @ X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        return dW1, db1, dW2, db2
    
    def compute_accuracy(self, X, Y):
        A2, _ = self.forward_propagation(X)
        return np.mean((A2 > 0.5).astype(int) == Y)
    
    def fit(self, X, Y, epochs=1000, verbose=False):
        for epoch in range(epochs):
            A2, cache = self.forward_propagation(X)
            self.losses.append(self.compute_loss(Y, A2))
            self.accuracies.append(self.compute_accuracy(X, Y))
            dW1, db1, dW2, db2 = self.backward_propagation(X, Y, cache)
            self.W1 -= self.lr * dW1
            self.b1 -= self.lr * db1
            self.W2 -= self.lr * dW2
            self.b2 -= self.lr * db2
        return self

print("‚úÖ Comparison classes defined!")

In [None]:
# Training parameters
epochs = 2000

print("üî¨ Training Models...\n")

# 1. Vanilla SGD
print("1Ô∏è‚É£  Training Vanilla SGD...")
model_sgd = VanillaSGD(learning_rate=0.01, random_seed=42)
model_sgd.fit(X, Y, epochs=epochs)
print(f"   Final Loss: {model_sgd.losses[-1]:.4f}")

# 2. AdaGrad
print("\n2Ô∏è‚É£  Training AdaGrad...")
model_adagrad = AdaGrad(learning_rate=0.1, random_seed=42)
model_adagrad.fit(X, Y, epochs=epochs)
print(f"   Final Loss: {model_adagrad.losses[-1]:.4f}")

# 3. RMSProp
print("\n3Ô∏è‚É£  Training RMSProp...")
model_rmsprop = RMSProp(learning_rate=0.001, beta=0.9, random_seed=42)
model_rmsprop.fit(X, Y, epochs=epochs)
print(f"   Final Loss: {model_rmsprop.losses[-1]:.4f}")

print("\n‚úÖ All experiments complete!")

## 8. Visualize Results

In [None]:
# Plot loss curves
plt.figure(figsize=(16, 10))

plt.plot(model_sgd.losses, linewidth=2.5, label='Vanilla SGD', 
        color='#FF6B6B', alpha=0.8)
plt.plot(model_adagrad.losses, linewidth=2.5, label='AdaGrad', 
        color='#4ECDC4', alpha=0.8)
plt.plot(model_rmsprop.losses, linewidth=2.5, label='RMSProp', 
        color='#95E1D3', alpha=0.8)

plt.xlabel('Epoch', fontsize=13, fontweight='bold')
plt.ylabel('Loss', fontsize=13, fontweight='bold')
plt.title('Loss Curves: RMSProp vs AdaGrad vs SGD', fontsize=15, fontweight='bold')
plt.legend(fontsize=12, loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("  ‚Ä¢ RMSProp: Smooth, consistent convergence")
print("  ‚Ä¢ RMSProp: Doesn't suffer from diminishing LR")
print("  ‚Ä¢ RMSProp: Better than AdaGrad for long training")

## 9. Summary and Key Takeaways

### What We Learned

‚úÖ **The Problem**
- AdaGrad's learning rate vanishes over time
- Caused by monotonically increasing accumulation
- Problematic for deep networks and long training

‚úÖ **RMSProp Solution**
- Use exponentially weighted average instead of sum
- Recent gradients have more weight
- Old gradients fade away
- Learning rate stays active!

‚úÖ **Mathematical Foundation**
- Exponential average: $E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2$
- Adaptive update: $\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} g_t$
- Typical Œ≤: 0.9 or 0.999

‚úÖ **Advantages**
- Fixes AdaGrad's diminishing LR problem
- Works well for deep networks
- Stable and reliable
- Foundation for Adam optimizer

### When to Use RMSProp?

**Good For:**
- Deep neural networks
- Long training runs
- Non-convex optimization
- Recurrent neural networks (RNNs)

**Hyperparameters:**
- Learning rate (Œ±): 0.001 (typical)
- Beta (Œ≤): 0.9 or 0.999
- Epsilon (Œµ): 1e-8

### Connection to Other Notebooks

This notebook builds on:
- **`7_5_adagrad_adaptive_learning_rates.ipynb`**: AdaGrad and its limitations
- **`7_4_sgd_with_momentum.ipynb`**: Exponentially weighted averages

### Next Steps

üöÄ **Coming Next:**
- **7.7 Adam**: Combines momentum + RMSProp (most popular optimizer!)

---

**üéì Congratulations!** You now understand RMSProp and how it fixes AdaGrad's problems!

**Key Insight:** RMSProp uses exponentially weighted averages to keep the learning rate active throughout training!