## 5. Learning Rate

### Definition
**Learning Rate** is a hyperparameter that controls **how much model parameters change** in response to the calculated error during training. It determines the **step size** in the optimization process.

### Mathematical Foundation

In gradient descent optimization, the learning rate (typically denoted as η or α) is part of the weight update formula:

```
w_new = w_old - η × ∇L(w)

Where:
  w_old = current weight
  η = learning rate (step size)
  ∇L(w) = gradient of loss function (direction of steepest descent)
  w_new = updated weight
```

### Intuitive Explanation

Imagine hiking down a mountain trying to reach the valley (optimal solution):
- **High Learning Rate:** Take big steps down the mountain
  - Fast descent but might overshoot the valley
  - Risk of missing the optimal point
  - May diverge (go back up the mountain)
- **Low Learning Rate:** Take tiny steps down the mountain
  - Safer descent, likely to reach valley
  - Takes forever to get there
  - Convergence very slow
- **Optimal Learning Rate:** Right-sized steps
  - Efficiently reaches valley
  - Doesn't overshoot
  - Best training speed

### Python Example: Learning Rate Impact


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.datasets import make_regression

# Generate simple regression data
X, y = make_regression(n_samples=500, n_features=1, noise=20, random_state=42)

# Test different learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1, 1.0]
losses = {lr: [] for lr in learning_rates}

for lr in learning_rates:
    model = SGDRegressor(
        learning_rate='constant',
        eta0=lr,  # Initial learning rate
        max_iter=100,
        random_state=42,
        verbose=0
    )
    
    # Track loss for each iteration
    for i in range(100):
        model.fit(X, y)
        current_pred = model.predict(X)
        mse_loss = np.mean((current_pred - y) ** 2)
        losses[lr].append(mse_loss)

# Visualize the effect
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for idx, lr in enumerate(learning_rates):
    ax = axes[idx // 3, idx % 3]
    ax.plot(losses[lr], marker='o', linewidth=2)
    ax.set_title(f'Learning Rate = {lr}')
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Loss (MSE)')
    ax.grid(True, alpha=0.3)
    
    # Analyze convergence
    final_loss = losses[lr][-1]
    min_loss = min(losses[lr])
    ax.text(0.5, 0.95, f'Final Loss: {final_loss:.4f}',
            transform=ax.transAxes, ha='center', va='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig('learning_rate_impact.png', dpi=150)
plt.show()

print("Analysis of Learning Rates:")
for lr in learning_rates:
    min_loss = min(losses[lr])
    final_loss = losses[lr][-1]
    converged = min_loss < final_loss * 1.05  # Roughly converged
    print(f"η={lr:7.4f}: Min Loss={min_loss:.4f}, Final Loss={final_loss:.4f}, "
          f"Converged={converged}")


### Effects of Different Learning Rates

#### Too Low Learning Rate (η too small) 🐌
```
Problem: Convergence extremely slow
```

**Characteristics:**
- Takes thousands of iterations to converge
- Model barely improves with each step
- Training time becomes impractical
- May get stuck in local minima

**Code Example:**


In [None]:
from sklearn.linear_model import SGDClassifier
import numpy as np

X = np.random.rand(1000, 20)
y = np.random.randint(0, 2, 1000)

# Too low learning rate
model = SGDClassifier(eta0=0.00001, learning_rate='constant', max_iter=1000)
model.fit(X, y)
accuracy_low_lr = model.score(X, y)
print(f"Accuracy with LR=0.00001: {accuracy_low_lr:.4f}")
# Likely still underfitting after 1000 iterations!


#### Too High Learning Rate (η too large) 🚀
```
Problem: Model diverges or oscillates around optimal
```

**Characteristics:**
- Loss increases instead of decreasing
- Parameters become NaN or explode in magnitude
- Model diverges away from optimum
- Updates are too drastic

**Code Example:**


In [None]:
# Too high learning rate
model = SGDClassifier(eta0=10.0, learning_rate='constant', max_iter=100)
model.fit(X, y)
# Predictions might be all 0 or all 1 (completely wrong!)
# Loss might actually increase with iterations

# Check weight magnitude
print(f"Model weights (might be huge): {model.coef_}")


#### Just Right Learning Rate (optimal η) ✅
```
Fast convergence to near-optimal solution
```

**Characteristics:**
- Loss smoothly decreases
- Converges in reasonable iterations (usually < 100)
- Stable training process
- Good final model performance

### Factors Affecting Learning Rate

#### 1. **Batch Size**
Larger batches often tolerate higher learning rates:


In [None]:
from sklearn.linear_model import SGDClassifier

# Smaller batch size → smaller learning rate needed
model_small_batch = SGDClassifier(eta0=0.01, batch_size=32)
model_small_batch.fit(X, y)

# Larger batch size → can use larger learning rate
model_large_batch = SGDClassifier(eta0=0.1, batch_size=256)
model_large_batch.fit(X, y)


#### 2. **Data Normalization**
Pre-processed data requires different learning rates:


In [None]:
from sklearn.preprocessing import StandardScaler

# Unnormalized data
model1 = SGDClassifier(eta0=0.01)
model1.fit(X, y)  # Might fail or be slow

# Normalized data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
model2 = SGDClassifier(eta0=0.01)
model2.fit(X_normalized, y)  # Works much better with same LR!


#### 3. **Optimization Algorithm**
Different algorithms have different learning rate sensitivity:


In [None]:
# SGD: Simple, learning rate matters greatly
model_sgd = SGDClassifier(eta0=0.1, learning_rate='constant')

# Adam: Adaptive learning rates, less sensitive to initial LR
from sklearn.neural_network import MLPClassifier
model_adam = MLPClassifier(learning_rate_init=0.001, solver='adam')
model_adam.fit(X, y)


### Learning Rate Schedules

Instead of fixed learning rate, reduce it over time:


In [None]:
import numpy as np
from sklearn.linear_model import SGDClassifier

class LearningRateScheduler:
    def __init__(self, initial_lr=0.1, decay=0.9, schedule_type='exponential'):
        self.initial_lr = initial_lr
        self.decay = decay
        self.schedule_type = schedule_type
        self.iteration = 0
    
    def get_lr(self):
        if self.schedule_type == 'constant':
            return self.initial_lr
        
        elif self.schedule_type == 'exponential':
            # Exponential decay: η = η₀ × decay^iteration
            return self.initial_lr * (self.decay ** self.iteration)
        
        elif self.schedule_type == 'linear':
            # Linear decay: η = η₀ × (1 - iteration/max_iterations)
            return self.initial_lr * (1 - self.iteration / 100)
        
        elif self.schedule_type == 'step':
            # Step decay: reduce by factor every N iterations
            step = self.iteration // 20
            return self.initial_lr * (0.5 ** step)
    
    def step(self):
        self.iteration += 1

# Visualize different schedules
scheduler_types = ['constant', 'exponential', 'linear', 'step']
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for idx, schedule_type in enumerate(scheduler_types):
    scheduler = LearningRateScheduler(
        initial_lr=0.1,
        schedule_type=schedule_type
    )
    
    lrs = []
    for i in range(100):
        lrs.append(scheduler.get_lr())
        scheduler.step()
    
    ax = axes[idx // 2, idx % 2]
    ax.plot(lrs, linewidth=2, color='blue')
    ax.set_title(f'{schedule_type.capitalize()} Learning Rate Decay')
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Learning Rate')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


### Practical Tips for Choosing Learning Rate

1. **Start with default:** 0.01 is often reasonable starting point
2. **Check loss plot:** Should decrease smoothly, not oscillate or increase
3. **Scale with data:** Normalized data typically works with LR ~ 0.01-0.1
4. **Use schedules:** Start high, decay over time often works well
5. **Grid search:** Test multiple values: [0.001, 0.01, 0.1, 1.0]
6. **Monitor convergence:** If not converging, reduce LR


In [None]:
# Practical learning rate selection
learning_rates_to_test = [0.001, 0.01, 0.1, 1.0]
results = {}

for lr in learning_rates_to_test:
    model = SGDClassifier(
        eta0=lr,
        learning_rate='constant',
        max_iter=50,
        random_state=42
    )
    
    model.fit(X, y)
    accuracy = model.score(X, y)
    results[lr] = accuracy
    
    print(f"Learning Rate {lr:.4f}: Accuracy = {accuracy:.4f}")

# Choose best learning rate
best_lr = max(results, key=results.get)
print(f"\nBest learning rate: {best_lr} with accuracy {results[best_lr]:.4f}")


---
