# 🎯 Advanced Neural Networks: Complete Professional Guide

## 📚 What You'll Master
1. **Optimizers** - Adam, RMSprop, SGD+Momentum (complete derivations)
2. **Regularization** - Dropout, Batch Normalization, Weight Decay
3. **Activation Functions** - ReLU family, SELU, Swish
4. **Real-World** - ImageNet, BERT, GPT training techniques
5. **Exercises** - Implement optimizers from scratch
6. **Competition** - CIFAR-10 classification
7. **Interviews** - 7 critical questions

---


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
print('✅ Advanced NN ready!')


---
# 📖 Chapter 1: Advanced Optimizers

## 1.1 SGD with Momentum

**Problem**: SGD oscillates in narrow valleys

**Solution**: Add momentum term

$$v_t = \gamma v_{t-1} + \eta \nabla_\theta J(\theta)$$
$$\theta_t = \theta_{t-1} - v_t$$

**Intuition**: "Ball rolling downhill" accumulates velocity

## 1.2 RMSprop (Root Mean Square Propagation)

**Problem**: Learning rate same for all parameters

**Solution**: Adapt per-parameter learning rates

$$E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta)g_t^2$$
$$\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}}g_t$$

## 1.3 Adam (Adaptive Moment Estimation)

**Combines**: Momentum + RMSprop

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t$$

**Defaults**: $\beta_1=0.9$, $\beta_2=0.999$, $\eta=0.001$


In [None]:
class Adam:
    """Adam optimizer from scratch."""
    
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None
        self.v = None
        self.t = 0
    
    def update(self, params, grads):
        """Update parameters using Adam."""
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.t += 1
        
        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        
        # Update biased second moment estimate  
        self.v = self.beta2 * self.v + (1 - self.beta2) * (grads ** 2)
        
        # Bias correction
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)
        
        # Update parameters
        params -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return params

print('✅ Adam optimizer complete!')


---
# 🏭 Chapter 3: Real-World Training Techniques

### 1. **ImageNet Training** 🖼️
- **Model**: ResNet-50
- **Optimizer**: SGD with momentum (0.9)
- **LR Schedule**: Step decay every 30 epochs
- **Regularization**: Weight decay (1e-4)
- **Batch size**: 256
- **Training time**: 4 days on 8 GPUs

### 2. **BERT Pretraining** 📝
- **Model**: 340M parameters
- **Optimizer**: Adam (lr=1e-4)
- **Regularization**: Dropout (0.1)
- **Batch size**: 256 sequences
- **Hardware**: 64 TPU chips
- **Cost**: $7,000 per training run

### 3. **GPT-3 Training** 🤖
- **Model**: 175B parameters
- **Optimizer**: Adam with gradient clipping
- **Batch size**: 3.2M tokens
- **Training**: 300B tokens corpus
- **Cost**: ~$12M for full training

### 4. **Production ML (Uber)** 🚗
- **Problem**: ETA prediction
- **Optimizer**: Adam
- **Regularization**: Dropout + Early stopping
- **Serving**: <50ms latency
- **Scale**: Millions of predictions/sec


---
# 🎯 Chapter 4: Exercises

## Exercise 1: Implement RMSprop ⭐⭐
Build RMSprop optimizer from scratch

## Exercise 2: Learning Rate Schedules ⭐⭐⭐
Implement cosine annealing, step decay

## Exercise 3: Batch Normalization ⭐⭐⭐
Derive and implement batch norm layer

## Exercise 4: Gradient Clipping ⭐
Prevent exploding gradients


---
# 💡 Chapter 6: Interview Questions

### Q1: Adam vs SGD - when to use?
**Adam**: Most cases, fast convergence
**SGD**: Sometimes better generalization, simpler

### Q2: Why bias correction in Adam?
Early iterations have biased moment estimates → correct them

### Q3: Batch Normalization benefits?
- Faster convergence
- Higher learning rates
- Acts as regularizer
- Reduces internal covariate shift

### Q4: Dropout rate selection?
**Hidden layers**: 0.5
**Input layer**: 0.1-0.2

### Q5: Why BatchNorm before or after activation?
**Before**: More common, better performance
**After**: Original paper placement

### Q6: Learning rate warmup?
Gradually increase LR at start → stabilizes training

### Q7: Gradient explosion solutions?
- Gradient clipping
- BatchNorm
- Residual connections
- Proper weight initialization


---
# 📊 Summary

## Key Takeaways
✅ **Adam**: Default choice for most problems
✅ **BatchNorm**: Accelerates training significantly
✅ **Dropout**: Prevent overfitting
✅ **LR schedules**: Critical for final performance
⚠️ **No one-size-fits-all**: Experiment!
⚠️ **Hyperparameters matter**: Grid/random search

## Optimizer Comparison

| Optimizer | Speed | Memory | Best For |
|-----------|-------|--------|----------|
| SGD | Fast | Low | Simple problems |
| Momentum | Fast | Low | Narrow valleys |
| RMSprop | Medium | Medium | RNNs |
| Adam | Medium | High | Most problems |

---

## Next: Transformers and attention mechanisms
