# üî• FILE 2-B: Optimizer, Activation & Regularization

**PH·∫¶N 2 - INTERMEDIATE (CORE DEEP LEARNING)**

---

## üìã N·ªôi Dung

‚úÖ Activation functions n√¢ng cao (ReLU, LeakyReLU, GELU, Swish)

‚úÖ Optimizers chi ti·∫øt (SGD, Momentum, Adam, AdamW)

‚úÖ Learning rate v√† ·∫£nh h∆∞·ªüng

‚úÖ Learning rate scheduling

‚úÖ Regularization techniques:
- Dropout
- Weight Decay (L2)
- Batch Normalization

‚úÖ So s√°nh v√† th·ª±c nghi·ªám

---

## ‚è±Ô∏è Th·ªùi Gian H·ªçc: 2.5-3 gi·ªù

---

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

print(f"PyTorch version: {torch.__version__}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

---

# 1Ô∏è‚É£ Activation Functions N√¢ng Cao

## So S√°nh C√°c Activations

In [None]:
# Visualize activations
x = torch.linspace(-3, 3, 200)

activations = {
    'ReLU': nn.ReLU()(x),
    'LeakyReLU': nn.LeakyReLU(0.1)(x),
    'ELU': nn.ELU()(x),
    'GELU': nn.GELU()(x),
    'Swish (SiLU)': nn.SiLU()(x),
    'Tanh': nn.Tanh()(x),
}

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for i, (name, y) in enumerate(activations.items()):
    axes[i].plot(x.numpy(), y.numpy(), linewidth=2)
    axes[i].set_title(name, fontsize=12, fontweight='bold')
    axes[i].grid(True, alpha=0.3)
    axes[i].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[i].axvline(x=0, color='k', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

print("""
ƒê√ÅNH GI√Å:

ReLU: max(0, x)
  ‚úÖ Nhanh, ƒë∆°n gi·∫£n, ph·ªï bi·∫øn nh·∫•t
  ‚ùå Dying ReLU problem
  
LeakyReLU: max(0.1x, x)
  ‚úÖ Fix dying ReLU
  ‚ùå C·∫ßn tune alpha
  
GELU: x * Œ¶(x)  [Gaussian Error Linear Unit]
  ‚úÖ Smooth, d√πng trong Transformers (BERT, GPT)
  ‚ùå Ch·∫≠m h∆°n ReLU
  
Swish/SiLU: x * sigmoid(x)
  ‚úÖ Smooth, self-gated
  ‚úÖ T·ªët cho deep networks
  
KHUY·∫æN NGH·ªä:
  - Default: ReLU
  - Transformers: GELU
  - Deep networks: Swish/SiLU
  - Dying ReLU problem: LeakyReLU
""")

---

# 2Ô∏è‚É£ Optimizers Chi Ti·∫øt

## SGD vs Momentum vs Adam

In [None]:
print("=" * 70)
print("OPTIMIZER COMPARISON")
print("=" * 70)

# Prepare data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train = torch.FloatTensor(X[:800])
y_train = torch.LongTensor(y[:800])
X_val = torch.FloatTensor(X[800:])
y_val = torch.LongTensor(y[800:])

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Model
def create_model():
    return nn.Sequential(
        nn.Linear(20, 64),
        nn.ReLU(),
        nn.Linear(64, 32),
        nn.ReLU(),
        nn.Linear(32, 2)
    )

# Test different optimizers
def train_with_optimizer(optimizer_fn, name, epochs=50):
    model = create_model().to(device)
    optimizer = optimizer_fn(model.parameters())
    criterion = nn.CrossEntropyLoss()
    
    losses = []
    for epoch in range(epochs):
        model.train()
        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        # Validation loss
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_val.to(device))
            val_loss = criterion(val_outputs, y_val.to(device))
            losses.append(val_loss.item())
    
    return losses

# Compare optimizers
optimizers = {
    'SGD': lambda p: optim.SGD(p, lr=0.01),
    'SGD+Momentum': lambda p: optim.SGD(p, lr=0.01, momentum=0.9),
    'Adam': lambda p: optim.Adam(p, lr=0.001),
    'AdamW': lambda p: optim.AdamW(p, lr=0.001, weight_decay=0.01),
}

results = {}
for name, opt_fn in optimizers.items():
    print(f"Training with {name}...")
    results[name] = train_with_optimizer(opt_fn, name)

# Plot comparison
plt.figure(figsize=(12, 6))
for name, losses in results.items():
    plt.plot(losses, label=name, linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.title('Optimizer Comparison', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\nFinal losses:")
for name, losses in results.items():
    print(f"  {name:15s}: {losses[-1]:.4f}")

## Optimizer Details

In [None]:
print("""
=" * 70
CHI TI·∫æT C√ÅC OPTIMIZER
=" * 70

1. SGD (Stochastic Gradient Descent)
   Formula: Œ∏ = Œ∏ - lr * ‚àáL
   
   ‚úÖ ƒê∆°n gi·∫£n, ·ªïn ƒë·ªãnh
   ‚ùå Ch·∫≠m, d·ªÖ stuck local minima
   
2. SGD + Momentum
   Formula: 
     v = Œ≤*v + ‚àáL
     Œ∏ = Œ∏ - lr * v
   
   ‚úÖ TƒÉng t·ªëc, v∆∞·ª£t local minima
   ‚úÖ Th∆∞·ªùng Œ≤ = 0.9
   
3. Adam (Adaptive Moment Estimation)
   Combines:
     - Momentum (first moment)
     - RMSprop (second moment)
   
   ‚úÖ T·ª± ƒë·ªông ƒëi·ªÅu ch·ªânh LR cho t·ª´ng parameter
   ‚úÖ H·ªôi t·ª• nhanh
   ‚úÖ PH·ªî BI·∫æN NH·∫§T
   
   Hyperparameters:
     - lr: 0.001 (default)
     - betas: (0.9, 0.999)
     - eps: 1e-8
   
4. AdamW (Adam with Weight Decay)
   = Adam + L2 regularization ƒê√öNG C√ÅCH
   
   ‚úÖ Fix weight decay bug trong Adam
   ‚úÖ T·ªët h∆°n Adam cho nhi·ªÅu tasks
   ‚úÖ D√πng trong Transformers
   
   weight_decay: 0.01 (typical)

KHUY·∫æN NGH·ªä:
  - B·∫Øt ƒë·∫ßu: Adam (lr=0.001)
  - Fine-tuning: SGD + Momentum
  - Modern: AdamW
  - Transformers: AdamW + LR scheduling
""")

---

# 3Ô∏è‚É£ Learning Rate

## LR Experiment

In [None]:
print("=" * 70)
print("LEARNING RATE EXPERIMENT")
print("=" * 70)

learning_rates = [0.0001, 0.001, 0.01, 0.1, 1.0]
lr_results = {}

for lr in learning_rates:
    print(f"Testing LR={lr}...")
    model = create_model().to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    losses = []
    for epoch in range(50):
        model.train()
        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        model.eval()
        with torch.no_grad():
            val_loss = criterion(model(X_val.to(device)), y_val.to(device))
            losses.append(val_loss.item())
    
    lr_results[lr] = losses

# Plot
plt.figure(figsize=(12, 6))
for lr, losses in lr_results.items():
    plt.plot(losses, label=f'LR={lr}', linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Learning Rate Comparison', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("""
K·∫æT LU·∫¨N:
  - LR qu√° nh·ªè (0.0001): H·ªçc ch·∫≠m
  - LR v·ª´a (0.001, 0.01): T·ªët nh·∫•t
  - LR qu√° l·ªõn (0.1, 1.0): Kh√¥ng h·ªôi t·ª•
""")

## Learning Rate Scheduler

In [None]:
print("=" * 70)
print("LEARNING RATE SCHEDULING")
print("=" * 70)

model = create_model().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Start high

# StepLR: Gi·∫£m LR m·ªói N epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

lrs = []
for epoch in range(50):
    # Training code here...
    lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()  # Update LR

plt.figure(figsize=(10, 5))
plt.plot(lrs, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('StepLR Scheduler', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

print("""
C√ÅC SCHEDULER PH·ªî BI·∫æN:

1. StepLR
   lr = lr * gamma every step_size epochs
   
2. ReduceLROnPlateau
   Gi·∫£m LR khi val loss kh√¥ng c·∫£i thi·ªán
   
3. CosineAnnealingLR
   LR theo cosine curve
   
4. OneCycleLR
   TƒÉng r·ªìi gi·∫£m LR (super-convergence)

USAGE:
  scheduler = optim.lr_scheduler.StepLR(...)
  
  for epoch in range(epochs):
      train(...)  
      scheduler.step()  # ‚Üê After each epoch
""")

---

# 4Ô∏è‚É£ Regularization

## Dropout

In [None]:
print("=" * 70)
print("DROPOUT")
print("=" * 70)

class ModelWithDropout(nn.Module):
    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(64, 32)
        self.dropout2 = nn.Dropout(dropout_rate)
        self.fc3 = nn.Linear(32, 2)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)  # Drop neurons
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

model = ModelWithDropout(dropout_rate=0.5)

print("""
DROPOUT HO·∫†T ƒê·ªòNG:

Training mode:
  - Randomly "drop" p% neurons (set to 0)
  - Prevents co-adaptation
  - Acts like ensemble

Eval mode:
  - Use ALL neurons
  - Scale by (1-p)

L·ª¢I √çCH:
  ‚úÖ Gi·∫£m overfitting
  ‚úÖ ƒê∆°n gi·∫£n, hi·ªáu qu·∫£
  ‚úÖ Ensemble effect

DROPOUT RATE:
  - Typical: 0.2 - 0.5
  - Input layer: 0.1 - 0.2
  - Hidden layers: 0.5

‚ö†Ô∏è QUAN TR·ªåNG:
  model.train()  # Enable dropout
  model.eval()   # Disable dropout
""")

## Weight Decay (L2 Regularization)

In [None]:
print("=" * 70)
print("WEIGHT DECAY")
print("=" * 70)

# Without weight decay
optimizer_no_wd = optim.Adam(model.parameters(), lr=0.001)

# With weight decay
optimizer_wd = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

print("""
WEIGHT DECAY:

Formula: L_total = L + Œª * ||w||¬≤

√ù nghƒ©a:
  - Penalize large weights
  - Prefer smaller weights
  - Prevent overfitting

C√ÅCH D√ôNG:
  optimizer = optim.AdamW(
      model.parameters(),
      lr=0.001,
      weight_decay=0.01  # Œª
  )

GI√Å TR·ªä:
  - Typical: 0.0001 - 0.01
  - L·ªõn h∆°n = regularization m·∫°nh h∆°n

Adam vs AdamW:
  - Adam: Weight decay implementation c√≥ bug
  - AdamW: Correct implementation
  - Khuy·∫øn ngh·ªã: D√πng AdamW
""")

## Batch Normalization

In [None]:
print("=" * 70)
print("BATCH NORMALIZATION")
print("=" * 70)

class ModelWithBatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.bn1 = nn.BatchNorm1d(64)  # BatchNorm after Linear
        self.fc2 = nn.Linear(64, 32)
        self.bn2 = nn.BatchNorm1d(32)
        self.fc3 = nn.Linear(32, 2)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)      # Normalize
        x = torch.relu(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = torch.relu(x)
        x = self.fc3(x)
        return x

print("""
BATCH NORMALIZATION:

C√¥ng th·ª©c:
  x_norm = (x - mean) / sqrt(var + eps)
  y = Œ≥ * x_norm + Œ≤

L·ª¢I √çCH:
  ‚úÖ TƒÉng t·ªëc training
  ‚úÖ Cho ph√©p LR cao h∆°n
  ‚úÖ Gi·∫£m sensitivity to initialization
  ‚úÖ Regularization effect

V·ªä TR√ç:
  Linear ‚Üí BatchNorm ‚Üí Activation
  HO·∫∂C
  Linear ‚Üí Activation ‚Üí BatchNorm

TYPES:
  - BatchNorm1d: For FC layers
  - BatchNorm2d: For Conv layers
  - BatchNorm3d: For 3D Conv

‚ö†Ô∏è L∆ØU √ù:
  - Kh√°c nhau gi·ªØa train/eval mode
  - model.train() / model.eval()
""")

---

# ‚úÖ T·ªïng K·∫øt FILE 2-B

## H·ªçc ƒê∆∞·ª£c

‚úÖ **Activations**: ReLU, LeakyReLU, GELU, Swish

‚úÖ **Optimizers**: SGD, Momentum, Adam, AdamW

‚úÖ **Learning Rate**: Experiments, Scheduling

‚úÖ **Regularization**: Dropout, Weight Decay, BatchNorm

## Quick Reference

```python
# Model with regularization
class RegularizedModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.bn1 = nn.BatchNorm1d(64)
        self.dropout1 = nn.Dropout(0.3)
        self.fc2 = nn.Linear(64, 2)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = torch.relu(x)
        x = self.dropout1(x)
        x = self.fc2(x)
        return x

# Optimizer with weight decay
optimizer = optim.AdamW(model.parameters(), 
                        lr=0.001, 
                        weight_decay=0.01)

# LR Scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, 
                                      step_size=10, 
                                      gamma=0.5)
```

---