# üìò PYTORCH PHASE 1 - FILE 3: REGULARIZATION

**Core Concepts:** Overfitting, Dropout, Weight Decay, Label Smoothing

**M·ª•c ti√™u:**
- ‚úÖ Hi·ªÉu overfitting & generalization
- ‚úÖ Master Dropout technique
- ‚úÖ Understand Weight Decay (L2 regularization)
- ‚úÖ Apply Label Smoothing
- ‚úÖ Practical regularization strategies

**Th·ªùi l∆∞·ª£ng:** 2 tu·∫ßn

---

## üìö M·ª•c L·ª•c

### 1. OVERFITTING & GENERALIZATION
1.1 Bias-Variance Tradeoff
1.2 Train vs Validation Gap
1.3 Implicit vs Explicit Regularization

### 2. DROPOUT
2.1 Dropout Intuition
2.2 Train vs Test Behavior
2.3 Dropout Rate Effects
2.4 When Dropout Hurts

### 3. WEIGHT DECAY
3.1 L2 Regularization
3.2 L2 vs Weight Decay in Adam
3.3 Practical Tuning

### 4. LABEL SMOOTHING
4.1 Over-confidence Problem
4.2 Effect on Calibration
4.3 Accuracy vs Robustness

### 5. PRACTICAL EXPERIMENTS
5.1 Same Model With/Without Regularization
5.2 Dropout Rate Sweep
5.3 Weight Decay Sweep
5.4 Label Smoothing Impact

---

In [None]:
# Import libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

print(f"‚úÖ PyTorch version: {torch.__version__}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"‚úÖ Using device: {device}")

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

---

# 1. OVERFITTING & GENERALIZATION

## 1.1 Bias-Variance Tradeoff

### Definitions

**Bias**: Error from wrong assumptions
- High bias = underfitting
- Model too simple

**Variance**: Error from sensitivity to training data
- High variance = overfitting
- Model too complex

### Total Error

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

### Tradeoff

| Model Complexity | Bias | Variance | Total Error |
|------------------|------|----------|-------------|
| Too simple | High | Low | High |
| Just right | Low | Low | Low |
| Too complex | Low | High | High |

## 1.2 Train vs Validation Gap

### Overfitting Signs

- Train accuracy >> Val accuracy
- Train loss << Val loss
- Gap increases over time

### Solutions

1. **More data**
2. **Data augmentation**
3. **Regularization** (Dropout, Weight Decay)
4. **Early stopping**
5. **Simpler model**

In [None]:
# Demonstrate overfitting

# Create small dataset (prone to overfitting)
torch.manual_seed(42)
X_train = torch.randn(50, 10)
y_train = torch.randint(0, 2, (50,))
X_val = torch.randn(200, 10)
y_val = torch.randint(0, 2, (200,))

# Large model (prone to overfitting)
class LargeModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 128)
        self.fc4 = nn.Linear(128, 2)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        return self.fc4(x)

model = LargeModel().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training
train_losses, val_losses = [], []
train_accs, val_accs = [], []

for epoch in range(200):
    # Train
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train.to(device))
    loss = criterion(outputs, y_train.to(device))
    loss.backward()
    optimizer.step()
    
    train_losses.append(loss.item())
    train_acc = (outputs.argmax(1) == y_train.to(device)).float().mean().item()
    train_accs.append(train_acc)
    
    # Validation
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_val.to(device))
        val_loss = criterion(val_outputs, y_val.to(device))
        val_losses.append(val_loss.item())
        val_acc = (val_outputs.argmax(1) == y_val.to(device)).float().mean().item()
        val_accs.append(val_acc)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(train_losses, label='Train Loss', linewidth=2)
axes[0].plot(val_losses, label='Val Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Overfitting: Loss', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(train_accs, label='Train Acc', linewidth=2)
axes[1].plot(val_accs, label='Val Acc', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Overfitting: Accuracy', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚ö†Ô∏è  OVERFITTING DETECTED:")
print(f"   Train Loss: {train_losses[-1]:.4f}, Val Loss: {val_losses[-1]:.4f}")
print(f"   Train Acc: {train_accs[-1]:.2%}, Val Acc: {val_accs[-1]:.2%}")
print(f"   Gap: {train_accs[-1] - val_accs[-1]:.2%}")

---

# 2. DROPOUT

## 2.1 Dropout Intuition

### What is Dropout?

During training: **Randomly drop** (set to 0) some neurons with probability $p$

$$\text{output} = \begin{cases}
0 & \text{with probability } p \\
\frac{x}{1-p} & \text{with probability } 1-p
\end{cases}$$

### Why it Works?

1. **Model averaging**: Training ensemble of subnetworks
2. **Prevents co-adaptation**: Neurons can't rely on specific others
3. **Adds noise**: Acts as regularization

### Train vs Test

- **Training**: Dropout active, scale outputs by $\frac{1}{1-p}$
- **Testing**: Dropout OFF, use all neurons

## 2.2 Implementation

In [None]:
# Dropout demonstration

# Create model WITH dropout
class ModelWithDropout(nn.Module):
    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.fc1 = nn.Linear(10, 128)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(128, 128)
        self.dropout2 = nn.Dropout(dropout_rate)
        self.fc3 = nn.Linear(128, 2)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.dropout2(x)
        return self.fc3(x)

# Test dropout behavior
model = ModelWithDropout(dropout_rate=0.5)
x = torch.randn(5, 10)

# Training mode (dropout active)
model.train()
output_train1 = model(x)
output_train2 = model(x)

print("üé≤ Dropout Behavior:")
print("\nüìä Training mode (dropout ACTIVE):")
print(f"   Output 1: {output_train1[0].detach().numpy()}")
print(f"   Output 2: {output_train2[0].detach().numpy()}")
print(f"   Different? {not torch.allclose(output_train1, output_train2)}")

# Test mode (dropout inactive)
model.eval()
output_test1 = model(x)
output_test2 = model(x)

print("\nüìä Test mode (dropout INACTIVE):")
print(f"   Output 1: {output_test1[0].detach().numpy()}")
print(f"   Output 2: {output_test2[0].detach().numpy()}")
print(f"   Same? {torch.allclose(output_test1, output_test2)}")

print("\nüí° Key points:")
print("   - Training: Outputs DIFFERENT (dropout active)")
print("   - Testing: Outputs SAME (dropout inactive)")
print("   - Always use model.train() / model.eval()!")

## 2.3 Dropout Rate Effects

In [None]:
# Compare different dropout rates

def train_model_with_dropout(dropout_rate, epochs=100):
    """Train model v·ªõi dropout rate c·ª• th·ªÉ"""
    model = ModelWithDropout(dropout_rate=dropout_rate).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()
    
    train_losses, val_losses = [], []
    
    for epoch in range(epochs):
        # Train
        model.train()
        optimizer.zero_grad()
        outputs = model(X_train.to(device))
        loss = criterion(outputs, y_train.to(device))
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
        
        # Validate
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_val.to(device))
            val_loss = criterion(val_outputs, y_val.to(device))
            val_losses.append(val_loss.item())
    
    return train_losses, val_losses

# Test different dropout rates
dropout_rates = [0.0, 0.2, 0.5, 0.8]
results = {}

print("üîÑ Training v·ªõi different dropout rates...\n")

for rate in dropout_rates:
    train_losses, val_losses = train_model_with_dropout(rate)
    results[rate] = {'train': train_losses, 'val': val_losses}
    print(f"‚úÖ Dropout={rate}: Val Loss={val_losses[-1]:.4f}")

# Plot
plt.figure(figsize=(12, 5))

for rate in dropout_rates:
    plt.plot(results[rate]['val'], label=f'Dropout={rate}', linewidth=2, alpha=0.8)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Validation Loss', fontsize=12)
plt.title('Effect of Dropout Rate', fontsize=13, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print("\nüìä Observations:")
print("   Dropout=0.0: Overfits (no regularization)")
print("   Dropout=0.2-0.5: Good regularization")
print("   Dropout=0.8: Too much (underfits)")
print("\nüí° Typical: 0.2-0.5 for hidden layers, 0.5 for input layer")

## 2.4 When Dropout Hurts

### Situations Where Dropout is BAD

1. **Small datasets**: Not enough data to benefit
2. **Already regularized**: BatchNorm + Dropout can hurt
3. **Convolutional layers**: Spatial dropout better
4. **Recurrent layers**: Use specific dropout variants
5. **Output layer**: Never apply dropout here!

### Best Practices

‚úÖ **DO:**
- Use dropout=0.2-0.5 on fully connected layers
- Higher dropout on input layer (0.5)
- Lower dropout on hidden layers (0.2-0.3)
- Always `model.eval()` during inference

‚ùå **DON'T:**
- Apply dropout to every layer
- Use very high dropout (>0.7)
- Forget to switch train/eval modes
- Apply to output layer

---

# 3. WEIGHT DECAY

## 3.1 L2 Regularization

### Formula

Add penalty term to loss:

$$L_{\text{total}} = L_{\text{task}} + \frac{\lambda}{2} \sum_i w_i^2$$

### Gradient

$$\nabla L_{\text{total}} = \nabla L_{\text{task}} + \lambda w$$

### Effect

- Encourages smaller weights
- Smoother decision boundaries
- Prevents overfitting

## 3.2 Weight Decay vs L2

### In SGD: Same thing

```python
# L2 regularization
loss = task_loss + lambda * (w ** 2).sum()

# Weight decay
optimizer = SGD(params, lr=lr, weight_decay=lambda)
```

### In Adam: DIFFERENT!

- **L2**: Regularization term gets adapted by Adam
- **Weight Decay**: Decoupled, not adapted

‚Üí Use **AdamW** for proper weight decay

In [None]:
# Compare weight decay values

def train_with_weight_decay(weight_decay, epochs=100):
    """Train v·ªõi weight decay"""
    model = LargeModel().to(device)
    optimizer = optim.AdamW(model.parameters(), lr=0.01, weight_decay=weight_decay)
    criterion = nn.CrossEntropyLoss()
    
    val_losses = []
    
    for epoch in range(epochs):
        # Train
        model.train()
        optimizer.zero_grad()
        outputs = model(X_train.to(device))
        loss = criterion(outputs, y_train.to(device))
        loss.backward()
        optimizer.step()
        
        # Validate
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_val.to(device))
            val_loss = criterion(val_outputs, y_val.to(device))
            val_losses.append(val_loss.item())
    
    return val_losses

# Test different weight decay values
weight_decays = [0.0, 0.001, 0.01, 0.1]
results = {}

print("üîÑ Training v·ªõi different weight decay...\n")

for wd in weight_decays:
    val_losses = train_with_weight_decay(wd)
    results[wd] = val_losses
    print(f"‚úÖ Weight Decay={wd}: Val Loss={val_losses[-1]:.4f}")

# Plot
plt.figure(figsize=(12, 5))

for wd in weight_decays:
    plt.plot(results[wd], label=f'WD={wd}', linewidth=2, alpha=0.8)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Validation Loss', fontsize=12)
plt.title('Effect of Weight Decay', fontsize=13, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print("\nüìä Observations:")
print("   WD=0.0: Overfits")
print("   WD=0.001-0.01: Good regularization")
print("   WD=0.1: Too strong (underfits)")
print("\nüí° Typical: 0.0001-0.01 depending on model size")

---

# 4. LABEL SMOOTHING

## 4.1 Over-confidence Problem

### Standard Cross Entropy

Targets: **Hard labels** (one-hot)
$$y = [0, 0, 1, 0, 0, ...]$$

Problem: Model becomes overconfident
$$\text{softmax}(\text{logits}) = [0.001, 0.001, 0.997, 0.001, ...]$$

### Label Smoothing

Targets: **Soft labels**
$$y_{\text{smooth}} = (1-\epsilon) \cdot y + \frac{\epsilon}{K}$$

Where:
- $\epsilon$: smoothing parameter (typically 0.1)
- $K$: number of classes

Example v·ªõi $\epsilon=0.1, K=10$:
$$y = [0.01, 0.01, 0.91, 0.01, ...]$$

## 4.2 Benefits

- ‚úÖ Better calibration
- ‚úÖ More robust predictions
- ‚úÖ Slight regularization effect
- ‚úÖ Prevents overconfidence

In [None]:
# Label smoothing implementation

class LabelSmoothingCrossEntropy(nn.Module):
    """
    Cross entropy v·ªõi label smoothing
    """
    def __init__(self, epsilon=0.1):
        super().__init__()
        self.epsilon = epsilon
    
    def forward(self, outputs, targets):
        """
        Args:
            outputs: Model predictions (logits)
            targets: Ground truth labels (long tensor)
        """
        n_classes = outputs.size(-1)
        
        # Convert targets to one-hot
        one_hot = torch.zeros_like(outputs).scatter(1, targets.unsqueeze(1), 1)
        
        # Apply label smoothing
        smooth_labels = one_hot * (1 - self.epsilon) + self.epsilon / n_classes
        
        # Compute loss
        log_probs = F.log_softmax(outputs, dim=-1)
        loss = -(smooth_labels * log_probs).sum(dim=-1).mean()
        
        return loss

# Compare standard vs label smoothing
def train_with_label_smoothing(use_smoothing, epochs=100):
    model = LargeModel().to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    
    if use_smoothing:
        criterion = LabelSmoothingCrossEntropy(epsilon=0.1)
    else:
        criterion = nn.CrossEntropyLoss()
    
    val_losses = []
    
    for epoch in range(epochs):
        # Train
        model.train()
        optimizer.zero_grad()
        outputs = model(X_train.to(device))
        loss = criterion(outputs, y_train.to(device))
        loss.backward()
        optimizer.step()
        
        # Validate
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_val.to(device))
            val_loss = criterion(val_outputs, y_val.to(device))
            val_losses.append(val_loss.item())
    
    return val_losses

# Train both
print("üîÑ Training with standard CE...")
losses_standard = train_with_label_smoothing(False)

print("üîÑ Training with label smoothing...")
losses_smoothing = train_with_label_smoothing(True)

# Plot
plt.figure(figsize=(12, 5))
plt.plot(losses_standard, label='Standard CE', linewidth=2)
plt.plot(losses_smoothing, label='Label Smoothing (Œµ=0.1)', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Validation Loss', fontsize=12)
plt.title('Label Smoothing Effect', fontsize=13, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print("\nüìä Results:")
print(f"   Standard CE: {losses_standard[-1]:.4f}")
print(f"   Label Smoothing: {losses_smoothing[-1]:.4f}")
print("\n‚úÖ Label smoothing provides slight regularization")

---

# üéì T·ªïng k·∫øt FILE 3: Regularization

## ‚úÖ Nh·ªØng g√¨ ƒë√£ h·ªçc

### 1. Overfitting & Generalization
- **Bias-variance tradeoff**: Balance model complexity
- **Train-val gap**: Sign of overfitting
- **Solutions**: Regularization, more data, early stopping

### 2. Dropout
- **Mechanism**: Randomly drop neurons during training
- **Benefits**: Model averaging, prevents co-adaptation
- **Best rates**: 0.2-0.5 hidden, 0.5 input
- **Critical**: Use `model.train()` and `model.eval()`

### 3. Weight Decay
- **L2 regularization**: Penalty on large weights
- **Weight decay vs L2**: Different in Adam!
- **Use AdamW**: Proper decoupled weight decay
- **Typical values**: 0.0001-0.01

### 4. Label Smoothing
- **Problem**: Overconfidence
- **Solution**: Soft labels with $\epsilon=0.1$
- **Benefits**: Better calibration, robustness

## üöÄ Key Takeaways

1. **Overfitting** = high variance problem
2. **Dropout** effective regularizer (0.2-0.5)
3. **Weight decay** complementary to dropout
4. **AdamW** better than Adam + L2
5. **Label smoothing** prevents overconfidence
6. **Combine regularizations** for best results

## üìù Next Files

- FILE 4: Embedding
- FILE 5: Normalization
- FILE 6: Activation Functions

---

**Ch√∫c m·ª´ng b·∫°n ƒë√£ ho√†n th√†nh FILE 3! üéâ**