# üìó Phase 2 ‚Äì Representation: 6Ô∏è‚É£ Activation Functions

## Gi·∫£ng vi√™n: Deep Learning v·ªõi PyTorch

### üéØ V·ª™A ƒê·ª¶ ‚Äì KH√îNG SA ƒê√Ä

---

## M·ª•c ti√™u h·ªçc t·∫≠p

Sau khi ho√†n th√†nh notebook n√†y, b·∫°n s·∫Ω:
- ‚úÖ Hi·ªÉu **vai tr√≤** c·ªßa non-linearity trong neural networks
- ‚úÖ N·∫Øm v·ªØng c√°c activation functions ph·ªï bi·∫øn: **Sigmoid, Tanh, ReLU, Leaky ReLU, GELU, Swish**
- ‚úÖ Hi·ªÉu **Dead ReLU problem** v√† c√°ch gi·∫£i quy·∫øt
- ‚úÖ Hi·ªÉu t·∫°i sao **GELU** ƒë∆∞·ª£c d√πng trong Transformers
- ‚úÖ So s√°nh **gradient flow** gi·ªØa c√°c activations
- ‚úÖ Th·ª±c h√†nh experiments ƒë·ªÉ ch·ªçn activation ph√π h·ª£p

---

In [None]:
# Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Seed
torch.manual_seed(42)
np.random.seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è  Device: {device}")
print(f"üî• PyTorch version: {torch.__version__}")

---

## 6.1 Activation Overview

### üéØ Vai tr√≤ c·ªßa Non-linearity

**Activation function** l√† h√†m phi tuy·∫øn ƒë∆∞·ª£c √°p d·ª•ng sau m·ªói linear transformation:

$$
h = \sigma(Wx + b)
$$

Trong ƒë√≥:
- $W, b$: weights v√† bias (linear)
- $\sigma$: activation function (**non-linear**)

### ‚ö†Ô∏è Linear Networks Limitation

**N·∫øu kh√¥ng c√≥ activation (ho·∫∑c d√πng linear activation)**:

$$
\begin{align}
h_1 &= W_1 x \\
h_2 &= W_2 h_1 = W_2 W_1 x \\
h_3 &= W_3 h_2 = W_3 W_2 W_1 x = W_{combined} x
\end{align}
$$

**V·∫•n ƒë·ªÅ**: 
- Nhi·ªÅu layers = 1 linear transformation duy nh·∫•t!
- Kh√¥ng th·ªÉ h·ªçc **non-linear patterns**
- Network s√¢u = v√¥ d·ª•ng!

In [None]:
# Demo: Linear network kh√¥ng th·ªÉ h·ªçc XOR
def demonstrate_linear_limitation():
    """
    Demonstrate r·∫±ng linear network kh√¥ng th·ªÉ h·ªçc XOR problem
    """
    # XOR dataset
    X = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
    y = torch.tensor([0., 1., 1., 0.])  # XOR output
    
    # Linear network (NO activation)
    linear_model = nn.Sequential(
        nn.Linear(2, 10),
        nn.Linear(10, 10),
        nn.Linear(10, 1)
    )
    
    # Non-linear network (WITH ReLU)
    nonlinear_model = nn.Sequential(
        nn.Linear(2, 10),
        nn.ReLU(),
        nn.Linear(10, 10),
        nn.ReLU(),
        nn.Linear(10, 1)
    )
    
    # Training function
    def train(model, X, y, epochs=1000):
        optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
        criterion = nn.MSELoss()
        losses = []
        
        for epoch in range(epochs):
            optimizer.zero_grad()
            pred = model(X).squeeze()
            loss = criterion(pred, y)
            loss.backward()
            optimizer.step()
            losses.append(loss.item())
        
        return losses
    
    # Train both
    print("üèãÔ∏è  Training Linear vs Non-linear networks on XOR...\n")
    linear_losses = train(linear_model, X, y)
    nonlinear_losses = train(nonlinear_model, X, y)
    
    # Plot results
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.suptitle('üéØ XOR Problem: Linear vs Non-linear', fontsize=16, fontweight='bold')
    
    # Loss curves
    axes[0].plot(linear_losses, label='Linear (NO activation)', color='red', linewidth=2)
    axes[0].plot(nonlinear_losses, label='Non-linear (ReLU)', color='green', linewidth=2)
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss (MSE)')
    axes[0].set_title('Training Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    axes[0].set_yscale('log')
    
    # Predictions
    with torch.no_grad():
        linear_pred = linear_model(X).squeeze()
        nonlinear_pred = nonlinear_model(X).squeeze()
    
    x_labels = ['(0,0)', '(0,1)', '(1,0)', '(1,1)']
    x_pos = np.arange(len(x_labels))
    width = 0.25
    
    axes[1].bar(x_pos - width, y.numpy(), width, label='True', color='blue', alpha=0.7)
    axes[1].bar(x_pos, linear_pred.numpy(), width, label='Linear', color='red', alpha=0.7)
    axes[1].bar(x_pos + width, nonlinear_pred.numpy(), width, label='Non-linear', color='green', alpha=0.7)
    axes[1].set_xlabel('Input')
    axes[1].set_ylabel('Output')
    axes[1].set_title('Predictions')
    axes[1].set_xticks(x_pos)
    axes[1].set_xticklabels(x_labels)
    axes[1].legend()
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Final Loss:")
    print(f"   Linear: {linear_losses[-1]:.6f}")
    print(f"   Non-linear: {nonlinear_losses[-1]:.6f}")
    print("\n‚úÖ Non-linearity l√† THI·∫æT Y·∫æU ƒë·ªÉ h·ªçc complex patterns!")

demonstrate_linear_limitation()

---

## 6.2 Common Activation Functions

### üìê Sigmoid

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

**ƒê·∫∑c ƒëi·ªÉm**:
- Output: $(0, 1)$
- Smooth, differentiable
- D√πng cho binary classification (output layer)

**V·∫•n ƒë·ªÅ**:
- ‚ö†Ô∏è **Vanishing gradient**: gradient ‚Üí 0 khi $|x|$ l·ªõn
- ‚ö†Ô∏è Not zero-centered
- ‚ö†Ô∏è Computationally expensive (exp)

### üìê Tanh

$$
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1
$$

**ƒê·∫∑c ƒëi·ªÉm**:
- Output: $(-1, 1)$
- Zero-centered (t·ªët h∆°n Sigmoid)
- D√πng trong RNN, LSTM

**V·∫•n ƒë·ªÅ**:
- ‚ö†Ô∏è V·∫´n c√≥ **vanishing gradient**
- ‚ö†Ô∏è Expensive computation

In [None]:
# Visualization: Sigmoid v√† Tanh
def plot_activation_and_gradient(activation_fn, name, x_range=(-5, 5)):
    """
    Plot activation function v√† gradient c·ªßa n√≥
    """
    x = torch.linspace(x_range[0], x_range[1], 1000, requires_grad=True)
    y = activation_fn(x)
    
    # Compute gradient
    y.sum().backward()
    grad = x.grad
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
    fig.suptitle(f'{name} Activation Function', fontsize=16, fontweight='bold')
    
    # Activation
    ax1.plot(x.detach().numpy(), y.detach().numpy(), linewidth=2, color='blue')
    ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3)
    ax1.set_xlabel('x')
    ax1.set_ylabel(f'{name}(x)')
    ax1.set_title('Activation Function')
    ax1.grid(True, alpha=0.3)
    
    # Gradient
    ax2.plot(x.detach().numpy(), grad.numpy(), linewidth=2, color='red')
    ax2.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax2.axvline(x=0, color='k', linestyle='--', alpha=0.3)
    ax2.set_xlabel('x')
    ax2.set_ylabel(f"d{name}/dx")
    ax2.set_title('Gradient')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

print("üìä Sigmoid & Tanh Visualization\n")

# Sigmoid
plot_activation_and_gradient(torch.sigmoid, 'Sigmoid')
print("‚ö†Ô∏è  Sigmoid gradient ‚Üí 0 khi |x| > 5 (Vanishing Gradient!)\n")

# Tanh
plot_activation_and_gradient(torch.tanh, 'Tanh')
print("‚ö†Ô∏è  Tanh c≈©ng c√≥ vanishing gradient problem!")

### üìê ReLU (Rectified Linear Unit)

$$
\text{ReLU}(x) = \max(0, x) = \begin{cases} 
x & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

**∆Øu ƒëi·ªÉm**:
- ‚úÖ Kh√¥ng c√≥ vanishing gradient (·ªü $x > 0$)
- ‚úÖ Computationally efficient
- ‚úÖ Sparse activation (nhi·ªÅu neurons = 0)
- ‚úÖ **ƒê∆∞·ª£c d√πng r·ªông r√£i nh·∫•t!**

**Nh∆∞·ª£c ƒëi·ªÉm**:
- ‚ö†Ô∏è **Dead ReLU**: neurons c√≥ th·ªÉ "ch·∫øt" (output = 0 m√£i m√£i)
- ‚ö†Ô∏è Not zero-centered
- ‚ö†Ô∏è Unbounded (c√≥ th·ªÉ explode)

### üìê Leaky ReLU

$$
\text{LeakyReLU}(x) = \max(\alpha x, x) = \begin{cases} 
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$

Th∆∞·ªùng $\alpha = 0.01$ ho·∫∑c $0.1$

**∆Øu ƒëi·ªÉm**:
- ‚úÖ Gi·∫£i quy·∫øt Dead ReLU problem
- ‚úÖ V·∫´n efficient

**Variants**:
- **PReLU** (Parametric ReLU): $\alpha$ l√† learnable parameter
- **ELU** (Exponential Linear Unit): smooth ·ªü negative region

In [None]:
# Visualization: ReLU family
def plot_relu_family():
    """
    Compare ReLU variants
    """
    x = torch.linspace(-3, 3, 1000)
    
    relu = F.relu(x)
    leaky_relu = F.leaky_relu(x, negative_slope=0.1)
    elu = F.elu(x)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.suptitle('üî• ReLU Family', fontsize=16, fontweight='bold')
    
    # Activations
    axes[0].plot(x.numpy(), relu.numpy(), label='ReLU', linewidth=2)
    axes[0].plot(x.numpy(), leaky_relu.numpy(), label='Leaky ReLU (Œ±=0.1)', linewidth=2)
    axes[0].plot(x.numpy(), elu.numpy(), label='ELU', linewidth=2)
    axes[0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('f(x)')
    axes[0].set_title('Activation Functions')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Gradients
    x_grad = torch.linspace(-3, 3, 1000, requires_grad=True)
    
    for act_fn, name, color in [(F.relu, 'ReLU', 'C0'), 
                                  (lambda x: F.leaky_relu(x, 0.1), 'Leaky ReLU', 'C1'),
                                  (F.elu, 'ELU', 'C2')]:
        x_grad.grad = None
        y = act_fn(x_grad)
        y.sum().backward()
        axes[1].plot(x_grad.detach().numpy(), x_grad.grad.numpy(), 
                    label=name, linewidth=2, color=color)
    
    axes[1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
    axes[1].set_xlabel('x')
    axes[1].set_ylabel('df/dx')
    axes[1].set_title('Gradients')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    axes[1].set_ylim(-0.5, 1.5)
    
    plt.tight_layout()
    plt.show()

plot_relu_family()
print("\nüìä Key Observations:")
print("   - ReLU: gradient = 0 when x < 0 (Dead ReLU risk)")
print("   - Leaky ReLU: small gradient when x < 0 (solves Dead ReLU)")
print("   - ELU: smooth, better gradient flow")

### üìê GELU (Gaussian Error Linear Unit)

$$
\text{GELU}(x) = x \cdot \Phi(x)
$$

Trong ƒë√≥ $\Phi(x)$ l√† CDF c·ªßa Gaussian distribution.

**Approximation** (faster):
$$
\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right]\right)
$$

**ƒê·∫∑c ƒëi·ªÉm**:
- ‚úÖ Smooth (differentiable everywhere)
- ‚úÖ Non-monotonic (c√≥ curvature)
- ‚úÖ **ƒê∆∞·ª£c d√πng trong BERT, GPT, Transformers**
- ‚úÖ Better gradient properties than ReLU

### üìê Swish (SiLU)

$$
\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
$$

**ƒê·∫∑c ƒëi·ªÉm**:
- ‚úÖ Self-gated
- ‚úÖ Smooth, non-monotonic
- ‚úÖ Discovered by Google AutoML
- ‚úÖ T·ªët cho deep networks

In [None]:
# Visualization: GELU v√† Swish
def plot_modern_activations():
    """
    Visualize GELU v√† Swish
    """
    x = torch.linspace(-5, 5, 1000)
    
    # Compute activations
    relu = F.relu(x)
    gelu = F.gelu(x)
    silu = F.silu(x)  # Swish
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle('üöÄ Modern Activation Functions', fontsize=16, fontweight='bold')
    
    # Activation comparison
    axes[0, 0].plot(x.numpy(), relu.numpy(), label='ReLU', linewidth=2, linestyle='--')
    axes[0, 0].plot(x.numpy(), gelu.numpy(), label='GELU', linewidth=2)
    axes[0, 0].plot(x.numpy(), silu.numpy(), label='Swish/SiLU', linewidth=2)
    axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[0, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('f(x)')
    axes[0, 0].set_title('Activation Functions')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Gradients
    x_grad = torch.linspace(-5, 5, 1000, requires_grad=True)
    
    for act_fn, name, color in [(F.relu, 'ReLU', 'C0'), 
                                  (F.gelu, 'GELU', 'C1'),
                                  (F.silu, 'Swish', 'C2')]:
        x_grad.grad = None
        y = act_fn(x_grad)
        y.sum().backward()
        axes[0, 1].plot(x_grad.detach().numpy(), x_grad.grad.numpy(), 
                       label=name, linewidth=2, color=color)
    
    axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[0, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
    axes[0, 1].set_xlabel('x')
    axes[0, 1].set_ylabel('df/dx')
    axes[0, 1].set_title('Gradients')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Zoom near zero for GELU
    x_zoom = torch.linspace(-2, 2, 1000)
    gelu_zoom = F.gelu(x_zoom)
    relu_zoom = F.relu(x_zoom)
    axes[1, 0].plot(x_zoom.numpy(), relu_zoom.numpy(), label='ReLU', linewidth=2, linestyle='--')
    axes[1, 0].plot(x_zoom.numpy(), gelu_zoom.numpy(), label='GELU', linewidth=2)
    axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
    axes[1, 0].set_xlabel('x')
    axes[1, 0].set_ylabel('f(x)')
    axes[1, 0].set_title('Zoom: GELU vs ReLU near zero')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Negative values behavior
    x_neg = torch.linspace(-5, 0, 1000)
    axes[1, 1].plot(x_neg.numpy(), F.relu(x_neg).numpy(), label='ReLU', linewidth=2)
    axes[1, 1].plot(x_neg.numpy(), F.gelu(x_neg).numpy(), label='GELU', linewidth=2)
    axes[1, 1].plot(x_neg.numpy(), F.silu(x_neg).numpy(), label='Swish', linewidth=2)
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].set_title('Behavior at Negative Values')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_modern_activations()

print("\n‚ú® Key Insights:")
print("   - GELU: Smooth, non-zero gradient ·ªü negative region")
print("   - Swish: Self-gating, smooth everywhere")
print("   - C·∫£ 2 ƒë·ªÅu t·ªët h∆°n ReLU cho deep networks!")

---

## 6.3 Activation & Optimization

### üíÄ Dead ReLU Problem

**ƒê·ªãnh nghƒ©a**: M·ªôt neuron "ch·∫øt" khi output = 0 for ALL inputs.

**Nguy√™n nh√¢n**:
1. Weights update khi·∫øn $Wx + b < 0$ lu√¥n
2. Gradient = 0 ‚Üí kh√¥ng update ƒë∆∞·ª£c n·ªØa
3. Neuron "ch·∫øt" vƒ©nh vi·ªÖn

**Gi·∫£i ph√°p**:
- ‚úÖ D√πng Leaky ReLU / PReLU / ELU
- ‚úÖ Careful weight initialization
- ‚úÖ Lower learning rate
- ‚úÖ Batch Normalization

In [None]:
# Demo: Dead ReLU problem
def demonstrate_dead_relu():
    """
    Demonstrate Dead ReLU problem
    """
    # Create model with ReLU
    torch.manual_seed(42)
    model_relu = nn.Sequential(
        nn.Linear(10, 100),
        nn.ReLU(),
        nn.Linear(100, 100),
        nn.ReLU(),
        nn.Linear(100, 1)
    )
    
    # Create model with Leaky ReLU
    torch.manual_seed(42)
    model_leaky = nn.Sequential(
        nn.Linear(10, 100),
        nn.LeakyReLU(0.1),
        nn.Linear(100, 100),
        nn.LeakyReLU(0.1),
        nn.Linear(100, 1)
    )
    
    # Synthetic data
    X = torch.randn(1000, 10)
    y = torch.randn(1000, 1)
    
    # Training function that tracks dead neurons
    def train_and_track_dead_neurons(model, X, y, epochs=50):
        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)  # High LR to trigger dead ReLU
        criterion = nn.MSELoss()
        dead_neurons_history = []
        
        for epoch in range(epochs):
            optimizer.zero_grad()
            output = model(X)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()
            
            # Count dead neurons in hidden layers
            with torch.no_grad():
                activations = model[1](model[0](X))
                dead = (activations.abs().sum(dim=0) == 0).sum().item()
                dead_neurons_history.append(dead)
        
        return dead_neurons_history
    
    print("üî¨ Tracking Dead Neurons...\n")
    dead_relu = train_and_track_dead_neurons(model_relu, X, y)
    dead_leaky = train_and_track_dead_neurons(model_leaky, X, y)
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(dead_relu, label='ReLU', linewidth=2, color='red')
    plt.plot(dead_leaky, label='Leaky ReLU', linewidth=2, color='green')
    plt.xlabel('Epoch')
    plt.ylabel('Number of Dead Neurons')
    plt.title('üíÄ Dead ReLU Problem', fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print(f"Final dead neurons (ReLU): {dead_relu[-1]}/100")
    print(f"Final dead neurons (Leaky ReLU): {dead_leaky[-1]}/100")
    print("\n‚úÖ Leaky ReLU gi·∫£m ƒë√°ng k·ªÉ Dead Neuron problem!")

demonstrate_dead_relu()

### üåä Smooth vs Non-smooth Activations

**Non-smooth** (ReLU, Leaky ReLU):
- Kh√¥ng differentiable t·∫°i x=0
- Sharp transitions
- Faster computation

**Smooth** (GELU, Swish, Sigmoid, Tanh):
- Differentiable everywhere
- Better gradient flow
- More stable optimization

### ü§ñ T·∫°i sao GELU trong Transformers?

1. **Smooth gradient**: Better optimization cho deep networks
2. **Non-monotonic**: Richer expressivity
3. **Probabilistic interpretation**: Gates inputs by their value
4. **Empirically better**: Proven trong BERT, GPT

**BERT/GPT architecture**:
```python
FFN = nn.Sequential(
    nn.Linear(d_model, 4 * d_model),
    nn.GELU(),  # ‚Üê GELU here!
    nn.Linear(4 * d_model, d_model)
)
```

---

## 6.4 Practical Experiments

Ch√∫ng ta s·∫Ω compare c√°c activations tr√™n:
- üéØ Convergence speed
- üìä Final accuracy
- üåä Gradient flow stability

In [None]:
# Setup: Dataset v√† model architecture
def create_dataset(n_samples=5000):
    """
    Create a non-linear classification dataset
    """
    X = torch.randn(n_samples, 20)
    # Non-linear combination
    y = ((X[:, :5].pow(2).sum(dim=1) > 5) & 
         (X[:, 5:10].sum(dim=1) > 0)).long()
    return TensorDataset(X, y)

class FlexibleMLP(nn.Module):
    """
    MLP v·ªõi pluggable activation
    """
    def __init__(self, input_dim=20, hidden_dim=128, output_dim=2, activation='relu'):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, output_dim)
        
        # Activation selection
        activations = {
            'relu': nn.ReLU(),
            'leaky_relu': nn.LeakyReLU(0.1),
            'gelu': nn.GELU(),
            'silu': nn.SiLU(),
            'tanh': nn.Tanh(),
            'sigmoid': nn.Sigmoid()
        }
        self.activation = activations[activation]
    
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.fc4(x)
        return x

# Create dataset
dataset = create_dataset()
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

print(f"‚úÖ Dataset created: {len(dataset)} samples")
print(f"‚úÖ Class distribution: {dataset[:][1].bincount()}")

In [None]:
# Training function v·ªõi detailed tracking
def train_with_tracking(model, dataloader, epochs=30, lr=0.001, device='cpu'):
    """
    Train v√† track multiple metrics
    """
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    history = {
        'loss': [],
        'accuracy': [],
        'grad_norm': [],
        'activation_std': []
    }
    
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        correct = 0
        total = 0
        grad_norms = []
        activation_stds = []
        
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            
            optimizer.zero_grad()
            
            # Forward v·ªõi activation tracking
            output = model(X)
            loss = criterion(output, y)
            
            # Backward
            loss.backward()
            
            # Track gradient norm
            total_norm = 0
            for p in model.parameters():
                if p.grad is not None:
                    total_norm += p.grad.data.norm(2).item() ** 2
            grad_norms.append(total_norm ** 0.5)
            
            optimizer.step()
            
            # Stats
            epoch_loss += loss.item()
            _, predicted = output.max(1)
            correct += predicted.eq(y).sum().item()
            total += y.size(0)
        
        # Record history
        history['loss'].append(epoch_loss / len(dataloader))
        history['accuracy'].append(100. * correct / total)
        history['grad_norm'].append(np.mean(grad_norms))
    
    return history

print("üèãÔ∏è  Training function ready!")

In [None]:
# Experiment: Compare all activations
print("üöÄ Running Comprehensive Activation Comparison...\n")

activations_to_test = ['relu', 'leaky_relu', 'gelu', 'silu', 'tanh', 'sigmoid']
results = {}

for act_name in tqdm(activations_to_test, desc="Training models"):
    model = FlexibleMLP(activation=act_name)
    history = train_with_tracking(model, dataloader, epochs=30, device=device)
    results[act_name] = history

print("\n‚úÖ Training completed!")

In [None]:
# Comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('üìä Comprehensive Activation Function Comparison', 
             fontsize=16, fontweight='bold')

colors = ['C0', 'C1', 'C2', 'C3', 'C4', 'C5']

# Loss curves
for (act_name, history), color in zip(results.items(), colors):
    axes[0, 0].plot(history['loss'], label=act_name.upper(), 
                   linewidth=2, color=color)
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Training Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_yscale('log')

# Accuracy curves
for (act_name, history), color in zip(results.items(), colors):
    axes[0, 1].plot(history['accuracy'], label=act_name.upper(),
                   linewidth=2, color=color)
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy (%)')
axes[0, 1].set_title('Training Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Gradient norm
for (act_name, history), color in zip(results.items(), colors):
    axes[1, 0].plot(history['grad_norm'], label=act_name.upper(),
                   linewidth=2, color=color, alpha=0.7)
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Gradient Norm')
axes[1, 0].set_title('Gradient Flow Stability')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_yscale('log')

# Final performance comparison
final_accs = [results[act]['accuracy'][-1] for act in activations_to_test]
bars = axes[1, 1].bar(range(len(activations_to_test)), final_accs, color=colors)
axes[1, 1].set_xticks(range(len(activations_to_test)))
axes[1, 1].set_xticklabels([a.upper() for a in activations_to_test], rotation=45)
axes[1, 1].set_ylabel('Final Accuracy (%)')
axes[1, 1].set_title('Final Performance')
axes[1, 1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, acc in zip(bars, final_accs):
    height = bar.get_height()
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., height,
                   f'{acc:.1f}%', ha='center', va='bottom')

plt.tight_layout()
plt.show()

In [None]:
# Summary statistics
print("\nüìä Final Performance Summary\n")
print("-" * 60)
print(f"{'Activation':<15} {'Final Acc (%)':<15} {'Final Loss':<15} {'Avg Grad Norm'}")
print("-" * 60)

for act_name in activations_to_test:
    history = results[act_name]
    final_acc = history['accuracy'][-1]
    final_loss = history['loss'][-1]
    avg_grad = np.mean(history['grad_norm'][-5:])  # Last 5 epochs
    
    print(f"{act_name.upper():<15} {final_acc:<15.2f} {final_loss:<15.4f} {avg_grad:.4f}")

print("-" * 60)

# Find best
best_act = max(activations_to_test, key=lambda a: results[a]['accuracy'][-1])
print(f"\nüèÜ Best Activation: {best_act.upper()}")
print(f"   Accuracy: {results[best_act]['accuracy'][-1]:.2f}%")

---

## üìö K·∫øt lu·∫≠n & Best Practices

### üéØ Khi n√†o d√πng activation n√†o?

| Use Case | Recommended | L√Ω do |
|----------|-------------|-------|
| **CNN (Computer Vision)** | ReLU / Leaky ReLU | Fast, proven, works well |
| **Transformers / NLP** | GELU | Smooth, better gradient, proven in BERT/GPT |
| **RNN / LSTM** | Tanh | Traditional choice, works well |
| **Deep Networks (>50 layers)** | GELU / Swish | Better gradient flow |
| **Binary Classification Output** | Sigmoid | Output in (0,1) |
| **Multi-class Output** | Softmax | Probability distribution |
| **Regression Output** | None (Linear) | No constraint needed |

### üí° Best Practices:

1. **Default Choice**: Start v·ªõi **ReLU** (CNN) ho·∫∑c **GELU** (Transformers)
2. **Dead ReLU**: N·∫øu g·∫∑p ‚Üí switch sang **Leaky ReLU** ho·∫∑c **ELU**
3. **Very Deep Networks**: Th·ª≠ **GELU** ho·∫∑c **Swish**
4. **Gradient Issues**: Avoid Sigmoid/Tanh trong hidden layers
5. **Combination**: C√≥ th·ªÉ d√πng activations kh√°c nhau ·ªü layers kh√°c nhau

### ‚ö†Ô∏è Common Mistakes:

- ‚ùå D√πng Sigmoid/Tanh cho hidden layers trong deep networks
- ‚ùå Qu√™n ƒë·∫∑t activation sau Linear layers
- ‚ùå D√πng activation ·ªü output layer khi kh√¥ng c·∫ßn
- ‚ùå Kh√¥ng test multiple activations khi optimize

### üî¨ Experiment Tips:

1. Lu√¥n test **√≠t nh·∫•t 2-3 activations** kh√°c nhau
2. Monitor **gradient norm** ƒë·ªÉ detect vanishing/exploding
3. Check **dead neuron percentage** v·ªõi ReLU
4. Compare **convergence speed**, kh√¥ng ch·ªâ final accuracy

---

## üéì B√†i t·∫≠p th·ª±c h√†nh

1. **Implement Mish activation** t·ª´ scratch: $\text{Mish}(x) = x \cdot \tanh(\ln(1 + e^x))$
2. **Compare activations tr√™n CIFAR-10**: Test ReLU vs GELU tr√™n ResNet
3. **Visualize activation distributions**: Plot histogram c·ªßa activations qua training
4. **Dead neuron tracking**: Monitor percentage of dead neurons over time
5. **Custom activation**: Design v√† test activation function c·ªßa b·∫°n!

---

## üìñ References

1. [GELU Paper](https://arxiv.org/abs/1606.08415) - Gaussian Error Linear Units
2. [Swish Paper](https://arxiv.org/abs/1710.05941) - Searching for Activation Functions
3. [ReLU Deep Dive](https://arxiv.org/abs/1803.08375) - Understanding ReLU
4. [Activation Survey](https://arxiv.org/abs/2109.14545) - Comprehensive comparison

---

### üôè C·∫£m ∆°n b·∫°n ƒë√£ h·ªçc!

**Normalization + Activation = Foundation c·ªßa Modern Deep Learning!** üöÄ