# 07 優化技巧 Optimization Techniques

## 學習目標

1. 深入理解 Weight Initialization 的重要性
2. 實作 Momentum SGD 和 Adam 優化器
3. 實作 Learning Rate Schedule
4. 比較不同優化策略的效果

## 為什麼優化技巧重要？

深度神經網路的訓練是一個複雜的非凸優化問題。好的優化策略可以：
- 加速收斂
- 避免陷入局部極小值
- 提高最終模型的性能

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
print("Optimization Techniques module loaded!")

## 第一部分：Weight Initialization

### 為什麼初始化很重要？

不當的初始化會導致：
1. **梯度消失**：激活值太小，梯度接近零
2. **梯度爆炸**：激活值太大，梯度暴增
3. **對稱性問題**：如果所有權重相同，每個神經元學到相同的東西

### 常見的初始化方法

1. **零初始化**：$W = 0$（壞的！對稱性問題）
2. **隨機初始化**：$W \sim \mathcal{N}(0, \sigma^2)$（需要選擇合適的 $\sigma$）
3. **Xavier 初始化**：$W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})$（適合 tanh/sigmoid）
4. **He 初始化**：$W \sim \mathcal{N}(0, \frac{2}{n_{in}})$（適合 ReLU）

In [None]:
def analyze_activation_distribution(init_method, activation_fn, num_layers=10, hidden_dim=256):
    """
    分析不同初始化方式在深度網路中的激活值分佈
    
    Parameters
    ----------
    init_method : str
        初始化方法：'zero', 'small', 'normal', 'xavier', 'he'
    activation_fn : str
        激活函數：'tanh', 'relu'
    """
    x = np.random.randn(32, hidden_dim)
    activations = [x]
    
    for layer_idx in range(num_layers):
        # 初始化權重
        if init_method == 'zero':
            W = np.zeros((hidden_dim, hidden_dim))
        elif init_method == 'small':
            W = np.random.randn(hidden_dim, hidden_dim) * 0.01
        elif init_method == 'normal':
            W = np.random.randn(hidden_dim, hidden_dim) * 1.0
        elif init_method == 'xavier':
            std = np.sqrt(2.0 / (hidden_dim + hidden_dim))
            W = np.random.randn(hidden_dim, hidden_dim) * std
        elif init_method == 'he':
            std = np.sqrt(2.0 / hidden_dim)
            W = np.random.randn(hidden_dim, hidden_dim) * std
        
        # 線性變換
        x = x @ W
        
        # 激活函數
        if activation_fn == 'tanh':
            x = np.tanh(x)
        elif activation_fn == 'relu':
            x = np.maximum(0, x)
        
        activations.append(x)
    
    return activations

# 比較不同初始化方式
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

init_methods = ['small', 'normal', 'xavier', 'he']
activation_fns = ['tanh', 'relu']

for i, act_fn in enumerate(activation_fns):
    for j, init_method in enumerate(init_methods):
        ax = axes[i, j]
        
        activations = analyze_activation_distribution(init_method, act_fn)
        
        # 畫每層的激活值分佈
        means = [np.mean(a) for a in activations]
        stds = [np.std(a) for a in activations]
        
        layers = list(range(len(means)))
        ax.fill_between(layers, 
                        [m - s for m, s in zip(means, stds)],
                        [m + s for m, s in zip(means, stds)],
                        alpha=0.3)
        ax.plot(layers, means, 'o-')
        ax.plot(layers, stds, 's--', label='std')
        
        ax.set_xlabel('Layer')
        ax.set_ylabel('Value')
        ax.set_title(f'{init_method.capitalize()} Init + {act_fn.upper()}')
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n觀察：")
print("- small + tanh/relu: 激活值迅速趨近於零（梯度消失）")
print("- normal + tanh: 激活值飽和在 ±1（梯度消失）")
print("- normal + relu: 激活值爆炸")
print("- xavier + tanh: 激活值保持穩定")
print("- he + relu: 激活值保持穩定")

## 第二部分：優化器實作

### 2.1 Vanilla SGD

$$\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$$

### 2.2 Momentum SGD

$$v_{t+1} = \beta v_t + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \alpha v_{t+1}$$

**直觀理解**：像一個有動量的球在滾動，不容易被小的梯度波動影響

### 2.3 Adam (Adaptive Moment Estimation)

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$
$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
$$\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

**直觀理解**：結合了 Momentum（一階動量）和 RMSProp（二階動量）的優點

In [None]:
class SGD:
    """
    Vanilla SGD 優化器
    """
    def __init__(self, learning_rate=0.01):
        self.lr = learning_rate
    
    def update(self, params_and_grads):
        for param, grad in params_and_grads:
            param -= self.lr * grad


class MomentumSGD:
    """
    Momentum SGD 優化器
    """
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.lr = learning_rate
        self.momentum = momentum
        self.velocities = {}
    
    def update(self, params_and_grads):
        for i, (param, grad) in enumerate(params_and_grads):
            if i not in self.velocities:
                self.velocities[i] = np.zeros_like(param)
            
            self.velocities[i] = self.momentum * self.velocities[i] + grad
            param -= self.lr * self.velocities[i]


class Adam:
    """
    Adam 優化器
    
    結合了 Momentum 和 RMSProp 的優點
    """
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        
        self.m = {}  # 一階動量
        self.v = {}  # 二階動量
        self.t = 0   # 時間步
    
    def update(self, params_and_grads):
        self.t += 1
        
        for i, (param, grad) in enumerate(params_and_grads):
            if i not in self.m:
                self.m[i] = np.zeros_like(param)
                self.v[i] = np.zeros_like(param)
            
            # 更新一階動量
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad
            
            # 更新二階動量
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (grad ** 2)
            
            # 偏差修正
            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
            
            # 更新參數
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

print("優化器已定義！")

In [None]:
# 視覺化不同優化器在 2D 函數上的行為

def rosenbrock(x, y, a=1, b=100):
    """Rosenbrock 函數（香蕉函數）"""
    return (a - x)**2 + b * (y - x**2)**2

def rosenbrock_grad(x, y, a=1, b=100):
    """Rosenbrock 函數的梯度"""
    dx = -2 * (a - x) - 4 * b * x * (y - x**2)
    dy = 2 * b * (y - x**2)
    return np.array([dx, dy])

def optimize_rosenbrock(optimizer_class, optimizer_kwargs, start, max_iters=1000):
    """優化 Rosenbrock 函數"""
    pos = np.array(start, dtype=float)
    trajectory = [pos.copy()]
    losses = [rosenbrock(pos[0], pos[1])]
    
    optimizer = optimizer_class(**optimizer_kwargs)
    
    for _ in range(max_iters):
        grad = rosenbrock_grad(pos[0], pos[1])
        
        # 模擬 params_and_grads 的格式
        optimizer.update([(pos, grad)])
        
        trajectory.append(pos.copy())
        losses.append(rosenbrock(pos[0], pos[1]))
        
        if losses[-1] < 1e-8:
            break
    
    return np.array(trajectory), np.array(losses)

# 比較不同優化器
start_point = [-1.0, 2.0]

optimizers = [
    ('SGD', SGD, {'learning_rate': 0.001}),
    ('Momentum', MomentumSGD, {'learning_rate': 0.001, 'momentum': 0.9}),
    ('Adam', Adam, {'learning_rate': 0.01}),
]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 等高線圖
ax = axes[0]
x = np.linspace(-2, 2, 100)
y = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)

ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.5)
ax.contourf(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.3)

# 最優點
ax.plot(1, 1, 'r*', markersize=15, label='Optimum (1, 1)')

colors = ['blue', 'green', 'red']
for (name, opt_class, opt_kwargs), color in zip(optimizers, colors):
    traj, losses = optimize_rosenbrock(opt_class, opt_kwargs, start_point)
    ax.plot(traj[:, 0], traj[:, 1], 'o-', color=color, label=name, 
            markersize=2, alpha=0.7, linewidth=1)

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimization Trajectories on Rosenbrock Function')
ax.legend()
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)

# Loss 曲線
ax = axes[1]
for (name, opt_class, opt_kwargs), color in zip(optimizers, colors):
    traj, losses = optimize_rosenbrock(opt_class, opt_kwargs, start_point)
    ax.plot(losses[:200], color=color, label=name)

ax.set_xlabel('Iteration')
ax.set_ylabel('Loss')
ax.set_title('Convergence Comparison')
ax.set_yscale('log')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 第三部分：Learning Rate Schedule

學習率的選擇對訓練至關重要：
- **太大**：震盪，可能無法收斂
- **太小**：收斂太慢

常見的策略是從較大的學習率開始，隨著訓練進行逐漸減小。

### 常見的 Schedule

1. **Step Decay**：每隔 N 個 epoch 減少一定比例
2. **Exponential Decay**：$\alpha_t = \alpha_0 \cdot \gamma^t$
3. **Cosine Annealing**：$\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t}{T}\pi))$
4. **Warmup + Decay**：先增加後減少

In [None]:
class LRScheduler:
    """Learning Rate Scheduler 基類"""
    def __init__(self, optimizer, initial_lr):
        self.optimizer = optimizer
        self.initial_lr = initial_lr
        self.current_lr = initial_lr
    
    def step(self, epoch):
        raise NotImplementedError
    
    def get_lr(self):
        return self.current_lr


class StepLR(LRScheduler):
    """
    Step Decay: 每 step_size 個 epoch 減少 gamma 倍
    """
    def __init__(self, optimizer, initial_lr, step_size=30, gamma=0.1):
        super().__init__(optimizer, initial_lr)
        self.step_size = step_size
        self.gamma = gamma
    
    def step(self, epoch):
        self.current_lr = self.initial_lr * (self.gamma ** (epoch // self.step_size))
        self.optimizer.lr = self.current_lr


class ExponentialLR(LRScheduler):
    """
    Exponential Decay: lr = initial_lr * gamma^epoch
    """
    def __init__(self, optimizer, initial_lr, gamma=0.95):
        super().__init__(optimizer, initial_lr)
        self.gamma = gamma
    
    def step(self, epoch):
        self.current_lr = self.initial_lr * (self.gamma ** epoch)
        self.optimizer.lr = self.current_lr


class CosineAnnealingLR(LRScheduler):
    """
    Cosine Annealing: 學習率按餘弦曲線衰減
    """
    def __init__(self, optimizer, initial_lr, T_max, min_lr=0):
        super().__init__(optimizer, initial_lr)
        self.T_max = T_max
        self.min_lr = min_lr
    
    def step(self, epoch):
        self.current_lr = self.min_lr + 0.5 * (self.initial_lr - self.min_lr) * (
            1 + np.cos(np.pi * epoch / self.T_max)
        )
        self.optimizer.lr = self.current_lr


class WarmupCosineAnnealingLR(LRScheduler):
    """
    Warmup + Cosine Annealing
    先線性增加學習率，再用餘弦衰減
    """
    def __init__(self, optimizer, initial_lr, warmup_epochs, T_max, min_lr=0):
        super().__init__(optimizer, initial_lr)
        self.warmup_epochs = warmup_epochs
        self.T_max = T_max
        self.min_lr = min_lr
    
    def step(self, epoch):
        if epoch < self.warmup_epochs:
            # Warmup: 線性增加
            self.current_lr = self.initial_lr * epoch / self.warmup_epochs
        else:
            # Cosine annealing
            progress = (epoch - self.warmup_epochs) / (self.T_max - self.warmup_epochs)
            self.current_lr = self.min_lr + 0.5 * (self.initial_lr - self.min_lr) * (
                1 + np.cos(np.pi * progress)
            )
        self.optimizer.lr = self.current_lr

# 視覺化不同 Schedule
epochs = 100
initial_lr = 0.1

fig, ax = plt.subplots(figsize=(10, 5))

# Step LR
optimizer = SGD(learning_rate=initial_lr)
scheduler = StepLR(optimizer, initial_lr, step_size=30, gamma=0.1)
lrs = []
for epoch in range(epochs):
    scheduler.step(epoch)
    lrs.append(scheduler.get_lr())
ax.plot(lrs, label='Step (step=30, gamma=0.1)')

# Exponential LR
optimizer = SGD(learning_rate=initial_lr)
scheduler = ExponentialLR(optimizer, initial_lr, gamma=0.95)
lrs = []
for epoch in range(epochs):
    scheduler.step(epoch)
    lrs.append(scheduler.get_lr())
ax.plot(lrs, label='Exponential (gamma=0.95)')

# Cosine Annealing
optimizer = SGD(learning_rate=initial_lr)
scheduler = CosineAnnealingLR(optimizer, initial_lr, T_max=epochs)
lrs = []
for epoch in range(epochs):
    scheduler.step(epoch)
    lrs.append(scheduler.get_lr())
ax.plot(lrs, label='Cosine Annealing')

# Warmup + Cosine
optimizer = SGD(learning_rate=initial_lr)
scheduler = WarmupCosineAnnealingLR(optimizer, initial_lr, warmup_epochs=10, T_max=epochs)
lrs = []
for epoch in range(epochs):
    scheduler.step(epoch)
    lrs.append(scheduler.get_lr())
ax.plot(lrs, label='Warmup (10 epochs) + Cosine')

ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('Learning Rate Schedules')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 第四部分：實驗比較

讓我們在實際的分類任務上比較不同的優化策略。

In [None]:
# 簡化的網路用於實驗
class SimpleNet:
    """簡單的兩層網路"""
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Xavier 初始化
        std1 = np.sqrt(2.0 / (input_dim + hidden_dim))
        self.W1 = np.random.randn(input_dim, hidden_dim) * std1
        self.b1 = np.zeros(hidden_dim)
        
        std2 = np.sqrt(2.0 / (hidden_dim + output_dim))
        self.W2 = np.random.randn(hidden_dim, output_dim) * std2
        self.b2 = np.zeros(output_dim)
        
        # 梯度
        self.dW1 = None
        self.db1 = None
        self.dW2 = None
        self.db2 = None
        
        # 快取
        self.cache = None
    
    def forward(self, X):
        # Layer 1
        z1 = X @ self.W1 + self.b1
        a1 = np.maximum(0, z1)  # ReLU
        
        # Layer 2
        z2 = a1 @ self.W2 + self.b2
        
        self.cache = (X, z1, a1)
        return z2
    
    def loss(self, X, y):
        logits = self.forward(X)
        
        # Softmax + CE
        z_shifted = logits - np.max(logits, axis=1, keepdims=True)
        exp_z = np.exp(z_shifted)
        probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)
        
        N = len(y)
        loss = -np.mean(np.log(probs[np.arange(N), y] + 1e-10))
        
        self.probs = probs
        self.y = y
        return loss
    
    def backward(self):
        X, z1, a1 = self.cache
        N = len(self.y)
        
        # Softmax + CE gradient
        dz2 = self.probs.copy()
        dz2[np.arange(N), self.y] -= 1
        dz2 /= N
        
        # Layer 2 gradients
        self.dW2 = a1.T @ dz2
        self.db2 = np.sum(dz2, axis=0)
        
        # Backprop through ReLU
        da1 = dz2 @ self.W2.T
        dz1 = da1 * (z1 > 0)
        
        # Layer 1 gradients
        self.dW1 = X.T @ dz1
        self.db1 = np.sum(dz1, axis=0)
    
    def get_params_and_grads(self):
        return [
            (self.W1, self.dW1),
            (self.b1, self.db1),
            (self.W2, self.dW2),
            (self.b2, self.db2),
        ]
    
    def predict(self, X):
        logits = self.forward(X)
        return np.argmax(logits, axis=1)

# 產生資料
np.random.seed(42)
n_samples = 1000
n_classes = 5
n_features = 20

# 產生分類資料
X_data = np.random.randn(n_samples, n_features)
true_W = np.random.randn(n_features, n_classes)
logits = X_data @ true_W
y_data = np.argmax(logits, axis=1)

# 分割訓練/驗證
X_train, X_val = X_data[:800], X_data[800:]
y_train, y_val = y_data[:800], y_data[800:]

print(f"訓練資料: {X_train.shape}")
print(f"驗證資料: {X_val.shape}")

In [None]:
def train_with_optimizer(optimizer_class, optimizer_kwargs, epochs=100, use_scheduler=False):
    """使用指定的優化器訓練"""
    np.random.seed(42)
    net = SimpleNet(n_features, 64, n_classes)
    optimizer = optimizer_class(**optimizer_kwargs)
    
    if use_scheduler:
        scheduler = CosineAnnealingLR(optimizer, optimizer_kwargs.get('learning_rate', 0.01), T_max=epochs)
    
    history = {'loss': [], 'acc': [], 'lr': []}
    batch_size = 32
    
    for epoch in range(epochs):
        if use_scheduler:
            scheduler.step(epoch)
        
        # 訓練一個 epoch
        perm = np.random.permutation(len(y_train))
        epoch_loss = 0
        n_batches = 0
        
        for i in range(0, len(y_train), batch_size):
            idx = perm[i:i+batch_size]
            X_batch = X_train[idx]
            y_batch = y_train[idx]
            
            loss = net.loss(X_batch, y_batch)
            epoch_loss += loss
            n_batches += 1
            
            net.backward()
            optimizer.update(net.get_params_and_grads())
        
        # 記錄
        val_pred = net.predict(X_val)
        val_acc = np.mean(val_pred == y_val)
        
        history['loss'].append(epoch_loss / n_batches)
        history['acc'].append(val_acc)
        history['lr'].append(optimizer.lr)
    
    return history

# 比較不同優化器
results = {}

print("訓練中...")

# SGD
results['SGD'] = train_with_optimizer(SGD, {'learning_rate': 0.1})
print(f"SGD: 最終 Acc = {results['SGD']['acc'][-1]:.4f}")

# Momentum SGD
results['Momentum'] = train_with_optimizer(MomentumSGD, {'learning_rate': 0.1, 'momentum': 0.9})
print(f"Momentum: 最終 Acc = {results['Momentum']['acc'][-1]:.4f}")

# Adam
results['Adam'] = train_with_optimizer(Adam, {'learning_rate': 0.01})
print(f"Adam: 最終 Acc = {results['Adam']['acc'][-1]:.4f}")

# SGD + Cosine LR
results['SGD + Cosine'] = train_with_optimizer(SGD, {'learning_rate': 0.1}, use_scheduler=True)
print(f"SGD + Cosine: 最終 Acc = {results['SGD + Cosine']['acc'][-1]:.4f}")

In [None]:
# 視覺化比較
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss
ax = axes[0]
for name, hist in results.items():
    ax.plot(hist['loss'], label=name)
ax.set_xlabel('Epoch')
ax.set_ylabel('Training Loss')
ax.set_title('Loss Comparison')
ax.legend()
ax.grid(True, alpha=0.3)

# Accuracy
ax = axes[1]
for name, hist in results.items():
    ax.plot(hist['acc'], label=name)
ax.set_xlabel('Epoch')
ax.set_ylabel('Validation Accuracy')
ax.set_title('Accuracy Comparison')
ax.set_ylim(0, 1)
ax.legend()
ax.grid(True, alpha=0.3)

# Learning Rate
ax = axes[2]
for name, hist in results.items():
    ax.plot(hist['lr'], label=name)
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('Learning Rate')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 練習題

### 練習 1：實作 RMSProp

RMSProp 使用指數移動平均來調整學習率：

$$v_t = \beta v_{t-1} + (1 - \beta) g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t} + \epsilon} g_t$$

In [None]:
class RMSProp:
    """
    RMSProp 優化器
    """
    def __init__(self, learning_rate=0.01, beta=0.99, epsilon=1e-8):
        self.lr = learning_rate
        self.beta = beta
        self.epsilon = epsilon
        self.v = {}  # 二階動量
    
    def update(self, params_and_grads):
        for i, (param, grad) in enumerate(params_and_grads):
            # 解答：
            if i not in self.v:
                self.v[i] = np.zeros_like(param)
            
            # 更新二階動量
            self.v[i] = self.beta * self.v[i] + (1 - self.beta) * (grad ** 2)
            
            # 更新參數
            param -= self.lr * grad / (np.sqrt(self.v[i]) + self.epsilon)

# 測試 RMSProp
results['RMSProp'] = train_with_optimizer(RMSProp, {'learning_rate': 0.01})
print(f"RMSProp: 最終 Acc = {results['RMSProp']['acc'][-1]:.4f}")

### 練習 2：實作 Gradient Clipping

梯度裁剪可以防止梯度爆炸：

$$g' = \frac{g}{\max(1, \frac{\|g\|}{\text{threshold}})}$$

In [None]:
def clip_gradients(params_and_grads, max_norm=1.0):
    """
    對梯度進行裁剪
    
    Parameters
    ----------
    params_and_grads : list of (param, grad)
    max_norm : float
        最大梯度範數
    
    Returns
    -------
    clipped : list of (param, clipped_grad)
    """
    # 解答：
    # 計算所有梯度的總範數
    total_norm = 0
    for param, grad in params_and_grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)
    
    # 計算縮放因子
    clip_coef = max_norm / (total_norm + 1e-6)
    
    if clip_coef < 1:
        # 需要裁剪
        clipped = []
        for param, grad in params_and_grads:
            clipped.append((param, grad * clip_coef))
        return clipped
    else:
        return params_and_grads

# 測試
test_grads = [
    (np.zeros((3, 3)), np.random.randn(3, 3) * 10),
    (np.zeros((3,)), np.random.randn(3) * 10),
]

print("裁剪前的梯度範數:")
for _, grad in test_grads:
    print(f"  {np.linalg.norm(grad):.4f}")

clipped = clip_gradients(test_grads, max_norm=1.0)

print("\n裁剪後的梯度範數:")
for _, grad in clipped:
    print(f"  {np.linalg.norm(grad):.4f}")

### 練習 3：權重衰減 (Weight Decay / L2 Regularization)

In [None]:
class AdamW:
    """
    AdamW: Adam 加上 decoupled weight decay
    
    與 L2 正則化的區別：
    - L2 正則化：將 λ*w 加到梯度中
    - Weight Decay：直接在更新步驟中減去 λ*w
    """
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, 
                 epsilon=1e-8, weight_decay=0.01):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.weight_decay = weight_decay
        
        self.m = {}
        self.v = {}
        self.t = 0
    
    def update(self, params_and_grads):
        self.t += 1
        
        for i, (param, grad) in enumerate(params_and_grads):
            if i not in self.m:
                self.m[i] = np.zeros_like(param)
                self.v[i] = np.zeros_like(param)
            
            # Adam 更新
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (grad ** 2)
            
            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
            
            # 解答：結合 Adam 更新和 weight decay
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
            param -= self.lr * self.weight_decay * param  # decoupled weight decay

# 測試 AdamW
results['AdamW'] = train_with_optimizer(AdamW, {'learning_rate': 0.01, 'weight_decay': 0.01})
print(f"AdamW: 最終 Acc = {results['AdamW']['acc'][-1]:.4f}")

In [None]:
# 最終比較
print("\n=== 優化器比較總結 ===")
print(f"{'優化器':<15} {'最終 Loss':<12} {'最終 Acc':<12}")
print("-" * 40)
for name in ['SGD', 'Momentum', 'RMSProp', 'Adam', 'AdamW', 'SGD + Cosine']:
    if name in results:
        print(f"{name:<15} {results[name]['loss'][-1]:<12.4f} {results[name]['acc'][-1]:<12.4f}")

## 總結

在這個 notebook 中，我們學習了：

### 權重初始化

| 方法 | 公式 | 適用場景 |
|------|------|----------|
| Xavier | $\sigma = \sqrt{\frac{2}{n_{in} + n_{out}}}$ | Sigmoid, Tanh |
| He | $\sigma = \sqrt{\frac{2}{n_{in}}}$ | ReLU |

### 優化器

| 優化器 | 特點 | 超參數 |
|--------|------|--------|
| SGD | 最基本 | lr |
| Momentum | 累積動量，加速收斂 | lr, β |
| RMSProp | 自適應學習率 | lr, β |
| Adam | 結合 Momentum 和 RMSProp | lr, β1, β2 |
| AdamW | Adam + Decoupled Weight Decay | lr, β1, β2, wd |

### Learning Rate Schedule

| 方法 | 特點 |
|------|------|
| Step Decay | 每隔固定 epoch 減少 |
| Exponential | 指數衰減 |
| Cosine Annealing | 平滑的餘弦衰減 |
| Warmup | 先增加後減少 |

### 實用建議

1. **初學者**：使用 Adam，lr=0.001
2. **追求最佳性能**：SGD + Momentum + Cosine Annealing
3. **有正則化需求**：AdamW
4. **防止梯度爆炸**：使用 Gradient Clipping

### 完成 Module 5！

至此，我們已經從零實作了：
1. 反向傳播基礎
2. 全連接層
3. 激活函數
4. 卷積層
5. 池化層
6. 完整的 LeNet 網路
7. 各種優化技巧

接下來的 Module 6 將學習更進階的架構，如 BatchNorm、ResNet、U-Net 等！