# Module 2：多層感知機 (MLP) 與訓練技巧

## 學習目標

1. 理解感知機和多層感知機 (MLP) 的結構
2. 掌握各種激活函數及其特性
3. 學會使用不同的優化器 (SGD, Adam)
4. 理解過擬合與正則化技術 (Dropout, BatchNorm, L2)
5. 實作：用 MLP 分類 MNIST 手寫數字

---

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm

# 設定
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 固定隨機種子
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

plt.rcParams['figure.figsize'] = (10, 4)

---

## Part 1：從感知機到多層感知機

### 1.1 感知機 (Perceptron)

**結構：** 最簡單的神經元模型

$$y = \sigma(\mathbf{w} \cdot \mathbf{x} + b)$$

1. 輸入 $\mathbf{x}$ 和權重 $\mathbf{w}$ 做內積
2. 加上偏置 $b$
3. 通過激活函數 $\sigma$

**問題：** 單層感知機只能學習**線性可分**的問題（例如無法學習 XOR）

In [None]:
# 視覺化：線性可分 vs 非線性可分

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# AND 問題（線性可分）
ax = axes[0]
ax.scatter([0, 0, 1], [0, 1, 0], c='red', s=100, label='0')
ax.scatter([1], [1], c='blue', s=100, label='1')
ax.plot([0, 1.5], [1.5, 0], 'g--', linewidth=2)  # 分隔線
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_title('AND (Linearly Separable)')
ax.legend()
ax.grid(True)

# OR 問題（線性可分）
ax = axes[1]
ax.scatter([0], [0], c='red', s=100, label='0')
ax.scatter([0, 1, 1], [1, 0, 1], c='blue', s=100, label='1')
ax.plot([-0.5, 1], [0.5, -0.5], 'g--', linewidth=2)
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_title('OR (Linearly Separable)')
ax.legend()
ax.grid(True)

# XOR 問題（非線性可分）
ax = axes[2]
ax.scatter([0, 1], [0, 1], c='red', s=100, label='0')
ax.scatter([0, 1], [1, 0], c='blue', s=100, label='1')
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_title('XOR (NOT Linearly Separable!)')
ax.legend()
ax.grid(True)
ax.text(0.5, -0.3, 'No single line can separate!', ha='center', fontsize=10, color='red')

plt.tight_layout()
plt.show()

### 1.2 多層感知機 (MLP / Feedforward Neural Network)

**解決方案：** 堆疊多層神經元，並在層之間加入**非線性激活函數**

```
Input → [Linear + Activation] → [Linear + Activation] → ... → Output
         \____Hidden Layer____/   \____Hidden Layer____/
```

**為什麼需要非線性激活？**
- 如果沒有激活函數，多層線性層 = 一層線性層（線性的組合還是線性）
- 非線性激活讓網路能學習複雜的非線性關係

In [None]:
# 用 MLP 解決 XOR 問題

# XOR 資料
X_xor = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
y_xor = torch.tensor([[0.], [1.], [1.], [0.]])

# 定義 MLP
class XOR_MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)   # 2 -> 4
        self.layer2 = nn.Linear(4, 1)   # 4 -> 1
    
    def forward(self, x):
        x = torch.relu(self.layer1(x))  # 非線性激活！
        x = torch.sigmoid(self.layer2(x))  # 輸出 0-1 之間
        return x

# 訓練
model = XOR_MLP()
criterion = nn.BCELoss()  # Binary Cross Entropy
optimizer = optim.Adam(model.parameters(), lr=0.1)

losses = []
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(X_xor)
    loss = criterion(output, y_xor)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

# 結果
print("XOR 問題結果：")
print(f"Input: {X_xor.tolist()}")
print(f"Target: {y_xor.squeeze().tolist()}")
print(f"Prediction: {model(X_xor).squeeze().detach().round().tolist()}")

plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('XOR Learning Curve')
plt.show()

---

## Part 2：激活函數 (Activation Functions)

### 2.1 為什麼需要激活函數？

1. **引入非線性**：讓網路能學習複雜的函數
2. **控制輸出範圍**：例如 sigmoid 輸出 (0, 1)，適合機率

### 2.2 常用激活函數比較

In [None]:
# 激活函數視覺化

x = torch.linspace(-5, 5, 200)

activations = {
    'Sigmoid': torch.sigmoid(x),
    'Tanh': torch.tanh(x),
    'ReLU': torch.relu(x),
    'LeakyReLU': F.leaky_relu(x, 0.1),
    'GELU': F.gelu(x),
    'SiLU/Swish': F.silu(x),
}

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for i, (name, y) in enumerate(activations.items()):
    ax = axes[i]
    ax.plot(x.numpy(), y.numpy(), linewidth=2)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_title(name, fontsize=14)
    ax.set_xlim(-5, 5)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 2.3 各激活函數的特點

| 激活函數 | 公式 | 輸出範圍 | 優點 | 缺點 |
|---------|------|---------|------|------|
| **Sigmoid** | $\frac{1}{1+e^{-x}}$ | (0, 1) | 輸出可解釋為機率 | 梯度消失、非零中心 |
| **Tanh** | $\frac{e^x - e^{-x}}{e^x + e^{-x}}$ | (-1, 1) | 零中心 | 梯度消失 |
| **ReLU** | $\max(0, x)$ | [0, ∞) | 計算快、緩解梯度消失 | Dead ReLU 問題 |
| **LeakyReLU** | $\max(0.01x, x)$ | (-∞, ∞) | 解決 Dead ReLU | 多一個超參數 |
| **GELU** | $x \cdot \Phi(x)$ | ≈(-0.17, ∞) | Transformer 常用 | 計算較慢 |
| **SiLU/Swish** | $x \cdot \sigma(x)$ | ≈(-0.28, ∞) | 平滑、效果好 | 計算較慢 |

In [None]:
# 梯度消失問題示範

# Sigmoid 的導數
def sigmoid_derivative(x):
    s = torch.sigmoid(x)
    return s * (1 - s)

# ReLU 的導數
def relu_derivative(x):
    return (x > 0).float()

x = torch.linspace(-5, 5, 200)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Sigmoid 和其導數
ax = axes[0]
ax.plot(x, torch.sigmoid(x), label='Sigmoid', linewidth=2)
ax.plot(x, sigmoid_derivative(x), label='Sigmoid Derivative', linewidth=2)
ax.axhline(y=0.25, color='r', linestyle='--', alpha=0.5)
ax.set_title('Sigmoid: Max derivative = 0.25')
ax.legend()
ax.grid(True)

# ReLU 和其導數
ax = axes[1]
ax.plot(x, torch.relu(x), label='ReLU', linewidth=2)
ax.plot(x, relu_derivative(x), label='ReLU Derivative', linewidth=2)
ax.set_title('ReLU: Derivative is 0 or 1')
ax.legend()
ax.grid(True)

plt.tight_layout()
plt.show()

print("梯度消失問題：")
print("- Sigmoid 的最大導數只有 0.25")
print("- 經過多層後：0.25^10 ≈ 0.0000009，梯度幾乎消失！")
print("- ReLU 的導數是 0 或 1，不會越乘越小")

---

## Part 3：損失函數 (Loss Functions)

### 3.1 回歸任務：MSE Loss

$$L = \frac{1}{n}\sum_i(y_i - \hat{y}_i)^2$$

In [None]:
# MSE Loss 範例
mse_loss = nn.MSELoss()

y_pred = torch.tensor([2.5, 0.0, 2.1, 1.8])
y_true = torch.tensor([3.0, -0.5, 2.0, 2.0])

loss = mse_loss(y_pred, y_true)
print(f"MSE Loss: {loss.item():.4f}")

# 手算驗證
manual = ((y_pred - y_true) ** 2).mean()
print(f"手算: {manual.item():.4f}")

### 3.2 分類任務：Cross-Entropy Loss

**二元分類：** `nn.BCELoss` 或 `nn.BCEWithLogitsLoss`

**多類別分類：** `nn.CrossEntropyLoss`（自動包含 softmax）

In [None]:
# Cross-Entropy Loss 範例

# 注意：CrossEntropyLoss 的輸入是 logits（未經 softmax），不是機率！
ce_loss = nn.CrossEntropyLoss()

# 3 個樣本，4 個類別
logits = torch.tensor([[2.0, 1.0, 0.1, 0.5],   # 預測 class 0
                       [0.5, 2.5, 0.3, 0.2],   # 預測 class 1
                       [0.1, 0.2, 0.3, 3.0]])  # 預測 class 3

targets = torch.tensor([0, 1, 3])  # 真實標籤

loss = ce_loss(logits, targets)
print(f"Cross-Entropy Loss: {loss.item():.4f}")

# 看看 softmax 後的機率
probs = torch.softmax(logits, dim=1)
print(f"\nSoftmax probabilities:")
for i in range(3):
    print(f"  Sample {i}: {probs[i].tolist()} -> pred={probs[i].argmax()}, true={targets[i]}")

---

## Part 4：優化器 (Optimizers)

### 4.1 SGD (Stochastic Gradient Descent)

最基本的優化器：$\theta_{t+1} = \theta_t - \eta \nabla L$

**加入 Momentum：** 讓更新有「慣性」，加速收斂

$$v_t = \gamma v_{t-1} + \eta \nabla L$$
$$\theta_{t+1} = \theta_t - v_t$$

In [None]:
# 比較不同優化器

def train_with_optimizer(optimizer_fn, num_epochs=100):
    """用指定優化器訓練一個簡單模型"""
    torch.manual_seed(42)
    
    # 簡單的二次函數優化問題
    # 最小化 f(x, y) = x^2 + 10*y^2
    x = torch.tensor([5.0], requires_grad=True)
    y = torch.tensor([5.0], requires_grad=True)
    
    optimizer = optimizer_fn([x, y])
    
    history = []
    for _ in range(num_epochs):
        optimizer.zero_grad()
        loss = x**2 + 10 * y**2
        loss.backward()
        optimizer.step()
        history.append((x.item(), y.item(), loss.item()))
    
    return history

# 不同優化器
optimizers = {
    'SGD (lr=0.01)': lambda p: optim.SGD(p, lr=0.01),
    'SGD + Momentum': lambda p: optim.SGD(p, lr=0.01, momentum=0.9),
    'Adam': lambda p: optim.Adam(p, lr=0.1),
}

results = {}
for name, opt_fn in optimizers.items():
    results[name] = train_with_optimizer(opt_fn, num_epochs=100)

# 視覺化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
ax = axes[0]
for name, history in results.items():
    losses = [h[2] for h in history]
    ax.plot(losses, label=name, linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Loss Curves')
ax.legend()
ax.set_yscale('log')
ax.grid(True)

# Optimization paths
ax = axes[1]
for name, history in results.items():
    xs = [h[0] for h in history]
    ys = [h[1] for h in history]
    ax.plot(xs, ys, 'o-', label=name, markersize=3, alpha=0.7)
ax.scatter([0], [0], c='red', s=100, zorder=5, label='Optimum')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimization Paths')
ax.legend()
ax.grid(True)

plt.tight_layout()
plt.show()

### 4.2 Adam (Adaptive Moment Estimation)

**特點：**
- 結合 Momentum 和 RMSprop
- 為每個參數自動調整學習率
- 通常是預設首選優化器

**超參數：**
- `lr`：學習率，預設 0.001
- `betas`：(β1, β2)，預設 (0.9, 0.999)
- `weight_decay`：L2 正則化

In [None]:
# Adam 的常見用法
model = nn.Linear(10, 2)

# 基本用法
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 加上 weight decay (L2 正則化)
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# AdamW (更好的 weight decay 實現)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

print("Adam optimizer created successfully")

---

## Part 5：過擬合與正則化

### 5.1 什麼是過擬合 (Overfitting)？

**現象：** 模型在訓練資料上表現很好，但在測試資料上表現差。

**原因：** 模型「記住」了訓練資料的噪音，而不是學到真正的規律。

**解決方案：**
1. 更多資料
2. 簡化模型
3. 正則化技術（Dropout, L2, etc.）
4. 早停 (Early Stopping)

In [None]:
# 過擬合示範

# 生成簡單的資料（加噪音）
torch.manual_seed(42)
n_samples = 50

X = torch.linspace(-3, 3, n_samples).unsqueeze(1)
y_true = torch.sin(X)  # 真實函數是 sin
y = y_true + torch.randn_like(y_true) * 0.3  # 加噪音

# 分成訓練和測試
X_train, X_test = X[:35], X[35:]
y_train, y_test = y[:35], y[35:]

# 定義不同複雜度的模型
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 8),
            nn.ReLU(),
            nn.Linear(8, 1)
        )
    def forward(self, x):
        return self.net(x)

class ComplexModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    def forward(self, x):
        return self.net(x)

def train_model(model, epochs=2000):
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.MSELoss()
    
    train_losses, test_losses = [], []
    
    for _ in range(epochs):
        # Train
        model.train()
        optimizer.zero_grad()
        pred = model(X_train)
        loss = criterion(pred, y_train)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
        
        # Test
        model.eval()
        with torch.no_grad():
            test_pred = model(X_test)
            test_loss = criterion(test_pred, y_test)
            test_losses.append(test_loss.item())
    
    return train_losses, test_losses

# 訓練兩個模型
simple_model = SimpleModel()
complex_model = ComplexModel()

simple_train, simple_test = train_model(simple_model)
complex_train, complex_test = train_model(complex_model)

# 視覺化
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss curves for simple model
ax = axes[0]
ax.plot(simple_train, label='Train')
ax.plot(simple_test, label='Test')
ax.set_title('Simple Model (8 hidden units)')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.legend()
ax.set_ylim(0, 0.5)

# Loss curves for complex model
ax = axes[1]
ax.plot(complex_train, label='Train')
ax.plot(complex_test, label='Test')
ax.set_title('Complex Model (3x128 hidden units)')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.legend()
ax.set_ylim(0, 0.5)

# Predictions
ax = axes[2]
X_plot = torch.linspace(-3, 3, 100).unsqueeze(1)

simple_model.eval()
complex_model.eval()
with torch.no_grad():
    simple_pred = simple_model(X_plot)
    complex_pred = complex_model(X_plot)

ax.scatter(X_train, y_train, c='blue', alpha=0.5, label='Train data')
ax.scatter(X_test, y_test, c='red', alpha=0.5, label='Test data')
ax.plot(X_plot, torch.sin(X_plot), 'g--', label='True function', linewidth=2)
ax.plot(X_plot, simple_pred, 'b-', label='Simple model', linewidth=2)
ax.plot(X_plot, complex_pred, 'r-', label='Complex model', linewidth=2)
ax.set_title('Model Predictions')
ax.legend()

plt.tight_layout()
plt.show()

print(f"Simple Model - Train Loss: {simple_train[-1]:.4f}, Test Loss: {simple_test[-1]:.4f}")
print(f"Complex Model - Train Loss: {complex_train[-1]:.4f}, Test Loss: {complex_test[-1]:.4f}")
print("\n複雜模型的 train loss 更低，但 test loss 更高 = 過擬合！")

### 5.2 Dropout

**概念：** 訓練時隨機「關閉」一部分神經元

**效果：**
- 防止神經元之間產生「共依賴」
- 類似於訓練多個子網路的 ensemble
- 測試時不 dropout，但要調整輸出（PyTorch 自動處理）

In [None]:
# Dropout 示範

# 模型加上 Dropout
class ModelWithDropout(nn.Module):
    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),  # Dropout!
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(128, 1)
        )
    def forward(self, x):
        return self.net(x)

# 訓練帶 dropout 的模型
dropout_model = ModelWithDropout(dropout_rate=0.5)
dropout_train, dropout_test = train_model(dropout_model)

# 比較
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.plot(complex_train, label='Train (no dropout)')
ax.plot(complex_test, label='Test (no dropout)')
ax.plot(dropout_train, '--', label='Train (with dropout)')
ax.plot(dropout_test, '--', label='Test (with dropout)')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Effect of Dropout')
ax.legend()
ax.set_ylim(0, 0.5)

ax = axes[1]
dropout_model.eval()
with torch.no_grad():
    dropout_pred = dropout_model(X_plot)

ax.scatter(X_train, y_train, c='blue', alpha=0.5, label='Train')
ax.scatter(X_test, y_test, c='red', alpha=0.5, label='Test')
ax.plot(X_plot, torch.sin(X_plot), 'g--', label='True', linewidth=2)
ax.plot(X_plot, complex_pred, 'r-', label='No dropout', linewidth=2, alpha=0.7)
ax.plot(X_plot, dropout_pred, 'b-', label='With dropout', linewidth=2)
ax.set_title('Predictions')
ax.legend()

plt.tight_layout()
plt.show()

print(f"Without Dropout - Test Loss: {complex_test[-1]:.4f}")
print(f"With Dropout - Test Loss: {dropout_test[-1]:.4f}")

### 5.3 Batch Normalization

**概念：** 在每一層之後，把輸出正規化成均值 0、標準差 1

**效果：**
- 加速訓練收斂
- 允許使用更大的學習率
- 有輕微的正則化效果

In [None]:
# Batch Normalization 示範

class ModelWithBatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 128),
            nn.BatchNorm1d(128),  # BatchNorm!
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    def forward(self, x):
        return self.net(x)

# 比較有無 BatchNorm 的收斂速度
def train_model_fast(model, epochs=500, lr=0.1):
    optimizer = optim.SGD(model.parameters(), lr=lr)
    criterion = nn.MSELoss()
    
    train_losses = []
    for _ in range(epochs):
        model.train()
        optimizer.zero_grad()
        pred = model(X_train)
        loss = criterion(pred, y_train)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
    return train_losses

# 無 BatchNorm 的模型
class ModelNoBN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    def forward(self, x):
        return self.net(x)

torch.manual_seed(42)
model_no_bn = ModelNoBN()
torch.manual_seed(42)
model_with_bn = ModelWithBatchNorm()

losses_no_bn = train_model_fast(model_no_bn, lr=0.01)  # 小學習率
losses_with_bn = train_model_fast(model_with_bn, lr=0.1)  # 大學習率！

plt.figure(figsize=(10, 4))
plt.plot(losses_no_bn, label='Without BatchNorm (lr=0.01)')
plt.plot(losses_with_bn, label='With BatchNorm (lr=0.1)')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('BatchNorm allows higher learning rate')
plt.legend()
plt.ylim(0, 1)
plt.grid(True)
plt.show()

print("BatchNorm 讓我們可以用 10 倍的學習率，收斂更快！")

---

## Part 6：完整實作 - MNIST 手寫數字分類

現在把所有知識組合起來，實作一個完整的 MNIST 分類器。

In [None]:
# 載入 MNIST 資料集

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST 的均值和標準差
])

train_dataset = torchvision.datasets.MNIST(
    root='./data', train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.MNIST(
    root='./data', train=False, download=True, transform=transform
)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=0)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False, num_workers=0)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Image shape: {train_dataset[0][0].shape}")

In [None]:
# 看看資料長什麼樣

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flatten()):
    img, label = train_dataset[i]
    ax.imshow(img.squeeze(), cmap='gray')
    ax.set_title(f'Label: {label}')
    ax.axis('off')
plt.tight_layout()
plt.show()

In [None]:
# 定義 MLP 模型

class MNIST_MLP(nn.Module):
    def __init__(self, hidden_size=256, dropout_rate=0.2):
        super().__init__()
        
        self.flatten = nn.Flatten()  # 28x28 -> 784
        
        self.net = nn.Sequential(
            # 第一層
            nn.Linear(784, hidden_size),
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            
            # 第二層
            nn.Linear(hidden_size, hidden_size),
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            
            # 輸出層
            nn.Linear(hidden_size, 10)  # 10 個數字類別
        )
    
    def forward(self, x):
        x = self.flatten(x)
        return self.net(x)

# 建立模型並移到 GPU
model = MNIST_MLP(hidden_size=256, dropout_rate=0.2).to(device)

# 印出模型結構
print(model)

# 計算參數數量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

In [None]:
# 定義訓練和評估函數

def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item() * images.size(0)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    return total_loss / total, 100. * correct / total

def evaluate(model, test_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item() * images.size(0)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    return total_loss / total, 100. * correct / total

In [None]:
# 訓練！

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10
history = {'train_loss': [], 'train_acc': [], 'test_loss': [], 'test_acc': []}

print("Starting training...")
print("-" * 60)

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['test_loss'].append(test_loss)
    history['test_acc'].append(test_acc)
    
    print(f"Epoch [{epoch+1:2d}/{num_epochs}] "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
          f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

print("-" * 60)
print(f"Final Test Accuracy: {history['test_acc'][-1]:.2f}%")

In [None]:
# 視覺化訓練過程

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss
ax = axes[0]
ax.plot(history['train_loss'], label='Train')
ax.plot(history['test_loss'], label='Test')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training and Test Loss')
ax.legend()
ax.grid(True)

# Accuracy
ax = axes[1]
ax.plot(history['train_acc'], label='Train')
ax.plot(history['test_acc'], label='Test')
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy (%)')
ax.set_title('Training and Test Accuracy')
ax.legend()
ax.grid(True)

plt.tight_layout()
plt.show()

In [None]:
# 看看一些預測結果

model.eval()

# 取一批測試資料
images, labels = next(iter(test_loader))
images, labels = images.to(device), labels.to(device)

with torch.no_grad():
    outputs = model(images)
    probs = torch.softmax(outputs, dim=1)
    _, predicted = outputs.max(1)

# 顯示前 15 個
fig, axes = plt.subplots(3, 5, figsize=(12, 8))
for i, ax in enumerate(axes.flatten()):
    img = images[i].cpu().squeeze()
    pred = predicted[i].item()
    true = labels[i].item()
    prob = probs[i, pred].item()
    
    ax.imshow(img, cmap='gray')
    color = 'green' if pred == true else 'red'
    ax.set_title(f'Pred: {pred} ({prob:.1%})\nTrue: {true}', color=color)
    ax.axis('off')

plt.tight_layout()
plt.show()

In [None]:
# 混淆矩陣

from sklearn.metrics import confusion_matrix
import seaborn as sns

# 收集所有預測
all_preds = []
all_labels = []

model.eval()
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = model(images)
        _, predicted = outputs.max(1)
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.numpy())

# 計算混淆矩陣
cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# 找出最容易混淆的數字
print("\n最容易混淆的數字對：")
cm_no_diag = cm.copy()
np.fill_diagonal(cm_no_diag, 0)
for _ in range(3):
    i, j = np.unravel_index(cm_no_diag.argmax(), cm_no_diag.shape)
    print(f"  真實 {i} 被誤判為 {j}: {cm_no_diag[i, j]} 次")
    cm_no_diag[i, j] = 0

---

## 練習題（已完成，請閱讀理解）

### 練習 1：比較不同激活函數

**目標：** 觀察不同激活函數對 MNIST 訓練的影響

**Hint：**
- ReLU 是最常用的選擇
- GELU 和 SiLU 在某些情況下表現更好
- Sigmoid 通常不用在隱藏層（梯度消失問題）

In [None]:
# 練習 1：比較不同激活函數

class MLP_WithActivation(nn.Module):
    def __init__(self, activation_fn):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 10)
        self.activation = activation_fn
    
    def forward(self, x):
        x = self.flatten(x)
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.fc3(x)
        return x

def quick_train(model, epochs=5):
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    history = []
    for epoch in range(epochs):
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        
        _, test_acc = evaluate(model, test_loader, criterion, device)
        history.append(test_acc)
    
    return history

# 比較不同激活函數
activations = {
    'ReLU': nn.ReLU(),
    'LeakyReLU': nn.LeakyReLU(0.1),
    'GELU': nn.GELU(),
    'SiLU': nn.SiLU(),
    'Tanh': nn.Tanh(),
}

results = {}
for name, act_fn in activations.items():
    print(f"Training with {name}...")
    torch.manual_seed(42)
    model = MLP_WithActivation(act_fn)
    results[name] = quick_train(model, epochs=5)

# 視覺化
plt.figure(figsize=(10, 5))
for name, accs in results.items():
    plt.plot(range(1, 6), accs, 'o-', label=f'{name} ({accs[-1]:.1f}%)')
plt.xlabel('Epoch')
plt.ylabel('Test Accuracy (%)')
plt.title('Comparison of Activation Functions on MNIST')
plt.legend()
plt.grid(True)
plt.show()

### 練習 2：學習率調度 (Learning Rate Scheduling)

**目標：** 學會使用學習率調度器

**Hint：**
- 訓練初期用較大學習率，後期減小
- 常用：StepLR, CosineAnnealingLR, ReduceLROnPlateau

In [None]:
# 練習 2：學習率調度

# 重新建立模型
torch.manual_seed(42)
model_with_scheduler = MNIST_MLP().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_with_scheduler.parameters(), lr=0.01)  # 較大的初始學習率

# 學習率調度器：每 3 個 epoch 把學習率乘以 0.5
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.5)

history_scheduler = {'test_acc': [], 'lr': []}

print("Training with LR Scheduler...")
for epoch in range(10):
    current_lr = optimizer.param_groups[0]['lr']
    
    # 訓練
    model_with_scheduler.train()
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model_with_scheduler(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    # 更新學習率
    scheduler.step()
    
    # 評估
    _, test_acc = evaluate(model_with_scheduler, test_loader, criterion, device)
    history_scheduler['test_acc'].append(test_acc)
    history_scheduler['lr'].append(current_lr)
    
    print(f"Epoch {epoch+1}: LR = {current_lr:.6f}, Test Acc = {test_acc:.2f}%")

# 視覺化
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.plot(history_scheduler['lr'], 'o-')
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('Learning Rate Schedule')
ax.grid(True)

ax = axes[1]
ax.plot(history_scheduler['test_acc'], 'o-')
ax.set_xlabel('Epoch')
ax.set_ylabel('Test Accuracy (%)')
ax.set_title('Test Accuracy')
ax.grid(True)

plt.tight_layout()
plt.show()

### 練習 3：早停 (Early Stopping)

**目標：** 實現早停機制，防止過擬合

**Hint：**
- 監控驗證集的 loss 或 accuracy
- 如果連續 N 個 epoch 沒有改善，就停止訓練
- 保存最佳模型的參數

In [None]:
# 練習 3：早停

class EarlyStopping:
    """早停機制"""
    def __init__(self, patience=5, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None
        self.early_stop = False
        self.best_model_state = None
    
    def __call__(self, val_loss, model):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model_state = model.state_dict().copy()
        elif val_loss > self.best_loss - self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_loss = val_loss
            self.best_model_state = model.state_dict().copy()
            self.counter = 0
        
        return self.early_stop

# 訓練帶早停
torch.manual_seed(42)
model_early_stop = MNIST_MLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_early_stop.parameters(), lr=0.001)
early_stopping = EarlyStopping(patience=3)

history_es = {'train_loss': [], 'test_loss': []}

print("Training with Early Stopping (patience=3)...")
for epoch in range(50):  # 最多 50 個 epoch
    train_loss, _ = train_epoch(model_early_stop, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model_early_stop, test_loader, criterion, device)
    
    history_es['train_loss'].append(train_loss)
    history_es['test_loss'].append(test_loss)
    
    print(f"Epoch {epoch+1}: Train Loss = {train_loss:.4f}, Test Loss = {test_loss:.4f}, Test Acc = {test_acc:.2f}%")
    
    if early_stopping(test_loss, model_early_stop):
        print(f"\nEarly stopping triggered at epoch {epoch+1}!")
        break

# 載入最佳模型
model_early_stop.load_state_dict(early_stopping.best_model_state)
_, final_acc = evaluate(model_early_stop, test_loader, criterion, device)
print(f"\nBest model Test Accuracy: {final_acc:.2f}%")

# 視覺化
plt.figure(figsize=(8, 4))
plt.plot(history_es['train_loss'], label='Train Loss')
plt.plot(history_es['test_loss'], label='Test Loss')
plt.axvline(x=len(history_es['test_loss'])-early_stopping.patience-1, 
            color='r', linestyle='--', label='Best Model')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training with Early Stopping')
plt.legend()
plt.grid(True)
plt.show()

### 練習 4：模型保存與載入

**目標：** 學會保存和載入訓練好的模型

**Hint：**
- `torch.save(model.state_dict(), path)`：保存參數
- `model.load_state_dict(torch.load(path))`：載入參數
- 也可以保存整個模型，但不推薦（pickle 問題）

In [None]:
# 練習 4：模型保存與載入

import os

# 建立目錄
os.makedirs('models', exist_ok=True)

# 方法 1：只保存參數（推薦）
save_path = 'models/mnist_mlp.pth'
torch.save(model.state_dict(), save_path)
print(f"Model saved to {save_path}")
print(f"File size: {os.path.getsize(save_path) / 1024:.1f} KB")

# 載入參數
new_model = MNIST_MLP().to(device)
new_model.load_state_dict(torch.load(save_path))
new_model.eval()

# 驗證載入成功
_, loaded_acc = evaluate(new_model, test_loader, criterion, device)
print(f"Loaded model Test Accuracy: {loaded_acc:.2f}%")

In [None]:
# 方法 2：保存完整的 checkpoint（包含 optimizer 狀態，可以繼續訓練）

checkpoint_path = 'models/mnist_checkpoint.pth'

checkpoint = {
    'epoch': num_epochs,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'train_loss': history['train_loss'][-1],
    'test_acc': history['test_acc'][-1],
}

torch.save(checkpoint, checkpoint_path)
print(f"Checkpoint saved to {checkpoint_path}")

# 載入 checkpoint 並繼續訓練
checkpoint = torch.load(checkpoint_path)
resume_model = MNIST_MLP().to(device)
resume_model.load_state_dict(checkpoint['model_state_dict'])

resume_optimizer = optim.Adam(resume_model.parameters())
resume_optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

print(f"\nResumed from epoch {checkpoint['epoch']}")
print(f"Last train loss: {checkpoint['train_loss']:.4f}")
print(f"Last test accuracy: {checkpoint['test_acc']:.2f}%")

## Module 2 中場總結

### 核心概念

1. **MLP 結構**：Input → [Linear + Activation] × N → Output

2. **激活函數**：
   - ReLU：最常用，計算快，緩解梯度消失
   - GELU/SiLU：Transformer 常用，效果通常更好
   - Sigmoid/Tanh：輸出層特定用途

3. **損失函數**：
   - 回歸：MSELoss
   - 分類：CrossEntropyLoss

4. **優化器**：
   - SGD：基礎，加 momentum 效果更好
   - Adam：最常用，自動調整學習率

5. **正則化**：
   - Dropout：隨機關閉神經元
   - BatchNorm：正規化每層輸出，加速訓練
   - L2/Weight Decay：懲罰大權重
   - Early Stopping：監控驗證集，及時停止

---

## Part 7：進階訓練技巧

### 7.1 權重初始化（Weight Initialization）

**為什麼重要？** 好的初始化可以：
- 避免梯度消失/爆炸
- 加速收斂
- 達到更好的最終效果

**常用方法：**
- **Xavier/Glorot**：適合 Sigmoid/Tanh（保持輸入輸出變異數相同）
- **He/Kaiming**：適合 ReLU（考慮 ReLU 會把一半值變成 0）

In [None]:
# 權重初始化比較

class MLP_CustomInit(nn.Module):
    def __init__(self, init_method='default'):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 10)
        
        # 自訂初始化
        if init_method == 'xavier':
            nn.init.xavier_uniform_(self.fc1.weight)
            nn.init.xavier_uniform_(self.fc2.weight)
            nn.init.xavier_uniform_(self.fc3.weight)
        elif init_method == 'kaiming':
            nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
            nn.init.kaiming_uniform_(self.fc2.weight, nonlinearity='relu')
            nn.init.kaiming_uniform_(self.fc3.weight, nonlinearity='relu')
        elif init_method == 'zeros':
            nn.init.zeros_(self.fc1.weight)
            nn.init.zeros_(self.fc2.weight)
            nn.init.zeros_(self.fc3.weight)
        
        # Bias 通常初始化為 0
        nn.init.zeros_(self.fc1.bias)
        nn.init.zeros_(self.fc2.bias)
        nn.init.zeros_(self.fc3.bias)
    
    def forward(self, x):
        x = self.flatten(x)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# 比較不同初始化方法
init_methods = ['default', 'xavier', 'kaiming', 'zeros']
init_results = {}

print("比較不同初始化方法（訓練 3 個 epoch）：")
print("-" * 50)

for method in init_methods:
    torch.manual_seed(42)
    model = MLP_CustomInit(init_method=method).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()
    
    accs = []
    for epoch in range(3):
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            loss = criterion(model(images), labels)
            loss.backward()
            optimizer.step()
        
        _, acc = evaluate(model, test_loader, criterion, device)
        accs.append(acc)
    
    init_results[method] = accs
    print(f"{method:>10}: Epoch 1={accs[0]:.1f}%, Epoch 2={accs[1]:.1f}%, Epoch 3={accs[2]:.1f}%")

print("\n注意：zeros 初始化會導致對稱性問題，所有神經元學習相同的東西！")

### 7.2 梯度裁剪（Gradient Clipping）

**問題：** 梯度爆炸會導致訓練不穩定

**解決方案：** 限制梯度的大小
- `clip_grad_norm_`：限制梯度的整體 L2 範數
- `clip_grad_value_`：限制每個梯度元素的值

In [None]:
# 梯度裁剪範例

def train_with_grad_clip(model, train_loader, max_norm=1.0, epochs=3):
    """帶梯度裁剪的訓練"""
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()
    
    grad_norms = []
    
    for epoch in range(epochs):
        model.train()
        epoch_norms = []
        
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            loss = criterion(model(images), labels)
            loss.backward()
            
            # 計算裁剪前的梯度範數
            total_norm = 0
            for p in model.parameters():
                if p.grad is not None:
                    total_norm += p.grad.data.norm(2).item() ** 2
            total_norm = total_norm ** 0.5
            epoch_norms.append(total_norm)
            
            # 梯度裁剪！
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm)
            
            optimizer.step()
        
        grad_norms.append(np.mean(epoch_norms))
    
    return grad_norms

# 比較有無梯度裁剪
torch.manual_seed(42)
model_no_clip = MNIST_MLP().to(device)
torch.manual_seed(42)
model_with_clip = MNIST_MLP().to(device)

# 用較大的學習率測試
print("測試梯度裁剪效果（使用較大學習率 0.01）：")

# 不裁剪
optimizer1 = optim.Adam(model_no_clip.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

norms_no_clip = []
for epoch in range(3):
    model_no_clip.train()
    epoch_norms = []
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer1.zero_grad()
        loss = criterion(model_no_clip(images), labels)
        loss.backward()
        
        total_norm = sum(p.grad.data.norm(2).item() ** 2 for p in model_no_clip.parameters() if p.grad is not None) ** 0.5
        epoch_norms.append(total_norm)
        
        optimizer1.step()
    norms_no_clip.append(np.mean(epoch_norms))

# 有裁剪
optimizer2 = optim.Adam(model_with_clip.parameters(), lr=0.01)
norms_with_clip = []
for epoch in range(3):
    model_with_clip.train()
    epoch_norms = []
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer2.zero_grad()
        loss = criterion(model_with_clip(images), labels)
        loss.backward()
        
        total_norm = sum(p.grad.data.norm(2).item() ** 2 for p in model_with_clip.parameters() if p.grad is not None) ** 0.5
        epoch_norms.append(total_norm)
        
        # 梯度裁剪
        torch.nn.utils.clip_grad_norm_(model_with_clip.parameters(), max_norm=1.0)
        
        optimizer2.step()
    norms_with_clip.append(np.mean(epoch_norms))

print(f"\n無梯度裁剪 - 平均梯度範數: {norms_no_clip}")
print(f"有梯度裁剪 - 平均梯度範數: {norms_with_clip}")

_, acc_no_clip = evaluate(model_no_clip, test_loader, criterion, device)
_, acc_with_clip = evaluate(model_with_clip, test_loader, criterion, device)
print(f"\n最終準確率 - 無裁剪: {acc_no_clip:.2f}%, 有裁剪: {acc_with_clip:.2f}%")

### 7.3 學習率尋找器（Learning Rate Finder）

**思路：** 從很小的學習率開始，逐漸增大，記錄 loss 變化。
- Loss 下降最快的區間 = 好的學習率範圍
- Loss 開始上升 = 學習率太大了

In [None]:
# 簡單的學習率尋找器

def lr_finder(model_class, train_loader, lr_min=1e-7, lr_max=1, num_iter=100):
    """
    學習率尋找器
    """
    model = model_class().to(device)
    optimizer = optim.SGD(model.parameters(), lr=lr_min)
    criterion = nn.CrossEntropyLoss()
    
    # 指數增長的學習率
    lr_mult = (lr_max / lr_min) ** (1 / num_iter)
    
    lrs = []
    losses = []
    best_loss = float('inf')
    
    model.train()
    data_iter = iter(train_loader)
    
    for i in range(num_iter):
        try:
            images, labels = next(data_iter)
        except StopIteration:
            data_iter = iter(train_loader)
            images, labels = next(data_iter)
        
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # 如果 loss 爆炸就停止
        if loss.item() > best_loss * 10:
            break
        
        if loss.item() < best_loss:
            best_loss = loss.item()
        
        lrs.append(optimizer.param_groups[0]['lr'])
        losses.append(loss.item())
        
        loss.backward()
        optimizer.step()
        
        # 增加學習率
        for param_group in optimizer.param_groups:
            param_group['lr'] *= lr_mult
    
    return lrs, losses

# 執行 LR Finder
print("執行學習率尋找器...")
lrs, losses = lr_finder(MNIST_MLP, train_loader)

# 平滑 loss
def smooth(values, weight=0.9):
    smoothed = []
    last = values[0]
    for v in values:
        smoothed.append(last * weight + v * (1 - weight))
        last = smoothed[-1]
    return smoothed

smoothed_losses = smooth(losses)

# 視覺化
plt.figure(figsize=(10, 4))
plt.plot(lrs, smoothed_losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.grid(True)

# 找出建議的學習率（loss 下降最快的點）
min_loss_idx = np.argmin(smoothed_losses)
suggested_lr = lrs[min_loss_idx] / 10  # 通常取最低點的 1/10
plt.axvline(x=suggested_lr, color='r', linestyle='--', label=f'Suggested LR: {suggested_lr:.1e}')
plt.legend()
plt.show()

print(f"\n建議的學習率: {suggested_lr:.1e}")
print("提示: 選擇 loss 開始快速下降的點，通常是最低點學習率的 1/10")

### 7.4 Label Smoothing（標籤平滑）

**概念：** 不要用 hard labels (0 或 1)，而是用 soft labels

例如：`[0, 0, 1, 0]` → `[0.025, 0.025, 0.925, 0.025]`

**效果：** 正則化，防止模型過度自信

In [None]:
# Label Smoothing 範例

# PyTorch 的 CrossEntropyLoss 內建 label_smoothing 參數
criterion_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
criterion_hard = nn.CrossEntropyLoss()

# 測試
logits = torch.tensor([[2.0, 1.0, 0.5]])
target = torch.tensor([0])

loss_hard = criterion_hard(logits, target)
loss_smooth = criterion_smooth(logits, target)

print(f"Hard labels loss: {loss_hard.item():.4f}")
print(f"Smooth labels (0.1) loss: {loss_smooth.item():.4f}")

# 訓練比較
torch.manual_seed(42)
model_hard = MNIST_MLP().to(device)
torch.manual_seed(42)
model_smooth = MNIST_MLP().to(device)

def quick_train_with_criterion(model, criterion, epochs=5):
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    accs = []
    for epoch in range(epochs):
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            loss = criterion(model(images), labels)
            loss.backward()
            optimizer.step()
        _, acc = evaluate(model, test_loader, nn.CrossEntropyLoss(), device)
        accs.append(acc)
    return accs

print("\n訓練 5 個 epoch 比較：")
acc_hard = quick_train_with_criterion(model_hard, criterion_hard)
acc_smooth = quick_train_with_criterion(model_smooth, criterion_smooth)

print(f"Hard labels: {acc_hard[-1]:.2f}%")
print(f"Label smoothing (0.1): {acc_smooth[-1]:.2f}%")

In [None]:
# 清理臨時檔案
import shutil
if os.path.exists('models'):
    shutil.rmtree('models')
    print("已清理臨時模型檔案")

---

## 完整總結與實戰 Checklist

### 本 Module 涵蓋的內容：

| 主題 | 技術 | 使用時機 |
|------|------|----------|
| **激活函數** | ReLU, GELU, SiLU | 隱藏層非線性 |
| **損失函數** | MSE, CrossEntropy | 回歸/分類任務 |
| **優化器** | SGD, Adam, AdamW | 參數更新 |
| **正則化** | Dropout, BatchNorm, L2 | 防止過擬合 |
| **初始化** | Xavier, Kaiming | 加速收斂 |
| **梯度裁剪** | clip_grad_norm_ | 訓練穩定性 |
| **學習率調度** | StepLR, CosineAnnealing | 動態調整 LR |
| **早停** | EarlyStopping | 防止過擬合 |
| **標籤平滑** | label_smoothing | 正則化 |

### 實戰 Checklist：

- [ ] 模型定義：Flatten → Linear + Activation + Dropout/BN → Output
- [ ] 訓練循環：zero_grad → forward → loss → backward → step
- [ ] 監控過擬合：比較 train_loss 和 test_loss
- [ ] 使用 BatchNorm 加速收斂
- [ ] 使用 Dropout 防止過擬合
- [ ] 嘗試 Learning Rate Finder 找最佳 LR
- [ ] 實現 Early Stopping 自動停止
- [ ] 保存最佳模型的 checkpoint

### 訓練問題排查指南：

| 現象 | 可能原因 | 解決方案 |
|------|----------|----------|
| Loss 不下降 | LR 太小/太大 | 調整 LR 或用 LR Finder |
| Loss 變 NaN | LR 太大/數值問題 | 降低 LR、梯度裁剪 |
| 訓練 loss 下降但測試 loss 上升 | 過擬合 | 增加 Dropout、早停 |
| 訓練很慢 | 沒用 GPU/LR 太小 | 確認 device、增加 LR |
| 準確率卡住不動 | 學習率太大 | 降低 LR 或用調度器 |

### 下一步：Module 3 - CNN 與影像任務