# 03 激活函數 Activation Functions

## 學習目標

1. 理解激活函數的作用（引入非線性）
2. 實作 ReLU, Sigmoid, Tanh 及其反向傳播
3. 實作 Softmax + Cross-Entropy Loss（合併計算以保持數值穩定）
4. 理解各種激活函數的優缺點

## 為什麼需要激活函數？

如果神經網路只有線性層，多層線性層的組合仍然是線性的：

$$Y = W_2(W_1 X + b_1) + b_2 = (W_2 W_1)X + (W_2 b_1 + b_2) = W'X + b'$$

這樣再多層也等於一層！激活函數引入**非線性**，讓網路能學習複雜的函數。

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
print("Activation Functions module loaded!")

## 第一部分：ReLU (Rectified Linear Unit)

ReLU 是目前最常用的激活函數。

### 定義

$$\text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}$$

### 梯度

$$\frac{\partial \text{ReLU}(x)}{\partial x} = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}$$

### 優點
- 計算簡單快速
- 解決梯度消失問題（對正值部分）
- 稀疏激活（有些神經元輸出為 0）

### 缺點
- **Dead ReLU 問題**：如果神經元進入負區，梯度永遠為 0，無法恢復

In [None]:
class ReLU:
    """
    ReLU 激活函數層
    
    forward: out = max(0, x)
    backward: dx = dout * (x > 0)
    """
    
    def __init__(self):
        self.cache = None
    
    def forward(self, x):
        """
        前向傳播
        
        Parameters
        ----------
        x : np.ndarray
            任意形狀的輸入
        
        Returns
        -------
        out : np.ndarray
            與輸入相同形狀
        """
        self.cache = x
        out = np.maximum(0, x)
        return out
    
    def backward(self, dout):
        """
        反向傳播
        
        Parameters
        ----------
        dout : np.ndarray
            上游梯度
        
        Returns
        -------
        dx : np.ndarray
            對輸入的梯度
        """
        x = self.cache
        dx = dout * (x > 0).astype(float)
        return dx

# 視覺化 ReLU
x = np.linspace(-5, 5, 100)
relu = ReLU()
y = relu.forward(x)

# 梯度（假設 loss = sum(y)）
dy = np.ones_like(x)
dx = relu.backward(dy)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.plot(x, y, 'b-', linewidth=2, label='ReLU(x)')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel('ReLU(x)')
ax.set_title('ReLU Forward')
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[1]
ax.plot(x, dx, 'r-', linewidth=2, label="ReLU'(x)")
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel("ReLU'(x)")
ax.set_title('ReLU Gradient')
ax.set_ylim(-0.1, 1.5)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Leaky ReLU

為了解決 Dead ReLU 問題，Leaky ReLU 讓負值區域也有一個小的斜率：

$$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}$$

通常 $\alpha = 0.01$

In [None]:
class LeakyReLU:
    """
    Leaky ReLU 激活函數
    """
    
    def __init__(self, alpha=0.01):
        self.alpha = alpha
        self.cache = None
    
    def forward(self, x):
        self.cache = x
        out = np.where(x > 0, x, self.alpha * x)
        return out
    
    def backward(self, dout):
        x = self.cache
        dx = dout * np.where(x > 0, 1, self.alpha)
        return dx

# 視覺化 Leaky ReLU
leaky_relu = LeakyReLU(alpha=0.1)  # 用較大的 alpha 以便觀察
y_leaky = leaky_relu.forward(x)
dx_leaky = leaky_relu.backward(np.ones_like(x))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.plot(x, y, 'b-', linewidth=2, label='ReLU')
ax.plot(x, y_leaky, 'g-', linewidth=2, label='LeakyReLU (α=0.1)')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('ReLU vs LeakyReLU')
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[1]
ax.plot(x, dx, 'b-', linewidth=2, label="ReLU'")
ax.plot(x, dx_leaky, 'g-', linewidth=2, label="LeakyReLU'")
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel("f'(x)")
ax.set_title('Gradients')
ax.set_ylim(-0.1, 1.5)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 第二部分：Sigmoid

### 定義

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

### 梯度

$$\frac{\partial \sigma(x)}{\partial x} = \sigma(x)(1 - \sigma(x))$$

**推導**：
$$\frac{d\sigma}{dx} = \frac{e^{-x}}{(1 + e^{-x})^2} = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = \sigma(1 - \sigma)$$

### 特點
- 輸出範圍 $(0, 1)$，適合輸出機率
- **梯度消失問題**：當 $|x|$ 大時，$\sigma'(x) \approx 0$
- 輸出不是 zero-centered

In [None]:
class Sigmoid:
    """
    Sigmoid 激活函數層
    
    forward: out = 1 / (1 + exp(-x))
    backward: dx = dout * out * (1 - out)
    """
    
    def __init__(self):
        self.cache = None
    
    def forward(self, x):
        """
        數值穩定的 sigmoid 實作
        """
        # 分開處理正負值以避免 overflow
        out = np.where(x >= 0,
                       1 / (1 + np.exp(-x)),
                       np.exp(x) / (1 + np.exp(x)))
        self.cache = out  # 儲存 output（不是 input）給 backward 用
        return out
    
    def backward(self, dout):
        """
        反向傳播：dx = dout * σ * (1 - σ)
        """
        out = self.cache
        dx = dout * out * (1 - out)
        return dx

# 視覺化 Sigmoid
sigmoid = Sigmoid()
y_sigmoid = sigmoid.forward(x)
dx_sigmoid = sigmoid.backward(np.ones_like(x))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.plot(x, y_sigmoid, 'b-', linewidth=2)
ax.axhline(y=0.5, color='k', linewidth=0.5, linestyle='--')
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel('σ(x)')
ax.set_title('Sigmoid')
ax.grid(True, alpha=0.3)

ax = axes[1]
ax.plot(x, dx_sigmoid, 'r-', linewidth=2)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel("σ'(x)")
ax.set_title('Sigmoid Gradient (max at x=0)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"σ(0) = {sigmoid.forward(np.array([0]))[0]:.4f}")
print(f"σ'(0) = {sigmoid.cache[0] * (1 - sigmoid.cache[0]):.4f}")
print(f"最大梯度值: {np.max(dx_sigmoid):.4f} (理論上 max = 0.25 at x=0)")

## 第三部分：Tanh

### 定義

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$$

### 梯度

$$\frac{\partial \tanh(x)}{\partial x} = 1 - \tanh^2(x)$$

### 特點
- 輸出範圍 $(-1, 1)$，**zero-centered**
- 仍有梯度消失問題（但比 sigmoid 好）
- 在 RNN 中仍常用

In [None]:
class Tanh:
    """
    Tanh 激活函數層
    
    forward: out = tanh(x)
    backward: dx = dout * (1 - out^2)
    """
    
    def __init__(self):
        self.cache = None
    
    def forward(self, x):
        out = np.tanh(x)
        self.cache = out
        return out
    
    def backward(self, dout):
        out = self.cache
        dx = dout * (1 - out ** 2)
        return dx

# 視覺化 Tanh
tanh = Tanh()
y_tanh = tanh.forward(x)
dx_tanh = tanh.backward(np.ones_like(x))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.plot(x, y_sigmoid, 'b-', linewidth=2, label='Sigmoid')
ax.plot(x, y_tanh, 'g-', linewidth=2, label='Tanh')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Sigmoid vs Tanh')
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[1]
ax.plot(x, dx_sigmoid, 'b-', linewidth=2, label="Sigmoid'")
ax.plot(x, dx_tanh, 'g-', linewidth=2, label="Tanh'")
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel("f'(x)")
ax.set_title('Gradient Comparison')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"tanh(0) = {y_tanh[len(x)//2]:.4f}")
print(f"tanh'(0) = {dx_tanh[len(x)//2]:.4f}")
print("注意：Tanh 是 zero-centered (output 包含正負值)")
print("      Tanh 的最大梯度 = 1 (at x=0)，比 Sigmoid 的 0.25 大")

## 第四部分：Softmax + Cross-Entropy Loss

### Softmax

將 $K$ 個 raw scores（logits）轉換成機率分佈：

$$p_i = \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

### Cross-Entropy Loss

$$L = -\sum_{i=1}^{K} y_i \log(p_i)$$

其中 $y$ 是 one-hot 編碼的真實標籤。如果真實類別是 $c$：

$$L = -\log(p_c)$$

### 為什麼要合併計算？

單獨計算 softmax 的梯度很複雜，但 **Softmax + Cross-Entropy 合併後的梯度非常簡潔**：

$$\frac{\partial L}{\partial z_i} = p_i - y_i$$

這也很直觀：梯度就是「預測」減「真實」。

In [None]:
def softmax(z):
    """
    數值穩定的 Softmax
    
    Parameters
    ----------
    z : np.ndarray, shape (N, K)
        Raw scores (logits)
    
    Returns
    -------
    p : np.ndarray, shape (N, K)
        機率分佈，每行總和為 1
    """
    # 減去最大值以防止 exp overflow
    z_shifted = z - np.max(z, axis=1, keepdims=True)
    exp_z = np.exp(z_shifted)
    p = exp_z / np.sum(exp_z, axis=1, keepdims=True)
    return p

# 測試 softmax
z = np.array([[1.0, 2.0, 3.0],
              [1000.0, 1001.0, 1002.0]])  # 第二行測試數值穩定性

p = softmax(z)
print("Logits:")
print(z)
print("\nSoftmax 輸出 (機率):")
print(p)
print(f"\n每行總和: {np.sum(p, axis=1)}")
print("（即使輸入很大，輸出仍然是有效的機率分佈）")

In [None]:
class SoftmaxCrossEntropy:
    """
    Softmax + Cross-Entropy Loss（合併計算）
    
    這個層接收 raw scores (logits)，輸出 loss。
    反向傳播時直接計算 dL/d(logits) = softmax(logits) - y
    """
    
    def __init__(self):
        self.cache = None
    
    def forward(self, z, y):
        """
        前向傳播：計算 Softmax Cross-Entropy Loss
        
        Parameters
        ----------
        z : np.ndarray, shape (N, K)
            Raw scores (logits)
        y : np.ndarray, shape (N,)
            真實類別標籤（整數）
        
        Returns
        -------
        loss : float
            平均 cross-entropy loss
        """
        N = z.shape[0]
        
        # 計算 softmax
        p = softmax(z)
        
        # 計算 cross-entropy loss
        # L = -log(p[正確類別])
        eps = 1e-10  # 避免 log(0)
        log_likelihood = -np.log(p[np.arange(N), y] + eps)
        loss = np.mean(log_likelihood)
        
        # 儲存給 backward 用
        self.cache = (p, y)
        
        return loss
    
    def backward(self):
        """
        反向傳播：dL/dz = p - y_one_hot
        
        Returns
        -------
        dz : np.ndarray, shape (N, K)
            對 logits 的梯度
        """
        p, y = self.cache
        N = p.shape[0]
        
        # 梯度 = softmax 輸出 - one-hot 標籤
        dz = p.copy()
        dz[np.arange(N), y] -= 1
        dz /= N  # 平均
        
        return dz

# 測試
z = np.array([[1.0, 2.0, 0.5],
              [0.5, 0.3, 0.8],
              [2.0, 1.0, 0.0]])
y = np.array([1, 2, 0])  # 正確類別

loss_fn = SoftmaxCrossEntropy()
loss = loss_fn.forward(z, y)
dz = loss_fn.backward()

print("Logits:")
print(z)
print(f"\n真實標籤: {y}")
print(f"\nSoftmax 輸出:")
print(softmax(z))
print(f"\nCross-Entropy Loss: {loss:.4f}")
print(f"\n梯度 dL/dz:")
print(dz)

### 梯度公式的數學推導

為什麼 $\frac{\partial L}{\partial z_i} = p_i - y_i$？

設真實類別是 $c$（即 $y_c = 1$，其他 $y_j = 0$）

$$L = -\log(p_c) = -\log\left(\frac{e^{z_c}}{\sum_j e^{z_j}}\right) = -z_c + \log\left(\sum_j e^{z_j}\right)$$

對 $z_i$ 求偏導：

**情況 1**：$i = c$（對正確類別）
$$\frac{\partial L}{\partial z_c} = -1 + \frac{e^{z_c}}{\sum_j e^{z_j}} = -1 + p_c = p_c - 1 = p_c - y_c$$

**情況 2**：$i \neq c$（對其他類別）
$$\frac{\partial L}{\partial z_i} = 0 + \frac{e^{z_i}}{\sum_j e^{z_j}} = p_i = p_i - 0 = p_i - y_i$$

合併：$\frac{\partial L}{\partial z_i} = p_i - y_i$ ✓

In [None]:
# 梯度檢驗
def gradient_check_softmax_ce(z, y, eps=1e-5):
    """
    檢驗 Softmax + Cross-Entropy 的梯度
    """
    loss_fn = SoftmaxCrossEntropy()
    loss = loss_fn.forward(z, y)
    dz_analytic = loss_fn.backward()
    
    # 數值梯度
    dz_numerical = np.zeros_like(z)
    
    for i in range(z.shape[0]):
        for j in range(z.shape[1]):
            z_plus = z.copy()
            z_plus[i, j] += eps
            loss_plus = SoftmaxCrossEntropy().forward(z_plus, y)
            
            z_minus = z.copy()
            z_minus[i, j] -= eps
            loss_minus = SoftmaxCrossEntropy().forward(z_minus, y)
            
            dz_numerical[i, j] = (loss_plus - loss_minus) / (2 * eps)
    
    diff = np.abs(dz_analytic - dz_numerical)
    rel_error = np.max(diff / (np.abs(dz_analytic) + np.abs(dz_numerical) + 1e-8))
    
    print("=== Softmax + Cross-Entropy 梯度檢驗 ===")
    print(f"最大絕對誤差: {np.max(diff):.2e}")
    print(f"最大相對誤差: {rel_error:.2e}")
    print(f"通過: {rel_error < 1e-5}")

gradient_check_softmax_ce(z, y)

## 第五部分：激活函數比較

讓我們比較不同激活函數在梯度傳遞方面的表現。

In [None]:
# 比較各種激活函數
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

x = np.linspace(-6, 6, 200)

activations = [
    ('ReLU', ReLU()),
    ('Leaky ReLU (α=0.1)', LeakyReLU(0.1)),
    ('Sigmoid', Sigmoid()),
    ('Tanh', Tanh()),
]

for idx, (name, act) in enumerate(activations):
    row, col = idx // 2, idx % 2
    
    y = act.forward(x)
    dy = act.backward(np.ones_like(x))
    
    # 函數值
    ax = axes[row, col]
    ax.plot(x, y, 'b-', linewidth=2, label='f(x)')
    ax.plot(x, dy, 'r--', linewidth=2, label="f'(x)")
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_xlabel('x')
    ax.set_title(name)
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-2, 4)

# 梯度消失問題演示
ax = axes[1, 2]

# 模擬深度網路中梯度的衰減
depths = np.arange(1, 21)
x_val = 0.5  # 激活值

sigmoid_grads = []
tanh_grads = []
relu_grads = []

for d in depths:
    # Sigmoid: 最大梯度 = 0.25
    sigmoid_grads.append(0.25 ** d)
    # Tanh: 最大梯度 = 1，但實際通常 < 1
    tanh_grads.append(0.6 ** d)  # 假設平均梯度 0.6
    # ReLU: 梯度 = 1
    relu_grads.append(1.0 ** d)

ax.semilogy(depths, sigmoid_grads, 'b-o', label='Sigmoid (0.25^d)')
ax.semilogy(depths, tanh_grads, 'g-o', label='Tanh (0.6^d)')
ax.semilogy(depths, relu_grads, 'r-o', label='ReLU (1^d)')
ax.set_xlabel('Network Depth')
ax.set_ylabel('Gradient Magnitude')
ax.set_title('Gradient Vanishing Problem')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n=== 激活函數比較 ===")
print("\n| 激活函數 | 輸出範圍 | 優點 | 缺點 |")
print("|----------|----------|------|------|")
print("| ReLU | [0, ∞) | 計算快，不飽和 | Dead ReLU 問題 |")
print("| Leaky ReLU | (-∞, ∞) | 解決 Dead ReLU | 需調 α |")
print("| Sigmoid | (0, 1) | 輸出可解釋為機率 | 梯度消失，非 zero-centered |")
print("| Tanh | (-1, 1) | Zero-centered | 梯度消失 |")

## 練習題

### 練習 1：實作 ELU (Exponential Linear Unit)

$$\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}$$

**提示**：反向傳播時，對於 $x \leq 0$，$\frac{d}{dx}[\alpha(e^x - 1)] = \alpha e^x$

In [None]:
class ELU:
    """
    ELU 激活函數
    
    ELU(x) = x           if x > 0
           = α(e^x - 1)  if x <= 0
    """
    
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.cache = None
    
    def forward(self, x):
        """
        前向傳播
        """
        # 解答：
        self.cache = x
        out = np.where(x > 0, x, self.alpha * (np.exp(x) - 1))
        return out
    
    def backward(self, dout):
        """
        反向傳播
        
        對於 x > 0: 梯度 = 1
        對於 x <= 0: 梯度 = α * e^x = ELU(x) + α
        """
        x = self.cache
        # 解答：
        # 對於 x <= 0，d/dx [α(e^x - 1)] = α * e^x
        dx = dout * np.where(x > 0, 1, self.alpha * np.exp(x))
        return dx

# 測試 ELU
elu = ELU(alpha=1.0)
x = np.linspace(-5, 5, 100)
y_elu = elu.forward(x)
dx_elu = elu.backward(np.ones_like(x))

# 視覺化
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
relu_test = ReLU()
y_relu = relu_test.forward(x)
ax.plot(x, y_relu, 'b-', linewidth=2, label='ReLU')
ax.plot(x, y_elu, 'g-', linewidth=2, label='ELU (α=1)')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('ReLU vs ELU')
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[1]
dx_relu = relu_test.backward(np.ones_like(x))
ax.plot(x, dx_relu, 'b-', linewidth=2, label="ReLU'")
ax.plot(x, dx_elu, 'g-', linewidth=2, label="ELU'")
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_xlabel('x')
ax.set_ylabel("f'(x)")
ax.set_title('Gradient Comparison')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("ELU 優點：")
print("1. 負值區域有非零輸出，比 ReLU 更接近 zero-mean")
print("2. 負值區域梯度不為零，解決 Dead ReLU 問題")
print("3. 負值區域有平滑的飽和，提供 noise robustness")

In [None]:
# 梯度檢驗 ELU
def gradient_check_elu(elu, x, eps=1e-5):
    """
    檢驗 ELU 的梯度
    """
    y = elu.forward(x)
    loss = np.sum(y ** 2)
    dout = 2 * y
    dx_analytic = elu.backward(dout)
    
    # 數值梯度
    dx_numerical = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_plus[i] += eps
        elu_new = ELU(elu.alpha)
        y_plus = elu_new.forward(x_plus)
        loss_plus = np.sum(y_plus ** 2)
        
        x_minus = x.copy()
        x_minus[i] -= eps
        elu_new = ELU(elu.alpha)
        y_minus = elu_new.forward(x_minus)
        loss_minus = np.sum(y_minus ** 2)
        
        dx_numerical[i] = (loss_plus - loss_minus) / (2 * eps)
    
    rel_error = np.max(np.abs(dx_analytic - dx_numerical) / 
                       (np.abs(dx_analytic) + np.abs(dx_numerical) + 1e-8))
    print(f"ELU 梯度檢驗 - 最大相對誤差: {rel_error:.2e}")
    print(f"通過: {rel_error < 1e-5}")

x_test = np.array([-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0])
elu_test = ELU(alpha=1.0)
gradient_check_elu(elu_test, x_test)

### 練習 2：實作完整的分類網路

組合 FC + ReLU + Softmax/CE 建立一個多層分類網路。

In [None]:
# 先定義 FullyConnected 類別（從上一個 notebook）
class FullyConnected:
    def __init__(self, in_features, out_features, init='he'):
        if init == 'xavier':
            std = np.sqrt(2.0 / (in_features + out_features))
        elif init == 'he':
            std = np.sqrt(2.0 / in_features)
        else:
            std = 0.01
        
        self.W = np.random.randn(in_features, out_features) * std
        self.b = np.zeros(out_features)
        self.dW = None
        self.db = None
        self.cache = None
    
    def forward(self, X):
        self.cache = X
        return X @ self.W + self.b
    
    def backward(self, dY):
        X = self.cache
        self.dW = X.T @ dY
        self.db = np.sum(dY, axis=0)
        return dY @ self.W.T


class MultiLayerClassifier:
    """
    多層分類器
    
    架構：FC -> ReLU -> FC -> ReLU -> FC -> Softmax/CE
    """
    
    def __init__(self, input_dim, hidden_dims, num_classes):
        """
        Parameters
        ----------
        input_dim : int
            輸入維度
        hidden_dims : list of int
            隱藏層維度列表
        num_classes : int
            類別數
        """
        self.layers = []
        
        # 建立網路
        dims = [input_dim] + hidden_dims + [num_classes]
        
        for i in range(len(dims) - 1):
            self.layers.append(FullyConnected(dims[i], dims[i+1], init='he'))
            if i < len(dims) - 2:  # 最後一層不加 ReLU
                self.layers.append(ReLU())
        
        self.loss_fn = SoftmaxCrossEntropy()
    
    def forward(self, X):
        """
        前向傳播（不包括 loss）
        """
        out = X
        for layer in self.layers:
            out = layer.forward(out)
        return out  # logits
    
    def loss(self, X, y):
        """
        計算損失
        """
        logits = self.forward(X)
        return self.loss_fn.forward(logits, y)
    
    def backward(self):
        """
        反向傳播
        """
        dout = self.loss_fn.backward()
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
    
    def get_params_and_grads(self):
        """
        回傳所有參數和梯度
        """
        params_and_grads = []
        for layer in self.layers:
            if hasattr(layer, 'W'):
                params_and_grads.append((layer.W, layer.dW))
                params_and_grads.append((layer.b, layer.db))
        return params_and_grads
    
    def predict(self, X):
        """
        預測類別
        """
        logits = self.forward(X)
        return np.argmax(logits, axis=1)

# 測試
net = MultiLayerClassifier(input_dim=10, hidden_dims=[32, 16], num_classes=5)
X = np.random.randn(8, 10)
y = np.random.randint(0, 5, 8)

loss = net.loss(X, y)
print(f"初始 loss: {loss:.4f}")

net.backward()
print(f"\n網路層數: {len(net.layers)}")
print("層結構:")
for i, layer in enumerate(net.layers):
    print(f"  {i}: {type(layer).__name__}")

In [None]:
# 在簡單資料集上訓練

# 產生多類別資料
np.random.seed(42)
N_per_class = 100
num_classes = 3

# 產生螺旋資料
X_list = []
y_list = []

for k in range(num_classes):
    r = np.linspace(0.0, 1, N_per_class)
    t = np.linspace(k * 4, (k + 1) * 4, N_per_class) + np.random.randn(N_per_class) * 0.2
    X_list.append(np.column_stack([r * np.sin(t), r * np.cos(t)]))
    y_list.append(np.full(N_per_class, k))

X_train = np.vstack(X_list)
y_train = np.hstack(y_list).astype(int)

# 打亂
perm = np.random.permutation(len(y_train))
X_train = X_train[perm]
y_train = y_train[perm]

# 視覺化資料
plt.figure(figsize=(8, 6))
colors = ['red', 'green', 'blue']
for k in range(num_classes):
    mask = y_train == k
    plt.scatter(X_train[mask, 0], X_train[mask, 1], c=colors[k], label=f'Class {k}', alpha=0.5)
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Spiral Dataset (3 classes)')
plt.legend()
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# 訓練網路
np.random.seed(42)
net = MultiLayerClassifier(input_dim=2, hidden_dims=[100, 50], num_classes=3)

lr = 1.0
epochs = 1000
losses = []
accuracies = []

for epoch in range(epochs):
    # Forward + Loss
    loss = net.loss(X_train, y_train)
    losses.append(loss)
    
    # Accuracy
    pred = net.predict(X_train)
    acc = np.mean(pred == y_train)
    accuracies.append(acc)
    
    # Backward
    net.backward()
    
    # Update (SGD)
    for param, grad in net.get_params_and_grads():
        param -= lr * grad
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}, Loss: {loss:.4f}, Accuracy: {acc:.4f}")

print(f"\n最終 Loss: {losses[-1]:.4f}")
print(f"最終 Accuracy: {accuracies[-1]:.4f}")

In [None]:
# 視覺化結果
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss 曲線
ax = axes[0]
ax.plot(losses)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training Loss')
ax.grid(True, alpha=0.3)

# Accuracy 曲線
ax = axes[1]
ax.plot(accuracies)
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title('Training Accuracy')
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3)

# 決策邊界
ax = axes[2]

# 產生網格
x_min, x_max = X_train[:, 0].min() - 0.5, X_train[:, 0].max() + 0.5
y_min, y_max = X_train[:, 1].min() - 0.5, X_train[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
grid = np.column_stack([xx.ravel(), yy.ravel()])

# 預測
Z = net.predict(grid)
Z = Z.reshape(xx.shape)

# 畫決策邊界
ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')

# 畫資料點
for k in range(num_classes):
    mask = y_train == k
    ax.scatter(X_train[mask, 0], X_train[mask, 1], c=colors[k], label=f'Class {k}', 
               edgecolors='black', linewidth=0.5, alpha=0.7)

ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_title('Decision Boundary')
ax.legend()

plt.tight_layout()
plt.show()

## 總結

在這個 notebook 中，我們學習了：

1. **激活函數的作用**：引入非線性，讓多層網路能學習複雜函數

2. **常見激活函數**：

| 函數 | 公式 | 梯度 | 特點 |
|------|------|------|------|
| ReLU | $\max(0, x)$ | $1_{x>0}$ | 簡單高效，可能 dead |
| Leaky ReLU | $\max(\alpha x, x)$ | $1$ or $\alpha$ | 解決 dead ReLU |
| Sigmoid | $\frac{1}{1+e^{-x}}$ | $\sigma(1-\sigma)$ | 輸出機率，梯度消失 |
| Tanh | $\tanh(x)$ | $1-\tanh^2(x)$ | Zero-centered |
| ELU | $x$ or $\alpha(e^x-1)$ | $1$ or $\alpha e^x$ | 平滑，近 zero-mean |

3. **Softmax + Cross-Entropy**：
   - 合併計算以保持數值穩定
   - 梯度超級簡潔：$p - y$

4. **梯度消失問題**：Sigmoid/Tanh 在深層網路中梯度會指數衰減，ReLU 解決此問題

### 下一步

接下來我們將實作 **Conv2D** 層，這是 CNN 的核心組件！