# 02 全連接層 Fully Connected Layer

## 學習目標

1. 深入理解全連接層 (FC Layer) 的前向傳播
2. 詳細推導 FC Layer 的反向傳播公式（用 index 展開方式）
3. 實作完整的 FC Layer class，包含 Xavier/He 初始化
4. 嚴格的梯度檢驗

## 全連接層的定義

全連接層（也叫 Dense Layer 或 Linear Layer）是神經網路最基本的組件。

$$\mathbf{Y} = \mathbf{X} \mathbf{W} + \mathbf{b}$$

其中：
- $\mathbf{X}$: 輸入，形狀 $(N, D)$，$N$ 是 batch size，$D$ 是輸入維度
- $\mathbf{W}$: 權重，形狀 $(D, M)$，$M$ 是輸出維度
- $\mathbf{b}$: 偏置，形狀 $(M,)$
- $\mathbf{Y}$: 輸出，形狀 $(N, M)$

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
print("Fully Connected Layer module loaded!")

## 第一部分：前向傳播

### 公式推導（index 展開）

輸出的每個元素：

$$Y_{n,m} = \sum_{d=1}^{D} X_{n,d} W_{d,m} + b_m$$

這就是一個加權和加上偏置。

### 向量化實作

用矩陣乘法可以一次算出所有元素：

```python
Y = X @ W + b
```

In [None]:
def fc_forward_naive(X, W, b):
    """
    全連接層前向傳播（樸素版本，用來理解公式）
    
    Parameters
    ----------
    X : np.ndarray, shape (N, D)
    W : np.ndarray, shape (D, M)
    b : np.ndarray, shape (M,)
    
    Returns
    -------
    Y : np.ndarray, shape (N, M)
    """
    N, D = X.shape
    M = W.shape[1]
    
    Y = np.zeros((N, M))
    
    for n in range(N):
        for m in range(M):
            # Y[n, m] = sum_d X[n, d] * W[d, m] + b[m]
            total = 0
            for d in range(D):
                total += X[n, d] * W[d, m]
            Y[n, m] = total + b[m]
    
    return Y

def fc_forward_vectorized(X, W, b):
    """
    全連接層前向傳播（向量化版本）
    """
    return X @ W + b

# 測試兩個版本是否一致
N, D, M = 4, 5, 3
X = np.random.randn(N, D)
W = np.random.randn(D, M)
b = np.random.randn(M)

Y_naive = fc_forward_naive(X, W, b)
Y_vec = fc_forward_vectorized(X, W, b)

print(f"樸素版本形狀: {Y_naive.shape}")
print(f"向量化版本形狀: {Y_vec.shape}")
print(f"兩者差異: {np.max(np.abs(Y_naive - Y_vec)):.2e}")

## 第二部分：反向傳播（詳細推導）

假設我們有損失函數 $L$，且已知 $\frac{\partial L}{\partial Y}$（記作 $dY$，形狀 $(N, M)$）。

我們需要計算：
1. $\frac{\partial L}{\partial X}$（形狀 $(N, D)$）
2. $\frac{\partial L}{\partial W}$（形狀 $(D, M)$）
3. $\frac{\partial L}{\partial b}$（形狀 $(M,)$）

### 推導 $\frac{\partial L}{\partial W}$

使用 chain rule，考慮 $L$ 對 $W_{d,m}$ 的梯度：

$$\frac{\partial L}{\partial W_{d,m}} = \sum_{n=1}^{N} \sum_{j=1}^{M} \frac{\partial L}{\partial Y_{n,j}} \frac{\partial Y_{n,j}}{\partial W_{d,m}}$$

因為 $Y_{n,j} = \sum_{k} X_{n,k} W_{k,j} + b_j$，所以：

$$\frac{\partial Y_{n,j}}{\partial W_{d,m}} = \begin{cases} X_{n,d} & \text{if } j = m \\ 0 & \text{otherwise} \end{cases}$$

代入：

$$\frac{\partial L}{\partial W_{d,m}} = \sum_{n=1}^{N} \frac{\partial L}{\partial Y_{n,m}} \cdot X_{n,d} = \sum_{n=1}^{N} X_{n,d} \cdot dY_{n,m}$$

寫成矩陣形式：$\frac{\partial L}{\partial W} = X^T \cdot dY$

### 推導 $\frac{\partial L}{\partial X}$

同樣使用 chain rule：

$$\frac{\partial L}{\partial X_{n,d}} = \sum_{j=1}^{M} \frac{\partial L}{\partial Y_{n,j}} \frac{\partial Y_{n,j}}{\partial X_{n,d}}$$

因為 $Y_{n,j} = \sum_{k} X_{n,k} W_{k,j} + b_j$，所以：

$$\frac{\partial Y_{n,j}}{\partial X_{n,d}} = W_{d,j}$$

代入：

$$\frac{\partial L}{\partial X_{n,d}} = \sum_{j=1}^{M} dY_{n,j} \cdot W_{d,j}$$

寫成矩陣形式：$\frac{\partial L}{\partial X} = dY \cdot W^T$

### 推導 $\frac{\partial L}{\partial b}$

$$\frac{\partial L}{\partial b_m} = \sum_{n=1}^{N} \frac{\partial L}{\partial Y_{n,m}} \frac{\partial Y_{n,m}}{\partial b_m} = \sum_{n=1}^{N} dY_{n,m}$$

寫成向量形式：$\frac{\partial L}{\partial b} = \sum_{n} dY$（對 axis=0 求和）

In [None]:
def fc_backward_naive(dY, X, W):
    """
    全連接層反向傳播（樸素版本）
    
    Parameters
    ----------
    dY : np.ndarray, shape (N, M)
        損失對輸出的梯度
    X : np.ndarray, shape (N, D)
        前向傳播時的輸入（cache）
    W : np.ndarray, shape (D, M)
        權重矩陣
    
    Returns
    -------
    dX : np.ndarray, shape (N, D)
    dW : np.ndarray, shape (D, M)
    db : np.ndarray, shape (M,)
    """
    N, D = X.shape
    M = W.shape[1]
    
    dX = np.zeros((N, D))
    dW = np.zeros((D, M))
    db = np.zeros(M)
    
    # 計算 dW
    for d in range(D):
        for m in range(M):
            for n in range(N):
                dW[d, m] += X[n, d] * dY[n, m]
    
    # 計算 dX
    for n in range(N):
        for d in range(D):
            for m in range(M):
                dX[n, d] += dY[n, m] * W[d, m]
    
    # 計算 db
    for m in range(M):
        for n in range(N):
            db[m] += dY[n, m]
    
    return dX, dW, db

def fc_backward_vectorized(dY, X, W):
    """
    全連接層反向傳播（向量化版本）
    """
    dW = X.T @ dY           # (D, N) @ (N, M) = (D, M)
    dX = dY @ W.T           # (N, M) @ (M, D) = (N, D)
    db = np.sum(dY, axis=0) # (M,)
    
    return dX, dW, db

# 測試
dY = np.random.randn(N, M)

dX_naive, dW_naive, db_naive = fc_backward_naive(dY, X, W)
dX_vec, dW_vec, db_vec = fc_backward_vectorized(dY, X, W)

print("=== 樸素版本 vs 向量化版本 ===")
print(f"dX 差異: {np.max(np.abs(dX_naive - dX_vec)):.2e}")
print(f"dW 差異: {np.max(np.abs(dW_naive - dW_vec)):.2e}")
print(f"db 差異: {np.max(np.abs(db_naive - db_vec)):.2e}")

## 第三部分：權重初始化

權重初始化對神經網路訓練非常重要。如果初始化不當，會導致：
- **梯度消失**：激活值太小，梯度接近零
- **梯度爆炸**：激活值太大，梯度暴增

### Xavier 初始化 (Glorot Initialization)

適用於 sigmoid/tanh 激活函數：

$$W \sim \mathcal{N}\left(0, \frac{1}{n_{in}}\right) \text{ 或 } W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)$$

### He 初始化

適用於 ReLU 激活函數：

$$W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)$$

### 直觀理解

假設輸入 $x$ 的每個元素有方差 $\sigma^2$，我們希望輸出也有相同的方差。

對於 $y = \sum_{i=1}^{n_{in}} w_i x_i$：

$$\text{Var}(y) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)$$

要讓 $\text{Var}(y) = \text{Var}(x)$，需要 $\text{Var}(w) = \frac{1}{n_{in}}$

In [None]:
def xavier_init(n_in, n_out):
    """
    Xavier 初始化
    
    Parameters
    ----------
    n_in : int
        輸入維度
    n_out : int
        輸出維度
    
    Returns
    -------
    W : np.ndarray, shape (n_in, n_out)
    """
    std = np.sqrt(2.0 / (n_in + n_out))
    return np.random.randn(n_in, n_out) * std

def he_init(n_in, n_out):
    """
    He 初始化（適用於 ReLU）
    """
    std = np.sqrt(2.0 / n_in)
    return np.random.randn(n_in, n_out) * std

# 比較不同初始化方式對前向傳播的影響
def analyze_init(init_func, name, layers=10, hidden_dim=256):
    """
    分析初始化方式對深度網路激活值的影響
    """
    x = np.random.randn(32, hidden_dim)  # batch of 32, 256 features
    
    activations = [x]
    
    for i in range(layers):
        W = init_func(hidden_dim, hidden_dim)
        x = x @ W  # 無激活函數，純線性
        activations.append(x)
    
    # 計算每層的統計量
    means = [np.mean(a) for a in activations]
    stds = [np.std(a) for a in activations]
    
    return means, stds

# 標準初始化
def standard_init(n_in, n_out):
    return np.random.randn(n_in, n_out) * 0.01

# 測試不同初始化
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for init_func, name, color in [
    (standard_init, 'Standard (0.01)', 'red'),
    (xavier_init, 'Xavier', 'blue'),
    (he_init, 'He', 'green')
]:
    means, stds = analyze_init(init_func, name)
    
    axes[0].plot(range(len(means)), means, 'o-', label=name, color=color)
    axes[1].plot(range(len(stds)), stds, 'o-', label=name, color=color)

axes[0].set_xlabel('Layer')
axes[0].set_ylabel('Mean')
axes[0].set_title('Activation Mean per Layer')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Layer')
axes[1].set_ylabel('Std')
axes[1].set_title('Activation Std per Layer')
axes[1].legend()
axes[1].set_yscale('log')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n觀察：")
print("- Standard init: 激活值迅速趨近於零（梯度消失）")
print("- Xavier/He init: 激活值的標準差保持穩定")

## 第四部分：完整的 FC Layer 類別

In [None]:
class FullyConnected:
    """
    全連接層（線性層）
    
    前向：Y = X @ W + b
    反向：
        dW = X.T @ dY
        db = sum(dY, axis=0)
        dX = dY @ W.T
    
    Parameters
    ----------
    in_features : int
        輸入維度
    out_features : int
        輸出維度
    init : str
        初始化方式：'xavier', 'he', 'normal'
    """
    
    def __init__(self, in_features, out_features, init='xavier'):
        self.in_features = in_features
        self.out_features = out_features
        
        # 權重初始化
        if init == 'xavier':
            std = np.sqrt(2.0 / (in_features + out_features))
        elif init == 'he':
            std = np.sqrt(2.0 / in_features)
        else:  # normal
            std = 0.01
        
        self.W = np.random.randn(in_features, out_features) * std
        self.b = np.zeros(out_features)
        
        # 梯度
        self.dW = None
        self.db = None
        
        # 快取
        self.cache = None
    
    def forward(self, X):
        """
        前向傳播
        
        Parameters
        ----------
        X : np.ndarray, shape (N, D)
            輸入，D == in_features
        
        Returns
        -------
        Y : np.ndarray, shape (N, M)
            輸出，M == out_features
        """
        self.cache = X
        Y = X @ self.W + self.b
        return Y
    
    def backward(self, dY):
        """
        反向傳播
        
        Parameters
        ----------
        dY : np.ndarray, shape (N, M)
            損失對輸出的梯度
        
        Returns
        -------
        dX : np.ndarray, shape (N, D)
            損失對輸入的梯度
        """
        X = self.cache
        
        # 計算梯度
        self.dW = X.T @ dY
        self.db = np.sum(dY, axis=0)
        dX = dY @ self.W.T
        
        return dX
    
    def __repr__(self):
        return f"FullyConnected({self.in_features}, {self.out_features})"

# 測試
fc = FullyConnected(in_features=10, out_features=5, init='he')
print(f"層結構: {fc}")
print(f"權重形狀: {fc.W.shape}")
print(f"偏置形狀: {fc.b.shape}")

# 前向傳播測試
X = np.random.randn(3, 10)
Y = fc.forward(X)
print(f"\n輸入形狀: {X.shape}")
print(f"輸出形狀: {Y.shape}")

## 第五部分：嚴格的梯度檢驗

In [None]:
def gradient_check_fc(fc_layer, X, eps=1e-5, verbose=True):
    """
    對全連接層進行嚴格的梯度檢驗
    
    使用 L = sum(Y^2) 作為測試損失函數
    則 dL/dY = 2Y
    """
    # 前向傳播
    Y = fc_layer.forward(X)
    
    # 假設 loss = sum(Y^2)
    loss = np.sum(Y ** 2)
    dY = 2 * Y
    
    # 反向傳播
    dX = fc_layer.backward(dY)
    
    all_passed = True
    
    # === 檢驗 dW ===
    if verbose:
        print("=== 檢驗 dW ===")
    
    dW_numerical = np.zeros_like(fc_layer.W)
    
    for i in range(fc_layer.W.shape[0]):
        for j in range(fc_layer.W.shape[1]):
            old_val = fc_layer.W[i, j]
            
            # W + eps
            fc_layer.W[i, j] = old_val + eps
            Y_plus = fc_layer.forward(X)
            loss_plus = np.sum(Y_plus ** 2)
            
            # W - eps
            fc_layer.W[i, j] = old_val - eps
            Y_minus = fc_layer.forward(X)
            loss_minus = np.sum(Y_minus ** 2)
            
            # 恢復
            fc_layer.W[i, j] = old_val
            
            dW_numerical[i, j] = (loss_plus - loss_minus) / (2 * eps)
    
    diff_W = np.abs(fc_layer.dW - dW_numerical)
    rel_error_W = np.max(diff_W / (np.abs(fc_layer.dW) + np.abs(dW_numerical) + 1e-8))
    
    if verbose:
        print(f"  最大絕對誤差: {np.max(diff_W):.2e}")
        print(f"  最大相對誤差: {rel_error_W:.2e}")
    
    if rel_error_W > 1e-4:
        if verbose:
            print("  ❌ 梯度檢驗失敗!")
        all_passed = False
    else:
        if verbose:
            print("  ✓ 梯度檢驗通過")
    
    # === 檢驗 db ===
    if verbose:
        print("\n=== 檢驗 db ===")
    
    db_numerical = np.zeros_like(fc_layer.b)
    
    for j in range(fc_layer.b.shape[0]):
        old_val = fc_layer.b[j]
        
        fc_layer.b[j] = old_val + eps
        Y_plus = fc_layer.forward(X)
        loss_plus = np.sum(Y_plus ** 2)
        
        fc_layer.b[j] = old_val - eps
        Y_minus = fc_layer.forward(X)
        loss_minus = np.sum(Y_minus ** 2)
        
        fc_layer.b[j] = old_val
        
        db_numerical[j] = (loss_plus - loss_minus) / (2 * eps)
    
    diff_b = np.abs(fc_layer.db - db_numerical)
    rel_error_b = np.max(diff_b / (np.abs(fc_layer.db) + np.abs(db_numerical) + 1e-8))
    
    if verbose:
        print(f"  最大絕對誤差: {np.max(diff_b):.2e}")
        print(f"  最大相對誤差: {rel_error_b:.2e}")
    
    if rel_error_b > 1e-4:
        if verbose:
            print("  ❌ 梯度檢驗失敗!")
        all_passed = False
    else:
        if verbose:
            print("  ✓ 梯度檢驗通過")
    
    # === 檢驗 dX ===
    if verbose:
        print("\n=== 檢驗 dX ===")
    
    dX_numerical = np.zeros_like(X)
    X_test = X.copy()
    
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            old_val = X_test[i, j]
            
            X_test[i, j] = old_val + eps
            Y_plus = fc_layer.forward(X_test)
            loss_plus = np.sum(Y_plus ** 2)
            
            X_test[i, j] = old_val - eps
            Y_minus = fc_layer.forward(X_test)
            loss_minus = np.sum(Y_minus ** 2)
            
            X_test[i, j] = old_val
            
            dX_numerical[i, j] = (loss_plus - loss_minus) / (2 * eps)
    
    diff_X = np.abs(dX - dX_numerical)
    rel_error_X = np.max(diff_X / (np.abs(dX) + np.abs(dX_numerical) + 1e-8))
    
    if verbose:
        print(f"  最大絕對誤差: {np.max(diff_X):.2e}")
        print(f"  最大相對誤差: {rel_error_X:.2e}")
    
    if rel_error_X > 1e-4:
        if verbose:
            print("  ❌ 梯度檢驗失敗!")
        all_passed = False
    else:
        if verbose:
            print("  ✓ 梯度檢驗通過")
    
    return all_passed

# 執行梯度檢驗
fc = FullyConnected(10, 5)
X = np.random.randn(3, 10)
passed = gradient_check_fc(fc, X)
print(f"\n總體結果: {'全部通過 ✓' if passed else '有錯誤 ✗'}")

## 練習題

### 練習 1：實作帶正則化的 FC Layer

加入 L2 正則化（weight decay）到梯度計算中：

$$L_{total} = L_{data} + \frac{\lambda}{2} \|W\|^2$$

**提示**：正則化項對 $W$ 的梯度是 $\lambda W$

In [None]:
class FullyConnectedWithL2:
    """
    帶 L2 正則化的全連接層
    """
    
    def __init__(self, in_features, out_features, weight_decay=0.0, init='xavier'):
        self.in_features = in_features
        self.out_features = out_features
        self.weight_decay = weight_decay
        
        # 初始化
        if init == 'xavier':
            std = np.sqrt(2.0 / (in_features + out_features))
        elif init == 'he':
            std = np.sqrt(2.0 / in_features)
        else:
            std = 0.01
        
        self.W = np.random.randn(in_features, out_features) * std
        self.b = np.zeros(out_features)
        
        self.dW = None
        self.db = None
        self.cache = None
    
    def forward(self, X):
        """
        前向傳播
        """
        self.cache = X
        return X @ self.W + self.b
    
    def backward(self, dY):
        """
        反向傳播（含 L2 正則化）
        
        正則化項 (lambda/2) * ||W||^2 對 W 的梯度是 lambda * W
        """
        X = self.cache
        
        # 資料梯度
        dW_data = X.T @ dY
        self.db = np.sum(dY, axis=0)
        dX = dY @ self.W.T
        
        # 加上正則化梯度
        # 解答：dW_total = dW_data + weight_decay * W
        self.dW = dW_data + self.weight_decay * self.W
        
        return dX
    
    def get_regularization_loss(self):
        """
        回傳正則化損失 (lambda/2) * ||W||^2
        """
        return 0.5 * self.weight_decay * np.sum(self.W ** 2)

# 測試
fc_l2 = FullyConnectedWithL2(10, 5, weight_decay=0.1)
X = np.random.randn(3, 10)

# 前向傳播
Y = fc_l2.forward(X)

# 假設損失函數
data_loss = np.sum(Y ** 2)
reg_loss = fc_l2.get_regularization_loss()
total_loss = data_loss + reg_loss

print(f"資料損失: {data_loss:.4f}")
print(f"正則化損失: {reg_loss:.4f}")
print(f"總損失: {total_loss:.4f}")

# 反向傳播
dY = 2 * Y
dX = fc_l2.backward(dY)

print(f"\n梯度形狀: dW={fc_l2.dW.shape}, db={fc_l2.db.shape}, dX={dX.shape}")

In [None]:
# 驗證帶正則化的梯度
def gradient_check_fc_l2(fc_layer, X, eps=1e-5):
    """
    檢驗帶 L2 正則化的 FC 層梯度
    """
    # 前向 + 反向
    Y = fc_layer.forward(X)
    data_loss = np.sum(Y ** 2)
    reg_loss = fc_layer.get_regularization_loss()
    total_loss = data_loss + reg_loss
    
    dY = 2 * Y
    fc_layer.backward(dY)
    
    # 數值梯度（對 total_loss）
    print("=== 檢驗帶 L2 正則化的 dW ===")
    
    dW_numerical = np.zeros_like(fc_layer.W)
    
    for i in range(fc_layer.W.shape[0]):
        for j in range(fc_layer.W.shape[1]):
            old_val = fc_layer.W[i, j]
            
            fc_layer.W[i, j] = old_val + eps
            Y_plus = fc_layer.forward(X)
            loss_plus = np.sum(Y_plus ** 2) + fc_layer.get_regularization_loss()
            
            fc_layer.W[i, j] = old_val - eps
            Y_minus = fc_layer.forward(X)
            loss_minus = np.sum(Y_minus ** 2) + fc_layer.get_regularization_loss()
            
            fc_layer.W[i, j] = old_val
            
            dW_numerical[i, j] = (loss_plus - loss_minus) / (2 * eps)
    
    rel_error = np.max(np.abs(fc_layer.dW - dW_numerical) / 
                       (np.abs(fc_layer.dW) + np.abs(dW_numerical) + 1e-8))
    print(f"  最大相對誤差: {rel_error:.2e}")
    print(f"  通過: {rel_error < 1e-4}")

gradient_check_fc_l2(fc_l2, X)

### 練習 2：實作 Dropout

Dropout 是一種正則化技術，在訓練時隨機「丟棄」一些神經元。

**訓練時**：以機率 $p$ 將神經元設為 0，其餘的要除以 $(1-p)$（rescale）

**測試時**：什麼都不做

**提示**：反向傳播時，梯度也要經過相同的 mask

In [None]:
class Dropout:
    """
    Dropout 層
    
    Parameters
    ----------
    p : float
        丟棄機率（0 到 1）
    """
    
    def __init__(self, p=0.5):
        self.p = p
        self.mask = None
        self.training = True
    
    def forward(self, X):
        """
        前向傳播
        
        Parameters
        ----------
        X : np.ndarray
            任意形狀的輸入
        
        Returns
        -------
        out : np.ndarray
        """
        if self.training:
            # 解答：
            # 1. 產生隨機 mask（機率 1-p 為 True）
            # 2. 將 mask 應用到輸入
            # 3. Rescale（除以 1-p）
            self.mask = (np.random.rand(*X.shape) > self.p)
            out = X * self.mask / (1 - self.p)
        else:
            out = X
        
        return out
    
    def backward(self, dout):
        """
        反向傳播
        """
        if self.training:
            # 解答：梯度也經過相同的 mask 和 rescale
            dX = dout * self.mask / (1 - self.p)
        else:
            dX = dout
        
        return dX
    
    def train(self):
        self.training = True
    
    def eval(self):
        self.training = False

# 測試 Dropout
np.random.seed(42)
dropout = Dropout(p=0.5)

X = np.ones((2, 5))
print("輸入 (全 1):")
print(X)

# 訓練模式
dropout.train()
out_train = dropout.forward(X)
print("\n訓練模式輸出 (p=0.5):")
print(out_train)
print(f"輸出平均: {np.mean(out_train):.2f} (應接近 1.0)")

# 測試模式
dropout.eval()
out_test = dropout.forward(X)
print("\n測試模式輸出:")
print(out_test)

# 反向傳播測試
dropout.train()
_ = dropout.forward(X)  # 重新產生 mask
dout = np.ones_like(X)
dX = dropout.backward(dout)
print("\n反向傳播的梯度:")
print(dX)

### 練習 3：組合 FC + Dropout 訓練網路

用 FC + Dropout 訓練一個簡單的分類器，觀察 Dropout 對過擬合的影響。

In [None]:
# 產生簡單的分類資料
np.random.seed(42)
N = 100  # 樣本數

# 兩類的資料
X_class0 = np.random.randn(N//2, 2) + np.array([2, 2])
X_class1 = np.random.randn(N//2, 2) + np.array([-2, -2])

X_train = np.vstack([X_class0, X_class1])
y_train = np.hstack([np.zeros(N//2), np.ones(N//2)]).astype(int)

# 打亂
perm = np.random.permutation(N)
X_train = X_train[perm]
y_train = y_train[perm]

# 視覺化
plt.figure(figsize=(6, 6))
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], c='blue', label='Class 0')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], c='red', label='Class 1')
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Training Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()

In [None]:
# Sigmoid 層（用於二分類）
class Sigmoid:
    def __init__(self):
        self.cache = None
    
    def forward(self, X):
        out = np.where(X >= 0,
                       1 / (1 + np.exp(-X)),
                       np.exp(X) / (1 + np.exp(X)))
        self.cache = out
        return out
    
    def backward(self, dout):
        return dout * self.cache * (1 - self.cache)

# 訓練網路（含 Dropout）
class ClassifierWithDropout:
    def __init__(self, use_dropout=True, p=0.5):
        self.fc1 = FullyConnected(2, 16, init='he')
        self.sigmoid1 = Sigmoid()
        self.dropout1 = Dropout(p=p) if use_dropout else None
        self.fc2 = FullyConnected(16, 1, init='he')
        self.sigmoid2 = Sigmoid()
        
        self.use_dropout = use_dropout
    
    def train_mode(self):
        if self.dropout1:
            self.dropout1.train()
    
    def eval_mode(self):
        if self.dropout1:
            self.dropout1.eval()
    
    def forward(self, X):
        h = self.fc1.forward(X)
        h = self.sigmoid1.forward(h)
        if self.use_dropout:
            h = self.dropout1.forward(h)
        h = self.fc2.forward(h)
        out = self.sigmoid2.forward(h)
        return out
    
    def backward(self, dout):
        dout = self.sigmoid2.backward(dout)
        dout = self.fc2.backward(dout)
        if self.use_dropout:
            dout = self.dropout1.backward(dout)
        dout = self.sigmoid1.backward(dout)
        dout = self.fc1.backward(dout)
        return dout
    
    def get_params_and_grads(self):
        return [
            (self.fc1.W, self.fc1.dW),
            (self.fc1.b, self.fc1.db),
            (self.fc2.W, self.fc2.dW),
            (self.fc2.b, self.fc2.db),
        ]

def train_classifier(model, X, y, epochs=500, lr=0.5):
    """
    訓練二分類器
    """
    losses = []
    
    for epoch in range(epochs):
        model.train_mode()
        
        # 前向傳播
        y_pred = model.forward(X)
        
        # Binary Cross-Entropy loss
        eps = 1e-8
        loss = -np.mean(y.reshape(-1, 1) * np.log(y_pred + eps) + 
                        (1 - y.reshape(-1, 1)) * np.log(1 - y_pred + eps))
        losses.append(loss)
        
        # 梯度
        dout = (y_pred - y.reshape(-1, 1)) / (y_pred * (1 - y_pred) + eps) / len(y)
        
        # 反向傳播
        model.backward(dout)
        
        # 更新參數
        for param, grad in model.get_params_and_grads():
            param -= lr * grad
    
    return losses

# 訓練兩個模型：有/無 Dropout
np.random.seed(42)
model_no_dropout = ClassifierWithDropout(use_dropout=False)
losses_no_dropout = train_classifier(model_no_dropout, X_train, y_train, epochs=1000)

np.random.seed(42)
model_with_dropout = ClassifierWithDropout(use_dropout=True, p=0.5)
losses_with_dropout = train_classifier(model_with_dropout, X_train, y_train, epochs=1000)

# 比較
plt.figure(figsize=(10, 4))
plt.plot(losses_no_dropout, label='No Dropout', alpha=0.7)
plt.plot(losses_with_dropout, label='With Dropout (p=0.5)', alpha=0.7)
plt.xlabel('Epoch')
plt.ylabel('BCE Loss')
plt.title('Training Loss Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"無 Dropout 最終 loss: {losses_no_dropout[-1]:.4f}")
print(f"有 Dropout 最終 loss: {losses_with_dropout[-1]:.4f}")
print("\n注意：有 Dropout 的訓練 loss 較高是正常的，因為訓練時有隨機丟棄。")
print("Dropout 的效果要在驗證集上才能看出來（減少過擬合）。")

## 總結

在這個 notebook 中，我們深入學習了：

1. **全連接層的數學**：
   - 前向：$Y = XW + b$
   - 反向：$dW = X^T \cdot dY$, $db = \sum dY$, $dX = dY \cdot W^T$

2. **權重初始化的重要性**：
   - Xavier 初始化：適合 sigmoid/tanh
   - He 初始化：適合 ReLU

3. **正則化技術**：
   - L2 正則化：加上 $\lambda W$ 到梯度
   - Dropout：訓練時隨機丟棄神經元

4. **梯度檢驗**：用數值微分驗證解析梯度

### 關鍵公式總結

| 操作 | 前向 | 反向（對參數） | 反向（對輸入） |
|------|------|----------------|----------------|
| FC | $Y = XW + b$ | $dW = X^T dY$ | $dX = dY W^T$ |
| L2 Reg | - | $dW += \lambda W$ | - |
| Dropout | $Y = X \odot m / (1-p)$ | - | $dX = dY \odot m / (1-p)$ |

### 下一步

接下來我們會實作激活函數（ReLU, Sigmoid, Softmax）和損失函數（Cross-Entropy）的完整前向/反向傳播。