# 01 反向傳播基礎 Backpropagation Basics

## 學習目標

1. 理解計算圖 (Computational Graph) 的概念
2. 掌握反向傳播中的連鎖律 (Chain Rule)
3. 從一維標量例子到向量化版本的梯度推導
4. 使用數值微分驗證解析梯度

## 為什麼反向傳播重要？

神經網路的訓練核心就是**最小化損失函數**。要做到這點，我們需要知道損失函數對每個參數的梯度，然後沿著梯度的反方向更新參數。

反向傳播 (Backpropagation) 是一個高效計算這些梯度的演算法，它利用 **chain rule** 從輸出一路往回傳遞梯度到每個參數。

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
print("Backpropagation Basics loaded successfully!")

## 第一部分：計算圖 Computational Graph

### 什麼是計算圖？

計算圖是一種將數學運算表示為有向圖的方式：
- **節點 (Node)**: 代表變數或操作
- **邊 (Edge)**: 代表資料流動方向

### 簡單例子

考慮一個簡單的函數：$f(x, y, z) = (x + y) \cdot z$

我們可以拆解成兩步：
1. $q = x + y$ (加法)
2. $f = q \cdot z$ (乘法)

```
     x ──┐
         ├──(+)── q ──┐
     y ──┘            ├──(×)── f
     z ───────────────┘
```

In [None]:
# 前向傳播 Forward Pass
def forward_example(x, y, z):
    """
    計算 f(x, y, z) = (x + y) * z
    同時回傳中間變數供反向傳播使用
    """
    q = x + y       # 加法
    f = q * z       # 乘法
    return f, q     # 回傳輸出和中間值

# 測試
x, y, z = 2.0, 3.0, 4.0
f, q = forward_example(x, y, z)
print(f"x = {x}, y = {y}, z = {z}")
print(f"q = x + y = {q}")
print(f"f = q * z = {f}")

## 第二部分：Chain Rule 連鎖律

### 核心公式

如果 $y = g(x)$ 且 $z = f(y)$，則：

$$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}$$

**直觀理解**：梯度是「局部梯度」乘以「上游梯度」

### 在計算圖上的應用

對於上面的例子 $f = (x + y) \cdot z$：

**前向傳播**：
- $q = x + y$
- $f = q \cdot z$

**反向傳播**（假設我們要計算 $\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}$）：

1. **從輸出開始**：$\frac{\partial f}{\partial f} = 1$

2. **乘法節點**（$f = q \cdot z$）的局部梯度：
   - $\frac{\partial f}{\partial q} = z$
   - $\frac{\partial f}{\partial z} = q$

3. **加法節點**（$q = x + y$）的局部梯度：
   - $\frac{\partial q}{\partial x} = 1$
   - $\frac{\partial q}{\partial y} = 1$

4. **鏈式法則**：
   - $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} = z \cdot 1 = z$
   - $\frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial y} = z \cdot 1 = z$
   - $\frac{\partial f}{\partial z} = q$

In [None]:
# 反向傳播 Backward Pass
def backward_example(x, y, z, q):
    """
    計算 f = (x + y) * z 對 x, y, z 的梯度
    
    Parameters
    ----------
    x, y, z : 輸入值
    q : 中間值 (x + y)
    
    Returns
    -------
    df_dx, df_dy, df_dz : 梯度
    """
    # 從輸出開始，初始梯度為 1
    df_df = 1.0
    
    # 乘法節點的反向傳播
    # f = q * z
    # ∂f/∂q = z, ∂f/∂z = q
    df_dq = z * df_df
    df_dz = q * df_df
    
    # 加法節點的反向傳播
    # q = x + y
    # ∂q/∂x = 1, ∂q/∂y = 1
    df_dx = 1.0 * df_dq
    df_dy = 1.0 * df_dq
    
    return df_dx, df_dy, df_dz

# 測試
df_dx, df_dy, df_dz = backward_example(x, y, z, q)
print(f"∂f/∂x = {df_dx} (should be z = {z})")
print(f"∂f/∂y = {df_dy} (should be z = {z})")
print(f"∂f/∂z = {df_dz} (should be q = {q})")

### 數值驗證

使用數值微分來驗證我們的解析梯度是否正確：

$$\frac{\partial f}{\partial x} \approx \frac{f(x + \epsilon) - f(x - \epsilon)}{2\epsilon}$$

In [None]:
def numerical_gradient(f, x, eps=1e-5):
    """
    計算標量函數 f 對 x 的數值梯度（中央差分）
    
    Parameters
    ----------
    f : callable
        標量函數
    x : float
        求梯度的點
    eps : float
        微小擾動
    
    Returns
    -------
    grad : float
        數值梯度
    """
    return (f(x + eps) - f(x - eps)) / (2 * eps)

# 驗證 ∂f/∂x
f_x = lambda x_: (x_ + y) * z
num_grad_x = numerical_gradient(f_x, x)
print(f"解析梯度 ∂f/∂x = {df_dx}")
print(f"數值梯度 ∂f/∂x ≈ {num_grad_x}")
print(f"相對誤差: {abs(df_dx - num_grad_x) / (abs(df_dx) + 1e-8):.2e}")

# 驗證 ∂f/∂y
f_y = lambda y_: (x + y_) * z
num_grad_y = numerical_gradient(f_y, y)
print(f"\n解析梯度 ∂f/∂y = {df_dy}")
print(f"數值梯度 ∂f/∂y ≈ {num_grad_y}")

# 驗證 ∂f/∂z
f_z = lambda z_: (x + y) * z_
num_grad_z = numerical_gradient(f_z, z)
print(f"\n解析梯度 ∂f/∂z = {df_dz}")
print(f"數值梯度 ∂f/∂z ≈ {num_grad_z}")

## 第三部分：線性回歸的反向傳播

現在讓我們用一個更實際的例子：線性回歸的梯度計算。

### 模型

$$\hat{y} = wx + b$$

### 損失函數（單一樣本）

$$L = (\hat{y} - y)^2 = (wx + b - y)^2$$

### 計算圖

```
w ──┐
    ├──(×)── p ──┐
x ──┘            ├──(+)── q ──┐
b ───────────────┘            ├──(-)── r ──(²)── L
y ────────────────────────────┘
```

其中：
- $p = wx$
- $q = p + b = wx + b = \hat{y}$
- $r = q - y = \hat{y} - y$
- $L = r^2$

### 手推梯度

使用 chain rule：

1. $\frac{\partial L}{\partial r} = 2r = 2(\hat{y} - y)$

2. $\frac{\partial L}{\partial q} = \frac{\partial L}{\partial r} \cdot \frac{\partial r}{\partial q} = 2r \cdot 1 = 2(\hat{y} - y)$

3. $\frac{\partial L}{\partial b} = \frac{\partial L}{\partial q} \cdot \frac{\partial q}{\partial b} = 2(\hat{y} - y) \cdot 1 = 2(\hat{y} - y)$

4. $\frac{\partial L}{\partial p} = \frac{\partial L}{\partial q} \cdot \frac{\partial q}{\partial p} = 2(\hat{y} - y) \cdot 1 = 2(\hat{y} - y)$

5. $\frac{\partial L}{\partial w} = \frac{\partial L}{\partial p} \cdot \frac{\partial p}{\partial w} = 2(\hat{y} - y) \cdot x$

In [None]:
class LinearRegressionBackprop:
    """
    用計算圖方式實作線性回歸的前向/反向傳播
    """
    
    def __init__(self):
        # 初始化參數
        self.w = np.random.randn()
        self.b = np.random.randn()
        
        # 梯度
        self.dw = 0.0
        self.db = 0.0
        
        # 快取（給 backward 用）
        self.cache = {}
    
    def forward(self, x, y):
        """
        前向傳播
        
        Parameters
        ----------
        x : float
            輸入
        y : float
            真實標籤
        
        Returns
        -------
        loss : float
            MSE 損失
        """
        # 計算預測值
        p = self.w * x      # p = wx
        q = p + self.b      # q = wx + b = y_hat
        r = q - y           # r = y_hat - y
        L = r ** 2          # L = (y_hat - y)^2
        
        # 儲存中間值供 backward 使用
        self.cache = {'x': x, 'y': y, 'p': p, 'q': q, 'r': r}
        
        return L
    
    def backward(self):
        """
        反向傳播，計算 dw 和 db
        """
        x = self.cache['x']
        r = self.cache['r']
        
        # 從輸出往回傳播
        dL_dL = 1.0
        
        # L = r^2
        # ∂L/∂r = 2r
        dL_dr = 2 * r * dL_dL
        
        # r = q - y
        # ∂r/∂q = 1
        dL_dq = 1.0 * dL_dr
        
        # q = p + b
        # ∂q/∂p = 1, ∂q/∂b = 1
        dL_dp = 1.0 * dL_dq
        dL_db = 1.0 * dL_dq
        
        # p = w * x
        # ∂p/∂w = x
        dL_dw = x * dL_dp
        
        self.dw = dL_dw
        self.db = dL_db
        
        return self.dw, self.db

# 測試
model = LinearRegressionBackprop()
model.w = 2.0
model.b = 1.0

x, y = 3.0, 10.0  # 真實值：2*3 + 1 = 7，但我們設 y=10
loss = model.forward(x, y)
dw, db = model.backward()

print(f"w = {model.w}, b = {model.b}")
print(f"x = {x}, y = {y}")
print(f"y_hat = w*x + b = {model.w * x + model.b}")
print(f"Loss = (y_hat - y)^2 = {loss}")
print(f"\n解析梯度：")
print(f"∂L/∂w = {dw}")
print(f"∂L/∂b = {db}")

In [None]:
# 數值驗證
eps = 1e-5

# 驗證 dw
model_plus = LinearRegressionBackprop()
model_plus.w = model.w + eps
model_plus.b = model.b
loss_plus = model_plus.forward(x, y)

model_minus = LinearRegressionBackprop()
model_minus.w = model.w - eps
model_minus.b = model.b
loss_minus = model_minus.forward(x, y)

num_dw = (loss_plus - loss_minus) / (2 * eps)
print(f"解析 ∂L/∂w = {dw}")
print(f"數值 ∂L/∂w ≈ {num_dw}")
print(f"相對誤差: {abs(dw - num_dw) / (abs(dw) + 1e-8):.2e}")

# 驗證 db
model_plus.w = model.w
model_plus.b = model.b + eps
loss_plus = model_plus.forward(x, y)

model_minus.w = model.w
model_minus.b = model.b - eps
loss_minus = model_minus.forward(x, y)

num_db = (loss_plus - loss_minus) / (2 * eps)
print(f"\n解析 ∂L/∂b = {db}")
print(f"數值 ∂L/∂b ≈ {num_db}")
print(f"相對誤差: {abs(db - num_db) / (abs(db) + 1e-8):.2e}")

## 第四部分：向量化版本

在實際的神經網路中，我們處理的是**向量和矩陣**，不是標量。讓我們推導向量化版本的梯度。

### 設定

- **輸入**: $X$ 的形狀是 $(N, D)$，其中 $N$ 是樣本數，$D$ 是特徵維度
- **權重**: $W$ 的形狀是 $(D, M)$，其中 $M$ 是輸出維度
- **偏置**: $b$ 的形狀是 $(M,)$
- **輸出**: $Y = XW + b$，形狀是 $(N, M)$

### 損失函數（以 MSE 為例）

$$L = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} (Y_{ij} - T_{ij})^2$$

其中 $T$ 是目標值。

### 梯度推導

假設我們已經知道 $\frac{\partial L}{\partial Y}$（形狀 $(N, M)$），我們要計算：
- $\frac{\partial L}{\partial X}$（形狀 $(N, D)$）
- $\frac{\partial L}{\partial W}$（形狀 $(D, M)$）
- $\frac{\partial L}{\partial b}$（形狀 $(M,)$）

#### 推導方式：從單一元素開始

考慮 $Y_{ij} = \sum_k X_{ik} W_{kj} + b_j$

**對 W 的梯度**：
$$\frac{\partial L}{\partial W_{kj}} = \sum_i \frac{\partial L}{\partial Y_{ij}} \cdot \frac{\partial Y_{ij}}{\partial W_{kj}} = \sum_i \frac{\partial L}{\partial Y_{ij}} \cdot X_{ik}$$

寫成矩陣形式：$\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Y}$

**對 X 的梯度**：
$$\frac{\partial L}{\partial X_{ik}} = \sum_j \frac{\partial L}{\partial Y_{ij}} \cdot \frac{\partial Y_{ij}}{\partial X_{ik}} = \sum_j \frac{\partial L}{\partial Y_{ij}} \cdot W_{kj}$$

寫成矩陣形式：$\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \cdot W^T$

**對 b 的梯度**：
$$\frac{\partial L}{\partial b_j} = \sum_i \frac{\partial L}{\partial Y_{ij}}$$

寫成向量形式：$\frac{\partial L}{\partial b} = \sum_i \frac{\partial L}{\partial Y}$（對 axis=0 求和）

In [None]:
class FullyConnectedLayer:
    """
    全連接層的前向/反向傳播實作
    
    Y = XW + b
    """
    
    def __init__(self, in_features, out_features):
        """
        Parameters
        ----------
        in_features : int
            輸入維度 D
        out_features : int
            輸出維度 M
        """
        # Xavier 初始化
        self.W = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
        self.b = np.zeros(out_features)
        
        # 梯度
        self.dW = None
        self.db = None
        
        # 快取
        self.cache = None
    
    def forward(self, X):
        """
        前向傳播
        
        Parameters
        ----------
        X : np.ndarray, shape (N, D)
            輸入資料
        
        Returns
        -------
        Y : np.ndarray, shape (N, M)
            輸出
        """
        self.cache = X
        Y = X @ self.W + self.b
        return Y
    
    def backward(self, dY):
        """
        反向傳播
        
        Parameters
        ----------
        dY : np.ndarray, shape (N, M)
            損失對輸出的梯度 ∂L/∂Y
        
        Returns
        -------
        dX : np.ndarray, shape (N, D)
            損失對輸入的梯度 ∂L/∂X
        """
        X = self.cache
        
        # ∂L/∂W = X^T @ dY
        self.dW = X.T @ dY
        
        # ∂L/∂b = sum(dY, axis=0)
        self.db = np.sum(dY, axis=0)
        
        # ∂L/∂X = dY @ W^T
        dX = dY @ self.W.T
        
        return dX

# 測試
N, D, M = 4, 3, 2
X = np.random.randn(N, D)
fc = FullyConnectedLayer(D, M)

# 前向傳播
Y = fc.forward(X)
print(f"輸入 X 形狀: {X.shape}")
print(f"權重 W 形狀: {fc.W.shape}")
print(f"偏置 b 形狀: {fc.b.shape}")
print(f"輸出 Y 形狀: {Y.shape}")

In [None]:
# 梯度檢驗
def gradient_check_fc(layer, X, eps=1e-5):
    """
    對全連接層進行梯度檢驗
    """
    # 前向傳播，使用簡單的 MSE loss
    Y = layer.forward(X)
    
    # 假設 loss = sum(Y^2)，則 dY = 2Y
    loss = np.sum(Y ** 2)
    dY = 2 * Y
    
    # 反向傳播
    dX = layer.backward(dY)
    
    # 數值驗證 dW
    print("=== 驗證 dW ===")
    num_dW = np.zeros_like(layer.W)
    for i in range(layer.W.shape[0]):
        for j in range(layer.W.shape[1]):
            # W + eps
            layer.W[i, j] += eps
            Y_plus = layer.forward(X)
            loss_plus = np.sum(Y_plus ** 2)
            
            # W - eps
            layer.W[i, j] -= 2 * eps
            Y_minus = layer.forward(X)
            loss_minus = np.sum(Y_minus ** 2)
            
            # 恢復
            layer.W[i, j] += eps
            
            num_dW[i, j] = (loss_plus - loss_minus) / (2 * eps)
    
    diff = np.abs(layer.dW - num_dW)
    rel_error = np.max(diff / (np.abs(layer.dW) + np.abs(num_dW) + 1e-8))
    print(f"dW 相對誤差: {rel_error:.2e}")
    print(f"驗證通過: {rel_error < 1e-5}")
    
    # 數值驗證 db
    print("\n=== 驗證 db ===")
    num_db = np.zeros_like(layer.b)
    for j in range(layer.b.shape[0]):
        layer.b[j] += eps
        Y_plus = layer.forward(X)
        loss_plus = np.sum(Y_plus ** 2)
        
        layer.b[j] -= 2 * eps
        Y_minus = layer.forward(X)
        loss_minus = np.sum(Y_minus ** 2)
        
        layer.b[j] += eps
        num_db[j] = (loss_plus - loss_minus) / (2 * eps)
    
    diff = np.abs(layer.db - num_db)
    rel_error = np.max(diff / (np.abs(layer.db) + np.abs(num_db) + 1e-8))
    print(f"db 相對誤差: {rel_error:.2e}")
    print(f"驗證通過: {rel_error < 1e-5}")
    
    # 數值驗證 dX
    print("\n=== 驗證 dX ===")
    num_dX = np.zeros_like(X)
    X_test = X.copy()
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            X_test[i, j] += eps
            Y_plus = layer.forward(X_test)
            loss_plus = np.sum(Y_plus ** 2)
            
            X_test[i, j] -= 2 * eps
            Y_minus = layer.forward(X_test)
            loss_minus = np.sum(Y_minus ** 2)
            
            X_test[i, j] += eps
            num_dX[i, j] = (loss_plus - loss_minus) / (2 * eps)
    
    diff = np.abs(dX - num_dX)
    rel_error = np.max(diff / (np.abs(dX) + np.abs(num_dX) + 1e-8))
    print(f"dX 相對誤差: {rel_error:.2e}")
    print(f"驗證通過: {rel_error < 1e-5}")

gradient_check_fc(fc, X)

## 第五部分：常見操作的梯度

在神經網路中，有幾種常見操作的梯度模式值得記住：

### 加法節點

$f = x + y$
- $\frac{\partial f}{\partial x} = 1$
- $\frac{\partial f}{\partial y} = 1$

**直觀理解**：梯度「平均分配」給兩個輸入

### 乘法節點

$f = x \cdot y$
- $\frac{\partial f}{\partial x} = y$
- $\frac{\partial f}{\partial y} = x$

**直觀理解**：梯度「交換」

### Max 節點

$f = \max(x, y)$
- 梯度只流向較大的那個輸入
- 這就是 ReLU 和 MaxPooling 梯度的原理

### 複製節點（多輸出）

如果一個變數被用在多個地方，梯度要**相加**。

例如 $f = x + x^2$，可以看成 $a = x$, $b = x$, $f = a + b^2$
$$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial a} + \frac{\partial f}{\partial b} \cdot \frac{\partial b^2}{\partial b} = 1 + 2x$$

In [None]:
# 視覺化不同操作的梯度流

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 加法節點
ax = axes[0]
ax.text(0.2, 0.7, 'x', fontsize=16, ha='center')
ax.text(0.2, 0.3, 'y', fontsize=16, ha='center')
ax.text(0.5, 0.5, '+', fontsize=20, ha='center', 
        bbox=dict(boxstyle='circle', facecolor='lightblue'))
ax.text(0.8, 0.5, 'f', fontsize=16, ha='center')

ax.annotate('', xy=(0.43, 0.57), xytext=(0.25, 0.7),
            arrowprops=dict(arrowstyle='->', color='blue'))
ax.annotate('', xy=(0.43, 0.43), xytext=(0.25, 0.3),
            arrowprops=dict(arrowstyle='->', color='blue'))
ax.annotate('', xy=(0.75, 0.5), xytext=(0.57, 0.5),
            arrowprops=dict(arrowstyle='->', color='blue'))

# 反向箭頭
ax.annotate('dout', xy=(0.57, 0.45), xytext=(0.75, 0.35),
            arrowprops=dict(arrowstyle='->', color='red'), color='red')
ax.annotate('dout', xy=(0.25, 0.65), xytext=(0.43, 0.57),
            arrowprops=dict(arrowstyle='->', color='red'), color='red', fontsize=10)
ax.annotate('dout', xy=(0.25, 0.35), xytext=(0.43, 0.43),
            arrowprops=dict(arrowstyle='->', color='red'), color='red', fontsize=10)

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_title('加法節點: 梯度平均分配', fontsize=12)
ax.axis('off')

# 乘法節點
ax = axes[1]
ax.text(0.2, 0.7, 'x', fontsize=16, ha='center')
ax.text(0.2, 0.3, 'y', fontsize=16, ha='center')
ax.text(0.5, 0.5, '×', fontsize=20, ha='center',
        bbox=dict(boxstyle='circle', facecolor='lightgreen'))
ax.text(0.8, 0.5, 'f', fontsize=16, ha='center')

ax.annotate('', xy=(0.43, 0.57), xytext=(0.25, 0.7),
            arrowprops=dict(arrowstyle='->', color='blue'))
ax.annotate('', xy=(0.43, 0.43), xytext=(0.25, 0.3),
            arrowprops=dict(arrowstyle='->', color='blue'))
ax.annotate('', xy=(0.75, 0.5), xytext=(0.57, 0.5),
            arrowprops=dict(arrowstyle='->', color='blue'))

ax.annotate('dout', xy=(0.57, 0.45), xytext=(0.75, 0.35),
            arrowprops=dict(arrowstyle='->', color='red'), color='red')
ax.annotate('y·dout', xy=(0.25, 0.65), xytext=(0.43, 0.57),
            arrowprops=dict(arrowstyle='->', color='red'), color='red', fontsize=10)
ax.annotate('x·dout', xy=(0.25, 0.35), xytext=(0.43, 0.43),
            arrowprops=dict(arrowstyle='->', color='red'), color='red', fontsize=10)

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_title('乘法節點: 梯度交換', fontsize=12)
ax.axis('off')

# Max 節點
ax = axes[2]
ax.text(0.2, 0.7, 'x', fontsize=16, ha='center')
ax.text(0.2, 0.3, 'y', fontsize=16, ha='center')
ax.text(0.5, 0.5, 'max', fontsize=14, ha='center',
        bbox=dict(boxstyle='round', facecolor='lightyellow'))
ax.text(0.8, 0.5, 'f', fontsize=16, ha='center')

ax.annotate('', xy=(0.43, 0.57), xytext=(0.25, 0.7),
            arrowprops=dict(arrowstyle='->', color='blue'))
ax.annotate('', xy=(0.43, 0.43), xytext=(0.25, 0.3),
            arrowprops=dict(arrowstyle='->', color='blue'))
ax.annotate('', xy=(0.75, 0.5), xytext=(0.57, 0.5),
            arrowprops=dict(arrowstyle='->', color='blue'))

ax.annotate('dout', xy=(0.57, 0.45), xytext=(0.75, 0.35),
            arrowprops=dict(arrowstyle='->', color='red'), color='red')
ax.annotate('dout (if x>y)', xy=(0.25, 0.65), xytext=(0.35, 0.75),
            arrowprops=dict(arrowstyle='->', color='red'), color='red', fontsize=9)
ax.annotate('0 (if x>y)', xy=(0.25, 0.35), xytext=(0.35, 0.2),
            arrowprops=dict(arrowstyle='->', color='gray'), color='gray', fontsize=9)

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_title('Max 節點: 梯度只流向最大值', fontsize=12)
ax.axis('off')

plt.tight_layout()
plt.show()

## 練習題

### 練習 1：實作 sigmoid 的前向和反向傳播

Sigmoid 函數：$\sigma(x) = \frac{1}{1 + e^{-x}}$

**提示**：sigmoid 的導數有一個漂亮的形式 $\sigma'(x) = \sigma(x)(1 - \sigma(x))$

In [None]:
class Sigmoid:
    """
    Sigmoid 層的前向/反向傳播
    """
    
    def __init__(self):
        self.cache = None
    
    def forward(self, x):
        """
        前向傳播
        
        Parameters
        ----------
        x : np.ndarray
            任意形狀的輸入
        
        Returns
        -------
        out : np.ndarray
            與輸入相同形狀
        """
        # 解答：
        # 數值穩定的 sigmoid 實作
        # 對於 x >= 0: sigmoid = 1 / (1 + exp(-x))
        # 對於 x < 0: sigmoid = exp(x) / (1 + exp(x))
        out = np.where(x >= 0,
                       1 / (1 + np.exp(-x)),
                       np.exp(x) / (1 + np.exp(x)))
        self.cache = out  # 儲存輸出值，不是輸入
        return out
    
    def backward(self, dout):
        """
        反向傳播
        
        Parameters
        ----------
        dout : np.ndarray
            上游梯度，形狀與 forward 輸出相同
        
        Returns
        -------
        dx : np.ndarray
            對輸入的梯度
        """
        # 解答：
        # σ'(x) = σ(x) * (1 - σ(x))
        # 這裡 self.cache 已經是 σ(x) 了
        sigmoid_out = self.cache
        dx = dout * sigmoid_out * (1 - sigmoid_out)
        return dx

# 測試
sigmoid = Sigmoid()
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
y = sigmoid.forward(x)
print(f"x = {x}")
print(f"sigmoid(x) = {y}")

# 驗證梯度
dout = np.ones_like(x)
dx = sigmoid.backward(dout)
print(f"\n解析梯度 dx = {dx}")

# 數值驗證
eps = 1e-5
num_dx = np.zeros_like(x)
for i in range(len(x)):
    x_plus = x.copy()
    x_plus[i] += eps
    sigmoid_new = Sigmoid()
    y_plus = sigmoid_new.forward(x_plus)
    
    x_minus = x.copy()
    x_minus[i] -= eps
    sigmoid_new = Sigmoid()
    y_minus = sigmoid_new.forward(x_minus)
    
    # 假設 loss = sum(y)
    num_dx[i] = (np.sum(y_plus) - np.sum(y_minus)) / (2 * eps)

print(f"數值梯度 dx ≈ {num_dx}")
print(f"相對誤差: {np.max(np.abs(dx - num_dx) / (np.abs(dx) + np.abs(num_dx) + 1e-8)):.2e}")

### 練習 2：實作多層網路的反向傳播

建構一個簡單的兩層網路：
- 輸入 → FC1 → Sigmoid → FC2 → 輸出

**提示**：反向傳播時要以相反順序呼叫各層的 backward

In [None]:
class TwoLayerNet:
    """
    兩層網路：FC1 → Sigmoid → FC2
    """
    
    def __init__(self, in_features, hidden_features, out_features):
        """
        Parameters
        ----------
        in_features : int
            輸入維度
        hidden_features : int
            隱藏層維度
        out_features : int
            輸出維度
        """
        # 解答：初始化各層
        self.fc1 = FullyConnectedLayer(in_features, hidden_features)
        self.sigmoid = Sigmoid()
        self.fc2 = FullyConnectedLayer(hidden_features, out_features)
    
    def forward(self, X):
        """
        前向傳播
        
        Parameters
        ----------
        X : np.ndarray, shape (N, in_features)
        
        Returns
        -------
        out : np.ndarray, shape (N, out_features)
        """
        # 解答：
        h1 = self.fc1.forward(X)
        h2 = self.sigmoid.forward(h1)
        out = self.fc2.forward(h2)
        return out
    
    def backward(self, dout):
        """
        反向傳播
        
        Parameters
        ----------
        dout : np.ndarray, shape (N, out_features)
            損失對輸出的梯度
        
        Returns
        -------
        dX : np.ndarray, shape (N, in_features)
            損失對輸入的梯度
        """
        # 解答：以相反順序呼叫 backward
        dh2 = self.fc2.backward(dout)
        dh1 = self.sigmoid.backward(dh2)
        dX = self.fc1.backward(dh1)
        return dX
    
    def get_params_and_grads(self):
        """
        回傳所有參數和對應的梯度
        """
        return [
            (self.fc1.W, self.fc1.dW),
            (self.fc1.b, self.fc1.db),
            (self.fc2.W, self.fc2.dW),
            (self.fc2.b, self.fc2.db),
        ]

# 測試
net = TwoLayerNet(in_features=4, hidden_features=8, out_features=2)
X = np.random.randn(3, 4)  # 3 個樣本，4 維特徵

# 前向傳播
Y = net.forward(X)
print(f"輸入形狀: {X.shape}")
print(f"輸出形狀: {Y.shape}")

# 假設簡單的 MSE loss
target = np.random.randn(3, 2)
loss = np.mean((Y - target) ** 2)
dY = 2 * (Y - target) / Y.size

# 反向傳播
dX = net.backward(dY)
print(f"\ndX 形狀: {dX.shape}")

# 顯示各層梯度
print("\n各層參數的梯度形狀:")
for i, (param, grad) in enumerate(net.get_params_and_grads()):
    print(f"  參數 {i}: {param.shape}, 梯度: {grad.shape}")

In [None]:
# 梯度檢驗整個網路
def gradient_check_network(net, X, target, eps=1e-5):
    """
    對整個網路進行梯度檢驗
    """
    # 前向傳播 + 反向傳播
    Y = net.forward(X)
    loss = np.mean((Y - target) ** 2)
    dY = 2 * (Y - target) / Y.size
    net.backward(dY)
    
    params_and_grads = net.get_params_and_grads()
    
    print("=== 網路梯度檢驗 ===")
    
    for idx, (param, grad) in enumerate(params_and_grads):
        # 取幾個隨機位置來檢驗
        num_checks = min(5, param.size)
        flat_idx = np.random.choice(param.size, num_checks, replace=False)
        
        for i in flat_idx:
            # 轉成多維索引
            multi_idx = np.unravel_index(i, param.shape)
            
            # param + eps
            old_val = param[multi_idx]
            param[multi_idx] = old_val + eps
            Y_plus = net.forward(X)
            loss_plus = np.mean((Y_plus - target) ** 2)
            
            # param - eps
            param[multi_idx] = old_val - eps
            Y_minus = net.forward(X)
            loss_minus = np.mean((Y_minus - target) ** 2)
            
            # 恢復
            param[multi_idx] = old_val
            
            num_grad = (loss_plus - loss_minus) / (2 * eps)
            analytic_grad = grad[multi_idx]
            
            rel_error = abs(num_grad - analytic_grad) / (abs(num_grad) + abs(analytic_grad) + 1e-8)
            
            if rel_error > 1e-4:
                print(f"  參數 {idx}, 位置 {multi_idx}: 解析={analytic_grad:.6f}, 數值={num_grad:.6f}, 誤差={rel_error:.2e} ❌")
        
    print("梯度檢驗完成!")

gradient_check_network(net, X, target)

### 練習 3：使用梯度下降訓練網路

用上面的兩層網路學習一個簡單的函數（例如 XOR 問題）

**提示**：XOR 問題需要非線性（這就是為什麼我們需要 sigmoid）

In [None]:
# XOR 資料集
X_xor = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
], dtype=np.float64)

y_xor = np.array([
    [0],
    [1],
    [1],
    [0]
], dtype=np.float64)

print("XOR 資料集:")
for i in range(4):
    print(f"  {X_xor[i]} -> {y_xor[i][0]}")

In [None]:
# 解答：訓練網路解決 XOR

# 建立網路（需要足夠的隱藏單元來學習 XOR）
np.random.seed(42)
net = TwoLayerNet(in_features=2, hidden_features=8, out_features=1)

# 訓練參數
learning_rate = 1.0
epochs = 5000

losses = []

for epoch in range(epochs):
    # 前向傳播
    Y = net.forward(X_xor)
    
    # 計算 MSE 損失
    loss = np.mean((Y - y_xor) ** 2)
    losses.append(loss)
    
    # 計算梯度
    dY = 2 * (Y - y_xor) / Y.size
    
    # 反向傳播
    net.backward(dY)
    
    # 更新參數（SGD）
    for param, grad in net.get_params_and_grads():
        param -= learning_rate * grad
    
    if epoch % 500 == 0:
        print(f"Epoch {epoch:4d}, Loss: {loss:.6f}")

print(f"\n最終 Loss: {losses[-1]:.6f}")

In [None]:
# 視覺化訓練過程
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss 曲線
ax = axes[0]
ax.plot(losses)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training Loss')
ax.set_yscale('log')
ax.grid(True, alpha=0.3)

# 決策邊界
ax = axes[1]

# 產生網格
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 100),
                     np.linspace(-0.5, 1.5, 100))
grid = np.c_[xx.ravel(), yy.ravel()]

# 預測
Z = net.forward(grid)
Z = Z.reshape(xx.shape)

# 畫決策邊界
ax.contourf(xx, yy, Z, levels=50, cmap='RdYlBu_r', alpha=0.7)
ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)

# 畫資料點
colors = ['red' if y == 0 else 'blue' for y in y_xor.flatten()]
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors='black', linewidth=2, zorder=5)

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_title('XOR Decision Boundary')

plt.tight_layout()
plt.show()

# 顯示最終預測
print("\n最終預測:")
Y_final = net.forward(X_xor)
for i in range(4):
    pred = 1 if Y_final[i, 0] > 0.5 else 0
    actual = int(y_xor[i, 0])
    correct = "✓" if pred == actual else "✗"
    print(f"  {X_xor[i]} -> pred: {Y_final[i, 0]:.4f} ({pred}) | actual: {actual} {correct}")

## 總結

在這個 notebook 中，我們學習了：

1. **計算圖**：將複雜運算拆解成簡單操作的有向圖

2. **連鎖律 (Chain Rule)**：反向傳播的核心
   $$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$$

3. **常見操作的梯度模式**：
   - 加法：梯度平均分配
   - 乘法：梯度交換
   - Max：梯度只流向最大值
   - 複製：梯度相加

4. **向量化梯度**：
   - $\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Y}$
   - $\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \cdot W^T$
   - $\frac{\partial L}{\partial b} = \sum_{\text{axis}=0} \frac{\partial L}{\partial Y}$

5. **梯度檢驗**：使用數值微分驗證解析梯度

### 下一步

接下來我們會實作更多層的類型：
- ReLU（比 Sigmoid 更常用的激活函數）
- Softmax + Cross-Entropy（分類問題的標準損失）
- Conv2D（卷積層，CNN 的核心）
- MaxPool（池化層）