# 向量化 vs For-Loop

## 學習目標

1. 理解為什麼向量化（vectorization）快
2. 學習把 Python for-loop 改寫成 numpy 操作
3. 掌握 broadcasting 技巧
4. 實際比較不同實作的效能差異

## 核心原則

> **把計算從 Python 移到 C**：每次呼叫 Python 都有開銷，
> 向量化讓你用一次 Python 呼叫完成大量計算。

In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt

np.random.seed(42)

---

## 第一部分：為什麼向量化快？

### 1.1 Python 迴圈的開銷

Python 是解釋型語言，每行程式碼執行時需要：

1. **Type checking**：Python 是動態型別，每次操作都要檢查
2. **Method lookup**：找到 `+` 對應的 `__add__` 方法
3. **Python object overhead**：每個 Python 物件都有額外的記憶體開銷
4. **GIL**：全域鎖限制並行

```python
# Python for-loop：每次迭代都有上述開銷
for i in range(1000000):
    result[i] = a[i] + b[i]  # 100 萬次 Python 操作

# NumPy 向量化：開銷只有一次
result = a + b  # 1 次 Python 呼叫，100 萬次 C 操作
```

In [None]:
# 實驗：簡單的向量加法

def add_forloop(a, b):
    """純 Python for-loop 實作"""
    result = np.empty_like(a)
    for i in range(len(a)):
        result[i] = a[i] + b[i]
    return result

def add_vectorized(a, b):
    """NumPy 向量化實作"""
    return a + b

# 測試不同大小
sizes = [1000, 10000, 100000, 1000000]

print("Vector Addition Benchmark:")
print("=" * 60)
print(f"{'Size':>10} {'For-loop':>15} {'Vectorized':>15} {'Speedup':>10}")
print("-" * 60)

for size in sizes:
    a = np.random.randn(size)
    b = np.random.randn(size)
    
    # For-loop
    start = time.perf_counter()
    result1 = add_forloop(a, b)
    time_forloop = time.perf_counter() - start
    
    # Vectorized
    start = time.perf_counter()
    for _ in range(100):  # 多次測量取平均
        result2 = add_vectorized(a, b)
    time_vectorized = (time.perf_counter() - start) / 100
    
    speedup = time_forloop / time_vectorized
    print(f"{size:>10,} {time_forloop:>12.6f}s {time_vectorized:>12.6f}s {speedup:>9.1f}x")
    
    # 驗證結果相同
    assert np.allclose(result1, result2)

### 1.2 向量化的優勢

1. **減少 interpreter overhead**：一次 Python 呼叫 vs N 次
2. **SIMD 指令**：CPU 可以同時處理多個數據（Single Instruction Multiple Data）
3. **Cache-friendly**：連續記憶體存取更快
4. **可能使用多執行緒**：某些操作可以自動並行

In [None]:
# 更複雜的例子：計算歐氏距離

def euclidean_distance_forloop(X, y):
    """計算 X 中每個點到 y 的距離（for-loop）
    
    X: (N, D) - N 個 D 維向量
    y: (D,) - 一個 D 維向量
    """
    N = X.shape[0]
    distances = np.empty(N)
    
    for i in range(N):
        diff = X[i] - y
        dist = 0
        for j in range(len(diff)):
            dist += diff[j] ** 2
        distances[i] = np.sqrt(dist)
    
    return distances

def euclidean_distance_vectorized(X, y):
    """計算 X 中每個點到 y 的距離（向量化）"""
    diff = X - y  # broadcasting: (N, D) - (D,) = (N, D)
    return np.sqrt(np.sum(diff ** 2, axis=1))  # (N,)

# 測試
N, D = 10000, 100
X = np.random.randn(N, D)
y = np.random.randn(D)

# For-loop
start = time.perf_counter()
dist1 = euclidean_distance_forloop(X, y)
time_forloop = time.perf_counter() - start

# Vectorized
start = time.perf_counter()
for _ in range(100):
    dist2 = euclidean_distance_vectorized(X, y)
time_vectorized = (time.perf_counter() - start) / 100

print(f"Euclidean distance ({N} points, {D} dims):")
print(f"  For-loop:   {time_forloop:.4f}s")
print(f"  Vectorized: {time_vectorized:.6f}s")
print(f"  Speedup:    {time_forloop / time_vectorized:.1f}x")
print(f"  Results match: {np.allclose(dist1, dist2)}")

---

## 第二部分：Broadcasting 技巧

### 2.1 Broadcasting 規則

NumPy 會自動「擴展」較小的陣列來匹配較大的：

```
規則 1: 如果維度數不同，在前面補 1
規則 2: 如果某維度大小為 1，擴展到匹配的大小
規則 3: 如果大小不為 1 且不相等，報錯
```

範例：
```
(3, 4) + (4,)    →  (3, 4) + (1, 4)  →  (3, 4)
(3, 4) + (3, 1)  →  (3, 4)
(3, 4) + (1,)    →  (3, 4) + (1, 1)  →  (3, 4)
(3, 4) + (5,)    →  Error! (4 != 5)
```

In [None]:
# Broadcasting 範例

# 1. 向每一行加上不同的值
A = np.array([[1, 2, 3],
              [4, 5, 6]])
row_add = np.array([10, 20, 30])

print("A (2, 3):")
print(A)
print("\nrow_add (3,):")
print(row_add)
print("\nA + row_add (broadcast over rows):")
print(A + row_add)

# 2. 向每一列加上不同的值
col_add = np.array([[100], [200]])  # shape (2, 1)

print("\ncol_add (2, 1):")
print(col_add)
print("\nA + col_add (broadcast over columns):")
print(A + col_add)

In [None]:
# Broadcasting 的強大應用：距離矩陣

def pairwise_distance_forloop(X, Y):
    """計算 X 和 Y 之間的距離矩陣（for-loop）
    
    X: (N, D)
    Y: (M, D)
    返回: (N, M) 距離矩陣
    """
    N = X.shape[0]
    M = Y.shape[0]
    D = np.zeros((N, M))
    
    for i in range(N):
        for j in range(M):
            diff = X[i] - Y[j]
            D[i, j] = np.sqrt(np.sum(diff ** 2))
    
    return D

def pairwise_distance_broadcast(X, Y):
    """計算距離矩陣（使用 broadcasting）
    
    技巧：
    X: (N, D) -> (N, 1, D)
    Y: (M, D) -> (1, M, D)
    X - Y: (N, M, D)  -> 每對點的差
    """
    # 擴展維度
    X_expanded = X[:, np.newaxis, :]  # (N, 1, D)
    Y_expanded = Y[np.newaxis, :, :]  # (1, M, D)
    
    # 差值（broadcasting）
    diff = X_expanded - Y_expanded  # (N, M, D)
    
    # 距離
    return np.sqrt(np.sum(diff ** 2, axis=2))  # (N, M)

def pairwise_distance_efficient(X, Y):
    """更記憶體高效的版本
    
    使用恆等式：||x-y||^2 = ||x||^2 + ||y||^2 - 2*x.y
    """
    X_sq = np.sum(X ** 2, axis=1, keepdims=True)  # (N, 1)
    Y_sq = np.sum(Y ** 2, axis=1, keepdims=True)  # (M, 1)
    
    # X_sq: (N, 1) broadcast to (N, M)
    # Y_sq.T: (1, M) broadcast to (N, M)
    # X @ Y.T: (N, M)
    D_sq = X_sq + Y_sq.T - 2 * X @ Y.T
    
    # 數值穩定性：可能有極小的負數
    D_sq = np.maximum(D_sq, 0)
    
    return np.sqrt(D_sq)

# 測試
N, M, D = 500, 300, 50
X = np.random.randn(N, D)
Y = np.random.randn(M, D)

# For-loop
start = time.perf_counter()
D1 = pairwise_distance_forloop(X, Y)
time_forloop = time.perf_counter() - start

# Broadcasting
start = time.perf_counter()
D2 = pairwise_distance_broadcast(X, Y)
time_broadcast = time.perf_counter() - start

# Efficient
start = time.perf_counter()
D3 = pairwise_distance_efficient(X, Y)
time_efficient = time.perf_counter() - start

print(f"Pairwise distance ({N}x{M}, {D} dims):")
print(f"  For-loop:    {time_forloop:.4f}s")
print(f"  Broadcasting: {time_broadcast:.4f}s  ({time_forloop/time_broadcast:.1f}x faster)")
print(f"  Efficient:   {time_efficient:.4f}s  ({time_forloop/time_efficient:.1f}x faster)")
print(f"\n  Results match: broadcast={np.allclose(D1, D2)}, efficient={np.allclose(D1, D3)}")

### 2.2 常見的 Broadcasting 模式

In [None]:
# 常見模式 1：標準化（每行減去該行的平均）

X = np.random.randn(100, 50)

# For-loop 版本
def normalize_forloop(X):
    result = np.empty_like(X)
    for i in range(X.shape[0]):
        mean = np.mean(X[i])
        std = np.std(X[i])
        result[i] = (X[i] - mean) / (std + 1e-8)
    return result

# 向量化版本
def normalize_vectorized(X):
    mean = X.mean(axis=1, keepdims=True)  # (N, 1)
    std = X.std(axis=1, keepdims=True)    # (N, 1)
    return (X - mean) / (std + 1e-8)      # broadcasting

# 測試
%timeit normalize_forloop(X)
%timeit normalize_vectorized(X)

In [None]:
# 常見模式 2：Softmax

def softmax_forloop(X):
    """對每一行做 softmax（for-loop）"""
    result = np.empty_like(X)
    for i in range(X.shape[0]):
        exp_x = np.exp(X[i] - np.max(X[i]))  # 數值穩定
        result[i] = exp_x / np.sum(exp_x)
    return result

def softmax_vectorized(X):
    """對每一行做 softmax（向量化）"""
    X_max = X.max(axis=1, keepdims=True)  # (N, 1)
    exp_x = np.exp(X - X_max)             # broadcasting
    return exp_x / exp_x.sum(axis=1, keepdims=True)  # broadcasting

# 測試
X = np.random.randn(1000, 100)

start = time.perf_counter()
result1 = softmax_forloop(X)
time_forloop = time.perf_counter() - start

start = time.perf_counter()
result2 = softmax_vectorized(X)
time_vectorized = time.perf_counter() - start

print(f"Softmax (1000 x 100):")
print(f"  For-loop:   {time_forloop:.4f}s")
print(f"  Vectorized: {time_vectorized:.4f}s")
print(f"  Speedup:    {time_forloop / time_vectorized:.1f}x")
print(f"  Match: {np.allclose(result1, result2)}")

In [None]:
# 常見模式 3：外積（outer product）

def outer_forloop(a, b):
    """計算外積 a @ b.T（for-loop）"""
    result = np.empty((len(a), len(b)))
    for i in range(len(a)):
        for j in range(len(b)):
            result[i, j] = a[i] * b[j]
    return result

def outer_vectorized(a, b):
    """計算外積（使用 broadcasting）"""
    return a[:, np.newaxis] * b[np.newaxis, :]  # (N, 1) * (1, M) = (N, M)

def outer_numpy(a, b):
    """使用 numpy 內建函數"""
    return np.outer(a, b)

# 測試
a = np.random.randn(1000)
b = np.random.randn(800)

start = time.perf_counter()
r1 = outer_forloop(a, b)
time_forloop = time.perf_counter() - start

start = time.perf_counter()
r2 = outer_vectorized(a, b)
time_vectorized = time.perf_counter() - start

start = time.perf_counter()
r3 = outer_numpy(a, b)
time_numpy = time.perf_counter() - start

print(f"Outer product (1000 x 800):")
print(f"  For-loop:    {time_forloop:.4f}s")
print(f"  Broadcasting: {time_vectorized:.6f}s ({time_forloop/time_vectorized:.1f}x)")
print(f"  np.outer:    {time_numpy:.6f}s ({time_forloop/time_numpy:.1f}x)")

---

## 第三部分：2D Convolution 向量化

這是本課程最重要的例子之一！

In [None]:
def conv2d_naive(x, kernel):
    """最 naive 的 2D 卷積實作
    
    x: (H, W) 輸入圖片
    kernel: (kH, kW) 卷積核
    """
    H, W = x.shape
    kH, kW = kernel.shape
    out_H = H - kH + 1
    out_W = W - kW + 1
    
    output = np.zeros((out_H, out_W))
    
    for i in range(out_H):
        for j in range(out_W):
            # 取出 patch
            patch = x[i:i+kH, j:j+kW]
            # 對應元素相乘再求和
            output[i, j] = np.sum(patch * kernel)
    
    return output


def conv2d_strided(x, kernel):
    """使用 stride_tricks 的向量化卷積
    
    技巧：用 as_strided 建立所有 patch 的 view，然後一次計算
    """
    H, W = x.shape
    kH, kW = kernel.shape
    out_H = H - kH + 1
    out_W = W - kW + 1
    
    # 建立 strided view
    # shape: (out_H, out_W, kH, kW)
    # 這個 view 包含了所有的 patch
    shape = (out_H, out_W, kH, kW)
    strides = (x.strides[0], x.strides[1], x.strides[0], x.strides[1])
    
    patches = np.lib.stride_tricks.as_strided(x, shape=shape, strides=strides)
    
    # 一次計算所有輸出
    # patches: (out_H, out_W, kH, kW)
    # kernel: (kH, kW)
    # 使用 einsum: 對 kH, kW 維度求和
    output = np.einsum('ijkl,kl->ij', patches, kernel)
    
    return output


# 測試
H, W = 128, 128
kH, kW = 5, 5

x = np.random.randn(H, W)
kernel = np.random.randn(kH, kW)

# Naive
start = time.perf_counter()
out1 = conv2d_naive(x, kernel)
time_naive = time.perf_counter() - start

# Strided
start = time.perf_counter()
out2 = conv2d_strided(x, kernel)
time_strided = time.perf_counter() - start

print(f"2D Convolution ({H}x{W} image, {kH}x{kW} kernel):")
print(f"  Naive:   {time_naive:.4f}s")
print(f"  Strided: {time_strided:.6f}s")
print(f"  Speedup: {time_naive / time_strided:.1f}x")
print(f"  Match: {np.allclose(out1, out2)}")

In [None]:
# 完整的 4D 卷積（batch + channels）

def conv2d_4d_naive(x, W, stride=1, padding=0):
    """4D 卷積的 naive 實作
    
    x: (N, C_in, H, W)
    W: (C_out, C_in, kH, kW)
    """
    N, C_in, H, W_in = x.shape
    C_out, _, kH, kW = W.shape
    
    # Padding
    if padding > 0:
        x = np.pad(x, ((0, 0), (0, 0), (padding, padding), (padding, padding)))
    
    H_pad, W_pad = x.shape[2], x.shape[3]
    out_H = (H_pad - kH) // stride + 1
    out_W = (W_pad - kW) // stride + 1
    
    output = np.zeros((N, C_out, out_H, out_W))
    
    for n in range(N):              # batch
        for c_out in range(C_out):  # output channel
            for i in range(out_H):  # height
                for j in range(out_W):  # width
                    # 取出 patch
                    h_start = i * stride
                    w_start = j * stride
                    patch = x[n, :, h_start:h_start+kH, w_start:w_start+kW]
                    # 對所有 input channels 求和
                    output[n, c_out, i, j] = np.sum(patch * W[c_out])
    
    return output


def conv2d_4d_vectorized(x, W, stride=1, padding=0):
    """4D 卷積的向量化實作（使用 im2col）
    
    im2col 將卷積轉換成矩陣乘法
    """
    N, C_in, H, W_in = x.shape
    C_out, _, kH, kW = W.shape
    
    # Padding
    if padding > 0:
        x = np.pad(x, ((0, 0), (0, 0), (padding, padding), (padding, padding)))
    
    H_pad, W_pad = x.shape[2], x.shape[3]
    out_H = (H_pad - kH) // stride + 1
    out_W = (W_pad - kW) // stride + 1
    
    # im2col: (N, C_in, H, W) -> (N*out_H*out_W, C_in*kH*kW)
    shape = (N, C_in, kH, kW, out_H, out_W)
    strides = (x.strides[0], x.strides[1], x.strides[2], x.strides[3],
               x.strides[2] * stride, x.strides[3] * stride)
    
    col = np.lib.stride_tricks.as_strided(x, shape=shape, strides=strides)
    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N * out_H * out_W, -1)
    
    # Reshape kernel: (C_out, C_in*kH*kW)
    W_col = W.reshape(C_out, -1)
    
    # Matrix multiplication
    out_col = col @ W_col.T  # (N*out_H*out_W, C_out)
    
    # Reshape output: (N, out_H, out_W, C_out) -> (N, C_out, out_H, out_W)
    output = out_col.reshape(N, out_H, out_W, C_out).transpose(0, 3, 1, 2)
    
    return output


# 測試
N, C_in, H, W_in = 2, 3, 32, 32
C_out, kH, kW = 16, 3, 3

x = np.random.randn(N, C_in, H, W_in)
W = np.random.randn(C_out, C_in, kH, kW)

# Naive
start = time.perf_counter()
out1 = conv2d_4d_naive(x, W, padding=1)
time_naive = time.perf_counter() - start

# Vectorized
start = time.perf_counter()
out2 = conv2d_4d_vectorized(x, W, padding=1)
time_vectorized = time.perf_counter() - start

print(f"4D Convolution:")
print(f"  Input:  {x.shape}")
print(f"  Kernel: {W.shape}")
print(f"  Output: {out1.shape}")
print(f"\n  Naive:      {time_naive:.4f}s")
print(f"  Vectorized: {time_vectorized:.4f}s")
print(f"  Speedup:    {time_naive / time_vectorized:.1f}x")
print(f"  Match: {np.allclose(out1, out2)}")

---

## 第四部分：綜合效能比較

In [None]:
# 不同輸入大小下的效能比較

def benchmark_conv2d(sizes, C_in=3, C_out=16, kH=3):
    """測試不同輸入大小的卷積效能"""
    results = {'size': [], 'naive': [], 'vectorized': []}
    
    for size in sizes:
        x = np.random.randn(1, C_in, size, size).astype(np.float32)
        W = np.random.randn(C_out, C_in, kH, kH).astype(np.float32)
        
        # Naive (只測小的，否則太慢)
        if size <= 64:
            start = time.perf_counter()
            out1 = conv2d_4d_naive(x, W, padding=1)
            time_naive = time.perf_counter() - start
        else:
            time_naive = np.nan
        
        # Vectorized
        start = time.perf_counter()
        for _ in range(5):
            out2 = conv2d_4d_vectorized(x, W, padding=1)
        time_vectorized = (time.perf_counter() - start) / 5
        
        results['size'].append(size)
        results['naive'].append(time_naive)
        results['vectorized'].append(time_vectorized)
        
        if not np.isnan(time_naive):
            print(f"Size {size:4d}x{size:4d}: naive={time_naive:.4f}s, "
                  f"vectorized={time_vectorized:.4f}s, "
                  f"speedup={time_naive/time_vectorized:.1f}x")
        else:
            print(f"Size {size:4d}x{size:4d}: naive=skipped, "
                  f"vectorized={time_vectorized:.4f}s")
    
    return results

sizes = [16, 32, 64, 128, 256]
results = benchmark_conv2d(sizes)

In [None]:
# 繪製效能圖

fig, ax = plt.subplots(figsize=(10, 6))

sizes = results['size']
naive = results['naive']
vectorized = results['vectorized']

# 過濾掉 NaN
valid_idx = [i for i, v in enumerate(naive) if not np.isnan(v)]
sizes_valid = [sizes[i] for i in valid_idx]
naive_valid = [naive[i] for i in valid_idx]

ax.semilogy(sizes_valid, naive_valid, 'o-', label='Naive (for-loop)', linewidth=2)
ax.semilogy(sizes, vectorized, 's-', label='Vectorized (im2col)', linewidth=2)

ax.set_xlabel('Image Size (pixels)', fontsize=12)
ax.set_ylabel('Time (seconds, log scale)', fontsize=12)
ax.set_title('2D Convolution Performance: Naive vs Vectorized', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n觀察：")
print("1. 向量化版本快 10-100 倍以上")
print("2. 隨著輸入變大，差距更明顯")
print("3. im2col 把卷積轉成 GEMM，可以利用 BLAS 多執行緒")

---

## 總結

### 向量化技巧總結

1. **基本原則**：
   - 把 Python 迴圈換成 numpy 操作
   - 一次 numpy 呼叫 > 多次迴圈

2. **Broadcasting**：
   - 理解 broadcasting 規則
   - 用 `keepdims=True` 保持維度
   - 用 `np.newaxis` 增加維度

3. **常用技巧**：
   - `np.einsum` 處理複雜的張量操作
   - `np.lib.stride_tricks.as_strided` 建立高效的 view
   - `axis` 參數控制操作的維度

4. **im2col**：
   - 把卷積轉成矩陣乘法
   - 利用 BLAS 的高度優化
   - 詳細實作見下一個 notebook

### 效能提升幅度

| 操作 | For-loop vs Vectorized |
|-----|------------------------|
| 向量加法 | 10-100x |
| 距離計算 | 100-1000x |
| Softmax | 10-50x |
| 2D 卷積 | 50-200x |