# Cross Validation

## 實際操作
### **Part 1 K-Fold交叉驗證實作**
使用經典的Iris資料集，搭配SVM進行分類，並分別實作K-Fold, Nested K-Fold, Repeated K-Fold, Stratified K-Fold, Group K-Fold等方法切割資料

### 載入資料集

In [3]:
#1 載入資料集
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

### 建立模型（K-Fold）

用K-Fold分割資料並用來訓練模型

最後查看訓練結果

In [7]:
#2 實作K-Fold Cross Validation
from sklearn.model_selection import KFold
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 建立模型
model = SVC()

# 初始化K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# 儲存準確率
accuracies = []

# 進行K-Fold
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # 訓練模型
    model.fit(X_train, y_train)

    # 用模型預測
    y_pred = model.predict(X_test)

    # 計算準確率
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

# 顯示結果
print(f"K-Fold 準確率: {accuracies}")
print(f"平均準確率: {sum(accuracies)/len(accuracies)}")

K-Fold 準確率: [1.0, 1.0, 0.9333333333333333, 0.9333333333333333, 0.9666666666666667]
平均準確率: 0.9666666666666666


### 建立模型（Nest K-Fold）

In [10]:
#3 實作Nested K-Fold Cross Validation
from sklearn.model_selection import GridSearchCV

# 初始化模型
model = SVC()

# 定義參數範圍
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}

# 儲存外層評估結果
outer_scores = []

# 外層KFold
outer_kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in outer_kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # 內層KFold
    inner_kf = KFold(n_splits=3, shuffle=True, random_state=42)

    # 用GridSearchCV進行參數搜尋與交叉驗證
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_kf)
    grid_search.fit(X_train, y_train)

    # 用內層找到的最佳參數在外層測試集評估
    best_model = grid_search.best_estimator_
    outer_score = best_model.score(X_test, y_test)
    outer_scores.append(outer_score)

# 顯示結果
print(f"K-Fold 準確率: {outer_scores}")
print(f"平均準確率: {sum(outer_scores)/len(outer_scores)}")

K-Fold 準確率: [1.0, 1.0, 0.9333333333333333, 0.9666666666666667, 0.9666666666666667]
平均準確率: 0.9733333333333334


### 建立模型（Repeated K-Fold）

In [15]:
#4 實作Repeated K-Fold Cross Validation
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score

# 初始化模型
model = SVC()

# 進行RepeatedKFold
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=42)
scores = cross_val_score(model, X, y, cv=rkf)

# 顯示結果
print(f"每次交叉驗證的準確率: {scores}")
print(f"平均準確率: {scores.mean()}")

每次交叉驗證的準確率: [1.         0.94666667 1.         0.93333333]
平均準確率: 0.97


### 建立模型（Stratified K-Fold）

In [18]:
#5 實作Stratified K-Fold Cross Validation
from sklearn.model_selection import StratifiedKFold

# 初始化模型
model = SVC()

# 進行StratifiedKFold
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

# 顯示結果
print(f"每次交叉驗證的準確率: {scores}")
print(f"平均準確率: {scores.mean()}")

每次交叉驗證的準確率: [1.         0.92105263 0.97297297 0.94594595]
平均準確率: 0.9599928876244667


### 建立模型（Group K-Fold）

In [21]:
#6 實作Group K-Fold Cross Validation
from sklearn.model_selection import GroupKFold
import numpy as np

# 定義群組(這邊先隨機分)
groups = np.random.randint(0, 4, size=X.shape[0])

# 初始化模型
model = SVC()

# 進行GroupKFold
gkf = GroupKFold(n_splits=4)
scores = cross_val_score(model, X, y, cv=gkf, groups=groups)

# 顯示結果
print(f"每次交叉驗證的準確率: {scores}")
print(f"平均準確率: {scores.mean()}")

每次交叉驗證的準確率: [1.         0.975      0.97058824 0.91176471]
平均準確率: 0.9643382352941177


結果可以發現：
- 模型表現穩定、準確度高（平均皆>0.95）
- Nested K-Fold和Repeated K-Fold的參數調整和樣本隨機性處理上具有一定的穩定性
- Stratified和Group K-Fold準確度偏低，但模型對類別與群組分布變化還是有一定的適應性