While a single train/validation split helps, the performance measured on that one validation set can be sensitive to which specific data points ended up in the split. Cross-validation provides a more reliable and robust estimate of a model's generalization performance by systematically using different portions of the data for training and validation.

## Cross-Validation (CV) for Model Evaluation

This document covers:

* **Rationale:** Explains why CV gives a more reliable performance estimate than a single split.
* **K-Fold CV:** Describes the process and shows implementation using `KFold` and `cross_val_score`.
* **Stratified K-Fold CV:** Explains its importance for classification and demonstrates its use. Highlights that it's often the default for classifiers in `cross_val_score`.
* **Other Strategies:** Briefly mentions Leave-One-Out CV (`LOOCV`) and `ShuffleSplit`.
* **`cross_validate`:** Shows how to use this function to get more detailed results, including multiple metrics and timing information.
* **When to Use:** Clarifies that CV is primarily used during the model development phase on the training data for tuning and selection, before a final evaluation on the separate test set.

---

Cross-validation is a cornerstone technique for building reliable machine learning models.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import (train_test_split, cross_val_score, cross_validate,
                                     KFold, StratifiedKFold, LeaveOneOut, ShuffleSplit)
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression # Example model
from sklearn.svm import SVC # Another example model
from sklearn.metrics import accuracy_score, make_scorer # For custom scoring if needed

# --- 1. Rationale for Cross-Validation ---
# - A single train/validation split's performance metric can be noisy or biased
#   depending on which data points land in the validation set.
# - CV provides a more stable and reliable estimate of model performance by
#   training and evaluating the model on multiple different subsets of the data.
# - It uses the available data more efficiently, as each data point gets used
#   for both training and validation across the different iterations (folds).

# --- 2. Load and Prepare Data ---
# Using Iris dataset
print("--- Loading Iris Dataset ---")
iris = load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# IMPORTANT NOTE: CV is typically performed *after* an initial train-test split,
# using only the *training portion* (e.g., X_train_val, y_train_val from Section II)
# for model selection and hyperparameter tuning.
# The final test set (e.g., X_final_test, y_final_test) is still held out for the
# very final evaluation *after* CV and tuning are complete.

# For simplicity in demonstrating CV mechanics here, we'll use the *entire* X and y,
# but remember this isn't the standard practice for final model evaluation.
# We still scale the data first.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Scaled Data shape: X={X_scaled.shape}, y={y.shape}")
print("-" * 30)


# --- 3. K-Fold Cross-Validation ---
# The most common CV technique.
# 1. Split: The data (typically the training set) is divided into 'k' equal-sized,
#    non-overlapping subsets called "folds". Common values for k are 5 or 10.
# 2. Iterate: The process runs for 'k' iterations. In each iteration 'i':
#    - Fold 'i' is used as the validation set.
#    - The remaining 'k-1' folds are combined to form the training set.
#    - The model is trained on the training folds and evaluated on the validation fold.
# 3. Aggregate: The evaluation scores from the 'k' iterations are collected,
#    and typically the mean and standard deviation are reported.

print("--- K-Fold Cross-Validation ---")
# Define the model
model = LogisticRegression(solver='liblinear', random_state=42)

# Define the K-Fold strategy
# shuffle=True is recommended to randomize data order before splitting.
# random_state ensures reproducibility of the shuffle.
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print(f"KFold strategy: {kf}")

# Use cross_val_score to get scores for each fold
# cv=kf tells it to use our defined KFold strategy.
# scoring='accuracy' specifies the metric.
scores_kfold = cross_val_score(model, X_scaled, y, cv=kf, scoring='accuracy')

print(f"\nK-Fold Scores (Accuracy per fold): {scores_kfold}")
print(f"Mean K-Fold Accuracy: {scores_kfold.mean():.4f}")
print(f"Standard Deviation of K-Fold Accuracy: {scores_kfold.std():.4f}")
# Note: KFold doesn't preserve class ratios, which can be problematic for classification.
print("-" * 20)


# --- 4. Stratified K-Fold Cross-Validation ---
# Variation of K-Fold specifically for *classification* tasks.
# Ensures that the proportion of samples for each class is approximately
# the same in each fold as in the original dataset.
# Generally preferred over standard K-Fold for classification.

print("--- Stratified K-Fold Cross-Validation ---")
# Define the Stratified K-Fold strategy
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print(f"StratifiedKFold strategy: {skf}")

# Use cross_val_score with StratifiedKFold
# Note: If you pass an integer (e.g., cv=5) to cross_val_score for a classifier,
# it usually defaults to StratifiedKFold automatically. Explicitly defining it is clearer.
scores_skf = cross_val_score(model, X_scaled, y, cv=skf, scoring='accuracy')

print(f"\nStratified K-Fold Scores (Accuracy per fold): {scores_skf}")
print(f"Mean Stratified K-Fold Accuracy: {scores_skf.mean():.4f}")
print(f"Standard Deviation of Stratified K-Fold Accuracy: {scores_skf.std():.4f}")
print("-" * 30)


# --- 5. Other CV Strategies (Brief Mention) ---
print("--- Other CV Strategies ---")
# a) Leave-One-Out CV (LOOCV)
# A special case of K-Fold where K equals the number of samples (N).
# Each fold contains exactly one sample. Trains N models.
# Computationally expensive but provides a nearly unbiased estimate (high variance).
# loo = LeaveOneOut()
# scores_loo = cross_val_score(model, X_scaled, y, cv=loo, scoring='accuracy')
# print(f"\nMean LOOCV Accuracy: {scores_loo.mean():.4f} (computationally expensive)")

# b) ShuffleSplit
# Randomly samples a specified number of train/test splits. Folds can overlap.
# Useful for large datasets or when controlling the exact number of iterations is desired.
# ss = ShuffleSplit(n_splits=10, test_size=0.25, random_state=42)
# scores_ss = cross_val_score(model, X_scaled, y, cv=ss, scoring='accuracy')
# print(f"\nMean ShuffleSplit Accuracy (10 splits): {scores_ss.mean():.4f}")
print("- LeaveOneOutCV (LOOCV): K = number of samples. Expensive.")
print("- ShuffleSplit: Creates independent random splits.")
print("-" * 30)


# --- 6. Using cross_validate for More Details ---
# cross_validate provides more information than cross_val_score, such as:
# - Fit time per fold
# - Score time per fold
# - Multiple evaluation metrics simultaneously
# - Optionally, the training scores per fold

print("--- Using cross_validate ---")
model_svc = SVC(kernel='rbf', C=1.0, random_state=42) # Use a different model example
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring_metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

# Perform cross-validation
cv_results = cross_validate(model_svc, X_scaled, y,
                            cv=cv_strategy,
                            scoring=scoring_metrics,
                            return_train_score=True) # Get training scores too

# Display results (converting dict to DataFrame is nice)
cv_results_df = pd.DataFrame(cv_results)
print("Cross-Validate Results (DataFrame):")
print(cv_results_df)

# Calculate and print average test scores
print("\nAverage Test Scores:")
for metric in scoring_metrics:
    mean_score = cv_results_df[f'test_{metric}'].mean()
    std_score = cv_results_df[f'test_{metric}'].std()
    print(f"  - {metric}: {mean_score:.4f} (+/- {std_score*2:.4f})")

print(f"\nAverage Fit Time: {cv_results_df['fit_time'].mean():.4f}s")
print("-" * 30)


# --- 7. When to Use Cross-Validation ---
print("--- When to Use CV ---")
print("- Primary Use: During model development on the *training data portion*.")
print("  - Hyperparameter Tuning (e.g., inside GridSearchCV/RandomizedSearchCV).")
print("  - Model Selection (comparing different algorithms reliably).")
print("- It provides a more robust estimate of how a model configuration is likely")
print("  to perform on unseen data compared to a single validation set.")
print("- The *final* evaluation of the *chosen* model configuration should still")
print("  be done on the completely held-out *test set*.")
print("-" * 30)

--- Loading Iris Dataset ---
Scaled Data shape: X=(150, 4), y=(150,)
------------------------------
--- K-Fold Cross-Validation ---
KFold strategy: KFold(n_splits=5, random_state=42, shuffle=True)

K-Fold Scores (Accuracy per fold): [0.96666667 0.83333333 0.9        0.9        0.93333333]
Mean K-Fold Accuracy: 0.9067
Standard Deviation of K-Fold Accuracy: 0.0442
--------------------
--- Stratified K-Fold Cross-Validation ---
StratifiedKFold strategy: StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

Stratified K-Fold Scores (Accuracy per fold): [0.93333333 0.96666667 0.8        0.93333333 0.86666667]
Mean Stratified K-Fold Accuracy: 0.9000
Standard Deviation of Stratified K-Fold Accuracy: 0.0596
------------------------------
--- Other CV Strategies ---
- LeaveOneOutCV (LOOCV): K = number of samples. Expensive.
- ShuffleSplit: Creates independent random splits.
------------------------------
--- Using cross_validate ---
Cross-Validate Results (DataFrame):
   fit_time  score_t