# 20. Model Selection: Cross-Validation

**Purpose:** Learn and revise **cross-validation** in Scikit-learn.

---

## What is Cross-Validation?

Cross-validation splits the data into **K folds**, trains on **K-1** folds, and evaluates on the remaining fold.

## Concepts to Remember

| Concept | Description |
|--------|-------------|
| **cross_val_score** | Quick CV evaluation for a single metric. |
| **KFold** / **StratifiedKFold** | Fold generators; stratified keeps class balance. |
| **scoring** | Choose the metric (e.g., accuracy, f1). |
| **cross_validate** | Multiple metrics + timing in one call. |


In [1]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


In [2]:
np.random.seed(42)
X, y = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    class_sep=1.2,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

model = Pipeline(
    [
        ("scale", StandardScaler()),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)


In [3]:
holdout_acc = model.fit(X_train, y_train).score(X_test, y_test)
print(f"Holdout accuracy: {holdout_acc:.3f}")

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kfold = cross_val_score(model, X, y, cv=kfold, scoring="accuracy")
print(f"KFold CV accuracy: {scores_kfold.mean():.3f} +/- {scores_kfold.std():.3f}")

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_skf = cross_val_score(model, X, y, cv=skf, scoring="accuracy")
print(f"StratifiedKFold CV accuracy: {scores_skf.mean():.3f} +/- {scores_skf.std():.3f}")


Holdout accuracy: 0.860
KFold CV accuracy: 0.882 +/- 0.020
StratifiedKFold CV accuracy: 0.882 +/- 0.019


## Key Takeaways

- Cross-validation gives a **more reliable estimate** than a single train/test split.
- Use **StratifiedKFold** for classification to preserve class balance.
- Wrap preprocessing + model in a **Pipeline** to avoid data leakage during CV.
- Compare metrics with **scoring** to align evaluation with the real objective.


## Cross-Validation with GridSearchCV

Grid search evaluates a **parameter grid** using cross-validation and selects the best setting.

In [4]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "clf__C": [0.1, 1.0, 10.0],
    "clf__penalty": ["l2"],
}

search = GridSearchCV(
    model,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
)
search.fit(X_train, y_train)

print("Best params:", search.best_params_)
print(f"Best CV accuracy: {search.best_score_:.3f}")
print(f"Test accuracy with best params: {search.score(X_test, y_test):.3f}")




Best params: {'clf__C': 0.1, 'clf__penalty': 'l2'}
Best CV accuracy: 0.887
Test accuracy with best params: 0.850


