# Cross-Validation for Classification (scikit-learn) — No Pipeline

**Level:** Entry → Middle  
**Goal:** Learn how to do K-Fold / Stratified K-Fold cross-validation, compare models, and avoid data leakage — **without using `Pipeline`**.

### What you'll practice
- K-Fold vs. **Stratified** K-Fold
- Baseline with `DummyClassifier`
- Manual cross-validation loop (so you see what's happening under the hood)
- Proper scaling *inside* each fold (no leakage) — still **no Pipeline**
- Simple hyperparameter search with CV
- Final evaluation on a hold-out test set

> Dataset: `Breast Cancer Wisconsin (Diagnostic)` from scikit-learn (binary classification).


In [None]:
# 0) Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import __version__ as sk_version
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print('Versions -> numpy:', np.__version__, '| pandas:', pd.__version__, '| scikit-learn:', sk_version)


## 1) Load the dataset and create a hold-out test set

We'll keep a separate **test set** to evaluate the final model after we finish cross-validation and selection on the training split only.


In [None]:
# Load data as (X, y)
data = load_breast_cancer()
X_full = pd.DataFrame(data.data, columns=data.feature_names)
y_full = pd.Series(data.target, name='target')

# Hold-out split: we'll only use the training part for CV.
X_train, X_test, y_train, y_test = train_test_split(
    X_full, y_full, test_size=0.2, stratify=y_full, random_state=RANDOM_STATE
)

X_train.shape, X_test.shape, y_train.value_counts().to_dict(), y_test.value_counts().to_dict()


## 2) Baseline with `DummyClassifier` + Stratified K-Fold

Stratification keeps class balance similar across folds. Baselines help us judge whether a real model is doing better than "always predict the majority class".


In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

baseline = DummyClassifier(strategy='most_frequent', random_state=RANDOM_STATE)
accs = cross_val_score(baseline, X_train, y_train, cv=cv, scoring='accuracy')
print('Baseline accuracy per fold:', accs)
print('Baseline accuracy mean±std: %.3f ± %.3f' % (accs.mean(), accs.std()))


## 3) Quick model: Decision Tree with Stratified K-Fold

Trees don't require feature scaling, so we can use `cross_val_score` directly.


In [None]:
tree = DecisionTreeClassifier(random_state=RANDOM_STATE)
accs_tree = cross_val_score(tree, X_train, y_train, cv=cv, scoring='accuracy')
print('DecisionTree accuracy per fold:', accs_tree)
print('DecisionTree mean±std: %.3f ± %.3f' % (accs_tree.mean(), accs_tree.std()))


## 4) Logistic Regression — Manual CV loop with **scaling inside each fold**

To avoid leakage, we must fit the scaler **only on the training fold**, then transform the validation fold.
We will code the CV loop ourselves to make this explicit (still **no Pipeline**).


In [None]:
accs_lr, f1s_lr = [], []
for fold, (tr_idx, va_idx) in enumerate(cv.split(X_train, y_train), start=1):
    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

    # scale inside the fold
    scaler = StandardScaler()
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)

    lr = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
    lr.fit(X_tr_s, y_tr)
    pred = lr.predict(X_va_s)

    accs_lr.append(accuracy_score(y_va, pred))
    f1s_lr.append(f1_score(y_va, pred))

    print(f'Fold {fold} -> acc={accs_lr[-1]:.3f}, f1={f1s_lr[-1]:.3f}')

print('\nLogReg accuracy mean±std: %.3f ± %.3f' % (np.mean(accs_lr), np.std(accs_lr)))
print('LogReg F1 mean±std: %.3f ± %.3f' % (np.mean(f1s_lr), np.std(f1s_lr)))


## 5) (Optional) K-Fold vs. Stratified K-Fold — class balance check

Let's see the positive-class proportion in each fold for both strategies.


In [None]:
def fold_pos_rate(splitter, y):
    rates = []
    for tr, va in splitter.split(np.zeros(len(y)), y):
        rates.append(y.iloc[va].mean())
    return np.array(rates)

kfold = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

rates_strat = fold_pos_rate(cv, y_train)
rates_kfold = fold_pos_rate(kfold, y_train)

print('StratifiedKFold pos rate per fold:', np.round(rates_strat, 3))
print('KFold pos rate per fold:', np.round(rates_kfold, 3))

# simple bar chart to visualize
import matplotlib.pyplot as plt
plt.figure()
x = np.arange(1, 6)
plt.plot(x, rates_strat, marker='o', label='StratifiedKFold')
plt.plot(x, rates_kfold, marker='o', label='KFold')
plt.xlabel('Fold')
plt.ylabel('Positive class proportion')
plt.title('Class balance per fold')
plt.legend()
plt.show()


## 6) Manual hyperparameter search with CV (Decision Tree `max_depth`)

We'll loop over a few depths, compute CV accuracy, and pick the best.


In [None]:
depths = list(range(2, 11))
results = []

for d in depths:
    model = DecisionTreeClassifier(max_depth=d, random_state=RANDOM_STATE)
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')
    results.append((d, scores.mean(), scores.std()))

results_df = pd.DataFrame(results, columns=['max_depth', 'mean_acc', 'std_acc']).sort_values('mean_acc', ascending=False)
print(results_df)

best_depth = int(results_df.iloc[0]['max_depth'])
print('\nBest max_depth:', best_depth)


## 7) Train the selected model on the full training set and evaluate on the test set


In [None]:
best_tree = DecisionTreeClassifier(max_depth=best_depth, random_state=RANDOM_STATE)
best_tree.fit(X_train, y_train)

test_pred = best_tree.predict(X_test)

print('Test accuracy:', accuracy_score(y_test, test_pred))
print('Test precision:', precision_score(y_test, test_pred))
print('Test recall:', recall_score(y_test, test_pred))
print('Test F1:', f1_score(y_test, test_pred))
print('\nClassification report:\n', classification_report(y_test, test_pred))

cm = confusion_matrix(y_test, test_pred)
print('Confusion matrix:\n', cm)


## 8) Exercises
1. Try `RepeatedStratifiedKFold` to reduce variance. Compare means and standard deviations.
2. Add another model (e.g., `KNeighborsClassifier`) and (**inside each fold**) standardize the features before fitting.
3. Change scoring metrics to `'f1'` or `'roc_auc'` and compare rankings.
4. Create your own manual CV loop for `DecisionTreeClassifier` (no `cross_val_score`) to compute accuracy per fold.
