# Robust Model Evaluation with K-Fold Cross-Validation 🔄

When we evaluate a machine learning model, a single **train-test split** gives us one performance score. However, this score can be lucky or unlucky depending on which specific data points happen to end up in the test set. A different random split might result in a very different score, making it hard to be confident in the model's true performance.

**K-Fold Cross-Validation** is a more robust technique that provides a more reliable estimate of how a model will perform on unseen data. It does this by splitting the data into multiple "folds" and training and testing the model on each fold, then averaging the results.

---

## 1. The Limitation of a Single Train-Test Split

Let's start by training a `LogisticRegression` model on a single 75/25 split of a synthetic dataset.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
# Generate and split the data
X, y = make_classification(
    n_features=10,
    n_samples=1000,
    n_informative=8,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [17]:
from sklearn.metrics import classification_report

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("--- Performance on a Single Train-Test Split ---")
print(classification_report(y_test, y_pred))

--- Performance on a Single Train-Test Split ---
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



The model's accuracy is **70%**. But how much can we trust this single number? A different `random_state` in `train_test_split` would give us a different score.


## 2. How K-Fold Cross-Validation Works

K-Fold Cross-Validation addresses this issue. For a **K-Fold** (e.g., with K=5):
1.  The dataset is shuffled and split into **5** equal-sized parts, or "folds".
2.  The model is trained and evaluated **5** times.
3.  In each iteration, one fold is held out as the **test set**, and the remaining 4 folds are used for training.
4.  The final score is the **average** of the 5 individual scores, providing a more stable and reliable performance estimate.


We can set up a `KFold` object in `scikit-learn` to generate the splits. The following code demonstrates how it provides the training and testing indices for each iteration (or "split").


In [18]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

# This loop demonstrates how KFold provides indices for each split
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # In a manual implementation, you would train and score a model here

## 3. Easy Implementation with `cross_val_score`

While you can implement the loop manually, `scikit-learn` provides a convenient helper function, `cross_val_score`, that handles the entire K-Fold process automatically.

Let's use it to compare the average performance of three different models on our dataset.

### a) Logistic Regression

In [19]:
from sklearn.model_selection import cross_val_score

scores_logistic = cross_val_score(LogisticRegression(), X, y, cv=kf)
np.average(scores_logistic)

0.6950000000000001

### b) Decision Tree Classifier

In [20]:
from sklearn.tree import DecisionTreeClassifier

scores_dt = cross_val_score(DecisionTreeClassifier(), X, y, cv=kf)
np.average(scores_dt)

0.7989999999999999

### c) Random Forest Classifier

In [21]:
from sklearn.ensemble import RandomForestClassifier

scores_rf = cross_val_score(RandomForestClassifier(), X, y, cv=kf)
np.average(scores_rf)

0.898

**Conclusion:** The cross-validated scores give us much more confidence in comparing the models. We can see that on average, the **Random Forest Classifier (88.9%)** is the best performer for this dataset.


## 4. Getting More Metrics with `cross_validate`

If you need more than just a single accuracy score, the `cross_validate` function is even more powerful. It allows you to calculate multiple scoring metrics at once and also returns information about the training time.


In [22]:
from sklearn.model_selection import cross_validate

# Calculate both accuracy and ROC AUC for each fold
cross_validate(DecisionTreeClassifier(), X, y, cv=kf, scoring=['accuracy', 'roc_auc'])


{'fit_time': array([0.00499797, 0.00550818, 0.00450993, 0.00499916, 0.00456381]),
 'score_time': array([0.00200343, 0.00100327, 0.00200057, 0.00100136, 0.00200152]),
 'test_accuracy': array([0.71 , 0.83 , 0.775, 0.82 , 0.815]),
 'test_roc_auc': array([0.70594139, 0.82853141, 0.77661064, 0.8213141 , 0.81643918])}

The output dictionary contains the results for each of the 5 folds, giving a comprehensive view of the model's performance and consistency.