<a href="https://www.kaggle.com/code/angelchaudhary/cross-validation-vs-single-split?scriptVersionId=292325757" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Cross-Validation vs Single Split: Performance Drift Analysis

# Introduction

In many machine learning projects, models are evaluated using a single train–test split. While simple, this approach can produce unstable and misleading performance estimates depending on how the data is split. This case study investigates how model performance drifts when evaluated using a single split versus k-fold cross-validation.

Evaluation strategy directly affects model trustworthiness. Two models trained on the same data can appear very different in performance purely due to data partition randomness. Understanding this drift is critical for building reliable, production-ready models, especially in data science competitions and real-world deployments.

## Approach

We will:

- Train the same model using a single train–test split and k-fold cross-validation

- Compare performance metrics across folds and splits

- Quantify performance variability and drift

- Analyze when a single split is risky and when cross-validation provides more stable estimates

The goal is to build intuition around evaluation robustness, not just model accuracy.

# LET'S DO IT!!!
![FUNNY GIF](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExNXhpNmdiODE5OWZoYzFzMDRjanR2Zmxja2J0YXNlZmFra3J6MGpseSZlcD12MV9naWZzX3NlYXJjaCZjdD1n/13HBDT4QSTpveU/giphy.gif)

## Dataset Overview

This case study uses the Boston Housing dataset, a classic regression dataset that predicts the median value of owner-occupied homes (MEDV) based on socio-economic and environmental features. The dataset contains ~500 samples with 13 input features, including crime rate, number of rooms, property tax rate, and neighborhood characteristics. Its small size and inherent noise make model performance highly sensitive to how the data is split. These properties make the dataset especially suitable for analyzing performance drift between a single train–test split and k-fold cross-validation, highlighting the importance of robust evaluation strategies.

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("fedesoriano/the-boston-houseprice-data")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/the-boston-houseprice-data


In [3]:
import pandas as pd
import numpy as np 

df = pd.read_csv("/kaggle/input/the-boston-houseprice-data/boston.csv")

In [4]:
df.head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7


In [5]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


## Feature–Target Split

We separate the dataset into input features (`X`) and the target variable (`y`). The target variable `MEDV` represents the median house value and will be used for regression modeling.

In [6]:
X = df.drop(columns=["MEDV"])
y = df["MEDV"]

print("Feature shape:", X.shape)
print("Target shape:", y.shape)

Feature shape: (506, 13)
Target shape: (506,)


## Baseline Evaluation: Single Train–Test Split

We first evaluate model performance using a single train–test split. This approach is simple but highly sensitive to how the data is split, which can lead to unstable performance estimates.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=200,random_state=42)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

rmse_single = np.sqrt(mean_squared_error(y_test, y_pred))
r2_single = r2_score(y_test, y_pred)

print(f"Single Split RMSE: {rmse_single:.3f}")
print(f"Single Split R²: {r2_single:.3f}")

Single Split RMSE: 2.917
Single Split R²: 0.884


### Observation

Using a single 80–20 train–test split, the model achieves an RMSE of **2.92** and an R² of **0.88**, indicating strong predictive performance on this particular split.

However, this evaluation reflects performance on **only one random partition of the data**. Given the relatively small size of the dataset, these metrics may be **overly optimistic or pessimistic** depending on how the data was split. As a result, this single-split score does not reliably capture the model’s true generalization ability. This motivates the need for **cross-validation** to assess performance stability and quantify potential evaluation drift.

## Robust Evaluation: K-Fold Cross-Validation

To obtain a more reliable estimate of model performance, we use k-fold cross-validation. This approach evaluates the model across multiple data splits and reduces variance caused by split randomness.

In [9]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

rmse_scores = []
r2_scores = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = RandomForestRegressor(
        n_estimators=200,
        random_state=42
    )
    
    model.fit(X_train, y_train)
    preds = model.predict(X_val)

    rmse_scores.append(np.sqrt(mean_squared_error(y_val, preds)))
    r2_scores.append(r2_score(y_val, preds))

print("CV RMSE Mean:", np.mean(rmse_scores))
print("CV RMSE Std:", np.std(rmse_scores))
print("CV R² Mean:", np.mean(r2_scores))

CV RMSE Mean: 3.255151483931973
CV RMSE Std: 0.4567376242917499
CV R² Mean: 0.8705617147067851


### Observation

Using 5-fold cross-validation, the model achieves an average RMSE of **3.26** with a standard deviation of **0.46**, and a mean R² of **0.87**.

Compared to the single train–test split, cross-validation reports slightly worse but more realistic performance. The non-trivial standard deviation in RMSE across folds indicates that model performance is sensitive to data partitioning, confirming the presence of evaluation drift.

This demonstrates that the single-split result (RMSE ≈ 2.92) was optimistic and cross validation provides a more reliable estimate of true generalization performance.

## Performance Drift: Single Split vs Cross Validation

A clear performance drift is observed when comparing the single train–test split with cross-validation.

- **Single Split RMSE:** ~2.92  
- **Cross-Validation RMSE (Mean):** ~3.26  

The single split reports better performance, but this improvement is misleading as it is based on only one data partition. Cross-validation, by evaluating the model across multiple folds, exposes the variability in performance and provides a more conservative and stable estimate. The RMSE standard deviation across folds further highlights that model performance is not consistent across different subsets of data, reinforcing the risk of relying on a single split for evaluation.

## Conclusion & Key Takeaways

This case study demonstrates that model performance can vary significantly depending on the evaluation strategy used.

While a single train–test split reported strong performance, cross-validation revealed a **more realistic and stable estimate** of generalization by accounting for variability across multiple data splits. The observed performance drift highlights the risk of drawing conclusions from a single evaluation run, especially on small and noisy datasets.

Key takeaways:
- Single train–test splits can produce **optimistic or misleading results**
- Cross-validation reduces evaluation bias and improves reliability
- Performance stability is as important as raw accuracy when assessing models

Overall, cross-validation should be preferred when model evaluation quality and robustness are critical.