**Project Introduction & Overview:**

This project investigates model selection bias in machine learning, following the insights of Cawley and Talbot (2010), who highlighted that hyperparameter tuning on the same data used for performance evaluation can lead to overfitting and overly optimistic accuracy estimates.

Using a k-NN classifier to predict breast cancer diagnoses from four predictor variables, we explore how standard grid search cross-validation can inflate performance metrics by exploiting chance patterns in the training data. 

To address this, we implement nested cross-validation, which separates model selection from evaluation, providing a more unbiased estimate of real-world performance. 

Through this process, we compare different k values and assess their true predictive power, illustrating the importance of careful validation in machine learning workflows.

In [33]:
#import libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

In [35]:
#load data
b_data = pd.read_csv("breast-cancer-data.csv")
b_data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [13]:
#train and split data
from sklearn.model_selection import train_test_split

#Select predictor and target columns
features = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean']
target = 'diagnosis'

X = b_data[features]
y = b_data[target]

#Split into train (90%) and validation (10%)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.10, random_state=42, stratify=y)

print(f"Train size: {len(X_train)}, Validation size: {len(X_val)}")

Train size: 512, Validation size: 57


In [17]:
# Part 2: Grid search for optimal k
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#Define pipeline, scaling and k-NN
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

#Define parameter grid for k
param_grid = {'knn__n_neighbors': range(1, 31)}


#Grid search (using cross-validation on the training data only)
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Optimal k: {grid_search.best_params_['knn__n_neighbors']}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")


Optimal k: 22
Best cross-validation accuracy: 0.9043


**About the Paper by Cawley and Talbot:**

Cawley and Talbot (2010) discuss a significant yet frequently neglected issue in machine learning: overfitting during model selection, which leads to selection bias in model evaluation.

They argue that when a model's hyperparameters (for instance, the number of neighbors k in a k-NN classifier) are tuned using the same data subsequently used for estimating generalization performance, the selection process can overfit to the noise present in that dataset.

When hyperparameters are optimized to maximize those estimates, the model may appear to perform exceptionally well, not due to genuine learning, but because it has adapted to idiosyncrasies of the dataset used in tuning.

Because of this, the model's reported accuracy is optimistically biased, which means that it will usually do worse on unseen data than it did during cross-validation.
The paper emphasizes that this problem is especially bad when the same dataset is used for both model selection and final performance reporting. This lets information leak from the training phase to the evaluation phase.

Cawley & Talbot write that “model selection should be viewed as an integral part of model fitting” and warn that common evaluation practices “are susceptible to a form of selection bias” (p. 2079–2080, 2102–2105).

**How this applies specifically to our scenario:**

We utilized a k-NN classifier to predict breast cancer diagnosis based on four predictor variables (radius_mean, texture_mean, perimeter_mean, and area_mean).

To find the best model, we conducted a grid search over different values of k, using cross-validation on the same 90% training data.
However, as Cawley and Talbot describe, this procedure introduces model selection bias:

- The grid search optimizes k to maximize CV accuracy on the training data.
- Because the CV process is not perfectly reliable (it has variance), the chosen k may exploit chance patterns or noise in the training folds.
- When we later evaluate this tuned model, its performance may appear inflated compared to its true generalization ability.

This mirrors the major issue described in the paper. Our grid search has effectively overfit the model selection criterion, making our evaluation over-optimistic.

To mitigate this, we have to separate model selection from performance evaluation. For example, through nested cross-validation or by holding out an independent validation set, the 10% we reserved earlier.

This ensures that the data used to select hyperparameters is distinct from the data used to evaluate them, producing an unbiased estimate of real-world performance.

In [27]:
#Part 5: Implementing an unbiased model selection approach

from sklearn.model_selection import KFold, cross_val_score

#Recall pipe, param_grid, X_train, y_train, and X_val, y_val from previous parts

#previous  performance report
best_k_old = grid_search.best_params_['knn__n_neighbors']
train_acc_old = grid_search.best_score_
val_acc_old = grid_search.score(X_val, y_val)

print("\nPrevious Approach (Single Grid Search CV):")
print(f"Best k: {best_k_old}")
print(f"Training CV Accuracy: {train_acc_old:.4f}")
print(f"Validation Accuracy (left-out 10% set): {val_acc_old:.4f}")


#New Approach: Nested Cross-Validation (unbiased model selection)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)

#Reuse the same pipeline and parameter grid
nested_grid = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='accuracy')

#Perform nested cross-validation (outer loop)
nested_scores = cross_val_score(nested_grid, X_train, y_train, cv=outer_cv, scoring='accuracy')

print("\nNew Approach (Nested Cross-Validation):")
print(f"Nested CV mean accuracy: {nested_scores.mean():.4f}")
print(f"Nested CV standard deviation:{nested_scores.std():.4f}")

#Fit final model using full training data (with inner grid search)
nested_grid.fit(X_train, y_train)
best_k_new = nested_grid.best_params_['knn__n_neighbors']
val_acc_new = nested_grid.score(X_val, y_val)

print(f"Best k (nested): {best_k_new}")
print(f"Validation Accuracy (left-out 10% set): {val_acc_new:.4f}")



Previous Approach (Single Grid Search CV):
Best k: 22
Training CV Accuracy: 0.9043
Validation Accuracy (left-out 10% set): 0.9298

New Approach (Nested Cross-Validation):
Nested CV mean accuracy: 0.9061
Nested CV standard deviation:0.0377
Best k (nested): 27
Validation Accuracy (left-out 10% set): 0.9298


**Comparison:**

- The single grid search CV selected k = 22 with a training accuracy of 0.9043 and validation accuracy of 0.9298. 

- The nested cross-validation found k = 27, with a mean accuracy of 0.9061 with a variability of about 3.77% depending on which subset of data it was
  trained and tested on, and the same validation accuracy. 

While both performed similarly, the nested CV gives a more reliable, unbiased estimate since it separates model tuning from evaluation. 

This aligns with Cawley & Talbot (2010), who highlight that independent validation prevents selection bias and overfitting in model selection.