# Entry-Level Classification Lab: Cross-Validation & Hyperparameter Search (No Pipelines)

Welcome. In this lab you will practice **classification** with **hyperparameter search** using four validation strategies:

1. **Train/Test split** (with a tiny validation split taken from the training set)
2. **K-Fold**
3. **Stratified K-Fold**
4. **Group K-Fold**

You will apply all four strategies to **three datasets**: **Iris**, **Wine**, and **Breast Cancer**.  
For simplicity and clarity, we use a single model family: **DecisionTreeClassifier**.  
You will see **explicit split loops** and **no pipelines**, to keep everything transparent and beginner-friendly.

> Note: For Group K-Fold we simulate groups from the row indices so that samples from the same “group” never end up in both train and validation within a fold.


## What you'll learn
- How to perform a simple hyperparameter search with a **train/test split** plus a small validation set.
- How to do the same search using **K-Fold**, **Stratified K-Fold**, and **Group K-Fold**.
- How to read, compare, and reason about validation results.
- Why stratification and grouping matter in practice.

The code is structured step by step with comments so you can follow the thinking process.


## Setup

Install requirements if needed in your environment:
```bash
pip install scikit-learn pandas numpy
```


In [21]:
# Imports
import numpy as np
import pandas as pd

from itertools import product

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, GroupKFold

# Reproducibility
np.random.seed(42)


In [22]:
# Define a small, beginner-friendly hyperparameter grid for DecisionTreeClassifier
param_grid = {
    "max_depth": [None, 3, 5, 7],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2]
}

# Expand to a list of dicts for easy iteration
keys = list(param_grid.keys())
params_list = []
for values in product(*[param_grid[k] for k in keys]):
    params_list.append(dict(zip(keys, values)))

print(f"Number of parameter combinations: {len(params_list)}")


Number of parameter combinations: 16


In [41]:
# display parameter  in params_list
for param in params_list:
    print(param)

{'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1}
{'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 2}
{'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 1}
{'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 2}
{'max_depth': 3, 'min_samples_split': 2, 'min_samples_leaf': 1}
{'max_depth': 3, 'min_samples_split': 2, 'min_samples_leaf': 2}
{'max_depth': 3, 'min_samples_split': 5, 'min_samples_leaf': 1}
{'max_depth': 3, 'min_samples_split': 5, 'min_samples_leaf': 2}
{'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 1}
{'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 2}
{'max_depth': 5, 'min_samples_split': 5, 'min_samples_leaf': 1}
{'max_depth': 5, 'min_samples_split': 5, 'min_samples_leaf': 2}
{'max_depth': 7, 'min_samples_split': 2, 'min_samples_leaf': 1}
{'max_depth': 7, 'min_samples_split': 2, 'min_samples_leaf': 2}
{'max_depth': 7, 'min_samples_split': 5, 'min_samples_leaf': 1}
{'max_depth': 7, 'min_sample

## Scoring
We will use **accuracy** for simplicity. For each split method and each parameter setting, we compute validation accuracy and then keep the best-performing parameters.


# Dataset: Iris

We will load the Iris dataset into a Pandas DataFrame for clarity, and then apply the four validation strategies.


In [23]:
# Load Iris dataset
data = datasets.load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names if hasattr(data, "feature_names") else [f"feature_{i}" for i in range(data.data.shape[1])])
y = pd.Series(data.target, name="target")

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Target classes and counts:\n", y.value_counts().sort_index())


X shape: (150, 4)
y shape: (150,)
Target classes and counts:
 target
0    50
1    50
2    50
Name: count, dtype: int64


In [43]:
X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [44]:
y

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: int64

## 1) Train/Test split with a small validation set

We first make a **single** train/test split. Then we carve out a **validation** subset from the training data to pick hyperparameters. Finally, we retrain on the full training set using the best params and evaluate once on the held-out test set.

We keep the splitting lines in the requested style.


In [24]:
# Train/Test split (requested style)
# ------------------------------------------------------
# Style like:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ------------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Make a small validation split from the training set (60/20/20 overall)
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=42
)

best_params_tt = None
best_val_acc_tt = -1.0

for params in params_list:
    model = DecisionTreeClassifier(random_state=42, **params)
    model.fit(X_train_sub, y_train_sub)
    val_pred = model.predict(X_val)
    val_acc = accuracy_score(y_val, val_pred)
    if val_acc > best_val_acc_tt:
        best_val_acc_tt = val_acc
        best_params_tt = params

print(f"Best params (Train/Test with validation): {best_params_tt} | Val Acc: {best_val_acc_tt:.4f}")

# Retrain on full training set with best params; evaluate on the test set
final_model_tt = DecisionTreeClassifier(random_state=42, **best_params_tt)
final_model_tt.fit(X_train, y_train)
test_pred = final_model_tt.predict(X_test)
test_acc_tt = accuracy_score(y_test, test_pred)
print(f"Test Accuracy (best params): {test_acc_tt:.4f}")


Best params (Train/Test with validation): {'max_depth': 3, 'min_samples_split': 2, 'min_samples_leaf': 1} | Val Acc: 0.9333
Test Accuracy (best params): 1.0000


## 2) K-Fold cross-validation hyperparameter search

Here we use **KFold** and explicitly loop through folds in the requested style.


In [25]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

best_params_kf = None
best_cv_acc_kf = -1.0

for params in params_list:
    fold_accs = []
    fold = 0
    for train_index, test_index in kf.split(X):
        fold += 1
        # Requested style:
        # y_train = y.iloc[train_index]
        # y_test  = y.iloc[test_index]
        X_train, X_valid = X.iloc[train_index], X.iloc[test_index]
        y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_kf:
        best_cv_acc_kf = mean_acc
        best_params_kf = params

print(f"Best params (KFold): {best_params_kf} | Mean CV Acc: {best_cv_acc_kf:.4f}")


Best params (KFold): {'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1} | Mean CV Acc: 0.9533


## 3) Stratified K-Fold cross-validation hyperparameter search

For classification, **StratifiedKFold** keeps class proportions similar in every fold.


In [26]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

best_params_skf = None
best_cv_acc_skf = -1.0

for params in params_list:
    fold_accs = []
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        # Requested style:
        # for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        #     y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        X_train, X_valid = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_skf:
        best_cv_acc_skf = mean_acc
        best_params_skf = params

print(f"Best params (StratifiedKFold): {best_params_skf} | Mean CV Acc: {best_cv_acc_skf:.4f}")


Best params (StratifiedKFold): {'max_depth': 3, 'min_samples_split': 2, 'min_samples_leaf': 1} | Mean CV Acc: 0.9600


## 4) Group K-Fold cross-validation hyperparameter search

**GroupKFold** ensures that the same group never appears in both train and validation within the same fold.  
These toy datasets don't come with groups, so we **simulate** them by assigning every few consecutive rows to the same group.


In [27]:
from sklearn.model_selection import GroupKFold

# Simulate groups so that samples that share a group never leak across folds
# Ensure there are at least as many groups as folds
group_size = 3  # tweakable
groups = np.arange(len(X)) // group_size

gkf = GroupKFold(n_splits=5)

best_params_gkf = None
best_cv_acc_gkf = -1.0

for params in params_list:
    fold_accs = []
    for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups=groups), start=1):
        X_train, X_valid = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_gkf:
        best_cv_acc_gkf = mean_acc
        best_params_gkf = params

print(f"Best params (GroupKFold): {best_params_gkf} | Mean CV Acc: {best_cv_acc_gkf:.4f}")


Best params (GroupKFold): {'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1} | Mean CV Acc: 0.9400


In [28]:
# Summary of best results for this dataset
summary_rows = [
    {"method": "Train/Test + small Val", "best_params": best_params_tt, "score": float(best_val_acc_tt), "note": "Validation accuracy (not test) used for model selection"},
    {"method": "KFold", "best_params": best_params_kf, "score": float(best_cv_acc_kf), "note": "Mean CV accuracy"},
    {"method": "StratifiedKFold", "best_params": best_params_skf, "score": float(best_cv_acc_skf), "note": "Mean CV accuracy"},
    {"method": "GroupKFold", "best_params": best_params_gkf, "score": float(best_cv_acc_gkf), "note": "Mean CV accuracy"},
]
summary_df = pd.DataFrame(summary_rows)
summary_df


Unnamed: 0,method,best_params,score,note
0,Train/Test + small Val,"{'max_depth': 3, 'min_samples_split': 2, 'min_...",0.933333,Validation accuracy (not test) used for model ...
1,KFold,"{'max_depth': None, 'min_samples_split': 2, 'm...",0.953333,Mean CV accuracy
2,StratifiedKFold,"{'max_depth': 3, 'min_samples_split': 2, 'min_...",0.96,Mean CV accuracy
3,GroupKFold,"{'max_depth': None, 'min_samples_split': 2, 'm...",0.94,Mean CV accuracy


## Tips
- On **small datasets**, many parameter settings will look similar. That's normal.
- **StratifiedKFold** is usually better than plain KFold for classification, especially with imbalanced classes.
- **GroupKFold** is essential when you have **leakage risk** across related samples (e.g., multiple records from one patient or user).

Try changing the hyperparameter grid to see how results move.


# Dataset: Wine

We will load the Wine dataset into a Pandas DataFrame for clarity, and then apply the four validation strategies.


In [29]:
# Load Wine dataset
data = datasets.load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names if hasattr(data, "feature_names") else [f"feature_{i}" for i in range(data.data.shape[1])])
y = pd.Series(data.target, name="target")

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Target classes and counts:\n", y.value_counts().sort_index())


X shape: (178, 13)
y shape: (178,)
Target classes and counts:
 target
0    59
1    71
2    48
Name: count, dtype: int64


## 1) Train/Test split with a small validation set

We first make a **single** train/test split. Then we carve out a **validation** subset from the training data to pick hyperparameters. Finally, we retrain on the full training set using the best params and evaluate once on the held-out test set.

We keep the splitting lines in the requested style.


In [30]:
# Train/Test split (requested style)
# ------------------------------------------------------
# Style like:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ------------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Make a small validation split from the training set (60/20/20 overall)
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=42
)

best_params_tt = None
best_val_acc_tt = -1.0

for params in params_list:
    model = DecisionTreeClassifier(random_state=42, **params)
    model.fit(X_train_sub, y_train_sub)
    val_pred = model.predict(X_val)
    val_acc = accuracy_score(y_val, val_pred)
    if val_acc > best_val_acc_tt:
        best_val_acc_tt = val_acc
        best_params_tt = params

print(f"Best params (Train/Test with validation): {best_params_tt} | Val Acc: {best_val_acc_tt:.4f}")

# Retrain on full training set with best params; evaluate on the test set
final_model_tt = DecisionTreeClassifier(random_state=42, **best_params_tt)
final_model_tt.fit(X_train, y_train)
test_pred = final_model_tt.predict(X_test)
test_acc_tt = accuracy_score(y_test, test_pred)
print(f"Test Accuracy (best params): {test_acc_tt:.4f}")


Best params (Train/Test with validation): {'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 2} | Val Acc: 0.9722
Test Accuracy (best params): 0.9444


## 2) K-Fold cross-validation hyperparameter search

Here we use **KFold** and explicitly loop through folds in the requested style.


In [31]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

best_params_kf = None
best_cv_acc_kf = -1.0

for params in params_list:
    fold_accs = []
    fold = 0
    for train_index, test_index in kf.split(X):
        fold += 1
        # Requested style:
        # y_train = y.iloc[train_index]
        # y_test  = y.iloc[test_index]
        X_train, X_valid = X.iloc[train_index], X.iloc[test_index]
        y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_kf:
        best_cv_acc_kf = mean_acc
        best_params_kf = params

print(f"Best params (KFold): {best_params_kf} | Mean CV Acc: {best_cv_acc_kf:.4f}")


Best params (KFold): {'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 2} | Mean CV Acc: 0.8930


## 3) Stratified K-Fold cross-validation hyperparameter search

For classification, **StratifiedKFold** keeps class proportions similar in every fold.


In [32]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

best_params_skf = None
best_cv_acc_skf = -1.0

for params in params_list:
    fold_accs = []
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        # Requested style:
        # for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        #     y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        X_train, X_valid = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_skf:
        best_cv_acc_skf = mean_acc
        best_params_skf = params

print(f"Best params (StratifiedKFold): {best_params_skf} | Mean CV Acc: {best_cv_acc_skf:.4f}")


Best params (StratifiedKFold): {'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1} | Mean CV Acc: 0.8932


## 4) Group K-Fold cross-validation hyperparameter search

**GroupKFold** ensures that the same group never appears in both train and validation within the same fold.  
These toy datasets don't come with groups, so we **simulate** them by assigning every few consecutive rows to the same group.


In [33]:
from sklearn.model_selection import GroupKFold

# Simulate groups so that samples that share a group never leak across folds
# Ensure there are at least as many groups as folds
group_size = 3  # tweakable
groups = np.arange(len(X)) // group_size

gkf = GroupKFold(n_splits=5)

best_params_gkf = None
best_cv_acc_gkf = -1.0

for params in params_list:
    fold_accs = []
    for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups=groups), start=1):
        X_train, X_valid = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_gkf:
        best_cv_acc_gkf = mean_acc
        best_params_gkf = params

print(f"Best params (GroupKFold): {best_params_gkf} | Mean CV Acc: {best_cv_acc_gkf:.4f}")


Best params (GroupKFold): {'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1} | Mean CV Acc: 0.9042


In [34]:
# Summary of best results for this dataset
summary_rows = [
    {"method": "Train/Test + small Val", "best_params": best_params_tt, "score": float(best_val_acc_tt), "note": "Validation accuracy (not test) used for model selection"},
    {"method": "KFold", "best_params": best_params_kf, "score": float(best_cv_acc_kf), "note": "Mean CV accuracy"},
    {"method": "StratifiedKFold", "best_params": best_params_skf, "score": float(best_cv_acc_skf), "note": "Mean CV accuracy"},
    {"method": "GroupKFold", "best_params": best_params_gkf, "score": float(best_cv_acc_gkf), "note": "Mean CV accuracy"},
]
summary_df = pd.DataFrame(summary_rows)
summary_df


Unnamed: 0,method,best_params,score,note
0,Train/Test + small Val,"{'max_depth': None, 'min_samples_split': 2, 'm...",0.972222,Validation accuracy (not test) used for model ...
1,KFold,"{'max_depth': None, 'min_samples_split': 2, 'm...",0.893016,Mean CV accuracy
2,StratifiedKFold,"{'max_depth': None, 'min_samples_split': 2, 'm...",0.893175,Mean CV accuracy
3,GroupKFold,"{'max_depth': None, 'min_samples_split': 2, 'm...",0.904248,Mean CV accuracy


## Tips
- On **small datasets**, many parameter settings will look similar. That's normal.
- **StratifiedKFold** is usually better than plain KFold for classification, especially with imbalanced classes.
- **GroupKFold** is essential when you have **leakage risk** across related samples (e.g., multiple records from one patient or user).

Try changing the hyperparameter grid to see how results move.


# Dataset: Breast Cancer

We will load the Breast Cancer dataset into a Pandas DataFrame for clarity, and then apply the four validation strategies.


In [35]:
# Load Breast Cancer dataset
data = datasets.load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names if hasattr(data, "feature_names") else [f"feature_{i}" for i in range(data.data.shape[1])])
y = pd.Series(data.target, name="target")

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Target classes and counts:\n", y.value_counts().sort_index())


X shape: (569, 30)
y shape: (569,)
Target classes and counts:
 target
0    212
1    357
Name: count, dtype: int64


## 1) Train/Test split with a small validation set

We first make a **single** train/test split. Then we carve out a **validation** subset from the training data to pick hyperparameters. Finally, we retrain on the full training set using the best params and evaluate once on the held-out test set.

We keep the splitting lines in the requested style.


In [36]:
# Train/Test split (requested style)
# ------------------------------------------------------
# Style like:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ------------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Make a small validation split from the training set (60/20/20 overall)
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=42
)

best_params_tt = None
best_val_acc_tt = -1.0

for params in params_list:
    model = DecisionTreeClassifier(random_state=42, **params)
    model.fit(X_train_sub, y_train_sub)
    val_pred = model.predict(X_val)
    val_acc = accuracy_score(y_val, val_pred)
    if val_acc > best_val_acc_tt:
        best_val_acc_tt = val_acc
        best_params_tt = params

print(f"Best params (Train/Test with validation): {best_params_tt} | Val Acc: {best_val_acc_tt:.4f}")

# Retrain on full training set with best params; evaluate on the test set
final_model_tt = DecisionTreeClassifier(random_state=42, **best_params_tt)
final_model_tt.fit(X_train, y_train)
test_pred = final_model_tt.predict(X_test)
test_acc_tt = accuracy_score(y_test, test_pred)
print(f"Test Accuracy (best params): {test_acc_tt:.4f}")


Best params (Train/Test with validation): {'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1} | Val Acc: 0.9298
Test Accuracy (best params): 0.9474


## 2) K-Fold cross-validation hyperparameter search

Here we use **KFold** and explicitly loop through folds in the requested style.


In [37]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

best_params_kf = None
best_cv_acc_kf = -1.0

for params in params_list:
    fold_accs = []
    fold = 0
    for train_index, test_index in kf.split(X):
        fold += 1
        # Requested style:
        # y_train = y.iloc[train_index]
        # y_test  = y.iloc[test_index]
        X_train, X_valid = X.iloc[train_index], X.iloc[test_index]
        y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_kf:
        best_cv_acc_kf = mean_acc
        best_params_kf = params

print(f"Best params (KFold): {best_params_kf} | Mean CV Acc: {best_cv_acc_kf:.4f}")


Best params (KFold): {'max_depth': 5, 'min_samples_split': 5, 'min_samples_leaf': 1} | Mean CV Acc: 0.9455


## 3) Stratified K-Fold cross-validation hyperparameter search

For classification, **StratifiedKFold** keeps class proportions similar in every fold.


In [38]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

best_params_skf = None
best_cv_acc_skf = -1.0

for params in params_list:
    fold_accs = []
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        # Requested style:
        # for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        #     y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        X_train, X_valid = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_skf:
        best_cv_acc_skf = mean_acc
        best_params_skf = params

print(f"Best params (StratifiedKFold): {best_params_skf} | Mean CV Acc: {best_cv_acc_skf:.4f}")


Best params (StratifiedKFold): {'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 1} | Mean CV Acc: 0.9280


## 4) Group K-Fold cross-validation hyperparameter search

**GroupKFold** ensures that the same group never appears in both train and validation within the same fold.  
These toy datasets don't come with groups, so we **simulate** them by assigning every few consecutive rows to the same group.


In [39]:
from sklearn.model_selection import GroupKFold

# Simulate groups so that samples that share a group never leak across folds
# Ensure there are at least as many groups as folds
group_size = 3  # tweakable
groups = np.arange(len(X)) // group_size

gkf = GroupKFold(n_splits=5)

best_params_gkf = None
best_cv_acc_gkf = -1.0

for params in params_list:
    fold_accs = []
    for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups=groups), start=1):
        X_train, X_valid = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[test_idx]
        
        model = DecisionTreeClassifier(random_state=42, **params)
        model.fit(X_train, y_train)
        pred = model.predict(X_valid)
        acc = accuracy_score(y_valid, pred)
        fold_accs.append(acc)
    mean_acc = float(np.mean(fold_accs))
    if mean_acc > best_cv_acc_gkf:
        best_cv_acc_gkf = mean_acc
        best_params_gkf = params

print(f"Best params (GroupKFold): {best_params_gkf} | Mean CV Acc: {best_cv_acc_gkf:.4f}")


Best params (GroupKFold): {'max_depth': 5, 'min_samples_split': 5, 'min_samples_leaf': 2} | Mean CV Acc: 0.9279


In [40]:
# Summary of best results for this dataset
summary_rows = [
    {"method": "Train/Test + small Val", "best_params": best_params_tt, "score": float(best_val_acc_tt), "note": "Validation accuracy (not test) used for model selection"},
    {"method": "KFold", "best_params": best_params_kf, "score": float(best_cv_acc_kf), "note": "Mean CV accuracy"},
    {"method": "StratifiedKFold", "best_params": best_params_skf, "score": float(best_cv_acc_skf), "note": "Mean CV accuracy"},
    {"method": "GroupKFold", "best_params": best_params_gkf, "score": float(best_cv_acc_gkf), "note": "Mean CV accuracy"},
]
summary_df = pd.DataFrame(summary_rows)
summary_df


Unnamed: 0,method,best_params,score,note
0,Train/Test + small Val,"{'max_depth': None, 'min_samples_split': 2, 'm...",0.929825,Validation accuracy (not test) used for model ...
1,KFold,"{'max_depth': 5, 'min_samples_split': 5, 'min_...",0.945521,Mean CV accuracy
2,StratifiedKFold,"{'max_depth': 5, 'min_samples_split': 2, 'min_...",0.927993,Mean CV accuracy
3,GroupKFold,"{'max_depth': 5, 'min_samples_split': 5, 'min_...",0.927899,Mean CV accuracy


## Tips
- On **small datasets**, many parameter settings will look similar. That's normal.
- **StratifiedKFold** is usually better than plain KFold for classification, especially with imbalanced classes.
- **GroupKFold** is essential when you have **leakage risk** across related samples (e.g., multiple records from one patient or user).

Try changing the hyperparameter grid to see how results move.
