
# Sklearn Cross-Validation Classification with GridSearchCV 

This notebook demonstrates how to use **`GridSearchCV`** for hyperparameter tuning on **SVC** across three datasets:
- `load_iris()`
- `load_wine()`
- `load_breast_cancer()`

We compare three cross-validation strategies:
- **KFold**
- **StratifiedKFold**
- **GroupKFold** (with synthetic groups for demonstration)

**Key constraints**
- No `Pipeline` is used.
- No manual for-loops over folds. `GridSearchCV` handles CV internally.
- For `GroupKFold`, we supply a synthetic `groups` array because these datasets do not include natural group labels.

> Note: In real projects, you would typically use a `Pipeline` for proper preprocessing within each fold to avoid data leakage.


In [19]:

# ============================================
# Imports & common settings
# ============================================
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold, GridSearchCV
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42

# Hyperparameter grid for SVC
param_grid = [
    {'kernel': ['linear'], 'C': [0.1, 1, 10, 100]},
    {'kernel': ['rbf'],    'C': [0.1, 1, 10, 100], 'gamma': ['scale', 0.01, 0.1, 1]},
    {'kernel': ['poly'],   'C': [0.1, 1, 10], 'degree': [2, 3], 'gamma': ['scale']},
]

# Cross-validation objects (reusable)
kf  = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
gkf = GroupKFold(n_splits=5)


In [20]:

# Utility to render compact summaries without loops
def summary_frame(dataset_name, kf_search, skf_search, gkf_search):
    return pd.DataFrame([
        {'Dataset': dataset_name, 'CV': 'KFold',           'Best score (accuracy)': kf_search.best_score_,  'Best params': kf_search.best_params_},
        {'Dataset': dataset_name, 'CV': 'StratifiedKFold', 'Best score (accuracy)': skf_search.best_score_, 'Best params': skf_search.best_params_},
        {'Dataset': dataset_name, 'CV': 'GroupKFold',      'Best score (accuracy)': gkf_search.best_score_, 'Best params': gkf_search.best_params_},
    ])

# Function to show top results table (no loops needed in usage)
def top_results(gsearch, n=10):
    df = pd.DataFrame(gsearch.cv_results_)
    cols = ['rank_test_score','mean_test_score','std_test_score',
            'param_kernel','param_C','param_gamma','param_degree']
    cols = [c for c in cols if c in df.columns]
    return df.sort_values('rank_test_score').loc[:, cols].head(n)



## 1. Iris
We run `GridSearchCV` three times with different CV strategies. No manual folding is written; the CV objects handle it for us.


In [2]:

# ----------------------------
# Load Iris
# ----------------------------
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = pd.Series(iris.target, name='target')

# Synthetic groups for GroupKFold
np.random.seed(RANDOM_STATE)
groups_iris = np.random.randint(0, 5, size=len(y_iris))

# ----------------------------
# Grid searches
# ----------------------------
gs_iris_kf  = GridSearchCV(SVC(), param_grid=param_grid, cv=kf,  scoring='accuracy', n_jobs=-1)
gs_iris_skf = GridSearchCV(SVC(), param_grid=param_grid, cv=skf, scoring='accuracy', n_jobs=-1)
gs_iris_gkf = GridSearchCV(SVC(), param_grid=param_grid, cv=gkf, scoring='accuracy', n_jobs=-1)

gs_iris_kf.fit(X_iris, y_iris)
gs_iris_skf.fit(X_iris, y_iris)
gs_iris_gkf.fit(X_iris, y_iris, groups=groups_iris)

summary_iris = summary_frame('iris', gs_iris_kf, gs_iris_skf, gs_iris_gkf)
summary_iris


Unnamed: 0,Dataset,CV,Best score (accuracy),Best params
0,iris,KFold,0.98,"{'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}"
1,iris,StratifiedKFold,0.986667,"{'C': 1, 'kernel': 'linear'}"
2,iris,GroupKFold,0.978667,"{'C': 1, 'kernel': 'linear'}"


In [12]:

# Top configurations per CV (Iris)
top_iris_kf  = top_results(gs_iris_kf,  n=10)
top_iris_skf = top_results(gs_iris_skf, n=10)
top_iris_gkf = top_results(gs_iris_gkf, n=10)

top_iris_kf

Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
13,1,0.98,0.01633,rbf,10.0,0.01,
22,1,0.98,0.01633,poly,1.0,scale,2.0
21,1,0.98,0.01633,poly,0.1,scale,3.0
17,1,0.98,0.01633,rbf,100.0,0.01,
10,5,0.973333,0.024944,rbf,1.0,0.1,
0,5,0.973333,0.024944,linear,0.1,,
7,5,0.973333,0.024944,rbf,0.1,1,
1,5,0.973333,0.024944,linear,1.0,,
23,5,0.973333,0.024944,poly,1.0,scale,3.0
14,5,0.973333,0.013333,rbf,10.0,0.1,


In [13]:
top_iris_skf

Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
1,1,0.986667,0.026667,linear,1.0,,
14,1,0.986667,0.026667,rbf,10.0,0.1,
17,1,0.986667,0.026667,rbf,100.0,0.01,
2,4,0.98,0.026667,linear,10.0,,
23,4,0.98,0.026667,poly,1.0,scale,3.0
24,4,0.98,0.026667,poly,10.0,scale,2.0
12,7,0.973333,0.03266,rbf,10.0,scale,
0,7,0.973333,0.024944,linear,0.1,,
16,7,0.973333,0.03266,rbf,100.0,scale,
13,7,0.973333,0.024944,rbf,10.0,0.01,


In [14]:
 top_iris_gkf


Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
1,1,0.978667,0.027455,linear,1.0,,
17,1,0.978667,0.027455,rbf,100.0,0.01,
25,1,0.978667,0.027455,poly,10.0,scale,3.0
23,4,0.977926,0.018147,poly,1.0,scale,3.0
2,5,0.975333,0.020827,linear,10.0,,
16,6,0.973667,0.025307,rbf,100.0,scale,
3,7,0.97319,0.014268,linear,100.0,,
13,8,0.972926,0.014442,rbf,10.0,0.01,
21,8,0.972926,0.014442,poly,0.1,scale,3.0
22,8,0.972926,0.014442,poly,1.0,scale,2.0



## 2. Wine
Same approach, still no manual loops.


In [4]:

# ----------------------------
# Load Wine
# ----------------------------
wine = load_wine()
X_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
y_wine = pd.Series(wine.target, name='target')

# Synthetic groups for GroupKFold
np.random.seed(RANDOM_STATE + 1)
groups_wine = np.random.randint(0, 5, size=len(y_wine))

# ----------------------------
# Grid searches
# ----------------------------
gs_wine_kf  = GridSearchCV(SVC(), param_grid=param_grid, cv=kf,  scoring='accuracy', n_jobs=-1)
gs_wine_skf = GridSearchCV(SVC(), param_grid=param_grid, cv=skf, scoring='accuracy', n_jobs=-1)
gs_wine_gkf = GridSearchCV(SVC(), param_grid=param_grid, cv=gkf, scoring='accuracy', n_jobs=-1)

gs_wine_kf.fit(X_wine, y_wine)
gs_wine_skf.fit(X_wine, y_wine)
gs_wine_gkf.fit(X_wine, y_wine, groups=groups_wine)

summary_wine = summary_frame('wine', gs_wine_kf, gs_wine_skf, gs_wine_gkf)
summary_wine


Unnamed: 0,Dataset,CV,Best score (accuracy),Best params
0,wine,KFold,0.960794,"{'C': 0.1, 'kernel': 'linear'}"
1,wine,StratifiedKFold,0.977619,"{'C': 0.1, 'kernel': 'linear'}"
2,wine,GroupKFold,0.977344,"{'C': 0.1, 'kernel': 'linear'}"


In [9]:

# Top configurations per CV (Wine)
top_wine_kf  = top_results(gs_wine_kf,  n=10)
top_wine_skf = top_results(gs_wine_skf, n=10)
top_wine_gkf = top_results(gs_wine_gkf, n=10)

top_wine_kf


Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
0,1,0.960794,0.028206,linear,0.1,,
1,2,0.949524,0.036809,linear,1.0,,
2,2,0.949524,0.036809,linear,10.0,,
3,2,0.949524,0.036809,linear,100.0,,
16,5,0.735873,0.067652,rbf,100.0,scale,
12,6,0.707937,0.056869,rbf,10.0,scale,
22,7,0.702063,0.096453,poly,1.0,scale,2.0
25,8,0.685397,0.098431,poly,10.0,scale,3.0
24,9,0.679841,0.064883,poly,10.0,scale,2.0
8,10,0.674286,0.088739,rbf,1.0,scale,


In [10]:
 top_wine_skf

Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
0,1,0.977619,0.020832,linear,0.1,,
1,2,0.972063,0.024847,linear,1.0,,
2,2,0.972063,0.017571,linear,10.0,,
3,2,0.972063,0.017571,linear,100.0,,
16,5,0.746667,0.08601,rbf,100.0,scale,
12,6,0.712857,0.075981,rbf,10.0,scale,
22,7,0.691429,0.06477,poly,1.0,scale,2.0
24,8,0.69127,0.041978,poly,10.0,scale,2.0
25,9,0.685714,0.046465,poly,10.0,scale,3.0
8,10,0.674444,0.039769,rbf,1.0,scale,


In [11]:
top_wine_gkf

Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
0,1,0.977344,0.0211,linear,0.1,,
2,2,0.966216,0.019219,linear,10.0,,
3,2,0.966216,0.019219,linear,100.0,,
1,4,0.961338,0.024716,linear,1.0,,
16,5,0.763738,0.057462,rbf,100.0,scale,
12,6,0.73217,0.051137,rbf,10.0,scale,
17,7,0.696553,0.035407,rbf,100.0,0.01,
13,7,0.696553,0.035407,rbf,10.0,0.01,
8,9,0.686337,0.057657,rbf,1.0,scale,
24,10,0.686171,0.072058,poly,10.0,scale,2.0



## 3. Breast Cancer
Same structure as above.


In [6]:

# ----------------------------
# Load Breast Cancer
# ----------------------------
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = pd.Series(cancer.target, name='target')

# Synthetic groups for GroupKFold
np.random.seed(RANDOM_STATE + 2)
groups_cancer = np.random.randint(0, 5, size=len(y_cancer))

# ----------------------------
# Grid searches
# ----------------------------
gs_cancer_kf  = GridSearchCV(SVC(), param_grid=param_grid, cv=kf,  scoring='accuracy', n_jobs=-1)
gs_cancer_skf = GridSearchCV(SVC(), param_grid=param_grid, cv=skf, scoring='accuracy', n_jobs=-1)
gs_cancer_gkf = GridSearchCV(SVC(), param_grid=param_grid, cv=gkf, scoring='accuracy', n_jobs=-1)

gs_cancer_kf.fit(X_cancer, y_cancer)
gs_cancer_skf.fit(X_cancer, y_cancer)
gs_cancer_gkf.fit(X_cancer, y_cancer, groups=groups_cancer)

summary_cancer = summary_frame('breast_cancer', gs_cancer_kf, gs_cancer_skf, gs_cancer_gkf)
summary_cancer


Unnamed: 0,Dataset,CV,Best score (accuracy),Best params
0,breast_cancer,KFold,0.964788,"{'C': 100, 'kernel': 'linear'}"
1,breast_cancer,StratifiedKFold,0.963096,"{'C': 100, 'kernel': 'linear'}"
2,breast_cancer,GroupKFold,0.960756,"{'C': 10, 'kernel': 'linear'}"


In [15]:

# Top configurations per CV (Breast Cancer)
top_cancer_kf  = top_results(gs_cancer_kf,  n=10)
top_cancer_skf = top_results(gs_cancer_skf, n=10)
top_cancer_gkf = top_results(gs_cancer_gkf, n=10)

top_cancer_kf

Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
3,1,0.964788,0.02169,linear,100.0,,
2,2,0.959525,0.019882,linear,10.0,,
0,3,0.954277,0.010398,linear,0.1,,
1,4,0.952507,0.014452,linear,1.0,,
16,5,0.936671,0.044304,rbf,100.0,scale,
24,6,0.920866,0.036935,poly,10.0,scale,2.0
12,7,0.919112,0.040656,rbf,10.0,scale,
8,8,0.917311,0.031555,rbf,1.0,scale,
25,9,0.915603,0.038779,poly,10.0,scale,3.0
23,10,0.910262,0.035277,poly,1.0,scale,3.0


In [17]:
top_cancer_skf


Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
3,1,0.963096,0.017886,linear,100.0,,
2,2,0.957833,0.017867,linear,10.0,,
1,3,0.947306,0.012318,linear,1.0,,
0,4,0.942059,0.020388,linear,0.1,,
16,5,0.933225,0.029127,rbf,100.0,scale,
12,6,0.920913,0.021489,rbf,10.0,scale,
24,7,0.920913,0.026023,poly,10.0,scale,2.0
25,7,0.920913,0.026608,poly,10.0,scale,3.0
8,9,0.913895,0.024397,rbf,1.0,scale,
22,10,0.910371,0.022457,poly,1.0,scale,2.0


In [18]:
 top_cancer_gkf

Unnamed: 0,rank_test_score,mean_test_score,std_test_score,param_kernel,param_C,param_gamma,param_degree
2,1,0.960756,0.013941,linear,10.0,,
3,2,0.958798,0.017143,linear,100.0,,
1,3,0.957567,0.00714,linear,1.0,,
0,4,0.953211,0.018253,linear,0.1,,
16,5,0.938434,0.02331,rbf,100.0,scale,
12,6,0.922669,0.016164,rbf,10.0,scale,
24,7,0.920932,0.020482,poly,10.0,scale,2.0
25,7,0.920932,0.020482,poly,10.0,scale,3.0
8,9,0.917859,0.022564,rbf,1.0,scale,
22,10,0.912543,0.028395,poly,1.0,scale,2.0



## Overall Summary
Concatenate the per-dataset summaries to see everything in one table.


In [8]:

overall_summary = pd.concat([summary_iris, summary_wine, summary_cancer], ignore_index=True)
overall_summary


Unnamed: 0,Dataset,CV,Best score (accuracy),Best params
0,iris,KFold,0.98,"{'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}"
1,iris,StratifiedKFold,0.986667,"{'C': 1, 'kernel': 'linear'}"
2,iris,GroupKFold,0.978667,"{'C': 1, 'kernel': 'linear'}"
3,wine,KFold,0.960794,"{'C': 0.1, 'kernel': 'linear'}"
4,wine,StratifiedKFold,0.977619,"{'C': 0.1, 'kernel': 'linear'}"
5,wine,GroupKFold,0.977344,"{'C': 0.1, 'kernel': 'linear'}"
6,breast_cancer,KFold,0.964788,"{'C': 100, 'kernel': 'linear'}"
7,breast_cancer,StratifiedKFold,0.963096,"{'C': 100, 'kernel': 'linear'}"
8,breast_cancer,GroupKFold,0.960756,"{'C': 10, 'kernel': 'linear'}"



## Notes and Next Steps
- We avoided `Pipeline` to follow the constraint, but for real workflows you should include preprocessing inside a pipeline to avoid leakage.
- You can switch `scoring='accuracy'` to another metric (e.g., `'f1_macro'`) if class imbalance is a concern.
- To evaluate generalization of the best model, you can perform a separate `cross_val_score` on the `best_estimator_` from any search, still without manual loops.
