# Day 7 - Cross Validation & Model Stability
### Machine Learning Roadmap - Week 2
### Author - N Manish Kumar
---

After training models and applying regularization, the next critical question is:

**How reliable is our model’s performance estimate?**

A single train–test split can be misleading because results may change
depending on how the data is divided.

To solve this, we use **Cross-Validation**, which evaluates the model
across multiple data splits to estimate how well it generalizes.

In this notebook, we will:
- Use k-fold cross-validation
- Compare stability of different models
- Select a model using training data only
- Evaluate final performance on an untouched test set

Dataset used: **Breast Cancer Dataset (sklearn)**

---

## 1. Load Data

In [2]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

print("Feature matrix shape:", X.shape)
print("Target distribution:\n", y.value_counts())

Feature matrix shape: (569, 30)
Target distribution:
 1    357
0    212
Name: count, dtype: int64


---
## 2. Train-Test Split

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (455, 30)
Test set shape: (114, 30)


---
## 3. Feature Scaling

In [4]:
scaler = StandardScaler()

X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

---
## 4. Define Models for Comparison

In [16]:
# Baseline model (almost no regularization)
baseline_model = LogisticRegression(
    solver="saga",
    C=1e6,
    max_iter=10000
)

# L2 Regularization (Ridge) -> l1_ratio = 0
l2_model = LogisticRegression(
    solver="saga",
    l1_ratio=0.0,
    C=1.0,
    max_iter=10000
)

# L1 Regularization (Lasso) -> l1_ratio = 1
l1_model = LogisticRegression(
    solver="saga",
    l1_ratio=1.0,
    C=1.0,
    max_iter=10000
)

---
## 5. k-Fold Cross-Validation

### What is k-Fold Cross-Validation?

In k-fold cross-validation, the training dataset is divided into **k equal parts (folds)**.

For **k = 5**, the process is:

- Use 4 folds for training
- Use the remaining 1 fold for validation
- Repeat this process **5 times**, changing the validation fold each time

So every data point:
- Is used for training **4 times**
- Is used for validation **1 time**

This produces **5 different validation scores**.

### Why do we use k-fold Cross-Validation?

A single train–test split can give misleading results depending on how data is split.

Cross-validation:
- Gives a **more reliable estimate** of model performance
- Shows how **stable** the model is across different data splits
- Helps in **model selection** without touching the test set

### Important Rule

Cross-validation must be done **only on training data**.  
The test set must remain completely unused until final evaluation.


In [17]:
base_scores = cross_val_score(baseline_model, X_train_s, y_train, cv=5)
l2_scores = cross_val_score(l2_model, X_train_s, y_train, cv=5)
l1_scores = cross_val_score(l1_model, X_train_s, y_train, cv=5)

print("Baseline CV scores:", base_scores)
print("L2 CV scores:", l2_scores)
print("L1 CV scores:", l1_scores)

Baseline CV scores: [0.93406593 0.95604396 0.96703297 1.         0.94505495]
L2 CV scores: [0.96703297 0.97802198 0.96703297 1.         0.98901099]
L1 CV scores: [0.95604396 0.97802198 0.96703297 0.98901099 0.98901099]


## Interpretation

Each value represents the accuracy on one validation fold, so performance changes slightly
depending on how the data is split.

For model comparison we focus on:
- **Mean accuracy** → overall performance
- **Standard deviation** → stability across folds

A good model should have high mean accuracy and low variance.

Models must be selected using cross-validation on training data,
and the test set should be used only once for final evaluation.

---
## 8. Mean and Standard Deviation of CV Scores

After performing k-fold cross-validation, we obtain multiple accuracy values
for each model — one from each fold.

To compare models properly, we compute:

- **Mean accuracy** → average performance across folds  
- **Standard deviation (std)** → how stable the model is across different splits

These two values together give a better estimate of how the model will perform
on unseen data than a single train–test split.


In [18]:
print("Baseline CV Mean:", base_scores.mean())
print("Baseline CV Std :", base_scores.std())

print("L2 CV Mean:", l2_scores.mean())
print("L2 CV Std :", l2_scores.std())

print("L1 CV Mean:", l1_scores.mean())
print("L1 CV Std :", l1_scores.std())

Baseline CV Mean: 0.9604395604395604
Baseline CV Std : 0.022627758551619782
L2 CV Mean: 0.9802197802197803
L2 CV Std : 0.012815278889769896
L1 CV Mean: 0.9758241758241759
L1 CV Std : 0.01281527888976989


### Interpretation

- Higher **mean accuracy** indicates better overall performance.
- Lower **standard deviation** indicates more stable and reliable predictions.

If two models have similar mean accuracy, the model with **lower standard deviation**
is usually preferred because it is less sensitive to how the data is split.

In practice, regularized models (especially L2) often show better stability
compared to very flexible baseline models.

---
## 9. Final Model Evaluation on Test Set

After selecting the best model using cross-validation on training data,
we now evaluate it on the untouched test set.

This test set has not been used in:
- Training
- Cross-validation
- Model selection

Therefore, the test accuracy represents how well the model is expected
to perform on new, unseen data in real-world scenarios.

In [20]:
# Train best model on full training data
l2_model.fit(X_train_s, y_train)

# Evaluate on final test set
test_accuracy = l2_model.score(X_test_s, y_test)

print("Final Test Accuracy (l2):", test_accuracy)

Final Test Accuracy (l2): 0.9824561403508771


### Interpretation

The test accuracy gives the most reliable estimate of model performance
because the test set was never seen during training or validation.

If the test accuracy is close to the cross-validation mean accuracy,
it indicates that the model generalizes well and is not overfitting.

This completes the standard ML workflow:

Train → Cross-Validate → Select Model → Final Test Evaluation.

---
# Notebook Summary — Week 2 Day 7

In this notebook, we practiced the correct ML evaluation workflow:

### What was done
- Loaded and inspected the Breast Cancer dataset
- Created a hold-out test set for final evaluation
- Applied feature scaling using StandardScaler
- Compared multiple Logistic Regression models using 5-fold cross-validation
- Selected the best model based on stability and mean accuracy
- Evaluated final performance on the untouched test set

### Key Learnings
- Single train–test split can be unreliable
- Cross-validation gives a better estimate of generalization performance
- Model selection must be done using only training data
- Test set should be used only once, after model selection

### Final Outcome
The selected model showed consistent cross-validation performance and
achieved similar accuracy on the test set, indicating good generalization.