# Day 3 — Model Comparison Using Cross-Validation
### Machine Learning Roadmap — Week 3
### Author — N Manish Kumar
---

In previous notebooks, we evaluated individual models and tuned
hyperparameters using cross-validation.

However, in real ML problems, we often need to **compare different models**
and choose the best one based on reliable evaluation metrics.

Using test accuracy to compare multiple models leads to overfitting to the
test set and unreliable conclusions.

Instead, we compare models using **cross-validation on training data** and
consider both:
- Mean accuracy
- Stability (standard deviation across folds)

In this notebook, we will:
- Compare multiple models using 5-fold cross-validation
- Analyze mean and standard deviation of accuracy
- Select the best model based on reliability, not just peak accuracy

Dataset used: **Breast Cancer Dataset (sklearn)**

---

## 1. Import Libraries, Load Dataset, and Create Train/Test Split

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Train-test split (hold-out test set)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (455, 30)
Test set shape: (114, 30)


---
## 3. Define Models for Comparison

To compare models fairly, all preprocessing must be applied consistently.

We will compare the following models:

1. Logistic Regression (with regularization)
2. Logistic Regression (stronger regularization)
3. Decision Tree (simple non-linear model)

For Logistic Regression, we use Pipelines to ensure that feature scaling
is applied correctly inside each cross-validation fold.

In [2]:
models = {
    "Logistic Regression (C=1)": Pipeline(steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(C=1, solver="saga", max_iter=10000))
    ]),
    
    "Logistic Regression (C=0.1)": Pipeline(steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(C=0.1, solver="saga", max_iter=10000))
    ]),
    
    "Decision Tree (max_depth=5)": DecisionTreeClassifier(max_depth=5, random_state=42)
}

### Interpretation

Using multiple models allows us to compare both linear and non-linear
approaches on the same dataset.

Pipelines are used for Logistic Regression to ensure that scaling does not
leak information across cross-validation folds.

Decision Tree does not require feature scaling, so it is used directly.

---

## 4. Perform Cross-Validation for Each Model

To compare models fairly, we evaluate each model using the same
cross-validation strategy on the training data.

For each model, we compute:
- Mean cross-validation accuracy
- Standard deviation of accuracy

The standard deviation indicates how stable the model is across
different data splits.


In [9]:
cv_results ={}

for name, model in models.items():
    scores= cross_val_score(
        model,
        X_train,
        y_train,
        cv=5,
        scoring="accuracy",
        n_jobs=-1
    )

    cv_results[name] ={
        "mean_accuracy" : scores.mean(),
        "std_accuracy": scores.std()
    }
cv_results

{'Logistic Regression (C=1)': {'mean_accuracy': np.float64(0.9802197802197803),
  'std_accuracy': np.float64(0.012815278889769896)},
 'Logistic Regression (C=0.1)': {'mean_accuracy': np.float64(0.9802197802197803),
  'std_accuracy': np.float64(0.016150481820548422)},
 'Decision Tree (max_depth=5)': {'mean_accuracy': np.float64(0.9318681318681318),
  'std_accuracy': np.float64(0.021308482889742106)}}

### Interpretation

Each model has now been evaluated using 5-fold cross-validation.

Mean accuracy indicates overall performance, while standard deviation
indicates how sensitive the model is to different data splits.

In the next step, we will organize these results into a table to make
model comparison easier and more interpretable.

---
## 5. Create Comparison Table

To compare models clearly, we convert the cross-validation results into
a structured table showing:

- Model name
- Mean cross-validation accuracy
- Standard deviation of accuracy

This makes it easier to select the best model based on both performance
and stability.



In [10]:
results_df = pd.DataFrame(cv_results).T
results_df = results_df.sort_values(by="mean_accuracy", ascending=False)

results_df

Unnamed: 0,mean_accuracy,std_accuracy
Logistic Regression (C=1),0.98022,0.012815
Logistic Regression (C=0.1),0.98022,0.01615
Decision Tree (max_depth=5),0.931868,0.021308


### Interpretation

The table shows which model achieves the highest average validation accuracy
and how stable each model is across folds.

A good model should have:
- High mean accuracy
- Low standard deviation

Model selection should be based on these statistics, not on a single test
set performance.

---
## 6. Final Test Set Evaluation of Best Model

After selecting the best model using cross-validation on training data,
we now evaluate that model on the untouched test set.

This provides the final and most realistic estimate of how the selected
model will perform on new, unseen data.



In [12]:
# Select best model based on CV results
best_model_name = results_df.index[0]
best_model = models[best_model_name]

# Train best model on full training data
best_model.fit(X_train, y_train)

# Evaluate on test set
test_accuracy = best_model.score(X_test, y_test)

print("Best Model:", best_model_name)
print("Final Test Accuracy:", test_accuracy)

Best Model: Logistic Regression (C=1)
Final Test Accuracy: 0.9824561403508771


### Interpretation

If the test accuracy is close to the cross-validation mean accuracy,
it indicates that the model generalizes well and is not overfitting.

This confirms that selecting the model using cross-validation provided
a reliable choice for real-world performance.

---
# Notebook Summary — Week 3 Day 3

In this notebook, we learned how to compare multiple machine learning models
fairly using cross-validation instead of relying on a single test split.

### What was done
- Loaded and split the Breast Cancer dataset into training and test sets
- Defined multiple models including Logistic Regression and Decision Tree
- Evaluated each model using 5-fold cross-validation on training data
- Computed mean and standard deviation of accuracy for each model
- Created a comparison table to rank models based on performance and stability
- Selected the best model using cross-validation results
- Evaluated the selected model on the untouched test set

### Key Learnings
- Test set should not be used to compare multiple models
- Cross-validation provides a reliable estimate of model performance
- Mean accuracy reflects overall performance, while standard deviation reflects stability
- Model selection should balance both accuracy and consistency across folds

### Final Outcome
Using cross-validation, the best-performing and most stable model was selected,
and its test set performance confirmed good generalization to unseen data.
