Model Setup and Imports

In [11]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


We load the Breast Cancer dataset from sklearn.
The target variable indicates whether a tumor is malignant or benign.
All remaining columns are used as numerical features.

In [5]:
data = load_breast_cancer(as_frame=True)
df = data.frame

X = df.drop("target", axis=1)
y = df["target"]

To obtain a more reliable estimate of model performance, we use stratified 5-fold cross-validation.
This ensures that the class distribution is preserved in each fold.

In [2]:
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)


We compare three different models:
1. Logistic Regression as a linear baseline model.
2. Random Forest as a tree-based ensemble model.
3. Gradient Boosting as a boosted ensemble method.

Logistic Regression is scaled because it is sensitive to feature magnitude,
while tree-based models do not require feature scaling.

In [3]:
models = {
    "Logistic Regression (scaled)": Pipeline([
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(
            max_iter=5000,
            random_state=42
        ))
    ]),

    "Random Forest": RandomForestClassifier(
        n_estimators=300,
        random_state=42
    ),

    "Gradient Boosting": GradientBoostingClassifier(
        random_state=42
    )
}


Each model is evaluated using 5-fold cross-validation.
We report the mean accuracy and F1 score across folds to compare overall performance.

In [6]:
results = []

for name, model in models.items():
    scores = cross_validate(
        model,
        X,
        y,
        cv=cv,
        scoring=["accuracy", "f1"],
        return_train_score=False
    )

    results.append({
        "Model": name,
        "CV Accuracy (mean)": scores["test_accuracy"].mean(),
        "CV F1 (mean)": scores["test_f1"].mean()
    })

results_df = pd.DataFrame(results).sort_values(
    "CV F1 (mean)", ascending=False
)

results_df


Unnamed: 0,Model,CV Accuracy (mean),CV F1 (mean)
0,Logistic Regression (scaled),0.973669,0.979434
1,Random Forest,0.952569,0.962421
2,Gradient Boosting,0.949076,0.960242


Based on cross-validation results, Logistic Regression achieved the best overall performance.
We therefore retrain this model on the training set and evaluate it on a held-out test set.
This provides an unbiased estimate of final model performance.

In [12]:
# 1. Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# 2. Define the best model (based on CV results)
best_model = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(
        max_iter=5000,
        random_state=42
    ))
])

# 3. Fit the model
best_model.fit(X_train, y_train)

# 4. Predict on test set
y_pred = best_model.predict(X_test)

# 5. Evaluation
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Test Accuracy: 0.9824561403508771

Confusion Matrix:
[[41  1]
 [ 1 71]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98        42
           1       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

