# Notebook 03 â€” Preprocessing, Feature Engineering, Feature Selection

Goal: create a clear feature workflow on the Breast Cancer dataset and measure its impact on classical machine learning models.

We cover three topics
- Preprocessing with standardization inside a pipeline to avoid data leakage
- Lightweight feature engineering using a small set of simple ratio and interaction features
- Feature selection using two approaches
  - L1 logistic regression for sparse feature selection
  - RFE as an alternative selection method

We compare model performance on
- Full feature set
- Feature engineered feature set
- Selected feature set

Key outputs
- Cross validation tables for each feature set
- A comparison table that quantifies the change between full and selected features
- Saved tables and selected feature lists in the outputs folder



## Setup and data split

Load the dataset, define the fixed train and test split, and create output folders.
All artifacts are saved at the project root so that notebooks stay clean.



In [1]:
# Import required libraries.
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from pathlib import Path

ROOT_DIR = Path("..")
FIG_DIR = ROOT_DIR / "figures"
OUT_DIR = ROOT_DIR / "outputs"
FIG_DIR.mkdir(parents=True, exist_ok=True)
OUT_DIR.mkdir(parents=True, exist_ok=True)

SEED = 42
np.random.seed(SEED)

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED, stratify=y
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

scoring = {"acc": "accuracy", "f1": "f1", "roc_auc": "roc_auc"}

models = {
    "LogisticRegression": LogisticRegression(max_iter=5000, random_state=SEED),
    "KNN": KNeighborsClassifier(),
    "DecisionTree": DecisionTreeClassifier(random_state=SEED),
    "RandomForest": RandomForestClassifier(random_state=SEED),
    "SVM(RBF)": SVC(probability=True, random_state=SEED),
}

## Baseline evaluation on the full feature set

Evaluate the required models using cross validation on the training split.
This establishes a reference before feature engineering or feature selection.



In [2]:
# Preprocessing recap:
def cv_table(Xtr, ytr, tag="full"):
    rows = []
    for name, model in models.items():
        pipe = Pipeline([
            ("scaler", StandardScaler()),
            ("model", model),
        ])
        out = cross_validate(pipe, Xtr, ytr, cv=cv, scoring=scoring, return_train_score=False)
        rows.append({
            "feature_set": tag,
            "model": name,
            "cv_f1_mean": np.mean(out["test_f1"]),
            "cv_f1_std": np.std(out["test_f1"]),
            "cv_auc_mean": np.mean(out["test_roc_auc"]),
            "cv_auc_std": np.std(out["test_roc_auc"]),
            "cv_acc_mean": np.mean(out["test_acc"]),
            "cv_acc_std": np.std(out["test_acc"]),
        })
    return pd.DataFrame(rows).sort_values(["cv_f1_mean"], ascending=False)

df_full = cv_table(X_train, y_train, tag="full")
display(df_full)

df_full.to_csv(OUT_DIR / "03_cv_full_features.csv", index=False)
print("Saved:", OUT_DIR / "03_cv_full_features.csv")


Unnamed: 0,feature_set,model,cv_f1_mean,cv_f1_std,cv_auc_mean,cv_auc_std,cv_acc_mean,cv_acc_std
0,full,LogisticRegression,0.982544,0.007752,0.995872,0.00496,0.978022,0.009829
4,full,SVM(RBF),0.975615,0.011252,0.995562,0.004758,0.969231,0.014579
1,full,KNN,0.970489,0.009111,0.988235,0.00815,0.962637,0.011207
3,full,RandomForest,0.969935,0.014605,0.989577,0.008257,0.962637,0.017855
2,full,DecisionTree,0.931901,0.014435,0.917905,0.020094,0.916484,0.017855


Saved: ../outputs/03_cv_full_features.csv


## Lightweight feature engineering

The dataset already contains engineered morphological measurements.
Here we add a small set of simple ratios and interactions to test whether they bring marginal gains.
We keep feature names explicit for interpretability.



In [3]:
# Feature engineering note:
# The dataset already contains engineered morphological features (mean/worst/error).
# We add a small set of simple ratio/interaction features to test whether they bring marginal gains.

X_fe = X.copy()
eps = 1e-6

# Keep names explicit for interpretability
X_fe["perimeter_mean_over_radius_mean"] = X_fe["mean perimeter"] / (X_fe["mean radius"] + eps)
X_fe["area_mean_over_radius2"] = X_fe["mean area"] / (X_fe["mean radius"]**2 + eps)
X_fe["compactness_over_smoothness"] = X_fe["mean compactness"] / (X_fe["mean smoothness"] + eps)
X_fe["texture_x_smoothness"] = X_fe["mean texture"] * X_fe["mean smoothness"]

# Split FE version using same indices as original split
# (We re-split to keep it simple and deterministic; same SEED -> same split)
X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(
    X_fe, y, test_size=0.2, random_state=SEED, stratify=y
)

df_fe = cv_table(X_train_fe, y_train_fe, tag="feature_engineered")
display(df_fe)

df_fe.to_csv(OUT_DIR / "03_cv_feature_engineered.csv", index=False)
print("Saved:", OUT_DIR / "03_cv_feature_engineered.csv")


Unnamed: 0,feature_set,model,cv_f1_mean,cv_f1_std,cv_auc_mean,cv_auc_std,cv_acc_mean,cv_acc_std
0,feature_engineered,LogisticRegression,0.984286,0.003511,0.996698,0.004312,0.98022,0.004396
4,feature_engineered,SVM(RBF),0.977416,0.01148,0.996182,0.002344,0.971429,0.014906
3,feature_engineered,RandomForest,0.971862,0.010465,0.990764,0.008231,0.964835,0.012815
1,feature_engineered,KNN,0.969106,0.009883,0.990248,0.007961,0.96044,0.013187
2,feature_engineered,DecisionTree,0.950495,0.008166,0.937822,0.020035,0.938462,0.011207


Saved: ../outputs/03_cv_feature_engineered.csv


## Feature selection with L1 logistic regression

Fit an L1 regularized logistic regression with cross validated C.
Non zero coefficients define the selected feature subset.



In [4]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# L1-based feature selection
l1_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(
        penalty="l1",
        solver="liblinear",
        max_iter=5000,
        random_state=SEED
    ))
])

# Wider C range to avoid selecting 0 features
param_grid = {"model__C": np.logspace(-4, 4, 25)}

search = GridSearchCV(
    l1_pipe,
    param_grid=param_grid,
    scoring="f1",
    cv=cv,
    n_jobs=-1
)
search.fit(X_train, y_train)

best_l1 = search.best_estimator_
print("Best L1 params:", search.best_params_)
print("Best L1 CV F1:", search.best_score_)

coef = best_l1.named_steps["model"].coef_.ravel()
selected_mask = (coef != 0)

selected_features = X_train.columns[selected_mask].tolist()
print(f"Selected features: {len(selected_features)} / {X_train.shape[1]}")
print("Selected features (first 15):", selected_features[:15])

import json
with open(OUT_DIR / "03_selected_features_l1.json", "w") as f:
    json.dump(selected_features, f, indent=2)
print("Saved:", OUT_DIR / "03_selected_features_l1.json")




Best L1 params: {'model__C': np.float64(0.21544346900318823)}
Best L1 CV F1: 0.9804951862617767
Selected features: 10 / 30
Selected features (first 15): ['mean texture', 'mean concave points', 'radius error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst area', 'worst smoothness', 'worst concave points', 'worst symmetry']
Saved: ../outputs/03_selected_features_l1.json




## Evaluation on the selected feature set

Re run the same model suite using only the selected features.
This isolates the impact of feature selection.



In [5]:
# Run the next step of the pipeline.
X_train_sel = X_train[selected_features]
X_test_sel = X_test[selected_features]

df_sel = cv_table(X_train_sel, y_train, tag="selected_l1")
display(df_sel)

df_sel.to_csv(OUT_DIR / "03_cv_selected_l1.csv", index=False)
print("Saved:", OUT_DIR / "03_cv_selected_l1.csv")


Unnamed: 0,feature_set,model,cv_f1_mean,cv_f1_std,cv_auc_mean,cv_auc_std,cv_acc_mean,cv_acc_std
4,selected_l1,SVM(RBF),0.980836,0.00849,0.996698,0.002389,0.975824,0.010767
0,selected_l1,LogisticRegression,0.980805,0.006501,0.996078,0.003509,0.975824,0.008223
3,selected_l1,RandomForest,0.977295,0.01201,0.993395,0.004518,0.971429,0.014906
1,selected_l1,KNN,0.97226,0.006336,0.993189,0.006352,0.964835,0.008223
2,selected_l1,DecisionTree,0.953848,0.015489,0.942518,0.019873,0.942857,0.018906


Saved: ../outputs/03_cv_selected_l1.csv


## Full versus selected comparison

Join the full and selected cross validation tables by model name.
Compute deltas so that the impact is easy to report.



In [6]:
# Merge by model to compare full vs selected_l1
df_compare = (
    df_full[["model", "cv_f1_mean", "cv_auc_mean", "cv_acc_mean"]]
    .merge(
        df_sel[["model", "cv_f1_mean", "cv_auc_mean", "cv_acc_mean"]],
        on="model",
        suffixes=("_full", "_selected")
    )
)

# Add deltas (selected - full)
df_compare["delta_f1"] = df_compare["cv_f1_mean_selected"] - df_compare["cv_f1_mean_full"]
df_compare["delta_auc"] = df_compare["cv_auc_mean_selected"] - df_compare["cv_auc_mean_full"]
df_compare["delta_acc"] = df_compare["cv_acc_mean_selected"] - df_compare["cv_acc_mean_full"]

display(df_compare.sort_values("cv_f1_mean_full", ascending=False))

df_compare.to_csv(OUT_DIR / "03_full_vs_selected_comparison.csv", index=False)
print("Saved:", OUT_DIR / "03_full_vs_selected_comparison.csv")


Unnamed: 0,model,cv_f1_mean_full,cv_auc_mean_full,cv_acc_mean_full,cv_f1_mean_selected,cv_auc_mean_selected,cv_acc_mean_selected,delta_f1,delta_auc,delta_acc
0,LogisticRegression,0.982544,0.995872,0.978022,0.980805,0.996078,0.975824,-0.001739,0.000206,-0.002198
1,SVM(RBF),0.975615,0.995562,0.969231,0.980836,0.996698,0.975824,0.00522,0.001135,0.006593
2,KNN,0.970489,0.988235,0.962637,0.97226,0.993189,0.964835,0.001771,0.004954,0.002198
3,RandomForest,0.969935,0.989577,0.962637,0.977295,0.993395,0.971429,0.00736,0.003818,0.008791
4,DecisionTree,0.931901,0.917905,0.916484,0.953848,0.942518,0.942857,0.021947,0.024613,0.026374


Saved: ../outputs/03_full_vs_selected_comparison.csv


## Alternative selection with RFE

Run recursive feature elimination to obtain a second set of selected features.
This provides an alternative view of feature importance.



In [7]:
from sklearn.feature_selection import RFE

# RFE with LogisticRegression (L2) as estimator
est = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=5000, random_state=SEED))
])

# RFE needs estimator that exposes coef_ -> use LogisticRegression (without pipeline) in a scaled space is trickier,
# so here we do a simpler approach: scale first, then RFE on scaled numeric matrix.
scaler = StandardScaler()
Xtr_scaled = scaler.fit_transform(X_train)

lr = LogisticRegression(max_iter=5000, random_state=SEED)
rfe = RFE(estimator=lr, n_features_to_select=10)
rfe.fit(Xtr_scaled, y_train)

rfe_features = X_train.columns[rfe.support_].tolist()
print("RFE selected features:", rfe_features)

import json
with open(OUT_DIR / "03_selected_features_rfe.json", "w") as f:
    json.dump(rfe_features, f, indent=2)
print("Saved:", OUT_DIR / "03_selected_features_rfe.json")


RFE selected features: ['mean area', 'radius error', 'area error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst concavity', 'worst concave points']
Saved: ../outputs/03_selected_features_rfe.json
