# Breast Cancer Prediction - Feature Engineering and Preprocessing

This notebook constructs the preprocessing pipeline required to prepare medical diagnostic data for machine learning models.

The focus is on feature scaling, dataset splitting, and saving reproducible processed datasets.

In [1]:
import numpy as np
import pandas as pd
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

We load the locally stored breast cancer dataset created during the analysis phase.

In [2]:
df = pd.read_csv("../datasets/breast_cancer.csv")

X = df.drop("target", axis=1)
y = df["target"]

print("Feature shape:", X.shape)
print("Target shape:", y.shape)

Feature shape: (569, 30)
Target shape: (569,)


Medical features have very different value ranges.

We apply standardization so that:
- all features have zero mean and unit variance
- distance-based and gradient-based models perform correctly

In [3]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Scaled feature sample:\n", X_scaled[:3])

Scaled feature sample:
 [[ 1.09706398e+00 -2.07333501e+00  1.26993369e+00  9.84374905e-01
   1.56846633e+00  3.28351467e+00  2.65287398e+00  2.53247522e+00
   2.21751501e+00  2.25574689e+00  2.48973393e+00 -5.65265059e-01
   2.83303087e+00  2.48757756e+00 -2.14001647e-01  1.31686157e+00
   7.24026158e-01  6.60819941e-01  1.14875667e+00  9.07083081e-01
   1.88668963e+00 -1.35929347e+00  2.30360062e+00  2.00123749e+00
   1.30768627e+00  2.61666502e+00  2.10952635e+00  2.29607613e+00
   2.75062224e+00  1.93701461e+00]
 [ 1.82982061e+00 -3.53632408e-01  1.68595471e+00  1.90870825e+00
  -8.26962447e-01 -4.87071673e-01 -2.38458552e-02  5.48144156e-01
   1.39236330e-03 -8.68652457e-01  4.99254601e-01 -8.76243603e-01
   2.63326966e-01  7.42401948e-01 -6.05350847e-01 -6.92926270e-01
  -4.40780058e-01  2.60162067e-01 -8.05450380e-01 -9.94437403e-02
   1.80592744e+00 -3.69203222e-01  1.53512599e+00  1.89048899e+00
  -3.75611957e-01 -4.30444219e-01 -1.46748968e-01  1.08708430e+00
  -2.43889668e-01

In [5]:
joblib.dump(scaler, "../models/scaler.pkl")
print("Scaler saved.")

Scaler saved.


We split the dataset into training and testing subsets.

Stratification is applied to preserve malignant and benign proportions.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set:", X_train.shape)
print("Testing set:", X_test.shape)

Training set: (455, 30)
Testing set: (114, 30)


Processed datasets are saved to disk to ensure reproducibility and consistency across modeling and deployment stages.

In [7]:
joblib.dump((X_train, X_test, y_train, y_test), "../datasets/processed_breast_cancer.pkl")

print("Processed datasets saved.")

Processed datasets saved.


Conclusions:

- All features have been standardized.
- The dataset has been split in a stratified manner.
- Preprocessing artifacts have been saved for consistent reuse.

The data is now ready for training and comparing classification models.