# Cardiovascular Dataset â€“ Preprocessing Notebook

This notebook takes the cleaned train/test splits from the EDA notebook and applies all
preprocessing steps needed to create model-ready features and reusable artifacts for the
modeling team.

**Main goals:**
- Load raw train/test splits from EDA
- Define final feature groups (numeric, categorical, binary, dropped)
- Handle missing values in a consistent, train-only way
- Apply encoding and scaling via a reusable `ColumnTransformer`
- Save the fitted preprocessing pipeline and artifacts
- Export model-ready train and test sets


### The notebook isn't runnable yet, this is a skeleton for the notbook

## 0. Imports and configuration

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import joblib

from pathlib import Path



'\n# Config\nDATA_DIR = Path("data")          # adjust path\nPROCESSED_DIR = Path("processed")\nARTIFACTS_DIR = Path("artifacts")\n\nTARGET_COL = "Heart_Disease"\nRANDOM_STATE = 42\n\nPROCESSED_DIR.mkdir(exist_ok=True)\nARTIFACTS_DIR.mkdir(exist_ok=True)\n'

## 1. Load train/test splits

We start from the splits created in the EDA notebook to avoid any leakage or
changes in sampling. These files contain the raw (unencoded, unscaled) features.


In [10]:
X_train = pd.read_csv("processed/X_train_raw.csv")
y_train = pd.read_csv("processed/y_train_raw.csv")

X_test = pd.read_csv("processed/X_test_raw.csv")
y_test = pd.read_csv("processed/y_test_raw.csv")

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((277968, 19), (30886, 19), (277968, 1), (30886, 1))

In [6]:
X_train.head()

Unnamed: 0,General_Health,Checkup,Exercise,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,Height_(cm),Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption,Age_num
0,Very Good,Within the past 2 years,Yes,No,No,No,No,No,Female,45-49,165,90.72,33.28,Yes,0,5,4,10,47.0
1,Good,Within the past year,Yes,No,No,No,"No, pre-diabetes or borderline diabetes",No,Male,18-24,183,131.54,39.33,No,4,60,30,4,21.0
2,Very Good,Within the past year,Yes,No,No,Yes,No,No,Male,30-34,188,92.99,26.32,No,12,4,30,4,32.0
3,Very Good,Within the past year,Yes,No,No,Yes,No,No,Female,70-74,178,84.82,26.83,Yes,1,30,16,0,72.0
4,Good,Within the past 2 years,Yes,No,No,Yes,No,No,Male,65-69,178,90.72,28.7,No,3,5,10,12,67.0


In [7]:
y_train.head()

Unnamed: 0,Heart_Disease
0,0
1,0
2,0
3,0
4,1


## 2. Finalize Target and Class Imbalance Strategy

In [16]:
imbalance_table = (
    y_train.value_counts()
           .to_frame("count")
           .assign(percent=lambda df: (df["count"] / df["count"].sum() * 100).round(2))
)

imbalance_table

Unnamed: 0_level_0,count,percent
Heart_Disease,Unnamed: 1_level_1,Unnamed: 2_level_1
0,255494,91.91
1,22474,8.09


## Class Imbalance Strategy

The target variable (`Heart_Disease`) shows a mild imbalance, with a positive rate of approximately
8% in the training set. This level of imbalance is typical for cardiovascular survey data and is not
severe enough to require resampling techniques.

**We intentionally avoid oversampling or undersampling**, including SMOTE or random oversampling,
because these approaches can:
- distort the true population prevalence,
- create duplicated high-risk individuals,
- reduce calibration quality (important for medical models),
- and increase the risk of overfitting.

Instead, we rely on the following approach:

**1. Maintain the original class ratio**  
Stratified train/test split (already performed in the EDA notebook) preserves real-world prevalence.

**2. Use class weights during model training**  
Models such as logistic regression, random forest, and gradient boosting will receive
`class_weight="balanced"` or an equivalent positive-to-negative scaling factor.

This strategy keeps the data distribution realistic, supports good calibration, and aligns with
best practices in clinical machine-learning pipelines.


## 3. Define feature groups

We explicitly define which columns are:
- numeric (to be scaled),
- categorical (to be one-hot encoded),
- binary (kept as-is),
- dropped

This ensures the preprocessing pipeline is transparent and reproducible.


In [None]:
all_features = X_train.columns.tolist()

numeric_cols = [
    "ADD OUR COLUMNS"
]

categorical_cols = [
    "ADD OUR COLUMNS"
]

binary_cols = [
    "ADD OUR COLUMNS"
]


In [None]:
# PLEASE DROP THE COLUMN AGE_CATEGORY BECAUSE WE ALREADY MADE AGE_NUM INSTEAD

## 4. Build preprocessing pipeline

We build a single `ColumnTransformer` that:

- imputes and scales numeric features,
- imputes and one-hot encodes categorical features,
- passes binary columns through unchanged.

The pipeline is fitted on the training data and then applied consistently to
both train and test sets. The fitted object is saved as an artifact.


In [None]:
numeric_transformer = make_numeric = SimpleImputer(strategy="median")
numeric_pipeline = [
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
]

categorical_pipeline = [
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
]

preprocessor = ColumnTransformer(
    transformers=[
        ("num",    Pipeline(numeric_pipeline), numeric_cols),
        ("cat",    Pipeline(categorical_pipeline), categorical_cols),
        ("binary", "passthrough", binary_cols),
    ]
)


## 5. Fit on training data and transform train/test

The preprocessing pipeline is fitted **only on `X_train`** to avoid data leakage.
The same fitted object is then used to transform both `X_train` and `X_test`.


In [None]:
# Fit on training data
X_train_pre = preprocessor.fit_transform(X_train)
X_test_pre  = preprocessor.transform(X_test)

X_train_pre.shape, X_test_pre.shape


## 6. Sanity checks on transformed data

We run basic checks to ensure:
- No missing values remain after preprocessing
- Shapes of transformed matrices are as expected
- The class balance has not changed


In [None]:
# Check for NaNs
print("NaNs in X_train_pre:", np.isnan(X_train_pre).sum())
print("NaNs in X_test_pre:", np.isnan(X_test_pre).sum())

# Check target balance
print("Train positive rate:", y_train.mean())
print("Test positive rate:", y_test.mean())

## 7. Save preprocessing artifacts and model-ready data

We save:
- The fitted `preprocessor` (ColumnTransformer)
- Model-ready train and test matrices, along with targets

The modeling team can now load these directly without rerunning EDA.


## 8. Summary and handoff notes

- Input: raw train/test splits from EDA notebook (`X_train_raw`, `X_test_raw`)
- Preprocessing steps:
  - Dropped columns: `age_category`
  - Numeric features: median imputation + standard scaling
  - Categorical features: most-frequent imputation + one-hot encoding
  - Binary features: passed through unchanged

- Model-ready data:
  - `processed/train_model_ready.csv`
  - `processed/test_model_ready.csv`

These files are ready for downstream model training.
