# 1. Data Loading and Initial Setup

This notebook prepares the Maternal Health Risk dataset for our experiments.
This updated version will:
1.  Load and clean the raw data.
2.  Create a single, hold-out test set.
3.  Generate multiple training datasets with varying imbalance ratios (IR) via undersampling.
4.  For each IR, create a size-matched control dataset with the original class ratio.
5.  Preprocess and save all generated datasets.

In [24]:
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import resample
from pathlib import Path

RAW_PATH = Path("../data/raw/maternal_health_risk.csv")
PROCESSED_PATH = Path("../data/processed/")
TARGET_FEATURE = 'RiskLevel'
NUMERICAL_FEATURES = ['Age', 'SystolicBP', 'DiastolicBP', 'BS', 'BodyTemp', 'HeartRate']
RANDOM_STATE = 42
IMBALANCE_RATIOS = [1, 2, 3, 4, 5, 10] # Majority:Minority ratios to generate

RAW_PATH.parent.mkdir(parents=True, exist_ok=True)
PROCESSED_PATH.mkdir(parents=True, exist_ok=True)

print("Fetching and loading the raw dataset...")
maternal_health_risk = fetch_ucirepo(id=863)
X_features = maternal_health_risk.data.features
y_target = maternal_health_risk.data.targets
df_raw = pd.concat([X_features, y_target], axis=1)
df_raw.to_csv(RAW_PATH, index=False)
print("Raw dataset loaded.")


Fetching and loading the raw dataset...
Raw dataset loaded.


# 2. Data Cleaning

Applying the cleaning steps identified in the EDA: removing biased duplicates and the physiological outlier (HeartRate=7).


In [25]:
df_cleaned = df_raw.drop_duplicates().copy()
df_cleaned = df_cleaned[df_cleaned['HeartRate'] != 7].reset_index(drop=True)
print(f"Data cleaned. Shape after cleaning: {df_cleaned.shape}")


Data cleaned. Shape after cleaning: (451, 7)


# 3. Target Binarization

To simplify the initial experiments, we convert the multi-class target into a binary one.
- **Class 0 (Majority):** 'low risk'
- **Class 1 (Minority):** 'mid risk' and 'high risk' are combined into a single 'at-risk' class.


In [26]:
df_cleaned[TARGET_FEATURE] = df_cleaned[TARGET_FEATURE].replace({'mid risk': 'at-risk', 'high risk': 'at-risk'})
print("\nTarget variable distribution after binarization:")
print(df_cleaned[TARGET_FEATURE].value_counts())


Target variable distribution after binarization:
RiskLevel
low risk    233
at-risk     218
Name: count, dtype: int64


# 4. Create a Hold-Out Test Set

This is a one-time split. This test set will remain untouched and will be used for the final evaluation of all models.


In [27]:

X = df_cleaned.drop(TARGET_FEATURE, axis=1)
y = df_cleaned[[TARGET_FEATURE]]

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

train_full_df = pd.concat([X_train_full, y_train_full], axis=1)

print(f"\nFull training set shape: {train_full_df.shape}")
print(f"Hold-out test set shape: {X_test.shape}")



Full training set shape: (360, 7)
Hold-out test set shape: (91, 6)


# 5. Generate Imbalanced and Control Datasets

We now loop through the desired imbalance ratios to generate our experimental training sets. For each ratio, we create two versions:
1.  **Imbalanced Set**: Created by undersampling the majority class.
2.  **Control Set**: A stratified sample from the original training set that has the *same total number of instances* as the imbalanced set but preserves the *original class ratio*.


In [28]:
minority_df = train_full_df[train_full_df[TARGET_FEATURE] == 'at-risk']
majority_df = train_full_df[train_full_df[TARGET_FEATURE] == 'low risk']
n_minority = len(minority_df)
n_majority_available = len(majority_df)

print(f"Full training set composition: {n_majority_available} majority samples, {n_minority} minority samples.")

generated_datasets = {}

for ir in IMBALANCE_RATIOS:
    print(f"\nProcessing Imbalance Ratio (IR) = {ir}:1")
    
    n_majority_imbalanced = int(n_minority * ir)
    
    if n_majority_imbalanced > n_majority_available:
        print(f"  - WARNING: Cannot create {ir}:1 ratio. Not enough majority samples.")
        print(f"  - Using all available {n_majority_available} majority samples instead.")
        n_majority_imbalanced = n_majority_available

    majority_undersampled = resample(
        majority_df,
        replace=False,
        n_samples=n_majority_imbalanced,
        random_state=RANDOM_STATE
    )
    
    imbalanced_df = pd.concat([majority_undersampled, minority_df])
    total_size = len(imbalanced_df)
    generated_datasets[f'imbalanced_ir_{ir}'] = imbalanced_df
    
    print(f"  - Imbalanced set created: {total_size} samples ({len(majority_undersampled)} majority, {n_minority} minority)")

    if total_size == len(train_full_df):
        control_df = train_full_df.copy()
    else:
        control_df, _ = train_test_split(
            train_full_df,
            train_size=total_size,
            random_state=RANDOM_STATE,
            stratify=train_full_df[TARGET_FEATURE]
        )
    
    generated_datasets[f'control_ir_{ir}'] = control_df
    print(f"  - Control set created:      {len(control_df)} samples (original class ratio)")

    if n_majority_imbalanced == n_majority_available and ir != 1:
        print("\nStopping further processing as maximum possible imbalance has been reached.")
        break

Full training set composition: 186 majority samples, 174 minority samples.

Processing Imbalance Ratio (IR) = 1:1
  - Imbalanced set created: 348 samples (174 majority, 174 minority)
  - Control set created:      348 samples (original class ratio)

Processing Imbalance Ratio (IR) = 2:1
  - Using all available 186 majority samples instead.
  - Imbalanced set created: 360 samples (186 majority, 174 minority)
  - Control set created:      360 samples (original class ratio)

Stopping further processing as maximum possible imbalance has been reached.


# 6. Preprocessing and Saving All Datasets

We fit the scaler ONCE on the full training data to learn the scaling parameters. Then, we transform all generated training sets and the hold-out test set using this single, consistent scaler.


In [29]:
target_encoder = OrdinalEncoder(categories=[['low risk', 'at-risk']])
target_encoder.fit(y_train_full)

scaler = StandardScaler()
scaler.fit(X_train_full[NUMERICAL_FEATURES])

for name, df in generated_datasets.items():
    X_temp = df.drop(columns=[TARGET_FEATURE])
    y_temp = df[[TARGET_FEATURE]]

    X_processed = scaler.transform(X_temp[NUMERICAL_FEATURES])
    X_processed_df = pd.DataFrame(X_processed, columns=NUMERICAL_FEATURES)
    
    y_processed = target_encoder.transform(y_temp)
    
    final_df = X_processed_df
    final_df[TARGET_FEATURE] = y_processed
    save_path = PROCESSED_PATH / f"train_{name}.csv"
    final_df.to_csv(save_path, index=False)
    print(f"Saved: {save_path}")

X_test_processed = scaler.transform(X_test[NUMERICAL_FEATURES])
y_test_processed = target_encoder.transform(y_test)

test_df = pd.DataFrame(X_test_processed, columns=NUMERICAL_FEATURES)
test_df[TARGET_FEATURE] = y_test_processed
test_df.to_csv(PROCESSED_PATH / "test.csv", index=False)
print(f"Saved: {PROCESSED_PATH / 'test.csv'}")

print("\nPreprocessing complete. Datasets are ready for experiments.")


Saved: ../data/processed/train_imbalanced_ir_1.csv
Saved: ../data/processed/train_control_ir_1.csv


Saved: ../data/processed/train_imbalanced_ir_2.csv
Saved: ../data/processed/train_control_ir_2.csv
Saved: ../data/processed/test.csv

Preprocessing complete. Datasets are ready for experiments.
