# Feature Engineering Pipeline v2

This notebook runs the full **FeatureEngineer** pipeline from `notebooks/utils/features.py` on train, test, and holdout vital-sign time series, then saves the resulting feature matrices as Parquet files in `data/`.

**Outputs:** `train_features.parquet`, `test_features.parquet`, `holdout_features.parquet`

---

## What the pipeline does

The pipeline turns raw vital-sign rows (heart rate, BP, SpO₂, respiratory rate, etc.) and patient demographics into a single feature set suitable for downstream models (e.g. LightGBM, LSTM). All **scalers and encoders are fit on train encounters only** and then applied to test/holdout to avoid leakage.

When you run the pipeline, **logging** shows which feature group is currently being created (impute → lag → derivative → rolling → … → interactions → higher-order derivatives).

---

## Feature groups (in pipeline order)

| Step | Feature group | Description |
|------|----------------|-------------|
| 1 | **impute_vitals** | Fill missing vitals: neighbour interpolation, then encounter median fallback |
| 2 | **lag** | Per-vital lags 1..`n_lags` + `warmup_progress` (0→1 over first `n_lags` rows) |
| 3 | **derivative** | Delta (vs lag_n), delta_1s (vs lag_1), acceleration per vital |
| 4 | **rolling_stats** | Mean, std, min, max over the lag window (e.g. last 12 steps) per vital |
| 5 | **multiscale_rolling** | Same stats over 4 windows: 6, 12, 24, 60 steps (~30s–5 min) per vital |
| 6 | **derived_vitals** | Pulse pressure, MAP, shock_index (HR/SBP), hr_rr_ratio (HR/RR), and their deltas |
| 7 | **temporal** | Minutes into encounter, hour sin/cos, day-of-week sin/cos |
| 8 | **ecg** | Per-encounter ECG: basic stats + FFT (dom_freq, LF/HF power, spectral entropy), hr_ecg_diff (if ECG data exists) |
| 9 | **prior_label** | *(Optional)* prior_label, max_label_last_60s, ever_deteriorated — **disabled by default** to avoid label leakage on holdout |
| 10 | **patient** | Demographics (age, BMI, pain_score; scaled), missingness flags, is_elderly/is_child, gender/marital/race/ethnicity/encounter_description (OHE), reason_risk_tier, comorbidity flags from free-text, on_cardiac_meds / on_insulin |
| 11 | **clinical_alert** | Binary flags: tachycardia, bradycardia, hypotension, hypertension, SpO₂ low/critical, tachypnea/bradypnea, n_active_alerts |
| 12 | **interaction** | Vital × elderly, vital × child, vital × comorbidity_count, HR × cardiac_meds, shock_index × reason_risk_tier |
| 13 | **higher_order_derivatives** | Jerk (3rd derivative) per vital for sudden change detection |

---

## EDA-informed choices

- **Missingness as signal:** `bmi_missing`, `pain_score_missing`, `reason_missing` — EDA showed missingness correlates with outcome.
- **Comorbidities:** Parsed from `previous_medical_history` (hypertension, diabetes, kidney, cardiac, anemia, obesity); comorbidity count is strongly associated with deterioration.
- **Reason-for-visit risk tier:** High (e.g. MI, stroke, gunshot) vs medium vs low — high-risk reasons have much higher label-3 rates.
- **Encounter description:** One-hot encoded; e.g. obstetric vs ED has very different outcome rates.
- **Medication flags:** `on_cardiac_meds` (metoprolol, nitroglycerin, etc.) strongly predicts deterioration.
- **Dropped:** `known_allergies`, `previous_medications` (too sparse / no association); BMI/pain_score kept with missingness flags.

## 1. Setup

Add `notebooks/` to the path and load `FeatureEngineer` from `utils/features.py`. Logging is configured so the pipeline prints which feature group it is creating.

In [1]:
import logging
from pathlib import Path

import pandas as pd

from utils.features import FeatureEngineer

# Show pipeline progress: which feature group is currently being created
logging.basicConfig(level=logging.INFO, format="%(message)s")

DATA_DIR = Path("../data")

engineer = FeatureEngineer(
    n_lags=12,
    rolling_windows=[6, 12, 24, 60],
    data_dir=DATA_DIR,
)

## 2. Load raw data and run pipeline

Load CSV inputs, then call `engineer.transform(...)`. Progress is logged so you can see each feature group as it is created. Use `include_prior_labels=False` (default) to avoid label leakage on holdout.

In [2]:
train_raw = pd.read_csv(DATA_DIR / "train_data.csv", parse_dates=["timestamp"])
test_raw = pd.read_csv(DATA_DIR / "test_data.csv", parse_dates=["timestamp"])
holdout_raw = pd.read_csv(DATA_DIR / "holdout_data.csv", parse_dates=["timestamp"])
patients = pd.read_csv(DATA_DIR / "patients.csv")

for name, df in [("train", train_raw), ("test", test_raw), ("holdout", holdout_raw)]:
    print(f"{name:8s}: {df.shape}  |  encounters: {df['encounter_id'].nunique():,}")
print(f"\npatients: {patients.shape}")

# include_prior_labels=False prevents label leakage on holdout
train, test, holdout, feature_cols = engineer.transform(
    train_raw, test_raw, holdout_raw, patients,
    include_prior_labels=False,
)
print(f"\nFeature pipeline complete: {len(feature_cols)} columns")

Creating features: impute_vitals (neighbour + median fallback)


train   : (2109600, 8)  |  encounters: 2,930
test    : (451440, 8)  |  encounters: 627
holdout : (452880, 7)  |  encounters: 629

patients: (4186, 17)


Creating features: lag (vital lags 1..n + warmup_progress)
Creating features: derivative (delta, delta_1s, accel per vital)
Creating features: rolling_stats (mean/std/min/max over lag window)
Creating features: multiscale_rolling (mean/std/min/max per window)
Creating features: derived_vitals (pulse_pressure, map, shock_index, hr_rr_ratio)
Creating features: temporal (minutes_into_encounter, hour/dow sin/cos)
Creating features: ecg (stats + FFT: dom_freq, LF/HF, spectral_entropy, hr_ecg_diff)
Creating features: patient (demographics, comorbidity, risk tier, OHE)
Creating features: clinical_alert (tachycardia, hypotension, spo2_low, n_active_alerts)
Creating features: interaction (vital x elderly/child/comorbidity/cardiac_meds)
Creating features: higher_order_derivatives (jerk per vital)



Feature pipeline complete: 283 columns


## 3. Validation

Check that the resulting feature matrices have no missing values in the feature columns.

In [3]:
for name, df in [("train", train), ("test", test), ("holdout", holdout)]:
    n_missing = df[feature_cols].isna().sum().sum()
    assert n_missing == 0, f"{name}: {n_missing} NaNs in features"
    print(f"{name:8s}: shape={df.shape}  missing_feature_values={n_missing}")

train   : shape=(2109600, 286)  missing_feature_values=0
test    : shape=(451440, 286)  missing_feature_values=0
holdout : shape=(452880, 285)  missing_feature_values=0


## 4. Save

Write train/test/holdout feature DataFrames to Parquet in `data/` (using fastparquet to avoid pyarrow compatibility issues).

In [11]:
# pickle dump the feature_cols
import pickle
with open(DATA_DIR / "feature_cols.pkl", "wb") as f:
    pickle.dump(feature_cols, f)


In [12]:
# Work around fastparquet + pandas Arrow string dtype: convert encounter_id to object
# so fastparquet can encode it (ArrowExtensionArray raises on copy=False)
def _to_parquet_safe(df, path):
    df = df.copy()
    if "encounter_id" in df.columns:
        df["encounter_id"] = df["encounter_id"].astype(object)
    df.to_parquet(path, index=False, engine="fastparquet")

_to_parquet_safe(train, DATA_DIR / "train_features.parquet")
_to_parquet_safe(test, DATA_DIR / "test_features.parquet")
_to_parquet_safe(holdout, DATA_DIR / "holdout_features.parquet")
print("Saved train_features.parquet, test_features.parquet, holdout_features.parquet")

Saved train_features.parquet, test_features.parquet, holdout_features.parquet
