# Feature Engineering Pipeline v2

Produces `train_features.parquet`, `test_features.parquet`, and `holdout_features.parquet` in `data/`. Uses `FeatureEngineer` from `utils.features` for the full pipeline.

### Improvements over v1

1. **Bug fix**: patient features are now properly joined to time-series rows (v1 had a missing merge)
2. **Missingness as signal**: `bmi_missing`, `pain_score_missing`, `reason_missing` indicators — EDA showed missingness itself correlates with outcome
3. **Comorbidity features**: parsed from `previous_medical_history` free-text — hypertension, diabetes, kidney disease, cardiac conditions, anemia, obesity. Comorbidity count showed a monotonic relationship with label 3 (6.6% at 0 → 29.1% at 6 comorbidities)
4. **Encounter description**: one-hot encoded — obstetric encounters had 0% label 3 vs 23% for ED patient visits
5. **Reason-for-visit risk tier**: grouped into high/medium/low risk — myocardial infarction (25%), stroke (28%), gunshot (28%) vs normal pregnancy (0%)
6. **Age group flags**: `is_elderly` (≥65) and `is_child` (<18) — label 3 rate jumps from 3.6% in age 19–35 to 22.3% in 80+
7. **Marital status**: included now (p=0.000008 chi-square vs outcome)
8. **Medication flags**: on cardiac meds (metoprolol, nitroglycerin) = 29% label 3 vs 9% baseline

### EDA-driven rationale for dropping features

| Column | Missing % | Decision | Reason |
|---|---|---|---|
| `known_allergies` | 85% | **Drop** | Too sparse, no significant association (p=0.45) |
| `previous_medications` | 67% | **Drop** | No association with outcome (p=0.70) |
| `bmi` | 65% | **Keep + missingness flag** | Marginal signal (p=0.07), but missingness pattern informative |
| `pain_score` | 64% | **Keep + missingness flag** | Similar: value weak, missingness informative |
| `current_medications` | 59% | **Keep as flags** | Cardiac med keywords strongly predict deterioration |
| `race` | 0% | **Keep (OHE)** | No significant signal (p=0.63) but kept for completeness |
| `ethnicity` | 0% | **Keep (OHE)** | No signal (p=0.92) but zero cost |

> **Leakage guard**: all scalers and encoders are **fit on train encounters only** then applied to test and holdout.

### medhack-frontiers improvements (added)

9. **Shock index & HR:RR ratio** — hemodynamic instability markers (shock_index = HR/SBP, hr_rr_ratio = HR/RR)
10. **Multi-scale rolling** — 4 windows (6, 12, 24, 60 steps ≈ 30s–5m) for mean/std/min/max per vital
11. **ECG features** — basic stats + FFT (dom_freq, LF/HF power, LF/HF ratio, spectral entropy), hr_ecg_diff

## 1. Setup

In [43]:
import sys
from pathlib import Path

# Ensure notebooks/ is on path (Jupyter cwd is usually the notebook directory)
_nb_dir = Path.cwd()
if str(_nb_dir) not in sys.path:
    sys.path.insert(0, str(_nb_dir))

In [49]:
import importlib.util
from pathlib import Path

import pandas as pd

# Load FeatureEngineer directly from utils/features.py (avoids utils package init issues)
_nb_dir = Path.cwd()
_spec = importlib.util.spec_from_file_location(
    "utils.features",
    _nb_dir / "utils" / "features.py",
)
_feat = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(_feat)
FeatureEngineer = _feat.FeatureEngineer

DATA_DIR = Path("../data")

engineer = FeatureEngineer(
    n_lags=12,
    rolling_windows=[6, 12, 24, 60],
    data_dir=DATA_DIR,
)

## 2. Load Raw Data & Run Pipeline

In [50]:
train_raw = pd.read_csv(DATA_DIR / "train_data.csv", parse_dates=["timestamp"])
test_raw = pd.read_csv(DATA_DIR / "test_data.csv", parse_dates=["timestamp"])
holdout_raw = pd.read_csv(DATA_DIR / "holdout_data.csv", parse_dates=["timestamp"])
patients = pd.read_csv(DATA_DIR / "patients.csv")

for name, df in [("train", train_raw), ("test", test_raw), ("holdout", holdout_raw)]:
    print(f"{name:8s}: {df.shape}  |  encounters: {df['encounter_id'].nunique():,}")
print(f"\npatients: {patients.shape}")

train, test, holdout, feature_cols = engineer.transform(
    train_raw, test_raw, holdout_raw, patients
)
print(f"\nFeature pipeline complete: {len(feature_cols)} columns")

train   : (2109600, 8)  |  encounters: 2,930
test    : (451440, 8)  |  encounters: 627
holdout : (452880, 7)  |  encounters: 629

patients: (4186, 17)

Feature pipeline complete: 256 columns


## 3. Validation

In [51]:
for name, df in [("train", train), ("test", test), ("holdout", holdout)]:
    n_missing = df[feature_cols].isna().sum().sum()
    assert n_missing == 0, f"{name}: {n_missing} NaNs in features"
    print(f"{name:8s}: shape={df.shape}  missing_feature_values={n_missing}")

train   : shape=(2109600, 259)  missing_feature_values=0
test    : shape=(451440, 259)  missing_feature_values=0
holdout : shape=(452880, 258)  missing_feature_values=0


## 4. Save

In [None]:
# Use fastparquet to avoid pyarrow 23.x compatibility issue with pandas
train.to_parquet(DATA_DIR / "train_features.parquet", index=False, engine="fastparquet")
test.to_parquet(DATA_DIR / "test_features.parquet", index=False, engine="fastparquet")
holdout.to_parquet(DATA_DIR / "holdout_features.parquet", index=False, engine="fastparquet")
print("Saved train_features.parquet, test_features.parquet, holdout_features.parquet")