# Feature Engineering Pipeline v2

Produces `train_features.csv`, `test_features.csv`, and `holdout_features.csv` in `data/`.

### Improvements over v1

1. **Bug fix**: patient features are now properly joined to time-series rows (v1 had a missing merge)
2. **Missingness as signal**: `bmi_missing`, `pain_score_missing`, `reason_missing` indicators — EDA showed missingness itself correlates with outcome
3. **Comorbidity features**: parsed from `previous_medical_history` free-text — hypertension, diabetes, kidney disease, cardiac conditions, anemia, obesity. Comorbidity count showed a monotonic relationship with label 3 (6.6% at 0 → 29.1% at 6 comorbidities)
4. **Encounter description**: one-hot encoded — obstetric encounters had 0% label 3 vs 23% for ED patient visits
5. **Reason-for-visit risk tier**: grouped into high/medium/low risk — myocardial infarction (25%), stroke (28%), gunshot (28%) vs normal pregnancy (0%)
6. **Age group flags**: `is_elderly` (≥65) and `is_child` (<18) — label 3 rate jumps from 3.6% in age 19–35 to 22.3% in 80+
7. **Marital status**: included now (p=0.000008 chi-square vs outcome)
8. **Medication flags**: on cardiac meds (metoprolol, nitroglycerin) = 29% label 3 vs 9% baseline

### EDA-driven rationale for dropping features

| Column | Missing % | Decision | Reason |
|---|---|---|---|
| `known_allergies` | 85% | **Drop** | Too sparse, no significant association (p=0.45) |
| `previous_medications` | 67% | **Drop** | No association with outcome (p=0.70) |
| `bmi` | 65% | **Keep + missingness flag** | Marginal signal (p=0.07), but missingness pattern informative |
| `pain_score` | 64% | **Keep + missingness flag** | Similar: value weak, missingness informative |
| `current_medications` | 59% | **Keep as flags** | Cardiac med keywords strongly predict deterioration |
| `race` | 0% | **Keep (OHE)** | No significant signal (p=0.63) but kept for completeness |
| `ethnicity` | 0% | **Keep (OHE)** | No signal (p=0.92) but zero cost |

> **Leakage guard**: all scalers and encoders are **fit on train encounters only** then applied to test and holdout.

## 1. Setup

In [14]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from utils.features import add_lag_features_with_imputation, add_spectral_features, build_patient_features, join_patient_features

DATA_DIR = Path('../data')

VITAL_COLS = [
    'heart_rate', 'systolic_bp', 'diastolic_bp',
    'respiratory_rate', 'oxygen_saturation'
]
N_LAGS = 12  # 12 lags × 5 s cadence = 60-second lookback window
N_LAGS_SPECTRAL = 36  # 180s window for signal processing (see spectral_features_eda.ipynb)
SAMPLE_RATE_HZ = 0.2  # 5s cadence

## 2. Load Raw Data

In [15]:
train_raw   = pd.read_csv(DATA_DIR / 'train_data.csv',   parse_dates=['timestamp'])
test_raw    = pd.read_csv(DATA_DIR / 'test_data.csv',    parse_dates=['timestamp'])
holdout_raw = pd.read_csv(DATA_DIR / 'holdout_data.csv', parse_dates=['timestamp'])
patients    = pd.read_csv(DATA_DIR / 'patients.csv')

for name, df in [('train', train_raw), ('test', test_raw), ('holdout', holdout_raw)]:
    print(f"{name:8s}: {df.shape}  |  encounters: {df['encounter_id'].nunique():,}")

print(f"\npatients: {patients.shape}")
print(f"\nVital missing values (train raw):")
print(train_raw[VITAL_COLS].isna().sum())

train   : (2109600, 8)  |  encounters: 2,930
test    : (451440, 8)  |  encounters: 627
holdout : (452880, 7)  |  encounters: 629

patients: (4186, 17)

Vital missing values (train raw):
heart_rate           42252
systolic_bp          42197
diastolic_bp         42320
respiratory_rate     42278
oxygen_saturation    42618
dtype: int64


## 3. Vital Sign Imputation

**Strategy** (applied per encounter, in order):
1. **Neighbour interpolation** — fill with the mean of the adjacent timestamps (before & after)
2. **Encounter-median fallback** — for remaining gaps (start/end of sequence)

In [16]:
def impute_vitals(df: pd.DataFrame, vital_cols: list) -> pd.DataFrame:
    """Impute missing vital signs within each encounter."""
    df = df.sort_values(['encounter_id', 'timestamp']).reset_index(drop=True)
    out = df.copy()
    for col in vital_cols:
        prev = out.groupby('encounter_id')[col].shift(1)
        nxt  = out.groupby('encounter_id')[col].shift(-1)
        neighbour_mean = pd.concat([prev, nxt], axis=1).mean(axis=1)
        out[col] = out[col].fillna(neighbour_mean)
        out[col] = out.groupby('encounter_id')[col].transform(
            lambda x: x.fillna(x.median())
        )
    return out


train   = impute_vitals(train_raw,   VITAL_COLS)
test    = impute_vitals(test_raw,    VITAL_COLS)
holdout = impute_vitals(holdout_raw, VITAL_COLS)

for name, df in [('train', train), ('test', test), ('holdout', holdout)]:
    assert df[VITAL_COLS].isna().sum().sum() == 0, f"{name}: vitals still have NaNs"
print("Vital imputation complete — no missing values remain.")

Vital imputation complete — no missing values remain.


## 4. Lag Features

Each vital gets 12 lags (lag1 = 5s ago … lag12 = 60s ago), within each encounter.
Leading NaN rows filled with encounter-median.

In [17]:
import time
t0 = time.perf_counter()
train, lag_cols   = add_lag_features_with_imputation(train,   VITAL_COLS, N_LAGS)
test, _           = add_lag_features_with_imputation(test,    VITAL_COLS, N_LAGS)
holdout, _        = add_lag_features_with_imputation(holdout, VITAL_COLS, N_LAGS)
elapsed = time.perf_counter() - t0

for name, df in [('train', train), ('test', test), ('holdout', holdout)]:
    assert df[lag_cols].isna().sum().sum() == 0, f"{name}: lag cols still have NaNs"
print(f"Lag features added: {len(lag_cols)} columns ({len(VITAL_COLS)} vitals × {N_LAGS} lags) — {elapsed:.1f}s")

Lag features added: 60 columns (5 vitals × 12 lags) — 14.2s


## 5. Derivative Features

| Feature | Formula | Clinical meaning |
|---|---|---|
| `{v}_delta` | current − lag12 | 60-second trend |
| `{v}_delta_1s` | current − lag1 | Immediate velocity (5s) |
| `{v}_accel` | current − 2·lag1 + lag2 | Acceleration |

In [18]:
def add_derivative_features(df: pd.DataFrame, vital_cols: list, n_lags: int) -> pd.DataFrame:
    out = df.copy()
    for v in vital_cols:
        out[f'{v}_delta']    = out[v] - out[f'{v}_lag{n_lags}']
        out[f'{v}_delta_1s'] = out[v] - out[f'{v}_lag1']
        out[f'{v}_accel']    = out[v] - 2 * out[f'{v}_lag1'] + out[f'{v}_lag2']
    return out


train   = add_derivative_features(train,   VITAL_COLS, N_LAGS)
test    = add_derivative_features(test,    VITAL_COLS, N_LAGS)
holdout = add_derivative_features(holdout, VITAL_COLS, N_LAGS)

deriv_cols = [f'{v}{s}' for v in VITAL_COLS for s in ('_delta', '_delta_1s', '_accel')]
print(f"Derivative features added: {len(deriv_cols)} columns")

Derivative features added: 15 columns


## 6. Rolling Window Statistics

mean, std, min, max over the 13-value window [lag12 … current].

In [19]:
def add_rolling_stats(df: pd.DataFrame, vital_cols: list, n_lags: int) -> pd.DataFrame:
    out = df.copy()
    for v in vital_cols:
        window_cols = [f'{v}_lag{i}' for i in range(n_lags, 0, -1)] + [v]
        window = out[window_cols]
        out[f'{v}_mean'] = window.mean(axis=1)
        out[f'{v}_std']  = window.std(axis=1, ddof=0)
        out[f'{v}_min']  = window.min(axis=1)
        out[f'{v}_max']  = window.max(axis=1)
    return out


train   = add_rolling_stats(train,   VITAL_COLS, N_LAGS)
test    = add_rolling_stats(test,    VITAL_COLS, N_LAGS)
holdout = add_rolling_stats(holdout, VITAL_COLS, N_LAGS)

rolling_cols = [f'{v}{s}' for v in VITAL_COLS for s in ('_mean', '_std', '_min', '_max')]
print(f"Rolling stats added: {len(rolling_cols)} columns")

Rolling stats added: 20 columns


## 7. Derived Vital Signs

| Feature | Formula | Clinical meaning |
|---|---|---|
| `pulse_pressure` | SBP − DBP | Vasodilation (widened) vs poor cardiac output (narrowed) |
| `map` | (SBP + 2·DBP) / 3 | MAP < 65 = hypotension/shock |
| `pulse_pressure_delta` | PP − PP_lag12 | PP trajectory over 60s |
| `map_delta` | MAP − MAP_lag12 | MAP trajectory over 60s |

In [20]:
def add_derived_vitals(df: pd.DataFrame, n_lags: int) -> pd.DataFrame:
    out = df.copy()
    out['pulse_pressure'] = out['systolic_bp'] - out['diastolic_bp']
    out['map']            = (out['systolic_bp'] + 2 * out['diastolic_bp']) / 3

    pp_lagN  = out[f'systolic_bp_lag{n_lags}'] - out[f'diastolic_bp_lag{n_lags}']
    map_lagN = (out[f'systolic_bp_lag{n_lags}'] + 2 * out[f'diastolic_bp_lag{n_lags}']) / 3

    out['pulse_pressure_delta'] = out['pulse_pressure'] - pp_lagN
    out['map_delta']            = out['map'] - map_lagN
    return out


train   = add_derived_vitals(train,   N_LAGS)
test    = add_derived_vitals(test,    N_LAGS)
holdout = add_derived_vitals(holdout, N_LAGS)

derived_cols = ['pulse_pressure', 'map', 'pulse_pressure_delta', 'map_delta']
print(f"Derived vital features added: {len(derived_cols)} columns")

Derived vital features added: 4 columns


## 8. Temporal Features

In [21]:
def add_temporal_features(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    enc_start = out.groupby('encounter_id')['timestamp'].transform('min')
    out['minutes_into_encounter'] = (out['timestamp'] - enc_start).dt.total_seconds() / 60

    hour = out['timestamp'].dt.hour
    out['hour_sin'] = np.sin(2 * np.pi * hour / 24)
    out['hour_cos'] = np.cos(2 * np.pi * hour / 24)

    dow = out['timestamp'].dt.dayofweek
    out['dow_sin'] = np.sin(2 * np.pi * dow / 7)
    out['dow_cos'] = np.cos(2 * np.pi * dow / 7)
    return out


train   = add_temporal_features(train)
test    = add_temporal_features(test)
holdout = add_temporal_features(holdout)

temporal_cols = ['minutes_into_encounter', 'hour_sin', 'hour_cos', 'dow_sin', 'dow_cos']
print(f"Temporal features added: {len(temporal_cols)} columns")

Temporal features added: 5 columns


## 9. Encounter Prior Label Features

All use `shift(1)` — current row's label never predicts itself.
Holdout (no labels) gets zeros.

In [22]:
PRIOR_LABEL_COLS = ['prior_label', 'max_label_last_60s', 'max_label_encounter', 'ever_deteriorated']


def add_encounter_prior_label_features(df: pd.DataFrame, n_lags: int) -> pd.DataFrame:
    out = df.sort_values(['encounter_id', 'timestamp']).reset_index(drop=True).copy()

    if 'label' not in df.columns:
        for col in PRIOR_LABEL_COLS:
            out[col] = 0.0
        print("  No 'label' column — prior label features set to 0 (holdout mode).")
        return out

    out['prior_label'] = out.groupby('encounter_id')['label'].shift(1).fillna(0)

    out['max_label_last_60s'] = (
        out.groupby('encounter_id')['label']
        .transform(lambda x: x.shift(1).rolling(n_lags, min_periods=1).max())
        .fillna(0)
    )

    out['max_label_encounter'] = (
        out.groupby('encounter_id')['label']
        .transform(lambda x: x.shift(1).expanding().max())
        .fillna(0)
    )

    out['ever_deteriorated'] = (out['max_label_encounter'] > 0).astype(float)
    return out


train   = add_encounter_prior_label_features(train,   N_LAGS)
test    = add_encounter_prior_label_features(test,    N_LAGS)
holdout = add_encounter_prior_label_features(holdout, N_LAGS)

print(f"Prior label features added: {PRIOR_LABEL_COLS}")
if 'label' in train.columns:
    first_rows = train.groupby('encounter_id').head(1)
    assert (first_rows['prior_label'] == 0).all()
    print("  Sanity check passed: first row of each encounter has prior_label=0")

  No 'label' column — prior label features set to 0 (holdout mode).
Prior label features added: ['prior_label', 'max_label_last_60s', 'max_label_encounter', 'ever_deteriorated']
  Sanity check passed: first row of each encounter has prior_label=0


## 10. Patient Features (v2 — EDA-informed)

### Key EDA findings that shaped this section

**Strong predictors** (p < 0.001 via t-test or chi-square):
- `age`: mean 46.5 (no label 3) vs 63.7 (label 3), t=-12.33
- `gender`: M = 16.2% label 3, F = 5.6%
- `encounter_description`: obstetric = 0% label 3, ED patient visit = 23.1%
- `reason_for_visit`: MI = 25.3%, stroke = 27.8%, normal pregnancy = 0%
- `marital_status`: chi2=29.04, p=0.000008
- Comorbidity count: 6.6% at 0 → 29.1% at 6 comorbidities
- Cardiac medications (metoprolol, nitroglycerin): 29% label 3 vs 9% baseline

**Missingness is informative**:
- `reason_for_visit` missing → 20% label 3 vs 9.2% when present (older, sicker patients)
- `current_medications` present → 13.3% label 3 vs 8.7% (patients on meds are sicker)
- `previous_medical_history` present → 11.8% vs 7.0%

**Dropped** (not worth the noise):
- `known_allergies`: 85% missing, p=0.45
- `previous_medications`: 67% missing, p=0.70
- `date_of_birth`: redundant with age
- `patient_name`: synthetic identifier
- `encounter_class`: single value (emergency)

### 10a. Build and join patient features (utils.features)

In [23]:
train_encounter_ids = set(train['encounter_id'].unique())
patients, patient_feature_cols = build_patient_features(patients, train_encounter_ids)

train   = join_patient_features(train,   patients, patient_feature_cols)
test    = join_patient_features(test,    patients, patient_feature_cols)
holdout = join_patient_features(holdout, patients, patient_feature_cols)

print(f"Patient features: {len(patient_feature_cols)} columns")
for name, df in [('train', train), ('test', test), ('holdout', holdout)]:
    n_missing = df[patient_feature_cols].isna().sum().sum()
    print(f"  {name:8s}: {df.shape}  patient_feature_NaNs={n_missing}")

Patient features: 45 columns
  train   : (2109600, 161)  patient_feature_NaNs=0
  test    : (451440, 161)  patient_feature_NaNs=0
  holdout : (452880, 160)  patient_feature_NaNs=0


## 12. Final Validation

In [24]:
feature_cols = (
    VITAL_COLS
    + lag_cols
    + deriv_cols
    + rolling_cols
    + derived_cols
    + temporal_cols
    + PRIOR_LABEL_COLS
    + patient_feature_cols
)

for name, df in [('train', train), ('test', test), ('holdout', holdout)]:
    n_missing = df[feature_cols].isna().sum().sum()
    assert n_missing == 0, f"{name}: {n_missing} NaNs remain in feature columns"
    print(f"{name:8s}: shape={df.shape}  missing_feature_values={n_missing}")

train   : shape=(2109600, 161)  missing_feature_values=0
test    : shape=(451440, 161)  missing_feature_values=0
holdout : shape=(452880, 160)  missing_feature_values=0


## 13. Save to data/

In [None]:
train.to_csv(  DATA_DIR / 'train_features.csv',   index=False)
test.to_csv(   DATA_DIR / 'test_features.csv',    index=False)
holdout.to_csv(DATA_DIR / 'holdout_features.csv', index=False)

for name, path in [
    ('train',   DATA_DIR / 'train_features.csv'),
    ('test',    DATA_DIR / 'test_features.csv'),
    ('holdout', DATA_DIR / 'holdout_features.csv'),
]:
    size_mb = path.stat().st_size / 1e6
    print(f"Saved {name:8s} -> {path}  ({size_mb:.1f} MB)")