# Feature Engineering
## Extraction of Vital Signs and Laboratory Measurements  
In this section, we load the dynamic clinical data from MIMIC-III — specifically CHARTEVENTS.csv — and restrict it to records corresponding to the ICU stays in our df_final cohort. To ensure temporal consistency with the predictive setting, we retain only events that occurred within the first 24 hours from ICU admission (INTIME). This filter is crucial to prevent data leakage and simulate real-time prediction settings.

In [2]:
import pandas as pd
import numpy as np
import os

In [None]:
RAW_PATH = "../data/raw/"
EXPORT_PATH = "../data/processed/"
OUTPUT_PATH = os.path.join(EXPORT_PATH, "chartevents_first24h.csv")

# === Load ICU stay cohort with admission times ===
df_final = pd.read_csv(os.path.join(EXPORT_PATH, "df_final_static.csv"), parse_dates=["INTIME"])
valid_intimes = df_final.set_index("ICUSTAY_ID")["INTIME"].to_dict()

# === Configure chunked reading ===
cols_to_use = ["SUBJECT_ID", "HADM_ID", "ICUSTAY_ID", "ITEMID", "CHARTTIME", "VALUENUM"]
reader = pd.read_csv(
    os.path.join(RAW_PATH, "CHARTEVENTS.csv"),
    usecols=cols_to_use,
    parse_dates=["CHARTTIME"],
    chunksize=500_000,
    low_memory=False
)

# === Initialize list to collect filtered data ===
filtered_chunks = []

print("[INFO] Starting chunked reading of CHARTEVENTS.csv...")

for i, chunk in enumerate(reader, start=1):
    # Drop rows with missing essential fields
    chunk = chunk.dropna(subset=["ICUSTAY_ID", "CHARTTIME", "VALUENUM"]).copy()
    chunk["ICUSTAY_ID"] = chunk["ICUSTAY_ID"].astype(int)
    
    # Filter only ICU stays present in the cohort
    chunk = chunk[chunk["ICUSTAY_ID"].isin(valid_intimes.keys())].copy()
    
    # Ensure CHARTTIME is a datetime Series
    chunk["CHARTTIME"] = pd.to_datetime(chunk["CHARTTIME"])

    # Map and convert INTIME safely
    intimes_mapped = chunk["ICUSTAY_ID"].map(valid_intimes)
    chunk["INTIME"] = pd.to_datetime(intimes_mapped)

    # Now both columns are datetime64[ns] Series
    chunk["HOURS_FROM_ADMIT"] = (chunk["CHARTTIME"] - chunk["INTIME"]).dt.total_seconds() / 3600.0

    # Filter only events occurring in the first 24 hours
    chunk = chunk[(chunk["HOURS_FROM_ADMIT"] >= 0) & (chunk["HOURS_FROM_ADMIT"] <= 24)]
    
    # Append to result list
    filtered_chunks.append(chunk)
    print(f"[Chunk {i}] Retained rows: {chunk.shape[0]}")

# === Concatenate and save final result ===
first24h = pd.concat(filtered_chunks, ignore_index=True)
first24h.to_csv(OUTPUT_PATH, index=False)

print(f"[SUCCESS] Filtered CHARTEVENTS (first 24h) saved to: {OUTPUT_PATH}")
print(f"[INFO] Final shape: {first24h.shape}")


[INFO] Starting chunked reading of CHARTEVENTS.csv...
[Chunk 1] Retained rows: 18667
[Chunk 2] Retained rows: 50592
[Chunk 3] Retained rows: 7111
[Chunk 4] Retained rows: 8262
[Chunk 5] Retained rows: 15922
[Chunk 6] Retained rows: 13779
[Chunk 7] Retained rows: 17200
[Chunk 8] Retained rows: 19017
[Chunk 9] Retained rows: 14871
[Chunk 10] Retained rows: 12595
[Chunk 11] Retained rows: 13885
[Chunk 12] Retained rows: 950
[Chunk 13] Retained rows: 6793
[Chunk 14] Retained rows: 13300
[Chunk 15] Retained rows: 14488
[Chunk 16] Retained rows: 9817
[Chunk 17] Retained rows: 6441
[Chunk 18] Retained rows: 6864
[Chunk 19] Retained rows: 6539
[Chunk 20] Retained rows: 14173
[Chunk 21] Retained rows: 13020
[Chunk 22] Retained rows: 10059
[Chunk 23] Retained rows: 8666
[Chunk 24] Retained rows: 14818
[Chunk 25] Retained rows: 9366
[Chunk 26] Retained rows: 25526
[Chunk 27] Retained rows: 8767
[Chunk 28] Retained rows: 1818
[Chunk 29] Retained rows: 10723
[Chunk 30] Retained rows: 4824
[Chunk 31

## Extraction of Vital Signs in the First 24h
In this section, we extract core vital signs from CHARTEVENTS, restricted to the first 24 hours of ICU admission. Each signal (e.g., heart rate, blood pressure, SpO₂, temperature) is associated with one or more ITEMIDs in MIMIC-III. For each ICU stay, we compute descriptive statistics (mean, std, min, max, count) to summarize the temporal profile. These engineered features form the base of our predictive modeling pipeline.

In [11]:
# === Define ITEMIDs for selected vital signs (based on MIMIC documentation) ===
vital_items = {
    "Heart Rate": [211, 220045],
    "Systolic BP": [51, 455, 220050, 220179],
    "Temperature": [678, 223761],
    "SpO2": [646, 220277]
}

# === Prepare a container for stats ===
vital_features = []

# === Loop through each vital sign ===
for label, itemids in vital_items.items():
    temp = first24h[first24h["ITEMID"].isin(itemids)]
    
    # Compute descriptive statistics per ICUSTAY_ID
    stats = temp.groupby("ICUSTAY_ID")["VALUENUM"].agg(["mean", "std", "min", "max", "count"]).reset_index()
    stats.columns = ["ICUSTAY_ID"] + [f"{label.upper().replace(' ', '_')}_{stat.upper()}" for stat in stats.columns[1:]]
    
    vital_features.append(stats)

# === Merge all stats together ===
from functools import reduce
vital_df = reduce(lambda left, right: pd.merge(left, right, on="ICUSTAY_ID", how="outer"), vital_features)

# === Output check ===
print("[INFO] Vital signs extracted:", vital_df.shape)
vital_df.to_csv(EXPORT_PATH + "vital_features.csv", index=False)
vital_df.head()


[INFO] Vital signs extracted: (1903, 21)


Unnamed: 0,ICUSTAY_ID,HEART_RATE_MEAN,HEART_RATE_STD,HEART_RATE_MIN,HEART_RATE_MAX,HEART_RATE_COUNT,SYSTOLIC_BP_MEAN,SYSTOLIC_BP_STD,SYSTOLIC_BP_MIN,SYSTOLIC_BP_MAX,...,TEMPERATURE_MEAN,TEMPERATURE_STD,TEMPERATURE_MIN,TEMPERATURE_MAX,TEMPERATURE_COUNT,SPO2_MEAN,SPO2_STD,SPO2_MIN,SPO2_MAX,SPO2_COUNT
0,200075,77.076923,7.065053,62.0,91.0,39.0,108.5,17.966163,83.0,169.0,...,97.366667,0.355903,96.9,97.7,6.0,98.558824,1.185549,96.0,100.0,34.0
1,200150,91.75,11.891722,72.0,118.0,24.0,97.083333,9.757569,81.0,114.0,...,96.422222,0.762853,95.8,98.3,9.0,95.458333,2.126012,92.0,100.0,24.0
2,200231,80.451613,7.46922,73.0,118.0,31.0,100.733333,17.737421,48.0,140.0,...,97.883333,0.483391,97.5,98.8,6.0,97.935484,2.048341,94.0,100.0,31.0
3,200282,89.125,17.77775,70.0,126.0,32.0,108.24,14.683551,90.0,161.0,...,98.371429,2.337174,96.1,103.1,7.0,98.928571,1.18411,96.0,100.0,28.0
4,200441,112.482759,31.916879,86.0,213.0,29.0,111.214286,13.532833,83.0,133.0,...,100.106667,1.131034,98.3,102.5,15.0,97.703704,1.793467,94.0,100.0,27.0


## Merging Vital Features into the Final Dataset
In this section, we enrich the static ICU dataset (df_final) with the time-series features extracted from CHARTEVENTS during the first 24 hours. This join is performed on the ICU stay identifier (ICUSTAY_ID) using a left join to ensure that all original cohort entries are preserved. The resulting dataset (df_final_enriched) combines static and early dynamic information, forming a robust base for modeling.

In [12]:
# === Carica i dati se non già in memoria ===
df_final = pd.read_csv("../data/processed/df_final_static.csv")
vital_df = pd.read_csv("../data/processed/vital_features.csv")

# === Merge su ICUSTAY_ID ===
df_final_enriched = pd.merge(df_final, vital_df, on="ICUSTAY_ID", how="left")

# === Output diagnostico ===
print("[INFO] Shape after enrichment:", df_final_enriched.shape)
print("[INFO] Number of ICU stays with at least one vital sign:", df_final_enriched.dropna(subset=["HEART_RATE_MEAN"]).shape[0])

# === Preview ===
df_final_enriched.to_csv(EXPORT_PATH + "df_final_enriched.csv", index=False)
df_final_enriched.head()

[INFO] Shape after enrichment: (3685, 32)
[INFO] Number of ICU stays with at least one vital sign: 1902


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,AGE,GENDER,ADMISSION_TYPE,ADMISSION_LOCATION,INSURANCE,FIRST_CAREUNIT,LOS,...,TEMPERATURE_MEAN,TEMPERATURE_STD,TEMPERATURE_MIN,TEMPERATURE_MAX,TEMPERATURE_COUNT,SPO2_MEAN,SPO2_STD,SPO2_MIN,SPO2_MAX,SPO2_COUNT
0,51797,104616,265369.0,86,F,EMERGENCY,CLINIC REFERRAL/PREMATURE,Medicare,MICU,8.6956,...,99.511111,1.374267,97.5,101.9,9.0,99.24,0.925563,97.0,100.0,25.0
1,44534,183659,204918.0,53,M,EMERGENCY,EMERGENCY ROOM ADMIT,Medicaid,SICU,2.292,...,97.742857,1.198412,96.2,99.0,7.0,96.482759,3.202908,89.0,100.0,29.0
2,14828,144708,293475.0,60,F,EMERGENCY,EMERGENCY ROOM ADMIT,Private,SICU,1.106,...,,,,,,,,,,
3,14828,125239,288771.0,61,F,EMERGENCY,EMERGENCY ROOM ADMIT,Medicare,MICU,3.1126,...,,,,,,,,,,
4,44500,101872,260996.0,72,F,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,Medicare,CCU,0.936,...,98.58,1.089495,97.5,99.9,5.0,96.086957,2.065145,91.0,99.0,23.0


## Missing Values Strategy  
Missing values in clinical datasets are not just technical noise — they carry clinical meaning. In this section, we analyze missingness patterns and apply a principled imputation strategy that balances model performance and interpretability. For features derived from vital signs, we recommend imputing using the population median (robust to outliers) and adding indicator variables to preserve missingness information.

In [13]:
# === Carica dataset arricchito se non già in memoria ===
df = pd.read_csv("../data/processed/df_final_enriched.csv")

# === Seleziona solo le feature vitali ===
vital_cols = [col for col in df.columns if any(x in col for x in ["HEART_RATE", "SYSTOLIC_BP", "TEMPERATURE", "SPO2"])]

# === Crea indicatori di missingness ===
for col in vital_cols:
    df[f"{col}_MISSING"] = df[col].isnull().astype(int)

# === Imputa con la mediana globale ===
for col in vital_cols:
    median_val = df[col].median()
    df[col] = df[col].fillna(median_val)

# === Check finale ===
print("[INFO] Any remaining NaN:", df[vital_cols].isnull().any().any())

# === Salvataggio ===
df.to_csv("../data/processed/df_model_ready.csv", index=False)
df.head()

[INFO] Any remaining NaN: False


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,AGE,GENDER,ADMISSION_TYPE,ADMISSION_LOCATION,INSURANCE,FIRST_CAREUNIT,LOS,...,TEMPERATURE_MEAN_MISSING,TEMPERATURE_STD_MISSING,TEMPERATURE_MIN_MISSING,TEMPERATURE_MAX_MISSING,TEMPERATURE_COUNT_MISSING,SPO2_MEAN_MISSING,SPO2_STD_MISSING,SPO2_MIN_MISSING,SPO2_MAX_MISSING,SPO2_COUNT_MISSING
0,51797,104616,265369.0,86,F,EMERGENCY,CLINIC REFERRAL/PREMATURE,Medicare,MICU,8.6956,...,0,0,0,0,0,0,0,0,0,0
1,44534,183659,204918.0,53,M,EMERGENCY,EMERGENCY ROOM ADMIT,Medicaid,SICU,2.292,...,0,0,0,0,0,0,0,0,0,0
2,14828,144708,293475.0,60,F,EMERGENCY,EMERGENCY ROOM ADMIT,Private,SICU,1.106,...,1,1,1,1,1,1,1,1,1,1
3,14828,125239,288771.0,61,F,EMERGENCY,EMERGENCY ROOM ADMIT,Medicare,MICU,3.1126,...,1,1,1,1,1,1,1,1,1,1
4,44500,101872,260996.0,72,F,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,Medicare,CCU,0.936,...,0,0,0,0,0,0,0,0,0,0
