# Base Data Construction
## ICU Admission Filtering and Cohort Refinement

The goal of this section is to refine the previously identified sepsis cohort by retaining only valid ICU stays. ICU admissions must have a non-null, positive length of stay (LOS), and consistent timestamps—specifically, the discharge time (OUTTIME) must occur after the admission time (INTIME). This step ensures data integrity and removes administrative artifacts or invalid entries that would compromise model performance. Duplicate ICU stays are also dropped to guarantee one unique observation per ICUSTAY_ID.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [3]:
PATH = '../data/raw/'
EXPORT_PATH = '../data/processed/'
ASSETS_PATH = '../assets/plots/visualization/'

# === Load cohort and ICUSTAYS ===
sepsis_ids = pd.read_csv(EXPORT_PATH + "sepsis_cohort.csv")
icustays = pd.read_csv(os.path.join(PATH, "ICUSTAYS.csv"), parse_dates=["INTIME", "OUTTIME"])

# === Check if ICUSTAY_ID is available ===
if "ICUSTAY_ID" in sepsis_ids.columns:
    cohort_icu = pd.merge(sepsis_ids, icustays, on=["SUBJECT_ID", "HADM_ID", "ICUSTAY_ID"], how="inner")
else:
    cohort_icu = pd.merge(sepsis_ids, icustays, on=["SUBJECT_ID", "HADM_ID"], how="inner")

# === Apply validity filters ===
cohort_icu = cohort_icu[cohort_icu["LOS"].notnull()]
cohort_icu = cohort_icu[cohort_icu["LOS"] > 0]
cohort_icu = cohort_icu[cohort_icu["OUTTIME"] > cohort_icu["INTIME"]]

# === Drop duplicates if ICUSTAY_ID exists ===
if "ICUSTAY_ID" in cohort_icu.columns:
    cohort_icu = cohort_icu.drop_duplicates(subset=["ICUSTAY_ID"])

# === Output ===
print(f"[INFO] Valid ICU stays in sepsis cohort: {cohort_icu.shape[0]}")
cohort_icu[["SUBJECT_ID", "HADM_ID"] + (["ICUSTAY_ID"] if "ICUSTAY_ID" in cohort_icu.columns else []) + ["LOS"]].head()


[INFO] Valid ICU stays in sepsis cohort: 3685


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,LOS
0,51797,104616,265369.0,8.6956
1,44534,183659,204918.0,2.292
2,14828,144708,293475.0,1.106
3,14828,125239,288771.0,3.1126
4,44500,101872,260996.0,0.936


## Demographic and Administrative Data Integration

This section enriches the filtered ICU cohort with static patient characteristics and admission-level metadata. Specifically, we merge demographic data (gender and date of birth) from the PATIENTS.csv table, and administrative details (admission type, insurance, mortality flag, etc.) from ADMISSIONS.csv. The most important computed variable is the age at ICU admission, derived from the difference between ADMITTIME and DOB. This age is then capped at 91 for privacy compliance according to HIPAA guidelines. This step ensures that our dataset includes essential variables influencing ICU Length of Stay (LOS), such as patient age and socio-economic status.

In [4]:
# === Load relevant tables ===
patients = pd.read_csv(PATH + "PATIENTS.csv", parse_dates=["DOB"])
admissions = pd.read_csv(PATH + "ADMISSIONS.csv", parse_dates=["ADMITTIME"])

# === Merge with demographic data (gender, DOB) ===
cohort_icu = cohort_icu.merge(
    patients[["SUBJECT_ID", "GENDER", "DOB"]],
    on="SUBJECT_ID", how="left"
)

# === Merge with administrative data ===
cohort_icu = cohort_icu.merge(
    admissions[[
        "SUBJECT_ID", "HADM_ID", "ADMITTIME", "ADMISSION_TYPE",
        "ADMISSION_LOCATION", "INSURANCE", "HOSPITAL_EXPIRE_FLAG"
    ]],
    on=["SUBJECT_ID", "HADM_ID"], how="left"
)

# === Calculate age at ICU admission ===
age = cohort_icu["ADMITTIME"].dt.year - cohort_icu["DOB"].dt.year
adjustment = (cohort_icu["ADMITTIME"].dt.month < cohort_icu["DOB"].dt.month) | (
    (cohort_icu["ADMITTIME"].dt.month == cohort_icu["DOB"].dt.month) &
    (cohort_icu["ADMITTIME"].dt.day < cohort_icu["DOB"].dt.day)
)
cohort_icu["AGE"] = age - adjustment.astype(int)

# === Apply HIPAA cap for patients older than 89 ===
cohort_icu.loc[cohort_icu["AGE"] > 89, "AGE"] = 91

# === Drop impossible ages (negative, null) ===
cohort_icu = cohort_icu[(cohort_icu["AGE"] >= 0) & cohort_icu["AGE"].notnull()]

# === Visual sanity check ===
print("[INFO] ICU cohort with demographics:", cohort_icu.shape)
cohort_icu[["SUBJECT_ID", "AGE", "GENDER", "ADMISSION_TYPE", "INSURANCE"]].head()

[INFO] ICU cohort with demographics: (3685, 20)


Unnamed: 0,SUBJECT_ID,AGE,GENDER,ADMISSION_TYPE,INSURANCE
0,51797,86,F,EMERGENCY,Medicare
1,44534,53,M,EMERGENCY,Medicaid
2,14828,60,F,EMERGENCY,Private
3,14828,61,F,EMERGENCY,Medicare
4,44500,72,F,EMERGENCY,Medicare


## Final ICU Cohort Assembly

At this point, we have a clean and filtered cohort of ICU admissions for patients diagnosed with sepsis (ICD9 = 038), enriched with static demographic and administrative data. We now proceed to assemble the final base dataset, df_final, which includes all variables relevant for stratification and baseline analysis: patient identifiers, age, gender, admission characteristics, ICU unit type, and the target variable LOS. This dataset serves as the foundation for subsequent feature engineering and modeling phases.

In [5]:
# === Select final columns ===
df_final = cohort_icu[[
    "SUBJECT_ID", "HADM_ID", "ICUSTAY_ID",
    "AGE", "GENDER",
    "ADMISSION_TYPE", "ADMISSION_LOCATION", "INSURANCE",
    "FIRST_CAREUNIT", "LOS", "HOSPITAL_EXPIRE_FLAG", "INTIME"
]]

# === Final checks ===
df_final = df_final[df_final["LOS"].notnull()]  # Remove rows with missing LOS

# === Output shape ===
print("[SUCCESS] Final ICU dataset shape:", df_final.shape)
df_final.to_csv(EXPORT_PATH + "df_final_static.csv", index=False)
df_final.head()

[SUCCESS] Final ICU dataset shape: (3685, 12)


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,AGE,GENDER,ADMISSION_TYPE,ADMISSION_LOCATION,INSURANCE,FIRST_CAREUNIT,LOS,HOSPITAL_EXPIRE_FLAG,INTIME
0,51797,104616,265369.0,86,F,EMERGENCY,CLINIC REFERRAL/PREMATURE,Medicare,MICU,8.6956,1,2194-01-28 00:44:15
1,44534,183659,204918.0,53,M,EMERGENCY,EMERGENCY ROOM ADMIT,Medicaid,SICU,2.292,0,2155-05-13 09:30:42
2,14828,144708,293475.0,60,F,EMERGENCY,EMERGENCY ROOM ADMIT,Private,SICU,1.106,0,2156-08-20 15:56:12
3,14828,125239,288771.0,61,F,EMERGENCY,EMERGENCY ROOM ADMIT,Medicare,MICU,3.1126,1,2158-01-06 16:19:38
4,44500,101872,260996.0,72,F,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,Medicare,CCU,0.936,0,2164-06-05 15:43:30
