# Base Dataset Construction
## Identification of Clinical Cohort
This step aims to select patients with a specific disease, identified through the ICD-9 codes contained in the DIAGNOSES_ICD.csv file. As can be seen from the previous chapter, the disease chosen by us is Sepsis 038, which is why we choose within the diagnoses the only ones having that code.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

PATH = '../data/raw/'
EXPORT_PATH = '../data/processed/'

sepsis_ids = pd.read_csv(EXPORT_PATH + "sepsis_cohort.csv") #, parse_dates=["INTIME", "OUTTIME"])
cohort = sepsis_ids[['SUBJECT_ID', 'HADM_ID']].drop_duplicates()
display(cohort.head())

Unnamed: 0,SUBJECT_ID,HADM_ID
0,51797,104616
1,44534,183659
2,14828,144708
3,14828,125239
4,44500,101872


## Filtering ICU Stays
Then, we filters ICU stays (ICUSTAYS.csv) to include only those matching the previously selected patient cohort. It excludes stays with missing or zero LOS, and those with temporal inconsistencies (OUTTIME before INTIME). Duplicates are also removed to ensure that each ICU stay is valid and consistent for downstream clinical and predictive analyses.

In [2]:
icustays = pd.read_csv(PATH + "ICUSTAYS.csv")
cohort_icu = icustays.merge(cohort, on=["SUBJECT_ID", "HADM_ID"], how="inner")

cohort_icu = cohort_icu[cohort_icu["LOS"].notnull()]
cohort_icu = cohort_icu[cohort_icu["LOS"] > 0]
cohort_icu = cohort_icu[cohort_icu["OUTTIME"] > cohort_icu["INTIME"]]

cohort_icu = cohort_icu.drop_duplicates(subset=["ICUSTAY_ID"])

print(f"Valid ICU Admissions for Cohort: {cohort_icu.shape[0]}")
display(cohort_icu[["SUBJECT_ID", "HADM_ID", "ICUSTAY_ID", "LOS"]].head())

Valid ICU Admissions for Cohort: 3685


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,LOS
0,269,106296,206613,3.2788
1,275,129886,219649,7.1314
2,292,179726,222505,0.8854
3,305,194340,217232,2.437
4,323,143334,264375,3.0252


## Filtering ICU Stays
Then, we filters ICU stays (ICUSTAYS.csv) to include only those matching the previously selected patient cohort. It excludes stays with missing or zero LOS, and those with temporal inconsistencies (OUTTIME before INTIME). Duplicates are also removed to ensure that each ICU stay is valid and consistent for downstream clinical and predictive analyses.

In [3]:
cohort_icu = icustays.merge(cohort, on=["SUBJECT_ID", "HADM_ID"], how="inner")

cohort_icu = cohort_icu[cohort_icu["LOS"].notnull()]
cohort_icu = cohort_icu[cohort_icu["LOS"] > 0]
cohort_icu = cohort_icu[cohort_icu["OUTTIME"] > cohort_icu["INTIME"]]

cohort_icu = cohort_icu.drop_duplicates(subset=["ICUSTAY_ID"])

print(f"Valid ICU Admissions for Cohort: {cohort_icu.shape[0]}")
display(cohort_icu[["SUBJECT_ID", "HADM_ID", "ICUSTAY_ID", "LOS"]].head())

Valid ICU Admissions for Cohort: 3685


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,LOS
0,269,106296,206613,3.2788
1,275,129886,219649,7.1314
2,292,179726,222505,0.8854
3,305,194340,217232,2.437
4,323,143334,264375,3.0252


## Merge with Patient and Admission Information
After filtering ICU admissions, merging ICU admission records with demographic and administrative information from PATIENTS.csv and ADMISSIONS.csv is crucial. Gathering details like patient's sex, date of birth, admission type and location, insurance type, and hospital mortality flag helps integrate static variables affecting ICU stay duration. Calculating patient age at ICU admission is also vital, managing missing or conflicting values by assigning defaults when necessary. To maintain privacy compliance under HIPAA regulations, ages over 89 years are capped at 91 years. These transformations ensure the statistical accuracy and privacy adherence of the dataset, safeguarding both data integrity and patient confidentiality.

In [4]:
# Merge to get gender and DOB
patients = pd.read_csv(PATH + "PATIENTS.csv", parse_dates=["DOB"])
admissions = pd.read_csv(PATH + "ADMISSIONS.csv", parse_dates=["ADMITTIME"])
df = cohort_icu.merge(patients[["SUBJECT_ID", "GENDER", "DOB"]], on="SUBJECT_ID", how="left")

# Merge to get administrative variables
df = df.merge(
    admissions[[
        "SUBJECT_ID", "HADM_ID", "ADMITTIME",
        "ADMISSION_TYPE", "ADMISSION_LOCATION", "INSURANCE", "HOSPITAL_EXPIRE_FLAG"
    ]],
    on=["SUBJECT_ID", "HADM_ID"],
    how="left"
)

df['ADMITTIME'] = pd.to_datetime(df['ADMITTIME'], errors='coerce')
df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')
age_years = df['ADMITTIME'].dt.year - df['DOB'].dt.year

# An adjustment is applied for cases in which the patient’s birthday has not yet occurred in the year of hospital admission.
month_day_adjustment = ((df['ADMITTIME'].dt.month < df['DOB'].dt.month) |
                        ((df['ADMITTIME'].dt.month == df['DOB'].dt.month) &
                         (df['ADMITTIME'].dt.day < df['DOB'].dt.day)))
df['AGE'] = age_years - month_day_adjustment.astype(int)

## Timestamp Feature Cleaning and Extraction

In clinical datasets such as MIMIC-III, timestamps play a crucial role in capturing the temporal dynamics of patient care. However, raw timestamp columns (e.g., `INTIME`, `ADMITTIME`, `DISCHTIME`) are often not directly usable for predictive modeling or statistical analysis. This is due to several reasons:

- **High Cardinality and Privacy**: Timestamps are unique for each event, leading to high cardinality and potential privacy concerns. Directly using them as features can inadvertently leak sensitive information or overfit models to specific time points.
- **Irrelevance of Absolute Time**: The absolute date and time (e.g., "2012-03-15 14:23:00") are rarely meaningful in isolation. What often matters more are derived temporal features, such as the hour of admission or the day of the week, which can capture patterns in hospital operations, staffing, or patient flow.
- **Feature Engineering for Patterns**: Extracting components like the hour of the day or weekday from timestamps enables the model to learn patterns related to circadian rhythms, shift changes, or weekend effects, which are known to influence patient outcomes and care processes.
- **Data Consistency and Simplification**: By converting timestamps to standardized datetime objects and extracting only the relevant components, we ensure consistency across the dataset and reduce complexity, making downstream analysis more robust and interpretable.

The provided code systematically processes timestamp columns in the dataframe `df`:
- **Conversion to Datetime**: Each timestamp column is converted to pandas' `datetime` format, ensuring that subsequent operations are reliable and consistent, even if the original data contains formatting inconsistencies.
- **Feature Extraction**: For each timestamp, two new features are extracted:
    - The hour of the event (e.g., `INTIME_HOUR`), capturing time-of-day effects.
    - The day of the week (e.g., `INTIME_WEEKDAY`), capturing weekly patterns.
- **Column Cleanup**: After extracting the relevant features, the original timestamp columns are dropped to avoid redundancy and reduce dimensionality.

This approach transforms raw, high-cardinality timestamp data into a set of informative, low-cardinality features that are more suitable for machine learning models and statistical analysis. It also helps in maintaining data privacy and interpretability, both of which are essential in clinical research and healthcare analytics.

In [5]:
# === Timestamp Feature Cleaning and Extraction ===
timestamp_cols = ["INTIME", "ADMITTIME", "DISCHTIME"]
for col in timestamp_cols:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors="coerce")
        df[f"{col}_HOUR"] = df[col].dt.hour
        df[f"{col}_WEEKDAY"] = df[col].dt.weekday
        # df.drop(columns=[col], inplace=True)

## Construct the Final ICU Dataset
Finally, we can construct our final dataset `df_final`, the dataset gathering all static features for the filtered ICU patients. It includes identifiers, age, gender, administrative data, and the target variable LOS. Rows with missing LOS are dropped, as this field is critical for predictive modeling. The result is a solid and coherent dataset ready for clinical enrichment.

In [6]:
# Assemble final dataframe
df_final = df[[
    "SUBJECT_ID", "HADM_ID", "ICUSTAY_ID", "AGE", "GENDER",
    "ADMISSION_TYPE", "ADMISSION_LOCATION", "INSURANCE",
    "FIRST_CAREUNIT", "LOS", "HOSPITAL_EXPIRE_FLAG", "INTIME_HOUR", "INTIME_WEEKDAY",
    "ADMITTIME_HOUR", "ADMITTIME_WEEKDAY", "INTIME"
]]

# Remove any rows with missing LOS
df_final = df_final[df_final["LOS"].notnull()]

# Preview
print(f"df_final shape: {df_final.shape}")
# Save the final dataframe
df_final.to_csv(EXPORT_PATH + "df_final_static.csv", index=False)
df_final.head()

df_final shape: (3685, 16)


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,AGE,GENDER,ADMISSION_TYPE,ADMISSION_LOCATION,INSURANCE,FIRST_CAREUNIT,LOS,HOSPITAL_EXPIRE_FLAG,INTIME_HOUR,INTIME_WEEKDAY,ADMITTIME_HOUR,ADMITTIME_WEEKDAY,INTIME
0,269,106296,206613,40,M,EMERGENCY,EMERGENCY ROOM ADMIT,Medicaid,MICU,3.2788,0,11,0,11,0,2170-11-05 11:05:29
1,275,129886,219649,82,M,EMERGENCY,EMERGENCY ROOM ADMIT,Medicare,CCU,7.1314,1,11,6,3,5,2170-10-07 11:28:53
2,292,179726,222505,57,F,URGENT,TRANSFER FROM HOSP/EXTRAM,Private,MICU,0.8854,1,18,3,18,3,2103-09-27 18:29:30
3,305,194340,217232,76,F,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,Medicare,SICU,2.437,1,12,5,18,5,2129-09-03 12:31:31
4,323,143334,264375,57,M,EMERGENCY,EMERGENCY ROOM ADMIT,Medicare,MICU,3.0252,0,15,3,15,3,2120-01-11 15:48:28
