## CLIF Table One

Author: Kaveri Chhikara
Date v1: May 13, 2025

This script identifies the cohort of encounters with at least one ICU stay and then summarizes the cohort data into one table. 


#### Requirements

* Required table filenames should be `clif_patient`, `clif_hospitalization`, `clif_adt`, `clif_vitals`, `clif_labs`, `clif_medication_admin_continuous`, `clif_respiratory_support`, `clif_patient_assessments`
* Within each table, the following variables and categories are required.

| Table Name | Required Variables | Required Categories |
| --- | --- | --- |
| **clif_patient** | `patient_id`, `race_category`, `ethnicity_category`, `sex_category`, `death_dttm` | - |
| **clif_hospitalization** | `patient_id`, `hospitalization_id`, `admission_dttm`, `discharge_dttm`,`discharge_dttm`, `age_at_admission` | - |
| **clif_adt** |  `hospitalization_id`, `hospital_id`,`in_dttm`, `out_dttm`, `location_category` | - |
| **clif_vitals** | `hospitalization_id`, `recorded_dttm`, `vital_category`, `vital_value` | weight_kg |
| **clif_labs** | `hospitalization_id`, `lab_result_dttm`, `lab_order_dttm`, `lab_category`, `lab_value_numeric` | creatinine, bilirubin_total, po2_arterial, platelet_count |
| **clif_medication_admin_continuous** | `hospitalization_id`, `admin_dttm`, `med_name`, `med_category`, `med_dose`, `med_dose_unit` | norepinephrine, epinephrine, phenylephrine, vasopressin, dopamine, angiotensin(optional) |
| **clif_respiratory_support** | `hospitalization_id`, `recorded_dttm`, `device_category`, `mode_category`,  `fio2_set`, `lpm_set`, `resp_rate_set`, `peep_set`, `resp_rate_obs`, `tidal_volume_set`, `pressure_control_set`, `pressure_support_set` | - |
| **clif_patient_assessments** | `hospitalization_id`, `recorded_dttm` , `assessment_category`, `numerical_value`| `gcs_total` |
| **clif_crrt_therapy** | `hospitalization_id`, `recorded_dttm` | - |


## Cohort Identification


## Inclusion 
1. Adults
2. Patients with at least one ICU stay or those who had only emergency department or ward encounters and either died or received life support at any point. Life support is defined as the administration of any vasoactive drugs or respiratory support exceeding low-flow oxygen.

Respiratory support device: 'IMV', 'NIPPV', 'CPAP', 'High Flow NC'  

Vasoactive: 'norepinephrine', 'epinephrine', 'phenylephrine', 'vasopressin',
    'dopamine', 'angiotensin'

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gc
from pathlib import Path
import json
from typing import Union
from tqdm import tqdm

import sys
import clifpy
import os

print("=== Environment Verification ===")
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")
print(f"clifpy version: {clifpy.__version__}")
print(f"clifpy location: {clifpy.__file__}")

print("\n=== Python Path Check ===")
local_clifpy_path = "/Users/kavenchhikara/Desktop/CLIF/CLIFpy"
if any(local_clifpy_path in path for path in sys.path):
    print("⚠️  WARNING: Local CLIFpy still in path!")
    for path in sys.path:
        if local_clifpy_path in path:
            print(f"   Found: {path}")
else:
    print("✅ Clean environment - no local CLIFpy in path")

print(f"\n=== Working Directory ===")
print(f"Current directory: {os.getcwd()}")

In [None]:
# Load configuration
config_path = "../config/config.json"
with open(config_path, 'r') as f:
    config = json.load(f)

# Create the output directory for tableone results if it does not already exist
output_dir = Path("../output/final/tableone")
output_dir.mkdir(parents=True, exist_ok=True)

print(f"\n=� Configuration:")
print(f"   Data directory: {config['tables_path']}")
print(f"   File type: {config['file_type']}")
print(f"   Timezone: {config['timezone']}")

## Required columns and categories

In [None]:
print("\n" + "=" * 80)
print("Defining Required Data Elements")
print("=" * 80)

# Full patient table 

# Full hospitalization table 

# Full ADT table

# Vitals
vitals_required_columns = [
    'hospitalization_id',
    'recorded_dttm',
    'vital_category',
    'vital_value'
]
vitals_of_interest = ['heart_rate', 'respiratory_rate', 'sbp', 'dbp', 'map', 'spo2', 'weight_kg', 'height_cm']

# Respiratory Support 
rst_required_columns = [
    'hospitalization_id',
    'recorded_dttm',
    'device_name',
    'device_category',
    'mode_name', 
    'mode_category',
    'tracheostomy',
    'fio2_set',
    'lpm_set',
    'resp_rate_set',
    'peep_set',
    'resp_rate_obs',
    'tidal_volume_set', 
    'pressure_control_set',
    'pressure_support_set',
    'peak_inspiratory_pressure_set',
    'peak_inspiratory_pressure_obs',
    'plateau_pressure_obs',
    'minute_vent_obs'
]


# Continuous administered meds
meds_required_columns = [
    'hospitalization_id',
    'admin_dttm',
    'med_name',
    'med_category',
    'med_dose',
    'med_dose_unit'
]
meds_of_interest = [
    'norepinephrine', 'epinephrine', 'phenylephrine', 'vasopressin',
    'dopamine', 'angiotensin', 'dobutamine', 'milrinone', 'isoproterenol',
    'propofol', 'midazolam', 'lorazepam', 'dexmedetomidine', 
    'vecuronium', 'rocuronium', 'cisatracurium', 'pancuronium'
]

In [None]:
strobe_counts = {}

## Functions

In [None]:
# def generate_hourly_sequence(group):
#     blk = group.name  # use group name from groupby
#     start_time = group['vent_episode_start_dttm'].iloc[0]
#     end_time   = group['vent_end_dttm_72h'].iloc[0]
#     hourly_timestamps = pd.date_range(start=start_time, end=end_time, freq='h')
#     return pd.DataFrame({
#         'hospitalization_id': blk,
#         'recorded_dttm': hourly_timestamps
#     })

# def calculate_ibw(height_cm, sex):
#     if pd.isna(height_cm) or pd.isna(sex):
#         return np.nan
#     height_inches = height_cm / 2.54
#     sex = str(sex).lower()
#     if sex == 'male':
#         return 50 + 2.3 * (height_inches - 60)
#     elif sex == 'female':
#         return 45.5 + 2.3 * (height_inches - 60)
#     else:
#         return np.nan

# def calculate_base_excess(ph, hco3):
#     """
#     Calculate Base Excess using simplified formula
#     BE = (HCO3 - 24.4) + (8.3 * (pH - 7.4))
#     """
#     return (hco3 - 24.4) + (8.3 * (ph - 7.4))

# def calculate_pf_ratio(po2, fio2):
#     """
#     Vectorized calculation of P/F ratio (PaO2/FiO2)
#     FiO2 should be as fraction (0.21-1.0), not percentage
#     Handles pandas Series input.
#     """
#     fio2 = fio2.copy()
#     # Convert percentage to fraction if needed
#     mask_pct = fio2 > 1
#     fio2[mask_pct] = fio2[mask_pct] / 100
#     # Set minimum fio2 to 0.21 (room air)
#     fio2 = fio2.clip(lower=0.21)
#     return po2 / fio2

# def process_crrt_waterfall(
#     crrt: pd.DataFrame,
#     *,
#     id_col: str = "hospitalization_id",
#     gap_thresh: Union[str, pd.Timedelta] = "2h",
#     infer_modes: bool = True,          # infer missing mode from numeric pattern
#     flag_missing_bfr: bool = True,     # add QC flag if blood-flow still NaN
#     wipe_unused: bool = True,          # null parameters not used by the mode
#     fix_islands: bool = True,          # relabel single-row SCUF islands
#     verbose: bool = True,
# ) -> pd.DataFrame:
#     """
#     Clean + episode-aware forward-fill for the CLIF `crrt_therapy` table.
#     Episode-aware clean-up and forward-fill of the CLIF `crrt_therapy` table.

#     The function mirrors the respiratory-support “waterfall” logic but adapts it to
#     the quirks of Continuous Renal Replacement Therapy (CRRT):

#     1. **Episode detection** - a new `crrt_episode_id` starts whenever  
#        • `crrt_mode_category` changes **OR**  
#        • the gap between successive rows exceeds *gap_thresh* (default 2 h).
#     2. **Numeric forward-fill inside an episode** - fills *only* the parameters
#        that are clinically relevant for the active mode.
#     3. **Mode-specific wiping** after filling, parameters that are **not used**
#        in the current mode (e.g. `dialysate_flow_rate` in SCUF) are nulled so
#        stale data never bleed across modes.
#     4. **Deduplication & ordering** guarantees exactly **one row per
#        `(id_col, recorded_dttm)`**, chronologically ordered.

#     Parameters
#     ----------
#     crrt : pd.DataFrame
#         Raw `crrt_therapy` table **in UTC**. Must contain the schema columns
#         defined on the CLIF website (see docstring footer).
#     id_col : str, default ``"hospitalization_id"``
#         Encounter-level identifier.
#     gap_thresh : str or pd.Timedelta, default ``"2h"``
#         Maximum tolerated gap **inside** an episode before a new episode is
#         forced. Accepts any pandas-parsable offset string (``"90min"``, ``"3h"``,
#         …) or a ``pd.Timedelta``.
#     verbose : bool, default ``True``
#         If *True* prints progress banners.

#     Returns
#     -------
#     pd.DataFrame
#         Processed CRRT DataFrame with

#         * ``crrt_episode_id`` (int32) - sequential per encounter,
#         * forward-filled numeric parameters **within** each episode,
#         * unused parameters blanked per mode,
#         * unique, ordered rows ``id_col, recorded_dttm``.

#     Add-ons v2.0
#     ------------
#     • Optional numeric-pattern inference of `crrt_mode_category`.
#     • Flags rows that *should* have blood-flow but don't.
#     • Optional fix for single-row modality islands (sandwiched rows).
#     • Optional wipe vs. keep of parameters not used by the active mode.

#     Key steps
#     ----------
#     0.  Lower-case strings, coerce numerics, **infer** mode when blank.
#     1.  **Relabel single-row SCUF islands** (if *fix_islands*).
#     2.  Detect `crrt_episode_id` (mode change or >gap_thresh).
#     3.  Forward-fill numeric parameters *within* an episode.
#     4.  QC flag → `blood_flow_missing_after_ffill` (optional).
#     5.  Wipe / flag parameters not valid for the mode (configurable).
#     6.  Deduplicate & order ⇒ one row per ``(id_col, recorded_dttm)``.
#     """
#     p = print if verbose else (lambda *_, **__: None)
#     gap_thresh = pd.Timedelta(gap_thresh)

#     # ───────────── Phase 0 — prep, numeric coercion, optional inference
#     p("✦ Phase 0: prep & numeric coercion (+optional mode inference)")
#     df = crrt.copy()

#     df["crrt_mode_category"] = df["crrt_mode_category"].str.lower()
#     # save original dialysate_flow_rate values
#     df["_orig_df"] = df["dialysate_flow_rate"]

#     # 0a) RAW SCUF DF‐OUT sanity check
#     # look for rows that are already labeled “scuf”
#     # and that have a non‐zero dialysate_flow_rate in the raw data
#     raw_scuf = df["crrt_mode_category"].str.lower() == "scuf"
#     raw_df_positive = df["_orig_df"].fillna(0) > 0

#     n_bad = (raw_scuf & raw_df_positive).sum()
#     if n_bad:
#         print(f"!!!  Found {n_bad} raw SCUF rows with dialysate_flow_rate > 0 (should be 0 or NA)")
#         print(" Converting these mode category to NA, keep recorded numerical values as the ground truth")
#         df.loc[raw_df_positive, "crrt_mode_category"] = np.nan
#     else:
#         print("!!! No raw SCUF rows had dialysate_flow_rate > 0")

#     NUM_COLS = [
#         "blood_flow_rate",
#         "pre_filter_replacement_fluid_rate",
#         "post_filter_replacement_fluid_rate",
#         "dialysate_flow_rate",
#         "ultrafiltration_out",
#     ]
#     NUM_COLS = [c for c in NUM_COLS if c in df.columns]
#     df[NUM_COLS] = df[NUM_COLS].apply(pd.to_numeric, errors="coerce")

#     #  any row whose original ultrafiltration_out was >0 must never be SCUF
#     def drop_scuf_on_positive_df(df, p):
#         bad_df  = df["_orig_df"].fillna(0) > 0
#         scuf_now = df["crrt_mode_category"] == "scuf"
#         n = (bad_df & scuf_now).sum()
#         if n:
#             p(f"→ Removing {n:,} SCUF labels on rows with DF>0")
#             df.loc[bad_df & scuf_now, "crrt_mode_category"] = np.nan
            

#     if infer_modes:
#         miss = df["crrt_mode_category"].isna()
#         pre  = df["pre_filter_replacement_fluid_rate"].notna()
#         post = df["post_filter_replacement_fluid_rate"].notna()
#         dial = df["dialysate_flow_rate"].notna()
#         bf   = df["blood_flow_rate"].notna()
#         uf   = df["ultrafiltration_out"].notna()
#         all_num_present = df[NUM_COLS].notna().all(axis=1)

#         df.loc[miss & all_num_present,                       "crrt_mode_category"] = "cvvhdf"
#         df.loc[miss & (~dial) & pre & post,                  "crrt_mode_category"] = "cvvh"
#         df.loc[miss & dial & (~pre) & (~post),               "crrt_mode_category"] = "cvvhd"
#         df.loc[miss & (~dial) & (~pre) & (~post) & bf & uf,  "crrt_mode_category"] = "scuf"

#         filled = (miss & df["crrt_mode_category"].notna()).sum()
#         p(f"  • numeric-pattern inference filled {filled:,} missing modes")
#         drop_scuf_on_positive_df(df, p)

#     # ───────────── Phase 1 — sort and *fix islands before episodes*
#     p("✦ Phase 1: sort + SCUF-island fix")
#     df = df.sort_values([id_col, "recorded_dttm"]).reset_index(drop=True)

#     if fix_islands:
#         # after sorting, BEFORE episode detection
#         prev_mode = df.groupby(id_col)["crrt_mode_category"].shift()
#         next_mode = df.groupby(id_col)["crrt_mode_category"].shift(-1)

#         scuf_island = (
#             (df["crrt_mode_category"] == "scuf") &
#             (prev_mode.notna()) & (next_mode.notna()) &     # ensure we have neighbours
#             (prev_mode == next_mode)                        # both neighbours agree
#         )

#         df.loc[scuf_island, "crrt_mode_category"] = prev_mode[scuf_island]
#         n_fixed = scuf_island.sum()
#         p(f"  • relabelled {n_fixed:,} SCUF-island rows")
#         drop_scuf_on_positive_df(df, p)


#     # ───────────── Phase 2 — episode detection (now with fixed modes)
#     p("✦ Phase 2: derive `crrt_episode_id`")
#     mode_change = (
#         df.groupby(id_col)["crrt_mode_category"]
#           .apply(lambda s: s != s.shift())
#           .reset_index(level=0, drop=True)
#     )
#     time_gap = df.groupby(id_col)["recorded_dttm"].diff().gt(gap_thresh).fillna(False)
#     df["crrt_episode_id"] = ((mode_change | time_gap)
#                               .groupby(df[id_col]).cumsum()
#                               .astype("int32"))

#     # ───────────── Phase 3 — forward-fill numerics inside episodes
#     p("✦ Phase 3: forward-fill numeric vars inside episodes")
#     tqdm.pandas(disable=not verbose, desc="ffill per episode")
#     df[NUM_COLS] = (
#         df.groupby([id_col, "crrt_episode_id"], sort=False, group_keys=False)[NUM_COLS]
#           .progress_apply(lambda g: g.ffill())
#     )

#     # QC: blood-flow still missing?
#     if flag_missing_bfr and "blood_flow_rate" in NUM_COLS:
#         need_bfr = df["crrt_mode_category"].isin(["scuf", "cvvh", "cvvhd", "cvvhdf"])
#         df["blood_flow_missing_after_ffill"] = need_bfr & df["blood_flow_rate"].isna()
#         p(f"  • blood-flow still missing where required: "
#           f"{df['blood_flow_missing_after_ffill'].mean():.1%}")
        
#     # Bridge tiny episodes
    
#     single_row_ep = (
#         df.groupby([id_col, "crrt_episode_id"]).size() == 1
#     ).reset_index(name="n").query("n == 1")
#     print("Bridging single row episodes")

#     rows_to_bridge = df.merge(single_row_ep[[id_col, "crrt_episode_id"]],
#                             on=[id_col, "crrt_episode_id"]).index
    
#     CAT_COLS = [c for c in ["crrt_mode_category"] if c in df.columns]

#     # Combine with the numeric columns we already had
#     BRIDGE_COLS = NUM_COLS + CAT_COLS

#     # Forward-fill (and back-fill just in case the island is the first row of the encounter)
#     df.loc[rows_to_bridge, BRIDGE_COLS] = (
#         df.loc[rows_to_bridge, BRIDGE_COLS]
#         .groupby(df.loc[rows_to_bridge, id_col])      # keep encounter boundaries
#         .apply(lambda g: g.ffill())          
#         .reset_index(level=0, drop=True)
#     )
#     drop_scuf_on_positive_df(df, p)
#     # ───────────── Phase 4 — wipe / flag unused parameters
#     p("✦ Phase 4: handle parameters not valid for the mode")
#     MODE_PARAM_MAP = {
#         "scuf":   {"blood_flow_rate", "ultrafiltration_out"},
#         "cvvh":   {"blood_flow_rate", "pre_filter_replacement_fluid_rate",
#                    "post_filter_replacement_fluid_rate", "ultrafiltration_out"},
#         "cvvhd":  {"blood_flow_rate", "dialysate_flow_rate", "ultrafiltration_out"},
#         "cvvhdf": {"blood_flow_rate", "pre_filter_replacement_fluid_rate","post_filter_replacement_fluid_rate",
#                    "dialysate_flow_rate", "ultrafiltration_out"},
#     }

#     wiped_totals = {c: 0 for c in NUM_COLS}
#     for mode, keep in MODE_PARAM_MAP.items():
#         mask = df["crrt_mode_category"] == mode
#         drop_cols = list(set(NUM_COLS) - keep)
#         if wipe_unused:
#             for col in drop_cols:
#                 wiped_totals[col] += df.loc[mask, col].notna().sum()
#             df.loc[mask, drop_cols] = np.nan
#         else:
#             for col in drop_cols:
#                 df.loc[mask & df[col].notna(), f"{col}_unexpected"] = True

#     if verbose and wipe_unused:
#         p("  • cells set → NA by wipe:")
#         for col, n in wiped_totals.items():
#             p(f"    {col:<35} {n:>8,}")
#     # ───────────── Phase 4a — SCUF‐specific sanity check
#     if "dialysate_flow_rate" in df.columns:
#         # only consider rows that were originally SCUF mode
#         # and whose original _orig_df was non‐zero/non‐NA
#         scuf_rows = df["crrt_mode_category"] == "scuf"
#         orig_bad = df["_orig_df"].fillna(0) > 0

#         # these are rows where the *original* data had UF>0 despite SCUF
#         bad_orig_scuf = scuf_rows & orig_bad

#         n_bad_orig = bad_orig_scuf.sum()
#         if n_bad_orig:
#             p(f"!!! {n_bad_orig} rows originally labeled SCUF had DF>0 (raw data); forcing DF→NA for those")
#             df.loc[bad_orig_scuf, "dialysate_flow_rate"] = np.nan
#         else:
#             p("!!! No SCUF rows with DF>0")

#     # then drop the helper column
#     df = df.drop(columns="_orig_df")

#     # ───────────── Phase 5 — deduplicate & order
#     p("✦ Phase 5: deduplicate & order")
#     pre = len(df)
#     df = (
#         df.drop_duplicates(subset=[id_col, "recorded_dttm"])
#           .sort_values([id_col, "recorded_dttm"])
#           .reset_index(drop=True)
#     )
#     p(f"  • dropped {pre - len(df):,} duplicate rows")

#     if verbose:
#         sparse = df[NUM_COLS].isna().all(axis=1).mean()
#         p(f"  • rows with all NUM_COLS missing: {sparse:.1%}")

#     p("[OK] CRRT waterfall complete.")
#     return df

## Cohort identification

In [None]:
print("\n" + "=" * 80)
print("Loading CLIF Tables")
print("=" * 80)

from clifpy.clif_orchestrator import ClifOrchestrator

# Initialize ClifOrchestrator
clif = ClifOrchestrator(
    data_directory=config['tables_path'],
    filetype=config['file_type'],
    timezone=config['timezone']
)

## Step0: Load Core Tables

In [None]:
# ============================================================================
# STEP 0: Load Core Tables (Patient, Hospitalization, ADT)
# ============================================================================
print("\n" + "=" * 80)
print("Step 0: Load Core Tables (Patient, Hospitalization, ADT)")
print("=" * 80)
core_tables = ['patient', 'hospitalization', 'adt']

print(f"\nLoading {len(core_tables)} core tables...")
for table_name in core_tables:
    print(f"   Loading {table_name}...", end=" ")
    try:
        clif.load_table(table_name)
        table = getattr(clif, table_name)
        print(f"✓ ({len(table.df):,} rows)")
    except Exception as e:
        print(f"✗ Error: {e}")
        raise

print("\nCore tables loaded successfully!")

In [None]:
hosp_df = clif.hospitalization.df
adt_df = clif.adt.df

# Merge to get age information
all_encounters = pd.merge(
    hosp_df[["patient_id", "hospitalization_id", "admission_dttm", "discharge_dttm", 
             "age_at_admission", "discharge_category"]],
    adt_df[["hospitalization_id", "hospital_id", "in_dttm", "out_dttm", 
            "location_category", "location_type"]],
    on='hospitalization_id',
    how='inner'
)

In [None]:
# Check for duplicates by ['hospitalization_id', 'in_dttm', 'out_dttm']
dup_counts = all_encounters.duplicated(subset=['hospitalization_id', 'in_dttm', 'out_dttm']).sum()
if dup_counts > 0:
    print(f"Warning: {dup_counts} duplicate (hospitalization_id, in_dttm, out_dttm) entries found in all_encounters.")
else:
    print("No duplicate (hospitalization_id, in_dttm, out_dttm) entries found in all_encounters.")

## Step1: Date & Age filter

In [None]:
all_encounters.columns

In [None]:
# ============================================================================
# STEP 1: Identify Adult Patients (Age >= 18) and Admissions 2018-2024
# ============================================================================
print("\n" + "=" * 80)
print("Step 1: Identifying Adult Patients (Age >= 18) and Admissions 2018-2024")
print("=" * 80)

print("Applying initial cohort filters...")

# Use only the relevant columns from all_encounters
adult_encounters = all_encounters[
    [
        'patient_id', 'hospitalization_id', 'admission_dttm', 'discharge_dttm',
        'age_at_admission', 'discharge_category', 'hospital_id',
        'in_dttm', 'out_dttm', 'location_category', 'location_type'
    ]
].copy()

if config['timezone'].lower() == "mimic":
    # MIMIC: only age >= 18, no admit year restriction
    adult_encounters = adult_encounters[
        (adult_encounters['age_at_admission'] >= 18) & (adult_encounters['age_at_admission'].notna())
    ]
else:
    # Other sites: age >= 18 and admission between 2018-2024 inclusive
    adult_encounters = adult_encounters[
        (adult_encounters['age_at_admission'] >= 18) &
        (adult_encounters['age_at_admission'].notna()) &
        (adult_encounters['admission_dttm'].dt.year >= 2018) &
        (adult_encounters['admission_dttm'].dt.year <= 2024)
    ]

print(f"\nFiltering Results:")
print(f"   Total hospitalizations: {len(all_encounters['hospitalization_id'].unique()):,}")
print(f"   Adult hospitalizations (age >= 18, 2018-2024): {len(adult_encounters['hospitalization_id'].unique()):,}")
print(f"   Excluded (age < 18 or outside 2018-2024): {len(all_encounters['hospitalization_id'].unique()) - len(adult_encounters['hospitalization_id'].unique()):,}")


strobe_counts["0_total_hospitalizations"] = len(all_encounters['hospitalization_id'].unique())
strobe_counts["1_adult_hospitalizations"] = len(adult_encounters['hospitalization_id'].unique())
# Get list of adult hospitalization IDs for filtering
adult_hosp_ids = set(adult_encounters['hospitalization_id'].unique())
print(f"\n   Unique adult hospitalization IDs: {len(adult_hosp_ids):,}")

### Stitch hospitalizations 

If the `id_col` supplied by user is `hospitalization_id`, then we combine multiple `hospitalization_ids` into a single `encounter_block` for patients who transfer between hospital campuses or return soon after discharge. Hospitalizations that have a gap of **6 hours or less** between the discharge dttm and admission dttm are put in one encounter block.

If the `id_col` supplied by user is `hospitalization_joined_id` from the hospitalization table, then we consider the user has already stitched similar encounters, and we will consider that as the primary id column for all table joins moving forward.

In [None]:
from clifpy.utils.stitching_encounters import stitch_encounters

# stitch hospitalizations
hosp_filtered = clif.hospitalization.df[clif.hospitalization.df['hospitalization_id'].isin(adult_hosp_ids)]
adt_filtered = clif.adt.df[clif.adt.df['hospitalization_id'].isin(adult_hosp_ids)]

hosp_stitched, adt_stitched, encounter_mapping = stitch_encounters(
    hospitalization=hosp_filtered,
    adt=adt_filtered,
    time_interval=6  
)

# Direct assignment without additional copies
clif.hospitalization.df = hosp_stitched
clif.adt.df = adt_stitched

# Store the encounter mapping in the orchestrator for later use
clif.encounter_mapping = encounter_mapping

# Clean up intermediate variables
del hosp_filtered, adt_filtered
gc.collect()

In [None]:
# After your stitching code, add these calculations:

# Calculate stitching statistics
strobe_counts['1b_before_stitching'] = len(adult_hosp_ids)  # Original adult hospitalizations
strobe_counts['1b_after_stitching'] = len(hosp_stitched['encounter_block'].unique())  # Unique encounter blocks after stitching
strobe_counts['1b_stitched_hosp_ids'] = strobe_counts['1b_before_stitching'] - strobe_counts['1b_after_stitching']  # Number of hospitalizations that were linked

print(f"\nEncounter Stitching Results:")
print(f"   Number of unique hospitalizations before stitching: {strobe_counts['1b_before_stitching']:,}")
print(f"   Number of unique encounter blocks after stitching: {strobe_counts['1b_after_stitching']:,}")
print(f"   Number of linked hospitalization ids: {strobe_counts['1b_stitched_hosp_ids']:,}")

# Optional: Show the encounter mapping details
print(f"\nEncounter Mapping Details:")
print(f"   Total encounter mappings created: {len(encounter_mapping):,}")
if len(encounter_mapping) > 0:
    # Show some examples of how many original hospitalizations were combined
    mapping_counts = encounter_mapping.groupby('encounter_block').size()
    print(f"   Encounter blocks with multiple hospitalizations: {(mapping_counts > 1).sum():,}")
    print(f"   Maximum hospitalizations combined into one block: {mapping_counts.max()}")

# ADT

In [None]:
# Merge all_encounters with encounter_mapping to get encounter_block information
all_encounters = pd.merge(all_encounters, encounter_mapping, on='hospitalization_id', how='left')

# Convert location_category and discharge_category to lowercase in place (vectorized)
all_encounters['location_category'] = all_encounters['location_category'].str.lower()
all_encounters['discharge_category'] = all_encounters['discharge_category'].str.lower()

# Create vectorized ICU and death masks
icu_mask = all_encounters['location_category'].str.contains('icu', na=False)
death_mask = all_encounters['discharge_category'].isin(['expired', 'hospice'])

# Vectorized: For each encounter_block, does any row have ICU or death? (much faster)
# Use groupby('encounter_block')[mask].transform('any') to vectorize
all_encounters['icu_enc'] = icu_mask.groupby(all_encounters['encounter_block']).transform('any').astype(int)
all_encounters['death_enc'] = death_mask.groupby(all_encounters['encounter_block']).transform('any').astype(int)

# Cohort flag using logical OR (vectorized)
all_encounters['cohort_enc'] = (all_encounters['icu_enc'] | all_encounters['death_enc']).astype(int)

# Store hospitalization_ids for cohort_enc==1 in a list (as before)
cohort_enc_hospitalization_ids = all_encounters.loc[all_encounters['cohort_enc'] == 1, 'hospitalization_id'].unique().tolist()

In [None]:
# Identify encounters where death occurred
death_encounters = all_encounters[all_encounters['death_enc'] == 1]
# Identify those that never touched the ICU
non_icu_deaths = death_encounters[~death_encounters['icu_enc'].astype(bool)]
# Count the number of unique encounters with deaths outside of ICU
num_deaths_outside_icu = non_icu_deaths['encounter_block'].nunique()
# Calculate total deaths (unique encounter blocks with death)
total_encounters = all_encounters['encounter_block'].nunique()
# Calculate the percentage
pct_deaths_outside_icu = (num_deaths_outside_icu / total_encounters * 100) if total_encounters > 0 else 0
print(f"Number of deaths outside ICU: {num_deaths_outside_icu} ({pct_deaths_outside_icu:.1f}% of all hospitalizations)")

# Add ICU encounters to strobe counts as 1_icu_encounters
num_icu_encounters = all_encounters[all_encounters['icu_enc'] == 1]['encounter_block'].nunique()
if 'strobe_counts' not in globals():
    strobe_counts = {}
strobe_counts['1_icu_encounters'] = num_icu_encounters

In [None]:
final_cohort = all_encounters[
    all_encounters['hospitalization_id'].isin(cohort_enc_hospitalization_ids)
][['encounter_block', 'icu_enc', 'death_enc', 'cohort_enc']].drop_duplicates()

# Respiratory Support

In [None]:
# ============================================================================
# STEP 2: Load Respiratory Support and Identify Patients on Advanced Respiratory support 
# ============================================================================
print("\n" + "=" * 80)
print(" Loading Respiratory Support and Identifying IMV Patients")
print("=" * 80)

print(f"\nLoading respiratory_support table...")
clif.load_table('respiratory_support',
                        columns=rst_required_columns,
                        filters={'hospitalization_id': list(adult_hosp_ids)})
print(f"Respiratory support loaded ({len(clif.respiratory_support.df):,} rows)")

# Standardize category columns to lowercase
print(f"\nStandardizing category columns...")
category_cols = [col for col in clif.respiratory_support.df.columns if col.endswith('_category')]
for col in category_cols:
    clif.respiratory_support.df[col] = clif.respiratory_support.df[col].str.lower()

In [None]:
# Identify hospitalizations on advanced mechanical support
print(f"\nIdentifying hospitalizations with advanced respiratory support devices...")
device_types = ['imv', 'nippv', 'cpap', 'high flow nc']
clif.respiratory_support.df = pd.merge(clif.respiratory_support.df, encounter_mapping, 
                                        on='hospitalization_id', how='left')
advanced_support_hosp_ids = clif.respiratory_support.df.loc[
    clif.respiratory_support.df['device_category'].str.lower().isin([d.lower() for d in device_types]),
    'encounter_block'
].unique()
print(f"Hospitalizations with any advanced resp. device ({', '.join(device_types).upper()}): {len(advanced_support_hosp_ids):,}")
strobe_counts["2_advanced_resp_support_hospitalizations"] = len(advanced_support_hosp_ids)

In [None]:
# Create a DataFrame with advanced_support_hosp_ids and 'high_support_en' == 1
advanced_support_df = pd.DataFrame({
    'encounter_block': advanced_support_hosp_ids,
    'high_support_enc': 1
})

In [None]:
# Perform a full join (outer merge) of final_cohort and advanced_support_df on 'encounter_block'
final_cohort = final_cohort.merge(
    advanced_support_df,
    on='encounter_block',
    how='outer'
)


# Vasoactives

In [None]:
print(f"\nLoading medication_admin_continuous table...")
clif.load_table(
    'medication_admin_continuous',
    columns=meds_required_columns,
    filters={
        'hospitalization_id': list(adult_hosp_ids),
        'med_category': meds_of_interest
    }
)

In [None]:
# Identify hospitalizations on advanced mechanical support
print(f"\nIdentifying hospitalizations with advanced respiratory support devices...")
vasoactive_meds = ['norepinephrine', 'epinephrine', 'phenylephrine', 'vasopressin',
                   'dopamine', 'angiotensin']
clif.medication_admin_continuous.df= pd.merge(clif.medication_admin_continuous.df, encounter_mapping, 
                                        on='hospitalization_id', how='left')
vasoactive_hosp_ids = clif.medication_admin_continuous.df.loc[
    clif.medication_admin_continuous.df['med_category'].str.lower().isin([d.lower() for d in vasoactive_meds]),
    'encounter_block'
].unique()
print(f"Hospitalizations with any vasoactives. device ({', '.join(vasoactive_meds).upper()}): {len(vasoactive_hosp_ids):,}")
strobe_counts["3_vasoactive_hospitalizations"] = len(vasoactive_hosp_ids)

In [None]:
# Create a DataFrame with advanced_support_hosp_ids and 'high_support_en' == 1
vasoactives_df = pd.DataFrame({
    'encounter_block': vasoactive_hosp_ids,
    'vaso_support_enc': 1
})

In [None]:
# Join vasoactives_df with final cohort on hospitalization_id
final_cohort = final_cohort.merge(
    vasoactives_df,
    on='encounter_block',
    how='outer'
)

In [None]:
# Missing high_support_en means not on advanced support
final_cohort['vaso_support_enc'] = final_cohort['vaso_support_enc'].fillna(0).astype(int)
# Missing high_support_en means not on advanced support
final_cohort['high_support_enc'] = final_cohort['high_support_enc'].fillna(0).astype(int)
# Missing icu_enc means not ICU
final_cohort['icu_enc'] = final_cohort['icu_enc'].fillna(0).astype(int)
# Define the criteria for other critically ill
final_cohort['other_critically_ill'] = (
    (final_cohort[['icu_enc', 'vaso_support_enc', 'high_support_enc']].sum(axis=1) == 0)
).astype(int)
# Calculate the count
strobe_counts['4_other_critically_ill'] = final_cohort.loc[final_cohort['other_critically_ill'] == 1, 
                                                            'encounter_block'].nunique()
strobe_counts['5_all_critically_ill'] = final_cohort['encounter_block'].nunique()

# Summary

In [None]:
import pandas as pd
strobe_counts_df = pd.DataFrame(list(strobe_counts.items()), columns=['count_name', 'count_value'])
strobe_counts_df.to_csv('../output/final/tableone/strobe_counts.csv', index=False)
# Calculate mortality rates
mortality_rates = {
    'ICU Hospitalizations': final_cohort.loc[final_cohort['icu_enc'] == 1, 'death_enc'].mean() * 100,
    'Advanced Respiratory Support': final_cohort.loc[final_cohort['high_support_enc'] == 1, 'death_enc'].mean() * 100,
    'Vasoactive Hospitalizations': final_cohort.loc[final_cohort['vaso_support_enc'] == 1, 'death_enc'].mean() * 100,
    'Other Critically Ill': final_cohort.loc[final_cohort['other_critically_ill'] == 1, 'death_enc'].mean() * 100,
    'All Critically Ill Adults': final_cohort['death_enc'].mean() * 100,
}
mortality_rates_df = pd.DataFrame(list(mortality_rates.items()), columns=['count_name', 'count_value'])
mortality_rates_df.to_csv('../output/final/tableone/mortality_rates.csv', index=False)

In [None]:
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch

def create_consort_diagram(strobe_counts, mortality_rates):
    fig, ax = plt.subplots(1, 1, figsize=(14, 8))
    ax.set_xlim(-1, 13)
    ax.set_ylim(0, 14)
    ax.axis('off')

    box_style = "round,pad=0.1"
    boxes = {}

    def create_box(x, y, width, height, text, box_id=None, fontsize=10, fontweight='normal'):
        box = FancyBboxPatch(
            (x - width/2, y - height/2), width, height,
            boxstyle=box_style, facecolor='white', edgecolor='black', linewidth=1.5
        )
        ax.add_patch(box)
        ax.text(x, y, text, ha='center', va='center', fontsize=fontsize, fontweight=fontweight, wrap=True)
        
        return {
            'x': x, 'y': y, 'width': width, 'height': height,
            'left': x - width/2, 'right': x + width/2,
            'top': y + height/2, 'bottom': y - height/2
        }

    def create_arrow(from_box, to_box):
        x1, y1 = from_box['x'], from_box['bottom'] - 0.1
        x2, y2 = to_box['x'], to_box['top'] + 0.1
        ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
                    arrowprops=dict(arrowstyle='->', lw=2, color='black'))

    ax.text(5, 13, 'Cohort', ha='center', va='center', fontsize=16, fontweight='bold')

    # Define and arrange the boxes
    box1 = create_box(5, 12, 3, 0.7, 
                      f"Total Hospitalizations\nn = {strobe_counts['0_total_hospitalizations']:,}",
                      'total', fontsize=11, fontweight='bold')

    box2 = create_box(5, 10.5, 3, 0.7,
                      f"Adult Hospitalizations\nn = {strobe_counts['1b_after_stitching']:,}",
                      'adult', fontsize=11, fontweight='bold')
    create_arrow(box1, box2)

    # Define ICU, respiratory support, vasoactive, and other critically ill categories
    box3_icu = create_box(1, 8, 3, 0.9,
                          f"ICU Hospitalizations\nn = {strobe_counts['1_icu_encounters']:,}\nMortality: {mortality_rates['ICU Hospitalizations']:.2f}%",
                          'icu', fontsize=11, fontweight='bold')

    box3_resp = create_box(4.5, 8, 3, 0.9,
                           f"Advanced Respiratory Support\nn = {strobe_counts['2_advanced_resp_support_hospitalizations']:,}\nMortality: {mortality_rates['Advanced Respiratory Support']:.2f}%",
                           'resp_support', fontsize=11, fontweight='bold')

    box3_vaso = create_box(8, 8, 3, 0.9,
                           f"Vasoactive Hospitalizations\nn = {strobe_counts['3_vasoactive_hospitalizations']:,}\nMortality: {mortality_rates['Vasoactive Hospitalizations']:.2f}%",
                           'vasoactive', fontsize=11, fontweight='bold')

    box3_other = create_box(11.3, 8, 3, 0.9,
                            f"Other Critically Ill\nn = {strobe_counts['4_other_critically_ill']:,}\nMortality: {mortality_rates['Other Critically Ill']:.2f}%",
                            'other', fontsize=11, fontweight='bold')

    create_arrow(box2, box3_icu)
    create_arrow(box2, box3_resp)
    create_arrow(box2, box3_vaso)
    create_arrow(box2, box3_other)

    # Add a final box for "All Critically Ill Adults"
    box_final = create_box(5.7, 4.5, 5.2, 1.1,
        f"All Critically Ill Adults\nn = {final_cohort['encounter_block'].nunique():,}\nMortality: {mortality_rates['All Critically Ill Adults']:.2f}%",
        'all_critically_ill', fontsize=13, fontweight='bold')

    # Do NOT draw arrows from the four groups to the all critically ill adults box

    plt.tight_layout()
    plt.savefig('../output/final/tableone/consort_flow_diagram.png', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none')
    plt.show()

In [None]:
import matplotlib.pyplot as plt
from upsetplot import UpSet, from_indicators
import os
import pandas as pd

import warnings
warnings.filterwarnings('ignore', category=FutureWarning, module='upsetplot')

# Your UpSet plot code here...

# Create output directory if it doesn't exist
os.makedirs('../output/final/tableone', exist_ok=True)

# Prepare final_cohort data for UpSet plot
summary_df = final_cohort[['encounter_block', 'icu_enc', 'death_enc', 'high_support_enc', 'vaso_support_enc']].drop_duplicates()

# Rename columns to match CONSORT flow labels
summary_df = summary_df.rename(columns={
    'icu_enc': 'ICU Hospitalizations',
    'death_enc': 'Died',
    'high_support_enc': 'Advanced O2 Support',
    'vaso_support_enc': 'Vasoactive Support'
})

# Convert to boolean for UpSet plot
summary_df['ICU Hospitalizations'] = summary_df['ICU Hospitalizations'].astype(bool)
summary_df['Died'] = summary_df['Died'].astype(bool)
summary_df['Advanced O2 Support'] = summary_df['Advanced O2 Support'].astype(bool)
summary_df['Vasoactive Support'] = summary_df['Vasoactive Support'].astype(bool)

# Create UpSet plot
fig = plt.figure(figsize=(16, 12))
upset_data = from_indicators(
    ['ICU Hospitalizations', 'Died', 'Advanced O2 Support', 'Vasoactive Support'], 
    data=summary_df.set_index('encounter_block')
)

upset = UpSet(upset_data, 
              subset_size='count',
              show_counts=True,
              sort_by='cardinality',
              element_size=50,
              with_lines=True)

upset.plot(fig=fig)

plt.subplots_adjust(left=0.2, bottom=0.2, right=0.95, top=0.85, hspace=0.3, wspace=0.3)
plt.suptitle('Clinical Cohort Intersections', fontsize=16, y=0.95)

# Adjust font sizes for better readability
for ax in fig.get_axes():
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
                 ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(12)

# Save the plot
plt.savefig('../output/final/cohort_intersect_upset_plot.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Create the diagram
create_consort_diagram(strobe_counts, mortality_rates)