# Clinical Text Cohort Manifest

#### Goal: Predict if the working ED diagnosis is discordant with the final discharge diagnosis (family-level) for admission $i$.

#### Defined as:
$$\hat{y}_i \; = \; f_{\theta}(x_i( \le T^{ED}_i)) \; \approx \; y_i,$$
$\text{with} \; y_i \; \text{representing the binary label:}$
$$y_i \; = \; 1\{\exists f \in \mathcal{F}_i^{DC} \; \cap \; \mathcal{T} \; : \; f \notin \mathcal{F}_i^{ED} \}, \; y_i \; \in \{0,1\}$$
Here, $y_i \; \in \{0,1\}$ serves as the label, where $y_i = 1$ indicates discordant diagnosis, in which the ED working diagnosis misses at least one family-level discharge diagnosis, and $y_i = 0$ indicates concordant diagnosis.

With inputs:
$$x_i (\le T_i^{ED}): \text{all features time stamped on or before ED out time,}$$

and outputs:
$$\hat{y_i} \; \in \; [0,1]: \text{calibrated probability of misdiagnosis (discordance).}$$

**Variable Terms and Symbols**

- $i$: unit of analysis
- $T_i^{ED}$: ED index time (ed.outtime) for admission $i$
- $x_i (\le t)$: all features for $i$ with timestamp $t$

## 1. Filter Patient's to include in Cohort

### 0. Setup


##### Peter's Dataset Path:

##### Mounting my Google Drive (the datasets are in my google drive)
##### Mount Google Drive (for Google Colab)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os, glob, math # for path joins and directory folder creation
import pandas as pd # for data table work
import numpy as np # for numeric utilities
from pathlib import Path
from imblearn.over_sampling import SMOTE
from collections import Counter

The cohort criteria used for data analysis will meet the following conditions:
- Valid times: edstays.intime $\le$ edstays.outtime, admissions.admittime $\le$ admissions.dischtime.
- Admitted to the hospital via the Emergency Department (ED)

Additional checks: Check for y-variable input missingness
- Determine how many patients are missing an ED diagnosis (diagnosis.icd_code).
- Determine how many patients are missing a discharge diagnosis.

#### 0.1. Configuration

In [None]:
import os

# -- Root folders for MIMIC-IV and MIMIC-IV-ED data tables --

# For accessing MIMIC-IV hospital tables

HOSP = "/content/drive/MyDrive/MIMIC_IV_hospital_tables_datasets"

# For MIMIC-IV-ED ED tables
ED = "/content/drive/MyDrive/MIMIC_IV_ED_tables"

# Fix the OUT_DIR path - use one of these options:
OUT_DIR = "/content/drive/MyDrive/manifests"  # Option 1: Colab temp storage
# OUT_DIR = "/content/drive/MyDrive/manifests"  # Option 2: Save to Google Drive
# OUT_DIR = r"D:\manifests"  # Option 3: Raw string (if you need Windows path)

os.makedirs("/content/manifests", exist_ok=True)
OUT_DIR = "/content/drive/MyDrive/manifests"

SEED = 42

# Plausibility window used for ALL rows (not just those without hadm_id)
ED_TO_ADMIT_EARLY_HOURS  = 6   # admit can be up to 6h *before* ED intime (rare clerical/time-zone-ish edges)
ED_TO_ADMIT_LATE_HOURS   = 24  # admit can be up to 24h *after* ED outtime (transfer delays)

# Drop extreme ED durations (boarding outliers)
MAX_ED_MIN = 72 * 60  # 72 hours. Adjust to 24*60 or 48*60 if you want stricter caps.


# Set file paths to relevant .csv files, based on ensuring the cohort criteria
# (above) can be specified


# Dictionary mapping .csv file names to the file paths established above


paths = {
    "patients":    os.path.join(HOSP, "patients.csv.gz"), # To enforce 18+
    "admissions":  os.path.join(HOSP, "admissions.csv.gz"),# To enforce valid times
    "edstays":     os.path.join(ED, "edstays.csv.gz"), # To link ED stays (stay_id) to inpatient stays (hadm_id)
    "ed_dx":       os.path.join(ED, "diagnosis.csv.gz"), # To ensure the patient has a discharge diagnosis
    "disch_dx":    "/content/drive/MyDrive/Data690_cohort_manifest/diagnoses_icd.csv.gz",    # os.path.join(HOSP, "diagnoses_icd.csv.gz"), # To ensure that a discharge diagnosis exists for each patient
}

In [None]:
# -- Root folders for MIMIC-IV and MIMIC-IV-ED data tables --
# For accessing MIMIC-IV hospital tables
# HOSP = r"D:\raw\mimic-iv\physionet.org\files\mimiciv\3.1\hosp"
# For MIMIC-IV-ED ED tables
# ED = r"D:\raw\mimic-iv-ed\physionet.org\ed"

# Root output file path to guide the save for the .csv files bing produced
# OUT_DIR = r"D:\manifests"
# os.makedirs(OUT_DIR, exist_ok=True)

# Dictionary mapping .csv file names to the file paths established above
# paths = {    "patients":   os.path.join(HOSP, "patients.csv.gz"),
#    "admissions": os.path.join(HOSP, "admissions.csv.gz"),
#    "edstays":    os.path.join(ED, "edstays.csv.gz"),
#    "ed_dx":      os.path.join(ED, "diagnosis.csv"),
#    "disch_dx":   os.path.join(HOSP, "diagnoses_icd.csv.gz"),
#}

# Set seed for reprodicibility
#SEED = 42

#### 0.2. Helper

In [None]:
# To support parsing date/time data by key word;
# "kw" represents a dictionary of arguments for pandas;
# This returns a DataFrame and ultimately saves time
def read_csv(path, usecols=None, parse_date_cols=None, int_cols=None, allow_na=True):
    kw = {}
    if parse_date_cols:
        kw["parse_dates"] = parse_date_cols

    if int_cols:
        int_dtype = "Int64" if allow_na else "int64"
        kw["dtype"] = {col: int_dtype for col in int_cols}

    return pd.read_csv(path, usecols=usecols, **kw)

# To build ICD (diagnosis) family
def fam3(code):
    if pd.isna(code):
        return None
    return str(code).replace('.', '').upper()[:3]

def stringify_set(s):
    return "|".join(sorted(s)) if isinstance(s, set) else ""

### 1. Get Data

#### 1.1. GetColumn Variables for all Files

In [None]:
print("Column names from MIMIC .csv files:\n")
for name, file_path in paths.items():
    print(f"--- Column names for {name} ---")
    try:
        df_header = pd.read_csv(file_path, nrows=0)
        for col_name in df_header.columns:
            print(col_name)
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
    print()

Column names from MIMIC .csv files:

--- Column names for patients ---
subject_id
gender
anchor_age
anchor_year
anchor_year_group
dod

--- Column names for admissions ---
subject_id
hadm_id
admittime
dischtime
deathtime
admission_type
admit_provider_id
admission_location
discharge_location
insurance
language
marital_status
race
edregtime
edouttime
hospital_expire_flag

--- Column names for edstays ---
subject_id
hadm_id
stay_id
intime
outtime
gender
race
arrival_transport
disposition

--- Column names for ed_dx ---
subject_id
stay_id
seq_num
icd_code
icd_version
icd_title

--- Column names for disch_dx ---
subject_id
hadm_id
seq_num
icd_code
icd_version



#### 1.1. Load MIMIC Data

In [None]:
# Load admissions table: supports linking patient observations and filtering logically ordered times.
admissions = read_csv(
    paths["admissions"],
    usecols=["subject_id","hadm_id","admittime","dischtime","hospital_expire_flag",
             "admission_type","insurance","language","marital_status","race"],
    int_cols=["subject_id","hadm_id"],
    parse_date_cols=["admittime","dischtime"] # Parse as datetime object with pandas
)

# Load ED stays table: This is where data related to a patient's ED time is located
edstays = read_csv(
    paths["edstays"],
    usecols=["subject_id","stay_id","hadm_id","intime","outtime","gender"],
    int_cols=["subject_id","stay_id","hadm_id"],
    parse_date_cols=["intime","outtime"]
)

# Load ED diagnosis table, supports knowing what the discharge diagnosis is
ed_dx = read_csv(
    paths["ed_dx"],
    usecols=["subject_id","stay_id","icd_code","icd_version","seq_num"],
    int_cols=["subject_id","stay_id","seq_num"]
)

# Discharge-diagnosis presence to audit/filter
disch = read_csv(
    paths["disch_dx"],
    usecols=["subject_id","hadm_id","icd_code","icd_version","seq_num"],
    int_cols=["subject_id","hadm_id","seq_num"]
)

# Discharge-diagnosis presence to audit/filter
patients = read_csv(
    paths["patients"],
    usecols=["subject_id","anchor_age"],
    int_cols=["subject_id","anchor_age"]
)

#### 1.2. Review Column Names

In [None]:
df_dict = {
    "admissions": admissions,
    "edstays":    edstays,
    "ed_dx":      ed_dx,
    "disch_dx":   disch,
    "patients":   patients,
}

print("Column names from loaded DataFrames:\n")
for name, df in df_dict.items():
    print(f"--- Column names for {name} ---")
    for col_name in df.columns:
        print(col_name)
    print()

Column names from loaded DataFrames:

--- Column names for admissions ---
subject_id
hadm_id
admittime
dischtime
admission_type
insurance
language
marital_status
race
hospital_expire_flag

--- Column names for edstays ---
subject_id
hadm_id
stay_id
intime
outtime
gender

--- Column names for ed_dx ---
subject_id
stay_id
seq_num
icd_code
icd_version

--- Column names for disch_dx ---
subject_id
hadm_id
seq_num
icd_code
icd_version

--- Column names for patients ---
subject_id
anchor_age



#### 1.3. Enforce Valid Times

In [None]:
# Set variable for count of unique patient's admitted to the hospital
apc = admissions["subject_id"].nunique()
print("Unique patients with admissions:", apc)

# Set variable for count of unique patient's with ED stays
edc = edstays["subject_id"].nunique()
print("Unique patients with ED stays:", edc)

Unique patients with admissions: 223452
Unique patients with ED stays: 205504


In [None]:
# Only keep rows in edstays DataFrame where the ED in time is before the ED out time
edstays = edstays[
    (edstays["intime"] <= edstays["outtime"])
]

# Set variable for count of unique patient's with ED stays with valid timeframes
ved = edstays["subject_id"].nunique()

print("Patients removed for invalid ED in/out times:", (edc - ved))

Patients removed for invalid ED in/out times: 1


In [None]:
# Only keep rows in admissions DataFrame where the hospital admission time is before the hospital discharge time
admissions = admissions[
    (admissions["admittime"] <= admissions["dischtime"])
]

# Set variable for count of unique patient's admitted to the hospital with valid timeframes
vha = admissions["subject_id"].nunique()

print("Patients removed for invalid hospital admission/discharge times:", (apc - vha))

Patients removed for invalid hospital admission/discharge times: 70


In [None]:
# Print count of unique patient's with ED stays with valid timeframes
print("Unique patients with ED stays and valid times:", ved)

# Print count of unique patient's admitted to the hospital with valid timeframes
print("Unique patients with hospital admissions and valid times:", vha)

Unique patients with ED stays and valid times: 205503
Unique patients with hospital admissions and valid times: 223382


### 2. Merge Data

#### 2.1. Merge ED stays with hospital admissions

2.1.1. Assess ED Visits

In [None]:
edvisits = edstays["stay_id"].astype(str).str.len()
print("Total ED visits:", (len(edvisits)))

Total ED visits: 425081


2.1.2. Assess Hospital Admissions from ED

In [None]:
edadmit = edstays["hadm_id"].nunique()
print("Unique hospital admission codes in ED stays file:", edadmit)

Unique hospital admission codes in ED stays file: 202441


2.1.3. Merge

In [None]:
# Merge ED stays with admissions
cohort_draft = edstays.merge(
    admissions,
    on=["subject_id","hadm_id"],
    how="inner",
    validate="many_to_one",
)

# Rename column variables for interpretability
cohort_draft = cohort_draft.rename(columns={"intime":"ed_intime","outtime":"T_ED"})

# Calculate unique hospital admissions
edstay_and_admit = cohort_draft["hadm_id"].nunique()

print("Number of patients dropped from admission for no ED visit:", edadmit-edstay_and_admit)

uadmit = cohort_draft["hadm_id"].nunique()
print("Unique ED stays with hospital admissions:", uadmit)
print("Unique patients with ED stays with hospital admissions:", cohort_draft["subject_id"].nunique())

Number of patients dropped from admission for no ED visit: 104
Unique ED stays with hospital admissions: 202337
Unique patients with ED stays with hospital admissions: 107331


#### 2.2. Merge ED diagnoses with cohort draft

2.2.1. Assess the cumulative number of unique diagnosis given in the ED across patients

In [None]:
uicd = ed_dx["icd_code"].nunique()
print("Unique ED ICD Codes (Diagnoses):", uicd)

Unique ED ICD Codes (Diagnoses): 13199


2.2.2. Merge

In [None]:
# Merge - add ED diagnosis to cohort
cohort_draft = cohort_draft.merge(ed_dx, on=["subject_id","stay_id"], how="inner")

# Rename column variables for interpretability
cohort_draft = cohort_draft.rename(columns={"icd_version":"ED_icd_version", "icd_code":"ED_icd_code","seq_num":"ed_seq_num"})

# Determine the number of patients dropped for NA - impute NA's unless number is miniscule
draftcount = cohort_draft["hadm_id"].nunique()
print("Number of admissions dropped for no ED diagnosis:", edstay_and_admit-draftcount)

Number of admissions dropped for no ED diagnosis: 782


#### 2.3. Merge discharge diagnoses with cohort draft

2.3.1. Assess the cumulative number of unique diagnosis given at discharge across patients

In [None]:
uicd = disch["icd_code"].nunique()
print("Unique ICD Codes (Diagnoses):", uicd)

Unique ICD Codes (Diagnoses): 28562


2.3.2. Merge discharge diagnosis to cohort draft

In [None]:
# Merge - add discharge diagnosis to cohort
cohort_draft = cohort_draft.merge(disch, on=["subject_id","hadm_id"], how="inner")

# Rename column variables for interpretability
cohort_draft = cohort_draft.rename(columns={"icd_version":"disch_icd_version", "icd_code":"disch_icd_code","seq_num":"dis_seq_num"})

# Determine the number of patients dropped for NA - impute NA's unless number is miniscule
draftcount2 = cohort_draft["hadm_id"].nunique()
print("Number of admissions dropped for no discharge diagnosis:", draftcount-draftcount2)

Number of admissions dropped for no discharge diagnosis: 56


#### 2.4. Merge "patients" file with cohort draft

In [None]:
cohort_draft = cohort_draft.merge(patients, on=["subject_id"], how="inner")

### 3. Check final cohort

In [None]:
cohort = cohort_draft
cohort.head()

Unnamed: 0,subject_id,hadm_id,stay_id,ed_intime,T_ED,gender,admittime,dischtime,admission_type,insurance,...,marital_status,race,hospital_expire_flag,ed_seq_num,ED_icd_code,ED_icd_version,dis_seq_num,disch_icd_code,disch_icd_version,anchor_age
0,10000032,22595853,33258284,2180-05-06 19:17:00,2180-05-06 23:30:00,F,2180-05-06 22:23:00,2180-05-07 17:15:00,URGENT,Medicaid,...,WIDOWED,WHITE,0,1,5728,9,1,5723,9,52
1,10000032,22595853,33258284,2180-05-06 19:17:00,2180-05-06 23:30:00,F,2180-05-06 22:23:00,2180-05-07 17:15:00,URGENT,Medicaid,...,WIDOWED,WHITE,0,1,5728,9,2,78959,9,52
2,10000032,22595853,33258284,2180-05-06 19:17:00,2180-05-06 23:30:00,F,2180-05-06 22:23:00,2180-05-07 17:15:00,URGENT,Medicaid,...,WIDOWED,WHITE,0,1,5728,9,3,5715,9,52
3,10000032,22595853,33258284,2180-05-06 19:17:00,2180-05-06 23:30:00,F,2180-05-06 22:23:00,2180-05-07 17:15:00,URGENT,Medicaid,...,WIDOWED,WHITE,0,1,5728,9,4,7070,9,52
4,10000032,22595853,33258284,2180-05-06 19:17:00,2180-05-06 23:30:00,F,2180-05-06 22:23:00,2180-05-07 17:15:00,URGENT,Medicaid,...,WIDOWED,WHITE,0,1,5728,9,5,496,9,52


### 3. # Solve for multiple hadm_id's mapping to more than one stay_id

In [None]:
pairs = cohort[["hadm_id","stay_id"]].drop_duplicates()
multiplicity = pairs.groupby("hadm_id")["stay_id"].nunique()
print(multiplicity.value_counts().sort_index())
print("Max ED stays per hadm_id:", multiplicity.max())

stay_id
1    200942
2       555
3         2
Name: count, dtype: int64
Max ED stays per hadm_id: 3


In [None]:
def map_one_ed_stay_per_admission(df,
                                  hadm="hadm_id", stay="stay_id",
                                  ed_in="ed_intime", ed_out="T_ED",
                                  admit="admittime"):
    cols = [hadm, stay, ed_in, ed_out, admit]
    x = df[cols].drop_duplicates(subset=[hadm, stay]).copy()

    # Ensure datetimes
    for c in [ed_in, ed_out, admit]:
        if not pd.api.types.is_datetime64_any_dtype(x[c]):
            x[c] = pd.to_datetime(x[c], errors="coerce")

    # Features for ranking
    x["overlap"] = (x[ed_in] <= x[admit]) & (x[admit] <= x[ed_out])

    # Positive time gaps only
    zero = pd.Timedelta(0)
    before_gap = (x[admit] - x[ed_out])
    after_gap  = (x[ed_in]  - x[admit])
    x["gap_before"] = before_gap.where(before_gap >= zero, pd.NaT)
    x["gap_after"]  = after_gap.where(after_gap  >= zero, pd.NaT)

    # Overlap first, then nearest before, then nearest after
    x = x.sort_values(
        by=[hadm, "overlap", "gap_before", "gap_after", ed_in, stay],
        ascending=[True, False, True, True, True, True]
    )

    chosen = x.drop_duplicates(subset=[hadm], keep="first")[[hadm, stay]]

    # Sanity checks
    assert chosen[hadm].is_unique,  "Each hadm_id must appear once"
    assert chosen[stay].is_unique,  "A stay cannot map to multiple hadm"
    return chosen

# One to one mapping
mapping_1to1 = map_one_ed_stay_per_admission(cohort)
cohort_1to1 = cohort.merge(mapping_1to1, on=["hadm_id","stay_id"], how="inner")

print("hadm unique:", cohort_1to1["hadm_id"].nunique())
print("stay unique:", cohort_1to1["stay_id"].nunique())
print("Max stay_ids per hadm_id:",
      cohort_1to1.groupby("hadm_id")["stay_id"].nunique().max())
print("Max hadm_ids per stay_id:",
      cohort_1to1.groupby("stay_id")["hadm_id"].nunique().max())

cohort_1to1 = cohort.merge(mapping_1to1, on=["hadm_id","stay_id"], how="inner")
cohort = cohort_1to1.copy()

hadm unique: 201499
stay unique: 201499
Max stay_ids per hadm_id: 1
Max hadm_ids per stay_id: 1


## 2. Create Labels

### 2.1. Build diagnosis families for labels

#### 2.1.1. ED Diagnosis Families

In [None]:
ed_sets = (
    cohort_1to1
    .loc[:, ["hadm_id", "ED_icd_code"]]
    .assign(ed_family=lambda d: d["ED_icd_code"].map(fam3))
    .groupby("hadm_id")["ed_family"]
    .apply(lambda s: set(pd.unique(s.dropna())))
    .rename("ed_set")
    .reset_index()
)

#### 2.1.2. Discharge Diagnosis Families

In [None]:
dis_sets = (
    cohort_1to1
    .loc[:, ["hadm_id", "disch_icd_code"]]
    .assign(disch_family=lambda d: d["disch_icd_code"].map(fam3))
    .groupby("hadm_id")["disch_family"]
    .apply(lambda s: set(pd.unique(s.dropna())))
    .rename("dis_set")
    .reset_index()
)

#### 2.1.3. Sanity Check

In [None]:
assert ed_sets["ed_set"].map(type).eq(set).all()
assert dis_sets["dis_set"].map(type).eq(set).all()

#### 2.1.4. Merge diagnosis sets

In [None]:
both = ed_sets.merge(dis_sets, on="hadm_id", how="inner")
both["n_ed"]  = both["ed_set"].map(len)
both["n_dis"] = both["dis_set"].map(len)

print("Merged ED and DIS sets:", both.shape)

Merged ED and DIS sets: (201499, 5)


### 2.2. Apply Label Policy

#### 2.2.1. Check Class Balance on "NONE_COVERED"

In [None]:
# Label Policy
def policy_any_missed(row):
    return int(len(row["dis_set"] - row["ed_set"]) > 0)

def policy_none_covered(row):
    return int(row["dis_set"].isdisjoint(row["ed_set"]))

LABEL_POLICY = "NONE_COVERED"

if LABEL_POLICY == "ANY_MISSED":
    both["y"] = both.apply(policy_any_missed, axis=1)
elif LABEL_POLICY == "NONE_COVERED":
    both["y"] = both.apply(policy_none_covered, axis=1)
else:
    raise ValueError("Unsupported LABEL_POLICY")

both["missed_set_str"] = both.apply(lambda r: stringify_set(r["dis_set"] - r["ed_set"]), axis=1)
both["ed_set_str"]     = both["ed_set"].map(stringify_set)
both["dis_set_str"]    = both["dis_set"].map(stringify_set)
both["label"] = LABEL_POLICY

labels = both.reset_index()[[
    "hadm_id","y","n_ed","n_dis","ed_set_str","dis_set_str","missed_set_str","label"
]].copy()

print("Label distribution:")
print(labels["y"].value_counts(normalize=True))

Label distribution:
y
0    0.74846
1    0.25154
Name: proportion, dtype: float64


#### 2.2.2. Check Class Balance on "ANY_MISSED"

In [None]:
# Label Policy
def policy_any_missed(row):
    return int(len(row["dis_set"] - row["ed_set"]) > 0)

def policy_none_covered(row):
    return int(row["dis_set"].isdisjoint(row["ed_set"]))

LABEL_POLICY = "ANY_MISSED"

if LABEL_POLICY == "ANY_MISSED":
    both["y"] = both.apply(policy_any_missed, axis=1)
elif LABEL_POLICY == "NONE_COVERED":
    both["y"] = both.apply(policy_none_covered, axis=1)
else:
    raise ValueError("Unsupported LABEL_POLICY")

both["missed_set_str"] = both.apply(lambda r: stringify_set(r["dis_set"] - r["ed_set"]), axis=1)
both["ed_set_str"]     = both["ed_set"].map(stringify_set)
both["dis_set_str"]    = both["dis_set"].map(stringify_set)
both["label"] = LABEL_POLICY

labels = both.reset_index()[[
    "hadm_id","y","n_ed","n_dis","ed_set_str","dis_set_str","missed_set_str","label"
]].copy()

print("Label distribution:")
print(labels["y"].value_counts(normalize=True))

Label distribution:
y
1    0.942953
0    0.057047
Name: proportion, dtype: float64


### 2.3. Diagnosis Severity Weights Strategy

This approach weighs the cliinical importance of diagnoses.

#### 2.3.1. Define High Severity

In [None]:
# High-severity: Life-threatening, require immediate intervention
HIGH_SEVERITY = {
    # Cardiovascular
    'I21', 'I22', 'I23',  # Acute MI
    'I46',  # Cardiac arrest
    'I60', 'I61', 'I62', 'I63', 'I64',  # Stroke
    'I26',  # Pulmonary embolism
    'I71',  # Aortic dissection
    'I50',  # Heart failure (severe)

    # Respiratory
    'J96',  # Respiratory failure
    'J81',  # Pulmonary edema
    'J18',  # Pneumonia (can be severe)

    # Sepsis/Infection
    'A41',  # Sepsis
    'R65',  # SIRS

    # Trauma
    'S06',  # Intracranial injury
    'S27',  # Thoracic injury

    # Renal
    'N17', 'N18', 'N19',  # Acute kidney injury/failure

    # GI
    'K92',  # GI bleeding
    'K80', 'K81',  # Cholecystitis/cholelithiasis

    # Metabolic
    'E87',  # Severe electrolyte disorders
}

#### 2.3.2. Define Medium Severity

In [None]:
# Medium-severity: Important but not immediately life-threatening
MEDIUM_SEVERITY = {
    'J44',  # COPD
    'N39',  # UTI
    'L03',  # Cellulitis
    'K52',  # Gastroenteritis
    'M79',  # Myalgia
    'E11',  # Diabetes complications
}

#### 2.3.3. Define Low Severity

In [None]:
# Low-severity: Symptoms, minor conditions, follow-up findings
LOW_SEVERITY = {
    'R07',  # Chest pain (symptom, not diagnosis)
    'R10',  # Abdominal pain
    'R50',  # Fever
    'G43', 'G44',  # Headache
    'J06',  # Upper respiratory infection
    'R51',  # Headache
    'R42',  # Dizziness
    'Z',    # Z-codes (administrative, not medical diagnoses)
}

#### 2.3.4. Helpers

In [None]:
def get_severity_weight(icd_family):
    """Assign severity weight to ICD family."""
    if pd.isna(icd_family):
        return 0

    fam = str(icd_family)[:3].upper()

    if fam in HIGH_SEVERITY:
        return 3.0
    elif fam in MEDIUM_SEVERITY:
        return 2.0
    elif fam in LOW_SEVERITY:
        return 0.5  # Low weight for symptoms
    else:
        return 1.5  # Default: moderate

def compute_missed_severity(row):
    """Compute total severity of missed diagnoses."""
    missed = row["dis_set"] - row["ed_set"]

    if len(missed) == 0:
        return 0.0

    # Sum severity weights
    total_severity = sum(get_severity_weight(fam) for fam in missed)

    return total_severity

def compute_high_severity_missed(row):
    """Count number of high-severity diagnoses missed."""
    missed = row["dis_set"] - row["ed_set"]

    count = sum(1 for fam in missed if str(fam)[:3].upper() in HIGH_SEVERITY)

    return count

In [None]:
print(f"Defined {len(HIGH_SEVERITY)} high-severity diagnoses")
print(f"Defined {len(MEDIUM_SEVERITY)} medium-severity diagnoses")
print(f"Defined {len(LOW_SEVERITY)} low-severity diagnoses")

Defined 26 high-severity diagnoses
Defined 6 medium-severity diagnoses
Defined 9 low-severity diagnoses


#### 2.3.5. Add severity columns

In [None]:
both["missed_severity_score"] = both.apply(compute_missed_severity, axis=1)
both["n_high_severity_missed"] = both.apply(compute_high_severity_missed, axis=1)

print(f"Computed severity scores")
print(f"  Mean missed severity: {both['missed_severity_score'].mean():.2f}")
print(f"  Max missed severity: {both['missed_severity_score'].max():.2f}")

Computed severity scores
  Mean missed severity: 15.90
  Max missed severity: 74.50


#### 2.3.6. Create Label Variants

In [None]:
# Variant 1: Missed severity >= 2.0 (at least one medium-severity)
both["y_severity_2plus"] = (both["missed_severity_score"] >= 2.0).astype(int)

# Variant 2: Missed severity >= 3.0 (at least one high-severity)
both["y_severity_3plus"] = (both["missed_severity_score"] >= 3.0).astype(int)

# Variant 3: Missed 2+ medium/high severity diagnoses
both["y_severity_6plus"] = (both["missed_severity_score"] >= 6.0).astype(int)

# Variant 4: Critical miss (any high-severity diagnosis)
both["y_critical"] = (both["n_high_severity_missed"] > 0).astype(int)

# Variant 5: Multiple critical misses
both["y_critical_2plus"] = (both["n_high_severity_missed"] >= 2).astype(int)

# Variant 6: None covered (from above, for comparison when modeling but not previously tested or hopeful)
both["y_none_covered"] = both.apply(
    lambda r: int(r["dis_set"].isdisjoint(r["ed_set"])),
    axis=1
)

# Variant 1: Any missed (also seen above, too lenient - produced significant issues in modeling)
both["y_any_missed"] = (both["missed_severity_score"] > 0).astype(int)

#### 2.3.7. Label Statistics

In [None]:
label_variants = [
    "y_any_missed",
    "y_severity_2plus",
    "y_severity_3plus",
    "y_severity_6plus",
    "y_critical",
    "y_critical_2plus",
    "y_none_covered"
]

comparison_stats = []

for label_col in label_variants:
    pos_count = both[label_col].sum()
    pos_pct = both[label_col].mean() * 100

    comparison_stats.append({
        'Label': label_col,
        'Positive': pos_count,
        'Positive %': f"{pos_pct:.1f}%",
        'Negative': len(both) - pos_count,
        'Negative %': f"{100 - pos_pct:.1f}%",
        'Balance': 'Good' if 20 <= pos_pct <= 60 else ('Too High' if pos_pct > 60 else 'Too Low')
    })

comparison_df = pd.DataFrame(comparison_stats)
print(comparison_df.to_string(index=False))

           Label  Positive Positive %  Negative Negative %  Balance
    y_any_missed    190004      94.3%     11495       5.7% Too High
y_severity_2plus    181398      90.0%     20101      10.0% Too High
y_severity_3plus    181047      89.9%     20452      10.1% Too High
y_severity_6plus    160930      79.9%     40569      20.1% Too High
      y_critical     56101      27.8%    145398      72.2%     Good
y_critical_2plus     26252      13.0%    175247      87.0%  Too Low
  y_none_covered     50685      25.2%    150814      74.8%     Good


#### 2.3.8. Optimal Label Selection

In [None]:
# Find best balanced label
for label_col in label_variants:
    pos_pct = both[label_col].mean() * 100
    if 30 <= pos_pct <= 50:
        recommended_label = label_col
        break
else:
    # If no label in ideal range, pick closest to 40%
    diffs = [(abs(both[col].mean() * 100 - 40), col) for col in label_variants]
    recommended_label = min(diffs)[1]

print(f"\nRECOMMENDED: Use '{recommended_label}'")
print(f"  Positive: {both[recommended_label].mean() * 100:.1f}%")
print(f"  Rationale: Best balance for modeling")

# Set as primary label
both["y"] = both[recommended_label]
both["label_policy"] = recommended_label

print(f"\n Set 'y' column to {recommended_label}")


RECOMMENDED: Use 'y_critical'
  Positive: 27.8%
  Rationale: Best balance for modeling

 Set 'y' column to y_critical


#### 2.3.9. Analyze Label Characteristics

In [None]:
# For primary label, show what diagnoses are commonly missed
positive_cases = both[both["y"] == 1]

print(f"\nPositive cases (y=1): {len(positive_cases)}")
print(f"Negative cases (y=0): {len(both) - len(positive_cases)}")

# Most commonly missed diagnosis families
all_missed_families = []
for idx, row in positive_cases.iterrows():
    missed = row["dis_set"] - row["ed_set"]
    all_missed_families.extend(list(missed))

if all_missed_families:
    missed_counts = pd.Series(all_missed_families).value_counts().head(20)
    print("\nTop 20 most commonly missed diagnosis families:")
    print(missed_counts.to_string())

    # Check how many are high-severity
    high_sev_in_top20 = sum(1 for fam in missed_counts.index[:20]
                            if str(fam)[:3].upper() in HIGH_SEVERITY)
    print(f"\nOf top 20, {high_sev_in_top20} are high-severity diagnoses")


Positive cases (y=1): 56101
Negative cases (y=0): 145398

Top 20 most commonly missed diagnosis families:
E87    26008
E78    22658
Z79    19647
N18    17642
Z87    17452
E11    15939
I50    15468
N17    15241
Y92    15022
I25    14276
I10    13398
K21    13281
I48    11515
Z68    10072
Z86    10025
Z95     9994
F32     9956
I12     9430
Z85     8966
G47     8347

Of top 20, 4 are high-severity diagnoses


#### 2.3.10. Final Labels DataFrame

In [None]:
# Include all variants for comparison
labels = both[[
    "hadm_id",
    "y",
    "y_any_missed",
    "y_severity_2plus",
    "y_severity_3plus",
    "y_severity_6plus",
    "y_critical",
    "y_critical_2plus",
    "y_none_covered",
    "n_ed",
    "n_dis",
    "missed_severity_score",
    "n_high_severity_missed",
    "ed_set_str",
    "dis_set_str",
    "missed_set_str",
    "label_policy"
]].copy()

print(f"Created labels DataFrame with {len(labels)} admissions")
print(f"Primary label: '{recommended_label}'")

Created labels DataFrame with 201499 admissions
Primary label: 'y_critical'


#### 2.3.11. Merge with Cohort

In [None]:
cohort_adm = (
    cohort_1to1
    .drop_duplicates(subset=["hadm_id"])
    .copy()
)

cohort_with_y = cohort_adm.merge(
    labels,
    on="hadm_id",
    how="inner",
    validate="one_to_one"
)

print("Cohort with labels shape:", cohort_with_y.shape)
print(cohort_with_y[["y"]].value_counts(normalize=True))

Cohort with labels shape: (201499, 37)
y
0    0.721582
1    0.278418
Name: proportion, dtype: float64


#### 2.3.12. Summary and Recommendations

In [None]:
print(f"""
PRIMARY LABEL SELECTED: {recommended_label}
- Positive class: {both[recommended_label].mean() * 100:.1f}%
- Negative class: {(1 - both[recommended_label].mean()) * 100:.1f}%

WHAT THIS LABEL CAPTURES:
""")

if recommended_label == "y_severity_2plus":
    print("- ED missed at least one MEDIUM or HIGH severity diagnosis")
    print("- Excludes minor symptoms (chest pain, headache, etc.)")
    print("- Clinically meaningful discordance")
elif recommended_label == "y_severity_3plus":
    print("- ED missed at least one HIGH severity diagnosis")
    print("- Focuses on serious conditions (MI, stroke, sepsis, etc.)")
    print("- High clinical stakes")
elif recommended_label == "y_critical":
    print("- ED missed at least one life-threatening diagnosis")
    print("- Most clinically important cases")
    print("- May have lower prevalence")
elif recommended_label == "y_severity_6plus":
    print("- ED missed multiple significant diagnoses")
    print("- Cumulative severity >= 6.0")
    print("- Moderate-to-severe discordance")


PRIMARY LABEL SELECTED: y_critical
- Positive class: 27.8%
- Negative class: 72.2%

WHAT THIS LABEL CAPTURES:

- ED missed at least one life-threatening diagnosis
- Most clinically important cases
- May have lower prevalence


## 3. Split

### 3.1. Master Ordering of Patients - List

In [None]:
subjects_all = (
    cohort_with_y["subject_id"]
    .drop_duplicates()
    .sample(frac=1.0, random_state=SEED)
    .to_list()
)
n = len(subjects_all)

### 3.2. 70/15/15 Splits

In [None]:
# 70/15/15 split sizes
n_train = int(math.floor(0.70 * n))
n_val   = int(math.floor(0.15 * n))
n_test  = n - n_train - n_val

train_ids = subjects_all[:n_train]
val_ids   = subjects_all[n_train:n_train + n_val]
test_ids  = subjects_all[n_train + n_val:]

### 3.3. Sanity Checks

In [None]:
assert set(train_ids).isdisjoint(val_ids)
assert set(train_ids).isdisjoint(test_ids)
assert set(val_ids).isdisjoint(test_ids)

print(f"Primary splits: train={len(train_ids)}, val={len(val_ids)}, test={len(test_ids)}")

Primary splits: train=74889, val=16047, test=16049


### 3.4. Subject-Level Assignment

In [None]:
split_assign = pd.concat([
    pd.DataFrame({"subject_id": train_ids, "split": "train"}),
    pd.DataFrame({"subject_id": val_ids,   "split": "val"}),
    pd.DataFrame({"subject_id": test_ids,  "split": "test"}),
], ignore_index=True)

# Merge split assignment into cohort_with_y
cohort_with_y = cohort_with_y.merge(
    split_assign,
    on="subject_id",
    how="left",
    validate="many_to_one"
)

print("Split distribution:")
print(cohort_with_y["split"].value_counts())

Split distribution:
split
train    141303
val       30273
test      29923
Name: count, dtype: int64


### 3.5. Nested Fractions

In [None]:
# Add rank and percentile within each split (subject-level)
split_assign = split_assign.sort_values(["split", "subject_id"]).copy()
split_assign["rank_within_split"] = split_assign.groupby("split").cumcount()

# Size of each split
split_sizes = split_assign.groupby("split")["subject_id"].transform("count")
split_assign["pct_within_split"] = 100.0 * (split_assign["rank_within_split"] + 1) / split_sizes

# Merge back into cohort
cohort_with_y = cohort_with_y.merge(
    split_assign[["subject_id", "rank_within_split", "pct_within_split"]],
    on="subject_id",
    how="left",
    validate="many_to_one"
)

print(cohort_with_y[["split", "rank_within_split", "pct_within_split"]].head())

   split  rank_within_split  pct_within_split
0  train                  0          0.001335
1  train                  0          0.001335
2  train                  0          0.001335
3  train                  0          0.001335
4  train                  1          0.002671


## 4. Save cohort

### 4.1. Check features

In [None]:
print(cohort_with_y.columns)

Index(['subject_id', 'hadm_id', 'stay_id', 'ed_intime', 'T_ED', 'gender',
       'admittime', 'dischtime', 'admission_type', 'insurance', 'language',
       'marital_status', 'race', 'hospital_expire_flag', 'ed_seq_num',
       'ED_icd_code', 'ED_icd_version', 'dis_seq_num', 'disch_icd_code',
       'disch_icd_version', 'anchor_age', 'y', 'y_any_missed',
       'y_severity_2plus', 'y_severity_3plus', 'y_severity_6plus',
       'y_critical', 'y_critical_2plus', 'y_none_covered', 'n_ed', 'n_dis',
       'missed_severity_score', 'n_high_severity_missed', 'ed_set_str',
       'dis_set_str', 'missed_set_str', 'label_policy', 'split',
       'rank_within_split', 'pct_within_split'],
      dtype='object')


In [None]:
cohort_with_y.head()

Unnamed: 0,subject_id,hadm_id,stay_id,ed_intime,T_ED,gender,admittime,dischtime,admission_type,insurance,...,n_dis,missed_severity_score,n_high_severity_missed,ed_set_str,dis_set_str,missed_set_str,label_policy,split,rank_within_split,pct_within_split
0,10000032,22595853,33258284,2180-05-06 19:17:00,2180-05-06 23:30:00,F,2180-05-06 22:23:00,2180-05-07 17:15:00,URGENT,Medicaid,...,8,7.5,0,070|572|789|V08,070|296|309|496|571|572|789|V15,296|309|496|571|V15,y_critical,train,0,0.001335
1,10000032,22841357,38112554,2180-06-26 15:54:00,2180-06-26 21:31:00,F,2180-06-26 18:27:00,2180-06-27 18:49:00,EW EMER.,Medicaid,...,8,6.0,0,070|571|789|V08,070|276|287|305|496|571|789|V08,276|287|305|496,y_critical,train,0,0.001335
2,10000032,25742920,35968195,2180-08-05 20:58:00,2180-08-06 01:44:00,F,2180-08-05 23:44:00,2180-08-07 17:50:00,EW EMER.,Medicaid,...,9,9.0,0,571|789|V08,070|276|305|496|571|787|789|V08|V46,070|276|305|496|787|V46,y_critical,train,0,0.001335
3,10000032,29079034,39399961,2180-07-23 05:54:00,2180-07-23 14:00:00,F,2180-07-23 12:35:00,2180-07-25 17:55:00,EW EMER.,Medicaid,...,12,18.0,0,348|780,070|276|296|305|458|496|571|789|799|V08|V46|V49,070|276|296|305|458|496|571|789|799|V08|V46|V49,y_critical,train,0,0.001335
4,10000084,23052089,35203156,2160-11-20 20:36:00,2160-11-21 03:20:00,M,2160-11-21 01:56:00,2160-11-25 14:52:00,EW EMER.,Medicare,...,6,9.0,0,G20|R53,E78|F02|G31|R29|R44|Z85,E78|F02|G31|R29|R44|Z85,y_critical,train,1,0.002671


### 4.2. Save for Imputation

In [None]:
cohort_cols4imp = [
    'subject_id', 'hadm_id', 'gender', 'insurance',
    'language', 'marital_status', 'race', 'anchor_age',
    'y', 'split', 'rank_within_split', 'pct_within_split'
]

cohort_4imputation = (
    cohort_with_y[cohort_cols4imp]
    .drop_duplicates()
    .copy()
)

out_imp_csv = os.path.join(OUT_DIR, "IMPUTATION_VARIABLES.csv")
cohort_4imputation.to_csv(out_imp_csv, index=False)
print("Saved:", out_imp_csv)

Saved: /content/drive/MyDrive/manifests/IMPUTATION_VARIABLES.csv


### 4.2. Save for Modeling

In [None]:
cohort_cols_models = [
    'subject_id', 'hadm_id', 'stay_id', 'ed_intime', 'T_ED','ed_seq_num',
    'y', 'split', 'rank_within_split','pct_within_split'
]

cohort_final = (
    cohort_with_y[cohort_cols_models]
    .drop_duplicates()
    .copy()
)

out_label_csv = os.path.join(OUT_DIR, "FINAL_COHORT.csv")
cohort_final.to_csv(out_label_csv, index=False)
print("Saved:", out_label_csv)

uniqueptremain = cohort_final["subject_id"].nunique()
print("Total number of unique patients in cohort:", uniqueptremain)

uniqueadremain = cohort_final["hadm_id"].nunique()
print("Total number of unique admissions in cohort:", uniqueadremain)

Saved: /content/drive/MyDrive/manifests/FINAL_COHORT.csv
Total number of unique patients in cohort: 106985
Total number of unique admissions in cohort: 201499
