# ARDS Identification - Optimized Implementation

This notebook identifies ARDS patients from our base cohort using the updated ARDS definition:

## ARDS Definition (Updated):
**ARDS Onset = S/F ratio ≤ 315 within 60 minutes of ICU admission + Bilateral infiltrates flag**

### Key Changes:
- **No patient filtering**: All cohort patients are retained
- **ARDS flag**: Binary indicator for ARDS presence
- **ARDS onset time**: Timestamp when S/F ≤ 315 first occurred within 60min of ICU admission
- **Bilateral infiltrates**: Flag from radiology reports (separate indicator)
- **Vectorized processing**: Efficient handling of large datasets

### Output:
- Cohort with ARDS flags and onset times
- Bilateral infiltrates detection
- Summary statistics by ARDS status

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import re
import gc
import os
import warnings
warnings.filterwarnings('ignore')

# Define paths
MIMIC_PATH = '/Users/kavenchhikara/Desktop/CLIF/MIMIC-IV-3.1/physionet.org/files'
DATA_PATH = '/Users/kavenchhikara/Desktop/projects/SCCM/SCCM-Team2/ards_analysis/data'

print(f"Analysis start time: {datetime.now()}")
print(f"Data path: {DATA_PATH}")

Analysis start time: 2025-07-20 01:37:20.629612
Data path: /Users/kavenchhikara/Desktop/projects/SCCM/SCCM-Team2/ards_analysis/data


## Step 1: Load Base Cohort

In [4]:
# Load the optimized base cohort
try:
    base_cohort = pd.read_parquet(f'{DATA_PATH}/base_cohort_optimized.parquet')
    print(f"Loaded optimized cohort: {len(base_cohort):,} patients")
    print(f"HADM ID: {base_cohort['hadm_id'].nunique():,} ")
    print(f"STAY ID ID: {base_cohort['stay_id'].nunique():,} ")
except FileNotFoundError:
    # Fallback to original cohort
    base_cohort = pd.read_parquet(f'{DATA_PATH}/base_cohort.parquet')
    print(f"Loaded original cohort: {len(base_cohort):,} patients")
    print(f"HADM ID: {base_cohort['hadm_id'].nunique():,} ")
    print(f"STAY ID ID: {base_cohort['stay_id'].nunique():,} ")

print(f"Unique patients: {base_cohort['subject_id'].nunique():,}")

# Convert datetime columns
datetime_cols = ['admission_dttm', 'discharge_dttm', 'intime', 'outtime']
for col in datetime_cols:
    if col in base_cohort.columns:
        base_cohort[col] = pd.to_datetime(base_cohort[col])

# Display cohort info
print(f"\nCohort overview:")
print(f"Date range: {base_cohort['admission_dttm'].min()} to {base_cohort['admission_dttm'].max()}")
print(f"ICU stays: {base_cohort['stay_id'].nunique():,}")

Loaded optimized cohort: 21,590 patients
HADM ID: 18,857 
STAY ID ID: 21,590 
Unique patients: 17,500

Cohort overview:
Date range: 2110-01-11 10:14:00 to 2211-05-01 06:57:00
ICU stays: 21,590


## Step 2: Extract S/F Ratios Within 60 Minutes of ICU Admission

In [14]:
# Define itemids for SpO2 and FiO2
SPO2_ITEMIDS = [220277, 224696]  # SpO2 pulse oximetry
FIO2_ITEMIDS = [220210, 223835]  # FiO2 (%) and FiO2 (fraction)
SF_ITEMIDS = SPO2_ITEMIDS + FIO2_ITEMIDS

# Get target ICU stays
target_stay_ids = set(base_cohort['stay_id'])
print(f"Target ICU stays: {len(target_stay_ids):,}")

print("Loading chartevents for S/F ratio calculation...")
print("Extracting SpO2 and FiO2 data within 60 minutes of ICU admission...")

# Load chartevents with optimized column selection
chartevents = pd.read_csv(
    f'{MIMIC_PATH}/mimiciv/3.1/icu/chartevents.csv.gz',
    usecols=['stay_id', 'itemid', 'charttime', 'valuenum'],
    dtype={'stay_id': 'int32', 'itemid': 'int32', 'valuenum': 'float32'}
)
print(f"Chartevents loaded: {len(chartevents):,} rows")

# Filter to our cohort and S/F parameters (vectorized)
sf_data = chartevents[
    (chartevents['stay_id'].isin(target_stay_ids)) &
    (chartevents['itemid'].isin(SF_ITEMIDS)) &
    (chartevents['valuenum'].notna())
].copy()

print(f"S/F measurements for cohort: {len(sf_data):,}")

# Clear original chartevents
# del chartevents
gc.collect()

Target ICU stays: 21,590
Loading chartevents for S/F ratio calculation...
Extracting SpO2 and FiO2 data within 60 minutes of ICU admission...
Chartevents loaded: 432,997,491 rows
S/F measurements for cohort: 6,168,848


1701

In [15]:
# Add parameter type and merge with ICU times (vectorized)
sf_data['param_type'] = 'unknown'
sf_data.loc[sf_data['itemid'].isin(SPO2_ITEMIDS), 'param_type'] = 'spo2'
sf_data.loc[sf_data['itemid'].isin(FIO2_ITEMIDS), 'param_type'] = 'fio2'

# Convert charttime and merge with ICU admission times
sf_data['charttime'] = pd.to_datetime(sf_data['charttime'])
sf_data = sf_data.merge(
    base_cohort[['stay_id', 'hadm_id', 'intime']], 
    on='stay_id', 
    how='left'
)

# Calculate minutes from ICU admission (vectorized)
sf_data['minutes_from_icu'] = (
    sf_data['charttime'] - sf_data['intime']
).dt.total_seconds() / 60

# Filter to first 60 minutes of ICU admission (vectorized)
sf_data_60min = sf_data[
    (sf_data['minutes_from_icu'] >= 0) &
    (sf_data['minutes_from_icu'] <= 60)
].copy()

print(f"S/F measurements in first 60 minutes: {len(sf_data_60min):,}")
print(f"Parameter distribution:")
print(sf_data_60min['param_type'].value_counts())

# Clear full sf_data
# del sf_data
gc.collect()

S/F measurements in first 60 minutes: 84,846
Parameter distribution:
param_type
fio2    45691
spo2    39155
Name: count, dtype: int64


325

In [None]:
# Separate SpO2 and FiO2 data
spo2_data = sf_data_60min[sf_data_60min['param_type'] == 'spo2'].copy()
fio2_data = sf_data_60min[sf_data_60min['param_type'] == 'fio2'].copy()

print(f"SpO2 measurements (60min): {len(spo2_data):,}")
print(f"FiO2 measurements (60min): {len(fio2_data):,}")

if len(spo2_data) > 0 and len(fio2_data) > 0:
    # Convert FiO2 percentages to fractions (vectorized)
    fio2_mask = (sf_data_60min['param_type'] == 'fio2') & (sf_data_60min['valuenum'] > 1)
    sf_data_60min.loc[fio2_mask, 'valuenum'] = sf_data_60min.loc[fio2_mask, 'valuenum'] / 100

    print("Calculating S/F ratios using pivot method...")

    # Pivot to get SpO2 and FiO2 as columns (vectorized)
    sf_pivot = sf_data_60min.pivot_table(
        index=['stay_id', 'hadm_id', 'charttime'],
        columns='param_type',
        values='valuenum',
        aggfunc='first'  # Take first value if multiple at same time
    ).reset_index()

    # Clean up column names
    sf_pivot.columns.name = None

    # Ensure both columns exist
    if 'spo2' not in sf_pivot.columns:
        sf_pivot['spo2'] = np.nan
    if 'fio2' not in sf_pivot.columns:
        sf_pivot['fio2'] = np.nan

    # Calculate S/F ratios where both measurements exist (vectorized)
    sf_ratios = sf_pivot.dropna(subset=['spo2', 'fio2']).copy()
    sf_ratios = sf_ratios[sf_ratios['fio2'] > 0]  # Avoid division by zero
    sf_ratios['sf_ratio'] = sf_ratios['spo2'] / sf_ratios['fio2']

    print(f"Successful S/F calculations (pivot method): {len(sf_ratios):,}")
    print(f"S/F ratio distribution (60min window):")
    print(sf_ratios['sf_ratio'].describe())

    sf_ratios = sf_ratios.merge(
        base_cohort[['stay_id','intime']], 
        on='stay_id', 
        how='left')
    sf_ratios['minutes_from_icu'] = (sf_ratios['charttime'] - sf_ratios['intime']).dt.total_seconds() / 60

    # Clear intermediate data
    # del spo2_data, fio2_data, sf_pivot, sf_data_60min
    gc.collect()

else:
    print("Insufficient SpO2/FiO2 data in 60-minute window!")
    sf_ratios = pd.DataFrame()

SpO2 measurements (60min): 39,155
FiO2 measurements (60min): 45,691
Calculating S/F ratios using pivot method...
Successful S/F calculations (pivot method): 28,917
S/F ratio distribution (60min window):
count    28917.000000
mean       474.522003
std        225.091293
min          0.000000
25%        357.142853
50%        490.000000
75%        613.333313
max       5000.000000
Name: sf_ratio, dtype: float64


In [19]:
sf_ratios.head()

Unnamed: 0,stay_id,hadm_id,charttime,fio2,spo2,sf_ratio,intime,minutes_from_icu
0,30000153,23998182,2174-09-29 13:00:00,0.16,100.0,625.0,2174-09-29 12:09:00,51.0
1,30000646,22795209,2194-04-29 01:41:00,0.28,97.0,346.428558,2194-04-29 01:39:22,1.633333
2,30000646,22795209,2194-04-29 02:00:00,0.33,98.0,296.969696,2194-04-29 01:39:22,20.633333
3,30001555,25778760,2177-09-27 12:00:00,0.14,96.0,685.714294,2177-09-27 11:23:13,36.783333
4,30001947,23836605,2162-12-26 15:17:00,0.14,97.0,692.857117,2162-12-26 15:04:30,12.5


## Step 3: Identify ARDS Onset (S/F ≤ 315 within 60 minutes)

In [20]:
if len(sf_ratios) > 0:
    # Identify ARDS onset: first S/F ratio ≤ 315 within 60 minutes (vectorized)
    ards_sf_criteria = sf_ratios[sf_ratios['sf_ratio'] <= 315].copy()
    
    print(f"S/F measurements ≤ 315: {len(ards_sf_criteria):,}")
    
    if len(ards_sf_criteria) > 0:
        # Get earliest ARDS onset time for each admission (vectorized)
        ards_onset_times = ards_sf_criteria.groupby('hadm_id').agg({
            'charttime': 'min',  # Earliest time with S/F ≤ 315
            'sf_ratio': 'min',   # Lowest S/F ratio
            'minutes_from_icu': 'min'  # Minutes from ICU admission
        }).reset_index()
        
        ards_onset_times.rename(columns={
            'charttime': 'ards_onset_time',
            'sf_ratio': 'ards_onset_sf_ratio',
            'minutes_from_icu': 'ards_onset_minutes_from_icu'
        }, inplace=True)
        
        print(f"Admissions with ARDS onset (S/F ≤ 315): {len(ards_onset_times):,}")
        print(f"\nARDS onset timing (minutes from ICU admission):")
        print(ards_onset_times['ards_onset_minutes_from_icu'].describe())
        
        print(f"\nARDS onset S/F ratios:")
        print(ards_onset_times['ards_onset_sf_ratio'].describe())
        
    else:
        print("No ARDS onset events found (S/F ≤ 315)!")
        ards_onset_times = pd.DataFrame()
        
else:
    print("No S/F ratios calculated - cannot identify ARDS onset!")
    ards_onset_times = pd.DataFrame()

# Clear sf_ratios data
if 'sf_ratios' in locals():
    del sf_ratios, ards_sf_criteria
    gc.collect()

S/F measurements ≤ 315: 5,515
Admissions with ARDS onset (S/F ≤ 315): 4,402

ARDS onset timing (minutes from ICU admission):
count    4402.000000
mean       24.866852
std        17.368045
min         0.000000
25%         9.800000
50%        22.000000
75%        39.000000
max        60.000000
Name: ards_onset_minutes_from_icu, dtype: float64

ARDS onset S/F ratios:
count    4402.000000
mean      122.652115
std       103.251930
min         0.000000
25%        30.000000
50%        91.485508
75%       222.670460
max       314.814789
Name: ards_onset_sf_ratio, dtype: float64


## Step 4: Detect Bilateral Infiltrates from Radiology Reports

In [21]:
# Load radiology reports for cohort
print("Loading radiology reports for bilateral infiltrates detection...")
radiology = pd.read_csv(f'{MIMIC_PATH}/mimic-iv-note/2.2/note/radiology.csv.gz')

# Filter to our cohort admissions
cohort_hadm_ids = set(base_cohort['hadm_id'])
cohort_radiology = radiology[
    radiology['hadm_id'].isin(cohort_hadm_ids)
].copy()

print(f"Radiology reports for cohort: {len(cohort_radiology):,}")

# Convert charttime
cohort_radiology['charttime'] = pd.to_datetime(cohort_radiology['charttime'])

# Clear original radiology data
# del radiology
gc.collect()

Loading radiology reports for bilateral infiltrates detection...
Radiology reports for cohort: 215,193


0

In [22]:
# Vectorized bilateral infiltrates detection
def detect_bilateral_infiltrates_vectorized(text_series):
    """
    Vectorized function to detect bilateral infiltrates in radiology reports
    """
    # Convert to lowercase (vectorized)
    text_lower = text_series.str.lower().fillna('')
    
    # Define bilateral patterns
    bilateral_patterns = [
        r'bilateral.*(?:infiltrate|opacity|opacities|consolidation)',
        r'(?:infiltrate|opacity|opacities|consolidation).*bilateral',
        r'both lung.*(?:infiltrate|opacity|opacities|consolidation)',
        r'(?:infiltrate|opacity|opacities|consolidation).*both lung',
        r'diffuse.*(?:infiltrate|opacity|opacities|consolidation)',
        r'multifocal.*(?:infiltrate|opacity|opacities|consolidation)',
        r'bibasilar.*(?:infiltrate|opacity|opacities|consolidation)',
        r'bilateral.*ground.?glass',
        r'bilateral.*airspace disease',
        r'ards',  # Direct ARDS mention
        r'acute respiratory distress syndrome'
    ]
    
    # Check for any bilateral pattern (vectorized)
    bilateral_matches = pd.Series(False, index=text_series.index)
    
    for pattern in bilateral_patterns:
        bilateral_matches |= text_lower.str.contains(pattern, regex=True, na=False)
    
    # Check for separate left and right infiltrates (vectorized)
    left_infiltrate = text_lower.str.contains(
        r'left.*(?:infiltrate|opacity|consolidation)', regex=True, na=False
    )
    right_infiltrate = text_lower.str.contains(
        r'right.*(?:infiltrate|opacity|consolidation)', regex=True, na=False
    )
    
    # Combine results
    return bilateral_matches | (left_infiltrate & right_infiltrate)

# Apply bilateral infiltrates detection (vectorized)
print("Detecting bilateral infiltrates in radiology reports...")
cohort_radiology['bilateral_infiltrates'] = detect_bilateral_infiltrates_vectorized(
    cohort_radiology['text']
)

bilateral_reports = cohort_radiology['bilateral_infiltrates'].sum()
total_reports = len(cohort_radiology)
print(f"Reports with bilateral infiltrates: {bilateral_reports:,} ({bilateral_reports/total_reports*100:.1f}%)")

Detecting bilateral infiltrates in radiology reports...
Reports with bilateral infiltrates: 24,035 (11.2%)


In [23]:
# Get bilateral infiltrates flag for each admission (vectorized)
bilateral_by_admission = cohort_radiology.groupby('hadm_id').agg({
    'bilateral_infiltrates': 'any',  # True if any report shows bilateral infiltrates
    'charttime': 'min'  # First radiology report time
}).reset_index()

bilateral_by_admission.rename(columns={
    'bilateral_infiltrates': 'has_bilateral_infiltrates',
    'charttime': 'first_radiology_time'
}, inplace=True)

print(f"Admissions with bilateral infiltrates: {bilateral_by_admission['has_bilateral_infiltrates'].sum():,}")
print(f"Total admissions with radiology: {len(bilateral_by_admission):,}")
print(f"Bilateral infiltrates rate: {bilateral_by_admission['has_bilateral_infiltrates'].mean()*100:.1f}%")

# Clear radiology data
# del cohort_radiology
gc.collect()

Admissions with bilateral infiltrates: 8,425
Total admissions with radiology: 18,857
Bilateral infiltrates rate: 44.7%


16

## Step 5: Create Final Dataset with ARDS Flags

In [24]:
# Start with base cohort
final_cohort = base_cohort.copy()
print(f"Starting cohort: {len(final_cohort):,} patients")

# Merge ARDS onset information (vectorized left join)
if len(ards_onset_times) > 0:
    final_cohort = final_cohort.merge(
        ards_onset_times, 
        on='hadm_id', 
        how='left'
    )
    print(f"Merged ARDS onset data for {len(ards_onset_times):,} admissions")
else:
    # Add empty ARDS columns if no onset data
    final_cohort['ards_onset_time'] = pd.NaT
    final_cohort['ards_onset_sf_ratio'] = np.nan
    final_cohort['ards_onset_minutes_from_icu'] = np.nan
    print("No ARDS onset data - added empty columns")

# Merge bilateral infiltrates information (vectorized left join)
if len(bilateral_by_admission) > 0:
    final_cohort = final_cohort.merge(
        bilateral_by_admission, 
        on='hadm_id', 
        how='left'
    )
    print(f"Merged bilateral infiltrates data for {len(bilateral_by_admission):,} admissions")
else:
    # Add empty bilateral infiltrates columns
    final_cohort['has_bilateral_infiltrates'] = False
    final_cohort['first_radiology_time'] = pd.NaT
    print("No bilateral infiltrates data - added empty columns")

# Fill missing values (vectorized)
final_cohort['has_bilateral_infiltrates'] = final_cohort['has_bilateral_infiltrates'].fillna(False)

print(f"Final dataset: {len(final_cohort):,} patients")

Starting cohort: 21,590 patients
Merged ARDS onset data for 4,402 admissions
Merged bilateral infiltrates data for 18,857 admissions
Final dataset: 21,590 patients


In [25]:
# Create ARDS flags (vectorized)
print("Creating ARDS flags...")

# ARDS onset flag: S/F ≤ 315 within 60 minutes
final_cohort['has_ards_onset'] = final_cohort['ards_onset_time'].notna()

# Combined ARDS flag: onset + bilateral infiltrates
final_cohort['has_ards'] = (
    final_cohort['has_ards_onset'] & 
    final_cohort['has_bilateral_infiltrates']
)

# Calculate time from ICU admission to ARDS onset (vectorized)
final_cohort['hours_icu_to_ards_onset'] = (
    final_cohort['ards_onset_time'] - final_cohort['intime']
).dt.total_seconds() / 3600

# ARDS severity based on S/F ratio (vectorized)
final_cohort['ards_severity'] = 'none'
mask_mild = (final_cohort['has_ards_onset']) & (final_cohort['ards_onset_sf_ratio'] > 235)
mask_moderate = (final_cohort['has_ards_onset']) & (final_cohort['ards_onset_sf_ratio'] <= 235) & (final_cohort['ards_onset_sf_ratio'] > 150)
mask_severe = (final_cohort['has_ards_onset']) & (final_cohort['ards_onset_sf_ratio'] <= 150)

final_cohort.loc[mask_mild, 'ards_severity'] = 'mild'
final_cohort.loc[mask_moderate, 'ards_severity'] = 'moderate'
final_cohort.loc[mask_severe, 'ards_severity'] = 'severe'

print("ARDS flags created successfully!")

Creating ARDS flags...
ARDS flags created successfully!


## Step 6: Summary Statistics

In [26]:
print("=== ARDS IDENTIFICATION SUMMARY ===")

# Overall cohort
total_patients = len(final_cohort)
print(f"\n📊 Total Cohort: {total_patients:,} patients")

# ARDS onset (S/F criteria)
ards_onset_count = final_cohort['has_ards_onset'].sum()
ards_onset_rate = ards_onset_count / total_patients * 100
print(f"\n🔍 ARDS Onset (S/F ≤ 315 within 60min):")
print(f"  Patients: {ards_onset_count:,} ({ards_onset_rate:.1f}%)")

if ards_onset_count > 0:
    onset_times = final_cohort[final_cohort['has_ards_onset']]['ards_onset_minutes_from_icu']
    print(f"  Onset timing (min from ICU): {onset_times.mean():.1f} ± {onset_times.std():.1f}")
    print(f"  Median onset time: {onset_times.median():.1f} minutes")
    
    onset_sf = final_cohort[final_cohort['has_ards_onset']]['ards_onset_sf_ratio']
    print(f"  S/F ratio at onset: {onset_sf.mean():.1f} ± {onset_sf.std():.1f}")

# Bilateral infiltrates
bilateral_count = final_cohort['has_bilateral_infiltrates'].sum()
bilateral_rate = bilateral_count / total_patients * 100
print(f"\n🫁 Bilateral Infiltrates:")
print(f"  Patients: {bilateral_count:,} ({bilateral_rate:.1f}%)")

# Combined ARDS
ards_count = final_cohort['has_ards'].sum()
ards_rate = ards_count / total_patients * 100
print(f"\n🚨 ARDS (Onset + Bilateral Infiltrates):")
print(f"  Patients: {ards_count:,} ({ards_rate:.1f}%)")

# ARDS severity distribution
if ards_count > 0:
    print(f"\n📈 ARDS Severity (among ARDS patients):")
    ards_patients = final_cohort[final_cohort['has_ards']]
    severity_counts = ards_patients['ards_severity'].value_counts()
    for severity, count in severity_counts.items():
        if severity != 'none':
            pct = count / ards_count * 100
            print(f"  {severity.capitalize()}: {count:,} ({pct:.1f}%)")

# Clinical outcomes by ARDS status
print(f"\n🏥 Clinical Outcomes:")
if 'mortality' in final_cohort.columns:
    mortality_ards = final_cohort[final_cohort['has_ards']]['mortality'].mean() * 100
    mortality_no_ards = final_cohort[~final_cohort['has_ards']]['mortality'].mean() * 100
    print(f"  Mortality - ARDS: {mortality_ards:.1f}%")
    print(f"  Mortality - No ARDS: {mortality_no_ards:.1f}%")

if 'icu_los_days' in final_cohort.columns:
    los_ards = final_cohort[final_cohort['has_ards']]['icu_los_days'].median()
    los_no_ards = final_cohort[~final_cohort['has_ards']]['icu_los_days'].median()
    print(f"  ICU LOS (median) - ARDS: {los_ards:.1f} days")
    print(f"  ICU LOS (median) - No ARDS: {los_no_ards:.1f} days")

=== ARDS IDENTIFICATION SUMMARY ===

📊 Total Cohort: 21,590 patients

🔍 ARDS Onset (S/F ≤ 315 within 60min):
  Patients: 5,575 (25.8%)
  Onset timing (min from ICU): 24.1 ± 17.2
  Median onset time: 21.0 minutes
  S/F ratio at onset: 129.2 ± 105.2

🫁 Bilateral Infiltrates:
  Patients: 10,361 (48.0%)

🚨 ARDS (Onset + Bilateral Infiltrates):
  Patients: 3,331 (15.4%)

📈 ARDS Severity (among ARDS patients):
  Severe: 1,817 (54.5%)
  Mild: 969 (29.1%)
  Moderate: 545 (16.4%)

🏥 Clinical Outcomes:
  Mortality - ARDS: 26.5%
  Mortality - No ARDS: 15.1%
  ICU LOS (median) - ARDS: 4.3 days
  ICU LOS (median) - No ARDS: 2.4 days


## Step 7: Save Final Dataset

In [27]:
# Select columns for final dataset
save_columns = [
    # Patient identifiers
    'subject_id', 'hadm_id', 'stay_id',
    
    # Times
    'admission_dttm', 'discharge_dttm', 'intime', 'outtime',
    
    # Demographics
    'age_at_admission', 'gender',
    
    # ARDS variables
    'has_ards_onset', 'has_bilateral_infiltrates', 'has_ards',
    'ards_onset_time', 'ards_onset_sf_ratio', 'ards_onset_minutes_from_icu',
    'hours_icu_to_ards_onset', 'ards_severity',
    'first_radiology_time',
    
    # Clinical variables
    'admission_type', 'admission_location', 'discharge_location',
    'insurance', 'marital_status'
]

# Add mortality and LOS if available
if 'mortality' in final_cohort.columns:
    save_columns.append('mortality')
if 'icu_los_days' in final_cohort.columns:
    save_columns.append('icu_los_days')

# Filter to available columns
available_columns = [col for col in save_columns if col in final_cohort.columns]
print(f"Saving {len(available_columns)} columns...")

# Save final dataset
output_file = f'{DATA_PATH}/ards_cohort_with_flags.parquet'
final_cohort[available_columns].to_parquet(output_file, index=False)

# Calculate file size
file_size_mb = os.path.getsize(output_file) / 1024 / 1024

print(f"\n💾 ARDS COHORT SAVED")
print(f"File: {output_file}")
print(f"Size: {file_size_mb:.1f} MB")
print(f"Rows: {len(final_cohort):,}")
print(f"Columns: {len(available_columns)}")

# Save summary statistics
summary_stats = {
    'analysis_date': datetime.now().isoformat(),
    'total_patients': total_patients,
    'ards_onset_patients': int(ards_onset_count),
    'bilateral_infiltrates_patients': int(bilateral_count), 
    'ards_patients': int(ards_count),
    'ards_rate_percent': float(ards_rate),
    'ards_definition': 'S/F ratio ≤ 315 within 60min of ICU admission + bilateral infiltrates',
    'severity_distribution': final_cohort['ards_severity'].value_counts().to_dict()
}

import json
summary_file = f'{DATA_PATH}/ards_identification_summary.json'
with open(summary_file, 'w') as f:
    json.dump(summary_stats, f, indent=2, default=str)

print(f"Summary saved: {summary_file}")
print(f"\n⏰ Analysis completed at: {datetime.now()}")

Saving 25 columns...

💾 ARDS COHORT SAVED
File: /Users/kavenchhikara/Desktop/projects/SCCM/SCCM-Team2/ards_analysis/data/ards_cohort_with_flags.parquet
Size: 1.7 MB
Rows: 21,590
Columns: 25
Summary saved: /Users/kavenchhikara/Desktop/projects/SCCM/SCCM-Team2/ards_analysis/data/ards_identification_summary.json

⏰ Analysis completed at: 2025-07-20 01:55:42.538055


## Summary

✅ **ARDS identification completed with updated definition!**

### ARDS Definition Applied:
- **ARDS Onset**: S/F ratio ≤ 315 within 60 minutes of ICU admission
- **Bilateral Infiltrates**: Flag from radiology report analysis
- **Final ARDS**: Both criteria must be met

### Key Features:
- **No patient filtering**: All cohort patients retained
- **ARDS flags**: Binary indicators for each component
- **Onset timing**: Precise timestamps relative to ICU admission
- **Severity classification**: Based on S/F ratio thresholds
- **Vectorized processing**: Efficient handling of large datasets

### Output Variables:
- `has_ards_onset`: S/F ≤ 315 within 60min
- `has_bilateral_infiltrates`: Radiology flag
- `has_ards`: Combined ARDS flag
- `ards_onset_time`: Timestamp of first S/F ≤ 315
- `ards_severity`: mild/moderate/severe based on S/F ratio

### Next Steps:
1. **Proning event extraction** with timing relative to ARDS onset
2. **Neuromuscular blockade analysis** with dosing and timing
3. **Outcome analysis** comparing ARDS vs non-ARDS patients
4. **Statistical modeling** for intervention timing effects