# Hourly Wide Dataset Creation with pyCLIF

This notebook demonstrates how to create hourly aggregated wide datasets using the pyCLIF library. The hourly aggregation functionality allows you to collapse time-series data into hourly buckets with user-defined aggregation methods.

**Author:** pyCLIF Team  
**Date:** 2024

## Overview

The hourly wide dataset functionality allows you to:
- **Convert wide datasets** to hourly aggregation
- **Apply different aggregation methods** (max, min, mean, median, first, last, boolean, one-hot encode)
- **Track time progression** with `nth_hour` column (incremental hours from admission)
- **Maintain patient context** with demographics and hospitalization details
- **Create analysis-ready datasets** for time-series modeling

## Setup and Configuration

In [None]:
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
from pyclif import CLIF
import warnings
warnings.filterwarnings('ignore')

print("=== pyCLIF Hourly Wide Dataset Example ===")

### Configure Data Directory

Update the `data_dir` variable to point to your CLIF data location:

In [None]:
# Initialize CLIF with MIMIC data
data_dir = "/Users/vaishvik/Downloads/CLIF_MIMIC"

# Check if data directory exists
if not os.path.exists(data_dir):
    print(f"⚠️  Warning: Data directory {data_dir} does not exist.")
    print("Please update the data_dir variable to point to your CLIF data location.")
else:
    print(f"✅ Data directory found: {data_dir}")
    
# Initialize CLIF
print(f"\nInitializing CLIF with data from: {data_dir}")
clif = CLIF(
    data_dir=data_dir,
    filetype='parquet',
    timezone="US/Eastern"
)
print("🚀 CLIF object initialized successfully!")

## Step 1: Create a Wide Dataset

First, we need to create a wide dataset that we'll then convert to hourly aggregation.

In [None]:
print("=== Creating Wide Dataset ===")

# Create a wide dataset with vitals, labs, and medications
wide_df = clif.create_wide_dataset(
    optional_tables=['vitals', 'labs', 'medication_admin_continuous', 'patient_assessments'],
    category_filters={
        'vitals': ['map', 'heart_rate', 'spo2', 'respiratory_rate', 'temp_c', 'sbp', 'dbp'],
        'labs': ['hemoglobin', 'wbc', 'sodium', 'potassium', 'creatinine', 'glucose'],
        'medication_admin_continuous': ['norepinephrine', 'propofol', 'fentanyl', 'epinephrine'],
        'patient_assessments': ['gcs_total', 'rass', 'sbt_delivery_pass_fail']
    },
    sample=True,  # Sample 20 hospitalizations for demo
    save_to_data_location=False
)

if wide_df is not None:
    print(f"\n✅ Wide dataset created successfully!")
    print(f"   - Records: {len(wide_df):,}")
    print(f"   - Columns: {len(wide_df.columns)}")
    print(f"   - Hospitalizations: {wide_df['hospitalization_id'].nunique()}")
    print(f"   - Date range: {wide_df['event_time'].min()} to {wide_df['event_time'].max()}")
    
    # Check that the dataset was stored in clif.wide_df
    print(f"\n📋 Wide dataset stored in clif.wide_df: {clif.wide_df is not None}")
else:
    print("❌ Failed to create wide dataset")

### Explore the Wide Dataset Structure

In [None]:
if clif.wide_df is not None:
    print("📊 Wide Dataset Structure:")
    print(f"\nKey columns:")
    key_cols = ['patient_id', 'hospitalization_id', 'event_time', 'day_number', 'hosp_id_day_key']
    display(clif.wide_df[key_cols].head())
    
    print("\n📈 Sample clinical data:")
    clinical_cols = ['event_time', 'map', 'heart_rate', 'spo2', 'hemoglobin', 'norepinephrine']
    available_clinical = [col for col in clinical_cols if col in clif.wide_df.columns]
    display(clif.wide_df[available_clinical].head(10))
    
    # Show events per hour distribution
    clif.wide_df['hour'] = clif.wide_df['event_time'].dt.hour
    events_per_hour = clif.wide_df.groupby(['hospitalization_id', 'day_number', 'hour']).size()
    print(f"\n⏰ Events per hour statistics:")
    print(f"   - Mean: {events_per_hour.mean():.1f}")
    print(f"   - Max: {events_per_hour.max()}")
    print(f"   - Min: {events_per_hour.min()}")

## Step 2: Define Aggregation Configuration

Specify how each column should be aggregated when converting to hourly buckets.

In [None]:
# Define aggregation configuration
aggregation_config = {
    'max': ['map', 'temp_c', 'sbp', 'dbp'],
    'mean': ['heart_rate', 'respiratory_rate', 'spo2'],
    'min': ['spo2'],  # We want both mean and min for SpO2
    'median': ['glucose'],
    'first': ['gcs_total', 'rass'],
    'last': ['sbt_delivery_pass_fail'],
    'boolean': ['norepinephrine', 'propofol', 'fentanyl', 'epinephrine'],
    # Note: one_hot_encode would be used for categorical columns if we had them
}

print("📋 Aggregation Configuration:")
for method, columns in aggregation_config.items():
    print(f"\n{method.upper()}:")
    for col in columns:
        print(f"   - {col}")

## Step 3: Create Hourly Wide Dataset

Convert the wide dataset to hourly aggregation using the defined configuration.

In [None]:
print("=== Creating Hourly Wide Dataset ===")

# Create hourly aggregated dataset
hourly_df = clif.create_hourly_wide_dataset(aggregation_config)

if hourly_df is not None:
    print(f"\n✅ Hourly dataset created successfully!")
    print(f"   - Records: {len(hourly_df):,} (from {len(clif.wide_df):,} original)")
    print(f"   - Reduction: {(1 - len(hourly_df)/len(clif.wide_df))*100:.1f}%")
    print(f"   - Columns: {len(hourly_df.columns)}")
    print(f"   - Hospitalizations: {hourly_df['hospitalization_id'].nunique()}")
    
    # Check nth_hour range
    print(f"\n⏰ Time coverage:")
    print(f"   - Max nth_hour: {hourly_df['nth_hour'].max()} (≈ {hourly_df['nth_hour'].max()/24:.1f} days)")
    print(f"   - Min nth_hour: {hourly_df['nth_hour'].min()}")
    
    # Check that the dataset was stored
    print(f"\n📋 Hourly dataset stored in clif.hourly_wide_df: {clif.hourly_wide_df is not None}")
else:
    print("❌ Failed to create hourly dataset")

### Explore the Hourly Dataset Structure

In [None]:
if clif.hourly_wide_df is not None:
    print("📊 Hourly Dataset Structure:")
    
    # Show key columns
    print("\n🔑 Key columns (first 10 hours):")
    key_cols = ['hospitalization_id', 'day_number', 'hour_bucket', 'nth_hour']
    display(clif.hourly_wide_df[key_cols].head(10))
    
    # Show aggregated clinical data for one hospitalization
    first_hosp = clif.hourly_wide_df['hospitalization_id'].iloc[0]
    hosp_data = clif.hourly_wide_df[clif.hourly_wide_df['hospitalization_id'] == first_hosp]
    
    print(f"\n📈 Clinical data for hospitalization {first_hosp}:")
    clinical_cols = ['nth_hour', 'map', 'heart_rate', 'spo2', 'norepinephrine']
    available_cols = [col for col in clinical_cols if col in hosp_data.columns]
    display(hosp_data[available_cols].head(24))  # First 24 hours

## Step 4: Analyze Aggregation Results

Let's examine how different aggregation methods affected the data.

In [None]:
if clif.hourly_wide_df is not None:
    print("📊 Aggregation Analysis:")
    
    # 1. Boolean columns (medications)
    print("\n💊 Boolean Aggregation (Medication Presence):")
    med_cols = ['norepinephrine', 'propofol', 'fentanyl', 'epinephrine']
    available_meds = [col for col in med_cols if col in clif.hourly_wide_df.columns]
    
    for med in available_meds:
        hours_with_med = clif.hourly_wide_df[med].sum()
        percent = (hours_with_med / len(clif.hourly_wide_df)) * 100
        print(f"   - {med}: {hours_with_med} hours ({percent:.1f}%)")
    
    # 2. Vital signs aggregation
    print("\n📊 Vital Signs Aggregation Statistics:")
    vital_stats = clif.hourly_wide_df[['map', 'heart_rate', 'spo2']].describe()
    display(vital_stats.round(1))
    
    # 3. Data completeness by hour
    print("\n📈 Data Completeness Analysis:")
    completeness = {}
    for col in ['map', 'heart_rate', 'spo2', 'hemoglobin', 'gcs_total']:
        if col in clif.hourly_wide_df.columns:
            non_null = clif.hourly_wide_df[col].notna().sum()
            completeness[col] = (non_null / len(clif.hourly_wide_df)) * 100
    
    completeness_df = pd.DataFrame(list(completeness.items()), 
                                  columns=['Variable', 'Completeness %'])
    completeness_df = completeness_df.sort_values('Completeness %', ascending=False)
    display(completeness_df.round(1))

## Step 5: Visualize Hourly Trends

Let's create visualizations to understand the hourly patterns in the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

if clif.hourly_wide_df is not None:
    # Set up the plot style
    plt.style.use('default')
    sns.set_palette("husl")
    
    # Create figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Hourly Wide Dataset Analysis', fontsize=16, fontweight='bold')
    
    # Plot 1: Vital signs over first 48 hours for one patient
    first_hosp = clif.hourly_wide_df['hospitalization_id'].iloc[0]
    hosp_data = clif.hourly_wide_df[
        (clif.hourly_wide_df['hospitalization_id'] == first_hosp) & 
        (clif.hourly_wide_df['nth_hour'] <= 48)
    ]
    
    if 'map' in hosp_data.columns and hosp_data['map'].notna().sum() > 0:
        axes[0, 0].plot(hosp_data['nth_hour'], hosp_data['map'], 'o-', label='MAP', markersize=4)
    if 'heart_rate' in hosp_data.columns and hosp_data['heart_rate'].notna().sum() > 0:
        axes[0, 0].plot(hosp_data['nth_hour'], hosp_data['heart_rate'], 's-', label='Heart Rate', markersize=4)
    
    axes[0, 0].set_title(f'Vital Signs Over Time (First 48h)\nHospitalization: {first_hosp}')
    axes[0, 0].set_xlabel('Hour from Admission')
    axes[0, 0].set_ylabel('Value')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Plot 2: Medication usage pattern
    med_cols = ['norepinephrine', 'propofol', 'fentanyl', 'epinephrine']
    available_meds = [col for col in med_cols if col in clif.hourly_wide_df.columns]
    
    if available_meds:
        med_usage = clif.hourly_wide_df[available_meds].sum()
        axes[0, 1].bar(range(len(med_usage)), med_usage.values)
        axes[0, 1].set_xticks(range(len(med_usage)))
        axes[0, 1].set_xticklabels(med_usage.index, rotation=45)
        axes[0, 1].set_title('Medication Usage (Total Hours)')
        axes[0, 1].set_ylabel('Hours of Administration')
        axes[0, 1].grid(True, alpha=0.3)
    
    # Plot 3: Data density by hour of day
    hour_distribution = clif.hourly_wide_df.groupby('hour_bucket').size()
    axes[1, 0].bar(hour_distribution.index, hour_distribution.values, alpha=0.7)
    axes[1, 0].set_title('Data Distribution by Hour of Day')
    axes[1, 0].set_xlabel('Hour of Day')
    axes[1, 0].set_ylabel('Number of Records')
    axes[1, 0].set_xticks(range(0, 24, 4))
    axes[1, 0].grid(True, alpha=0.3)
    
    # Plot 4: Length of stay distribution (in hours)
    max_hours = clif.hourly_wide_df.groupby('hospitalization_id')['nth_hour'].max()
    axes[1, 1].hist(max_hours, bins=20, alpha=0.7, edgecolor='black')
    axes[1, 1].set_title('Length of Stay Distribution')
    axes[1, 1].set_xlabel('Hours')
    axes[1, 1].set_ylabel('Number of Hospitalizations')
    axes[1, 1].axvline(max_hours.mean(), color='red', linestyle='--', 
                       label=f'Mean: {max_hours.mean():.1f}h')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## Step 6: Example Use Cases

Let's demonstrate some practical use cases for the hourly aggregated data.

### Use Case 1: Identify Patients on Vasopressors

In [None]:
if clif.hourly_wide_df is not None:
    print("💊 Vasopressor Analysis:")
    
    vasopressors = ['norepinephrine', 'epinephrine']
    available_vaso = [v for v in vasopressors if v in clif.hourly_wide_df.columns]
    
    if available_vaso:
        # Create a column for any vasopressor use
        clif.hourly_wide_df['any_vasopressor'] = clif.hourly_wide_df[available_vaso].max(axis=1)
        
        # Find patients who received vasopressors
        vaso_patients = clif.hourly_wide_df[clif.hourly_wide_df['any_vasopressor'] == 1]
        unique_vaso_patients = vaso_patients['hospitalization_id'].nunique()
        
        print(f"\n📊 Vasopressor Statistics:")
        print(f"   - Patients on vasopressors: {unique_vaso_patients}")
        print(f"   - Total vasopressor hours: {vaso_patients.shape[0]}")
        print(f"   - Average duration per patient: {vaso_patients.shape[0]/unique_vaso_patients:.1f} hours")
        
        # Show when vasopressors are typically started
        first_vaso_hour = vaso_patients.groupby('hospitalization_id')['nth_hour'].min()
        print(f"\n⏰ Time to first vasopressor:")
        print(f"   - Mean: {first_vaso_hour.mean():.1f} hours")
        print(f"   - Median: {first_vaso_hour.median():.1f} hours")
        print(f"   - Min: {first_vaso_hour.min()} hours")
        print(f"   - Max: {first_vaso_hour.max()} hours")

### Use Case 2: Track Vital Sign Stability

In [None]:
if clif.hourly_wide_df is not None:
    print("📊 Vital Sign Stability Analysis:")
    
    # Calculate hourly changes in vital signs
    for vital in ['map', 'heart_rate', 'spo2']:
        if vital in clif.hourly_wide_df.columns:
            # Calculate hour-to-hour change for each patient
            clif.hourly_wide_df[f'{vital}_change'] = (
                clif.hourly_wide_df.groupby('hospitalization_id')[vital]
                .diff()
                .abs()
            )
    
    # Identify periods of instability (large changes)
    if 'map_change' in clif.hourly_wide_df.columns:
        map_instability_threshold = 10  # mmHg
        unstable_hours = clif.hourly_wide_df[
            clif.hourly_wide_df['map_change'] > map_instability_threshold
        ]
        
        print(f"\n🚨 MAP Instability Analysis (change > {map_instability_threshold} mmHg):")
        print(f"   - Unstable hours: {len(unstable_hours)}")
        print(f"   - Affected patients: {unstable_hours['hospitalization_id'].nunique()}")
        print(f"   - % of total hours: {(len(unstable_hours)/len(clif.hourly_wide_df))*100:.1f}%")
        
        # Show when instability typically occurs
        if len(unstable_hours) > 0:
            instability_timing = unstable_hours.groupby('hour_bucket').size()
            print(f"\n⏰ Instability by hour of day:")
            top_hours = instability_timing.nlargest(5)
            for hour, count in top_hours.items():
                print(f"   - Hour {hour}: {count} occurrences")

### Use Case 3: Create Features for Machine Learning

In [None]:
if clif.hourly_wide_df is not None:
    print("🤖 Machine Learning Feature Engineering:")
    
    # Create rolling window features
    window_size = 6  # 6-hour window
    
    for vital in ['map', 'heart_rate', 'spo2']:
        if vital in clif.hourly_wide_df.columns:
            # Rolling mean
            clif.hourly_wide_df[f'{vital}_rolling_mean_{window_size}h'] = (
                clif.hourly_wide_df.groupby('hospitalization_id')[vital]
                .transform(lambda x: x.rolling(window_size, min_periods=1).mean())
            )
            
            # Rolling std (variability)
            clif.hourly_wide_df[f'{vital}_rolling_std_{window_size}h'] = (
                clif.hourly_wide_df.groupby('hospitalization_id')[vital]
                .transform(lambda x: x.rolling(window_size, min_periods=2).std())
            )
    
    # Create time-based features
    clif.hourly_wide_df['hours_since_admission'] = clif.hourly_wide_df['nth_hour']
    clif.hourly_wide_df['is_night'] = clif.hourly_wide_df['hour_bucket'].isin(range(22, 24)) | clif.hourly_wide_df['hour_bucket'].isin(range(0, 6))
    clif.hourly_wide_df['is_weekend'] = pd.to_datetime(clif.hourly_wide_df['admission_dttm']).dt.dayofweek.isin([5, 6]) if 'admission_dttm' in clif.hourly_wide_df.columns else False
    
    print("\n✅ Created ML features:")
    ml_features = [col for col in clif.hourly_wide_df.columns if 'rolling' in col or col in ['hours_since_admission', 'is_night', 'is_weekend']]
    for feature in ml_features:
        print(f"   - {feature}")
    
    # Show sample of ML-ready dataset
    print("\n📋 Sample ML-ready dataset:")
    ml_cols = ['hospitalization_id', 'nth_hour', 'map', 'map_rolling_mean_6h', 'map_rolling_std_6h', 'is_night', 'any_vasopressor']
    available_ml_cols = [col for col in ml_cols if col in clif.hourly_wide_df.columns]
    display(clif.hourly_wide_df[available_ml_cols].head(10))

## Step 7: Save the Hourly Dataset

In [None]:
if clif.hourly_wide_df is not None:
    # Save to parquet format
    output_path = os.path.join(clif.data_dir, 'hourly_wide_dataset_example.parquet')
    clif.hourly_wide_df.to_parquet(output_path, index=False)
    
    print(f"💾 Hourly dataset saved to: {output_path}")
    
    # Also save as CSV for easy viewing
    csv_path = os.path.join(clif.data_dir, 'hourly_wide_dataset_example.csv')
    clif.hourly_wide_df.to_csv(csv_path, index=False)
    print(f"📄 CSV version saved to: {csv_path}")
    
    # Check file sizes
    parquet_size = os.path.getsize(output_path) / (1024 * 1024)  # MB
    csv_size = os.path.getsize(csv_path) / (1024 * 1024)  # MB
    
    print(f"\n📊 File sizes:")
    print(f"   - Parquet: {parquet_size:.2f} MB")
    print(f"   - CSV: {csv_size:.2f} MB")
    print(f"   - Compression ratio: {csv_size/parquet_size:.1f}x")

## Summary

In this notebook, we demonstrated:

### 🔧 **Core Functionality**
- Creating a wide dataset from CLIF tables
- Converting to hourly aggregation with various methods
- Using the `nth_hour` column for chronological tracking

### 📊 **Aggregation Methods**
- **max/min**: Track extremes (e.g., highest MAP, lowest SpO2)
- **mean/median**: Central tendency for continuous variables
- **first/last**: Assessments and time-sensitive values
- **boolean**: Medication presence indicators
- **one_hot_encode**: Categorical variable expansion

### 🎯 **Use Cases**
- Vasopressor usage analysis
- Vital sign stability tracking
- Machine learning feature engineering
- Time-series analysis preparation

### 💡 **Key Benefits**
- **Data reduction**: Fewer rows while preserving clinical information
- **Regular time intervals**: Easier for time-series modeling
- **Flexible aggregation**: Choose appropriate method per variable
- **Analysis-ready**: Direct use in statistical and ML workflows

The hourly wide dataset format is ideal for:
- Predictive modeling with regular time intervals
- Trend analysis and pattern recognition
- Clinical decision support systems
- Research on temporal patterns in critical care