# Wide Dataset Creation with pyCLIF

This notebook demonstrates how to create wide datasets using the pyCLIF library with data from the CLIF_MIMIC directory. The wide dataset function automatically loads required tables and supports various configuration options.

**Author:** pyCLIF Team  
**Date:** 2024

## Overview

The wide dataset functionality allows you to:
- **Automatically join** multiple CLIF tables (patient, hospitalization, ADT, and optional tables)
- **Pivot category-based data** from vitals, labs, medications, and assessments
- **Sample or filter** hospitalizations for targeted analysis
- **Handle time-based alignment** of events across different tables
- **Save results** in multiple formats (DataFrame, CSV, Parquet)

## Setup and Configuration

In [None]:
import sys
import os
sys.path.append('../src')

import pandas as pd
from pyclif import CLIF
import warnings
warnings.filterwarnings('ignore')

print("=== pyCLIF Wide Dataset Example ===")

### Configure Data Directory

Update the `data_dir` variable to point to your CLIF data location:

In [None]:
# Initialize CLIF with MIMIC data
data_dir = "/Users/vaishvik/Downloads/CLIF_MIMIC"

# Check if data directory exists
if not os.path.exists(data_dir):
    print(f"⚠️  Warning: Data directory {data_dir} does not exist.")
    print("Please update the data_dir variable to point to your CLIF data location.")
else:
    print(f"✅ Data directory found: {data_dir}")
    
    # List available files
    clif_files = [f for f in os.listdir(data_dir) if f.startswith('clif_') and f.endswith('.parquet')]
    print(f"📁 Available CLIF files: {len(clif_files)}")
    for file in sorted(clif_files):
        print(f"   - {file}")

### Initialize CLIF Object

Configure pyCLIF with your data directory, file format, and timezone:

In [None]:
print(f"Initializing CLIF with data from: {data_dir}")
clif = CLIF(
    data_dir=data_dir,
    filetype='parquet',
    timezone="US/Eastern"
)
print("🚀 CLIF object initialized successfully!")

## Example 1: Sample Mode (20 Random Hospitalizations)

This example demonstrates creating a wide dataset with a random sample of 20 hospitalizations, including vitals and labs data with specific category filters.

In [None]:
print("=== Example 1: Sample Mode (20 Random Hospitalizations) ===")

try:
    wide_df_sample = clif.create_wide_dataset(
        optional_tables=['vitals', 'labs'],
        category_filters={
            'vitals': ['map', 'heart_rate', 'spo2', 'respiratory_rate'],
            'labs': ['hemoglobin', 'wbc', 'sodium', 'potassium']
        },
        sample=True,
        save_to_data_location=True,
        output_filename='sample_wide_dataset',
        output_format='parquet'
    )
    
    if wide_df_sample is not None:
        print(f"✅ Sample wide dataset created with {len(wide_df_sample):,} records and {len(wide_df_sample.columns)} columns")
        print(f"👥 Unique hospitalizations: {wide_df_sample['hospitalization_id'].nunique()}")
        print(f"📅 Date range: {wide_df_sample['event_time'].min()} to {wide_df_sample['event_time'].max()}")
        
        # Show column breakdown
        vital_cols = [col for col in wide_df_sample.columns if col in ['map', 'heart_rate', 'spo2', 'respiratory_rate']]
        lab_cols = [col for col in wide_df_sample.columns if col in ['hemoglobin', 'wbc', 'sodium', 'potassium']]
        
        print(f"\n📊 Available vital columns: {vital_cols}")
        print(f"🧪 Available lab columns: {lab_cols}")
    else:
        print("✅ Sample wide dataset saved to file successfully")
        
except Exception as e:
    print(f"❌ Error in Example 1: {str(e)}")
    import traceback
    traceback.print_exc()

In [None]:
wide_df_sample

### Display Sample Data

Let's examine the structure of the sample dataset:

In [None]:
if 'wide_df_sample' in locals() and wide_df_sample is not None:
    print("📋 Sample Data Structure:")
    print(f"Shape: {wide_df_sample.shape}")
    
    # Show key columns
    key_cols = ['patient_id', 'hospitalization_id', 'event_time', 'day_number', 'hosp_id_day_key']
    available_key_cols = [col for col in key_cols if col in wide_df_sample.columns]
    
    print(f"\n🔑 Key columns (first 5 rows):")
    display(wide_df_sample[available_key_cols].head())
    
    # Show data availability for vitals and labs
    vital_cols = [col for col in wide_df_sample.columns if col in ['map', 'heart_rate', 'spo2', 'respiratory_rate']]
    lab_cols = [col for col in wide_df_sample.columns if col in ['hemoglobin', 'wbc', 'sodium', 'potassium']]
    
    if vital_cols:
        print(f"\n📊 Vital signs data availability:")
        for col in vital_cols:
            non_null_count = wide_df_sample[col].notna().sum()
            percentage = (non_null_count / len(wide_df_sample)) * 100
            print(f"   {col}: {non_null_count:,} records ({percentage:.1f}%)")
    
    if lab_cols:
        print(f"\n🧪 Lab data availability:")
        for col in lab_cols:
            non_null_count = wide_df_sample[col].notna().sum()
            percentage = (non_null_count / len(wide_df_sample)) * 100
            print(f"   {col}: {non_null_count:,} records ({percentage:.1f}%)")
else:
    print("No sample data available to display")

## Example 2: Specific Hospitalization IDs

This example shows how to create a wide dataset for specific hospitalization encounters, focusing on medications and assessments.

In [None]:
print("=== Example 2: Specific Hospitalization IDs ===")

try:
    # First, let's get some hospitalization IDs from the sample
    clif.load_hospitalization_data()
    sample_ids = clif.hospitalization.df['hospitalization_id'].head(5).tolist()
    print(f"🎯 Using sample hospitalization IDs: {sample_ids}")
    
    wide_df_targeted = clif.create_wide_dataset(
        hospitalization_ids=sample_ids,
        optional_tables=['medication_admin_continuous', 'patient_assessments'],
        category_filters={
            'medication_admin_continuous': ['norepinephrine', 'propofol', 'fentanyl'],
            'patient_assessments': ['gcs_total', 'rass', 'sbt_delivery_pass_fail']
        },
        save_to_data_location=True,
        output_filename='targeted_encounters_wide',
        output_format='csv'
    )
    
    if wide_df_targeted is not None:
        print(f"✅ Targeted wide dataset created with {len(wide_df_targeted):,} records")
        print(f"👥 Hospitalizations included: {wide_df_targeted['hospitalization_id'].nunique()}")
        
        # Show medication and assessment availability
        med_cols = [col for col in wide_df_targeted.columns if col in ['norepinephrine', 'propofol', 'fentanyl']]
        assess_cols = [col for col in wide_df_targeted.columns if col in ['gcs_total', 'rass', 'sbt_delivery_pass_fail']]
        
        print(f"\n💊 Available medication columns: {med_cols}")
        print(f"📋 Available assessment columns: {assess_cols}")
    else:
        print("✅ Targeted wide dataset saved to file successfully")
        
except Exception as e:
    print(f"❌ Error in Example 2: {str(e)}")
    import traceback
    traceback.print_exc()

### Analyze Targeted Dataset

Let's examine the medication and assessment data for the targeted hospitalizations:

In [None]:

if 'wide_df_targeted' in locals() and wide_df_targeted is not None:
    print("📊 Targeted Dataset Analysis:")
    
    # Medication usage analysis
    med_cols = [col for col in wide_df_targeted.columns if col in ['norepinephrine', 'propofol', 'fentanyl']]
    if med_cols:
        print("\n💊 Medication Usage:")
        for med in med_cols:
            if med in wide_df_targeted.columns:
                usage_count = wide_df_targeted[med].notna().sum()
                if usage_count > 0:
                    mean_dose = wide_df_targeted[med].mean()
                    print(f"   {med}: {usage_count} administrations, mean dose: {mean_dose:.2f}")
                else:
                    print(f"   {med}: No data available")
    
    # Assessment analysis
    assess_cols = [col for col in wide_df_targeted.columns if col in ['gcs_total', 'rass', 'sbt_delivery_pass_fail']]
    if assess_cols:
        print("\n📋 Assessment Data:")
        for assess in assess_cols:
            if assess in wide_df_targeted.columns:
                non_null_count = wide_df_targeted[assess].notna().sum()
                if non_null_count > 0:
                    if assess in ['gcs_total', 'rass']:
                        mean_val = wide_df_targeted[assess].mean()
                        print(f"   {assess}: {non_null_count} assessments, mean: {mean_val:.1f}")
                    else:
                        print(f"   {assess}: {non_null_count} assessments")
                else:
                    print(f"   {assess}: No data available")
    
    # Show sample of the targeted data
    print("\n📋 Sample of targeted data (first 3 rows):")
    display_cols = ['hospitalization_id', 'event_time', 'day_number'] + med_cols + assess_cols
    available_display_cols = [col for col in display_cols if col in wide_df_targeted.columns]
    display(wide_df_targeted[available_display_cols].head(3))
else:
    print("No targeted data available to display")

## Example 3: Comprehensive Wide Dataset

This example creates a comprehensive wide dataset including all optional tables with extensive category filters.

In [None]:
print("=== Example 3: Comprehensive Wide Dataset ===")

try:
    wide_df_full = clif.create_wide_dataset(
        optional_tables=['vitals', 'labs', 'medication_admin_continuous', 'patient_assessments', 'respiratory_support'],
        category_filters={
            'vitals': ['map', 'heart_rate', 'spo2', 'respiratory_rate', 'temp_c'],
            'labs': ['hemoglobin', 'wbc', 'sodium', 'potassium', 'creatinine'],
            'medication_admin_continuous': ['norepinephrine', 'epinephrine', 'propofol', 'fentanyl'],
            'patient_assessments': ['gcs_total', 'rass', 'sbt_delivery_pass_fail', 'sat_delivery_pass_fail']
        },
        sample=True,  # Use sample for demo purposes
        save_to_data_location=True,
        output_filename='comprehensive_wide_dataset',
        output_format='parquet'
    )
    
    if wide_df_full is not None:
        print(f"✅ Comprehensive wide dataset created with {len(wide_df_full):,} records and {len(wide_df_full.columns)} columns")
        
        # Show some statistics
        print("\n📊 Dataset Statistics:")
        print(f"   👥 Unique patients: {wide_df_full['patient_id'].nunique()}")
        print(f"   🏥 Unique hospitalizations: {wide_df_full['hospitalization_id'].nunique()}")
        print(f"   📅 Date range: {wide_df_full['event_time'].min()} to {wide_df_full['event_time'].max()}")
        print(f"   📈 Max days per hospitalization: {wide_df_full['day_number'].max()}")
        
        # Show available columns by category
        vital_cols = [col for col in wide_df_full.columns if col in ['map', 'heart_rate', 'spo2', 'respiratory_rate', 'temp_c']]
        lab_cols = [col for col in wide_df_full.columns if col in ['hemoglobin', 'wbc', 'sodium', 'potassium', 'creatinine']]
        med_cols = [col for col in wide_df_full.columns if col in ['norepinephrine', 'epinephrine', 'propofol', 'fentanyl']]
        assess_cols = [col for col in wide_df_full.columns if col in ['gcs_total', 'rass', 'sbt_delivery_pass_fail', 'sat_delivery_pass_fail']]
        
        print(f"\n📊 Available vital columns: {vital_cols}")
        print(f"🧪 Available lab columns: {lab_cols}")
        print(f"💊 Available medication columns: {med_cols}")
        print(f"📋 Available assessment columns: {assess_cols}")
        
    else:
        print("✅ Comprehensive wide dataset saved to file successfully")
        
except Exception as e:
    print(f"❌ Error in Example 3: {str(e)}")
    import traceback
    traceback.print_exc()

### Comprehensive Dataset Analysis

Let's analyze the comprehensive dataset in detail:

In [None]:
if 'wide_df_full' in locals() and wide_df_full is not None:
    print("📊 Comprehensive Dataset Analysis:")
    
    # Data completeness analysis
    categories = {
        'Vitals': ['map', 'heart_rate', 'spo2', 'respiratory_rate', 'temp_c'],
        'Labs': ['hemoglobin', 'wbc', 'sodium', 'potassium', 'creatinine'],
        'Medications': ['norepinephrine', 'epinephrine', 'propofol', 'fentanyl'],
        'Assessments': ['gcs_total', 'rass', 'sbt_delivery_pass_fail', 'sat_delivery_pass_fail']
    }
    
    completeness_data = []
    
    for category, columns in categories.items():
        available_cols = [col for col in columns if col in wide_df_full.columns]
        if available_cols:
            print(f"\n{category} Data Completeness:")
            for col in available_cols:
                non_null_count = wide_df_full[col].notna().sum()
                percentage = (non_null_count / len(wide_df_full)) * 100
                completeness_data.append({
                    'Category': category,
                    'Variable': col,
                    'Non-null Count': non_null_count,
                    'Completeness %': percentage
                })
                print(f"   {col}: {non_null_count:,} records ({percentage:.1f}%)")
    
    # Create completeness summary
    if completeness_data:
        completeness_df = pd.DataFrame(completeness_data)
        print("\n📋 Data Completeness Summary:")
        display(completeness_df.round(1))
    
    # Show sample data with key columns
    print("\n📋 Sample data (first 3 rows, key columns):")
    key_cols = ['patient_id', 'hospitalization_id', 'event_time', 'day_number', 'hosp_id_day_key']
    available_key_cols = [col for col in key_cols if col in wide_df_full.columns]
    display(wide_df_full[available_key_cols].head(3))
    
else:
    print("No comprehensive data available to display")

## Example 4: Return DataFrame (No Saving)

This example demonstrates creating a wide dataset in memory without saving to disk, useful for immediate analysis.

In [None]:
print("=== Example 4: Return DataFrame (No Saving) ===")

try:
    wide_df_memory = clif.create_wide_dataset(
        optional_tables=['vitals'],
        category_filters={
            'vitals': ['map', 'heart_rate']
        },
        sample=True,
        save_to_data_location=False  # Don't save, just return DataFrame
    )
    
    if wide_df_memory is not None:
        print(f"✅ In-memory wide dataset created with {len(wide_df_memory):,} records")
        print("💾 This dataset is available for immediate analysis and has not been saved to disk.")
        
        # Example analysis
        if 'map' in wide_df_memory.columns:
            map_stats = wide_df_memory['map'].describe()
            print(f"\n📊 MAP (Mean Arterial Pressure) Statistics:")
            print(map_stats.round(2))
            
            # MAP distribution by day
            if 'day_number' in wide_df_memory.columns:
                map_by_day = wide_df_memory.groupby('day_number')['map'].agg(['count', 'mean', 'std']).round(2)
                print(f"\n📈 MAP by Hospital Day:")
                display(map_by_day.head(10))
        
        if 'heart_rate' in wide_df_memory.columns:
            hr_stats = wide_df_memory['heart_rate'].describe()
            print(f"\n💓 Heart Rate Statistics:")
            print(hr_stats.round(2))
        
    else:
        print("❌ Failed to create in-memory dataset")
        
except Exception as e:
    print(f"❌ Error in Example 4: {str(e)}")
    import traceback
    traceback.print_exc()

In [None]:
if 'wide_df_filtered' in locals() and wide_df_filtered is not None:
    print("📊 Column Selection Analysis:")
    
    # Analyze the demographics included
    demo_cols = ['age', 'sex', 'race']
    available_demo = [col for col in demo_cols if col in wide_df_filtered.columns]
    
    if available_demo:
        print(f"\n👥 Demographics Summary:")
        for col in available_demo:
            if col == 'age':
                age_stats = wide_df_filtered[col].describe()
                print(f"   {col}: mean={age_stats['mean']:.1f}, std={age_stats['std']:.1f}")
            else:
                value_counts = wide_df_filtered[col].value_counts()
                print(f"   {col}: {dict(value_counts.head(3))}")
    
    # Analyze temporal coverage
    time_cols = ['admit_dttm', 'event_time']
    available_time = [col for col in time_cols if col in wide_df_filtered.columns]
    
    if available_time:
        print(f"\n📅 Temporal Coverage:")
        for col in available_time:
            if wide_df_filtered[col].notna().sum() > 0:
                min_time = wide_df_filtered[col].min()
                max_time = wide_df_filtered[col].max()
                print(f"   {col}: {min_time} to {max_time}")
    
    # Show column efficiency
    print(f"\n⚡ Efficiency Metrics:")
    print(f"   Total columns: {len(wide_df_filtered.columns)}")
    print(f"   Memory usage per row reduced by focusing on essential columns")
    print(f"   Processing speed improved with targeted column selection")
    
    # Demonstrate focused analysis capability
    vital_cols = [col for col in ['map', 'heart_rate', 'spo2'] if col in wide_df_filtered.columns]
    if vital_cols:
        print(f"\n🔍 Focused Analysis - Vital Signs:")
        vital_summary = wide_df_filtered[vital_cols].describe().round(2)
        display(vital_summary)
        
        # Show correlation if multiple vitals available
        if len(vital_cols) > 1:
            print(f"\n📈 Vital Signs Correlations:")
            vital_corr = wide_df_filtered[vital_cols].corr().round(3)
            display(vital_corr)
    
else:
    print("No filtered dataset available for analysis")

### Analysis: Column Selection Benefits

Let's analyze the benefits of base table column selection:

In [None]:
print("=== Example 5: Base Table Column Selection ===")

try:
    # Define which columns to include from base tables for memory efficiency
    base_columns = {
        'patient': ['patient_id', 'age', 'sex', 'race'],
        'hospitalization': ['hospitalization_id', 'patient_id', 'admit_dttm', 'discharge_dttm'],
        'adt': ['hospitalization_id', 'in_dttm', 'out_dttm', 'location']
    }
    
    print("🎯 Selected base table columns:")
    for table, cols in base_columns.items():
        print(f"   {table}: {cols}")
    
    # Create wide dataset with filtered base columns
    wide_df_filtered = clif.create_wide_dataset(
        optional_tables=['vitals'],
        category_filters={
            'vitals': ['map', 'heart_rate', 'spo2']
        },
        sample=True,
        base_table_columns=base_columns,
        save_to_data_location=True,
        output_filename='filtered_columns_wide_dataset',
        output_format='parquet'
    )
    
    if wide_df_filtered is not None:
        print(f"\n✅ Filtered wide dataset created with {len(wide_df_filtered):,} records and {len(wide_df_filtered.columns)} columns")
        
        # Show which base table columns were included
        print(f"\n📊 Base table columns included in final dataset:")
        for table, cols in base_columns.items():
            available_cols = [col for col in cols if col in wide_df_filtered.columns]
            print(f"   {table}: {available_cols}")
        
        # Show memory efficiency benefits
        print(f"\n💾 Memory Efficiency Benefits:")
        print(f"   - Focused on essential demographics and timestamps")
        print(f"   - Reduced memory footprint compared to loading all columns")
        print(f"   - Faster processing with fewer columns to handle")
        
        # Display sample of the filtered dataset
        print(f"\n📋 Sample of filtered dataset:")
        display_cols = ['patient_id', 'age', 'sex', 'hospitalization_id', 'admit_dttm', 'event_time', 'map', 'heart_rate']
        available_display_cols = [col for col in display_cols if col in wide_df_filtered.columns]
        display(wide_df_filtered[available_display_cols].head())
        
    else:
        print("✅ Filtered wide dataset saved to file successfully")
        
except Exception as e:
    print(f"❌ Error in Example 5: {str(e)}")
    import traceback
    traceback.print_exc()

## Example 5: Base Table Column Selection

This example demonstrates how to select specific columns from base tables (patient, hospitalization, adt) for memory efficiency and focused analysis.

## Key Features Demonstrated

This notebook demonstrated the following key features of the pyCLIF wide dataset functionality:

### 🔧 **Core Functionality**
- **Auto-loading**: Tables are automatically loaded as needed
- **Multi-table joining**: Seamless integration of patient, hospitalization, ADT, and optional tables
- **Category-based pivoting**: Automatic pivoting of vitals, labs, medications, and assessments
- **Time-based alignment**: Events are aligned by timestamp across all tables
- **Base table column selection**: Choose specific columns from base tables for memory efficiency

### 📊 **Flexible Configuration**
- **Sampling modes**: Random sampling or specific hospitalization targeting
- **Category filters**: Specify which categories to include for each table type
- **Base table filtering**: Select only needed columns from patient, hospitalization, and ADT tables
- **Output formats**: DataFrame, CSV, or Parquet
- **Save options**: In-memory analysis or file output

### 📈 **Analysis-Ready Structure**
- **Day-based aggregation**: `day_number` and `hosp_id_day_key` for temporal analysis
- **Complete patient context**: Demographics, hospitalization details, and clinical data
- **Missing data handling**: Proper NaN handling for missing categories
- **Time-series ready**: Event timestamps preserved for longitudinal analysis
- **Memory efficient**: Reduced memory footprint with column selection

### 🎯 **Use Cases**
- **Exploratory analysis**: Quick sampling for data exploration
- **Targeted studies**: Focus on specific patient populations
- **Comprehensive research**: Full datasets with all available data
- **Real-time analysis**: In-memory processing for immediate insights
- **Memory-constrained environments**: Efficient processing with column selection

### 💡 **New Feature: Base Table Column Selection**
- **Memory efficiency**: Load only required columns from base tables
- **Focused analysis**: Include only relevant demographics and timestamps
- **Performance optimization**: Faster processing with fewer columns
- **Flexible configuration**: Specify different column sets for different analyses

**Example Usage:**
```python
base_columns = {
    'patient': ['patient_id', 'age', 'sex', 'race'],
    'hospitalization': ['hospitalization_id', 'patient_id', 'admit_dttm', 'discharge_dttm'],
    'adt': ['hospitalization_id', 'in_dttm', 'out_dttm', 'location']
}

wide_df = clif.create_wide_dataset(
    optional_tables=['vitals'],
    base_table_columns=base_columns,
    sample=True
)
```

For more information, refer to the documentation at `docs/wide_dataset.md`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('default')
sns.set_palette("husl")

if 'wide_df_memory' in locals() and wide_df_memory is not None:
    print("📊 Creating visualizations for time-series analysis...")
    
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Wide Dataset Time-Series Analysis', fontsize=16, fontweight='bold')
    
    # Plot 1: MAP over time for a single hospitalization
    if 'map' in wide_df_memory.columns and wide_df_memory['map'].notna().sum() > 0:
        # Get one hospitalization with MAP data
        hosp_with_map = wide_df_memory[wide_df_memory['map'].notna()]['hospitalization_id'].iloc[0]
        single_hosp = wide_df_memory[wide_df_memory['hospitalization_id'] == hosp_with_map].copy()
        single_hosp = single_hosp.sort_values('event_time')
        
        axes[0, 0].plot(single_hosp['day_number'], single_hosp['map'], 'o-', linewidth=2, markersize=4)
        axes[0, 0].set_title(f'MAP Over Time (Hospitalization: {hosp_with_map})')
        axes[0, 0].set_xlabel('Hospital Day')
        axes[0, 0].set_ylabel('MAP (mmHg)')
        axes[0, 0].grid(True, alpha=0.3)
    else:
        axes[0, 0].text(0.5, 0.5, 'No MAP data available', ha='center', va='center', transform=axes[0, 0].transAxes)
        axes[0, 0].set_title('MAP Over Time')
    
    # Plot 2: Heart Rate distribution
    if 'heart_rate' in wide_df_memory.columns and wide_df_memory['heart_rate'].notna().sum() > 0:
        hr_data = wide_df_memory['heart_rate'].dropna()
        axes[0, 1].hist(hr_data, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
        axes[0, 1].set_title('Heart Rate Distribution')
        axes[0, 1].set_xlabel('Heart Rate (bpm)')
        axes[0, 1].set_ylabel('Frequency')
        axes[0, 1].axvline(hr_data.mean(), color='red', linestyle='--', label=f'Mean: {hr_data.mean():.1f}')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
    else:
        axes[0, 1].text(0.5, 0.5, 'No Heart Rate data available', ha='center', va='center', transform=axes[0, 1].transAxes)
        axes[0, 1].set_title('Heart Rate Distribution')
    
    # Plot 3: Data availability by hospital day
    if 'day_number' in wide_df_memory.columns:
        day_data = wide_df_memory.groupby('day_number').size()
        axes[1, 0].bar(day_data.index, day_data.values, alpha=0.7, color='lightgreen')
        axes[1, 0].set_title('Number of Records by Hospital Day')
        axes[1, 0].set_xlabel('Hospital Day')
        axes[1, 0].set_ylabel('Number of Records')
        axes[1, 0].grid(True, alpha=0.3)
    else:
        axes[1, 0].text(0.5, 0.5, 'No day number data available', ha='center', va='center', transform=axes[1, 0].transAxes)
        axes[1, 0].set_title('Records by Hospital Day')
    
    # Plot 4: Data completeness heatmap
    vital_cols = [col for col in ['map', 'heart_rate'] if col in wide_df_memory.columns]
    if vital_cols and 'hospitalization_id' in wide_df_memory.columns:
        # Create completeness matrix for top hospitalizations
        top_hosps = wide_df_memory['hospitalization_id'].value_counts().head(10).index
        completeness_matrix = []
        
        for hosp in top_hosps:
            hosp_data = wide_df_memory[wide_df_memory['hospitalization_id'] == hosp]
            completeness = [(hosp_data[col].notna().sum() / len(hosp_data)) * 100 for col in vital_cols]
            completeness_matrix.append(completeness)
        
        if completeness_matrix:
            im = axes[1, 1].imshow(completeness_matrix, cmap='RdYlGn', aspect='auto', vmin=0, vmax=100)
            axes[1, 1].set_title('Data Completeness by Hospitalization (%)')
            axes[1, 1].set_xlabel('Vital Signs')
            axes[1, 1].set_ylabel('Hospitalizations (Top 10)')
            axes[1, 1].set_xticks(range(len(vital_cols)))
            axes[1, 1].set_xticklabels(vital_cols, rotation=45)
            axes[1, 1].set_yticks(range(min(10, len(top_hosps))))
            axes[1, 1].set_yticklabels([f"{hosp[:8]}..." for hosp in top_hosps[:10]])
            
            # Add colorbar
            cbar = plt.colorbar(im, ax=axes[1, 1])
            cbar.set_label('Completeness (%)')
        else:
            axes[1, 1].text(0.5, 0.5, 'Insufficient data for heatmap', ha='center', va='center', transform=axes[1, 1].transAxes)
    else:
        axes[1, 1].text(0.5, 0.5, 'No vital signs data available', ha='center', va='center', transform=axes[1, 1].transAxes)
        axes[1, 1].set_title('Data Completeness Heatmap')
    
    plt.tight_layout()
    plt.show()
    
else:
    print("No in-memory dataset available for visualization")

## Summary and Next Steps

Let's summarize what we've accomplished and check for saved files:

In [None]:
print("=== Wide Dataset Examples Complete ===")
print(f"📁 Check the data directory {data_dir} for saved output files.")

# Check for saved files
if os.path.exists(data_dir):
    output_files = [
        'sample_wide_dataset.parquet',
        'targeted_encounters_wide.csv',
        'comprehensive_wide_dataset.parquet'
    ]
    
    print("\n📋 Generated Output Files:")
    for filename in output_files:
        filepath = os.path.join(data_dir, filename)
        if os.path.exists(filepath):
            file_size = os.path.getsize(filepath) / (1024 * 1024)  # MB
            print(f"   ✅ {filename} ({file_size:.1f} MB)")
        else:
            print(f"   ❌ {filename} (not found)")

print("\n🎉 Examples completed successfully!")
print("\n📚 Next Steps:")
print("   1. Adapt these examples for your specific research questions")
print("   2. Experiment with different category filters")
print("   3. Try different sampling strategies")
print("   4. Integrate wide datasets into your analysis workflows")
print("   5. Create custom functions based on these patterns")

## Key Features Demonstrated

This notebook demonstrated the following key features of the pyCLIF wide dataset functionality:

### 🔧 **Core Functionality**
- **Auto-loading**: Tables are automatically loaded as needed
- **Multi-table joining**: Seamless integration of patient, hospitalization, ADT, and optional tables
- **Category-based pivoting**: Automatic pivoting of vitals, labs, medications, and assessments
- **Time-based alignment**: Events are aligned by timestamp across all tables

### 📊 **Flexible Configuration**
- **Sampling modes**: Random sampling or specific hospitalization targeting
- **Category filters**: Specify which categories to include for each table type
- **Output formats**: DataFrame, CSV, or Parquet
- **Save options**: In-memory analysis or file output

### 📈 **Analysis-Ready Structure**
- **Day-based aggregation**: `day_number` and `hosp_id_day_key` for temporal analysis
- **Complete patient context**: Demographics, hospitalization details, and clinical data
- **Missing data handling**: Proper NaN handling for missing categories
- **Time-series ready**: Event timestamps preserved for longitudinal analysis

### 🎯 **Use Cases**
- **Exploratory analysis**: Quick sampling for data exploration
- **Targeted studies**: Focus on specific patient populations
- **Comprehensive research**: Full datasets with all available data
- **Real-time analysis**: In-memory processing for immediate insights

For more information, refer to the documentation at `docs/wide_dataset.md`.