# Advanced Vitals Analysis

This notebook demonstrates advanced analysis capabilities for vital signs data using pyCLIF, including filtering, aggregation, visualization, and clinical insights.

## Overview

The vitals table is one of the most important CLIF tables for clinical analysis. This notebook covers:
- Comprehensive vital signs exploration
- Time-series analysis
- Range validation and outlier detection
- Clinical trend analysis
- Statistical summaries and visualizations

## Setup and Imports

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Import pyCLIF components
from pyclif.tables.vitals import vitals
from pyclif.utils.io import load_data

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print(f"Environment setup complete!")
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {plt.__version__}")
print(f"Seaborn version: {sns.__version__}")

In [None]:
# Set data directory
DATA_DIR = "/Users/vaishvik/downloads/CLIF_MIMIC"
print(f"Data directory: {DATA_DIR}")

## Load and Explore Vitals Data

In [None]:
# Load vitals data with timezone conversion
vitals_table = vitals.from_file(DATA_DIR, "parquet")

print(f"✅ Vitals data loaded successfully!")
print(f"Shape: {vitals_table.df.shape}")
print(f"Validation status: {vitals_table.isvalid()}")
print(f"Date range: {vitals_table.df['recorded_dttm'].min()} to {vitals_table.df['recorded_dttm'].max()}")

# Display basic info
print("\nColumn information:")
print(vitals_table.df.info())

## Vital Categories Analysis

In [None]:
# Get comprehensive vital categories overview
vital_categories = vitals_table.get_vital_categories()
summary_stats = vitals_table.get_summary_stats()

print(f"=== VITAL CATEGORIES OVERVIEW ===")
print(f"Total vital categories: {len(vital_categories)}")
print(f"Total measurements: {summary_stats['total_records']:,}")
print(f"Unique hospitalizations: {summary_stats['unique_hospitalizations']:,}")

print("\nVital categories available:")
category_counts = summary_stats['vital_category_counts']
for category in sorted(category_counts.keys()):
    count = category_counts[category]
    percentage = (count / summary_stats['total_records']) * 100
    print(f"  {category:<25}: {count:>8,} ({percentage:>5.1f}%)")

In [None]:
# Visualize vital category distribution
plt.figure(figsize=(12, 8))
category_counts = pd.Series(summary_stats['vital_category_counts'])
top_categories = category_counts.nlargest(15)

plt.subplot(2, 1, 1)
top_categories.plot(kind='bar')
plt.title('Top 15 Vital Categories by Measurement Count')
plt.xlabel('Vital Category')
plt.ylabel('Number of Measurements')
plt.xticks(rotation=45, ha='right')

plt.subplot(2, 1, 2)
top_categories.plot(kind='pie', autopct='%1.1f%%')
plt.title('Distribution of Top Vital Categories')
plt.ylabel('')

plt.tight_layout()
plt.show()

## Detailed Analysis by Vital Category

In [None]:
# Focus on key vital signs
key_vitals = ['heart_rate', 'sbp', 'dbp', 'temp_c', 'oxygen_saturation', 'respiratory_rate']
available_key_vitals = [v for v in key_vitals if v in vital_categories]

print(f"=== KEY VITAL SIGNS ANALYSIS ===")
print(f"Available key vitals: {available_key_vitals}")

vital_stats_summary = []

for vital in available_key_vitals:
    vital_data = vitals_table.filter_by_vital_category(vital)
    
    if not vital_data.empty and 'vital_value' in vital_data.columns:
        stats = {
            'vital': vital,
            'count': len(vital_data),
            'mean': vital_data['vital_value'].mean(),
            'std': vital_data['vital_value'].std(),
            'min': vital_data['vital_value'].min(),
            'max': vital_data['vital_value'].max(),
            'q25': vital_data['vital_value'].quantile(0.25),
            'q50': vital_data['vital_value'].quantile(0.50),
            'q75': vital_data['vital_value'].quantile(0.75)
        }
        vital_stats_summary.append(stats)
        
        print(f"\n{vital.upper()}:")
        print(f"  Count: {stats['count']:,}")
        print(f"  Mean ± SD: {stats['mean']:.1f} ± {stats['std']:.1f}")
        print(f"  Range: {stats['min']:.1f} - {stats['max']:.1f}")
        print(f"  IQR: {stats['q25']:.1f} - {stats['q75']:.1f}")

# Convert to DataFrame for easier manipulation
vital_stats_df = pd.DataFrame(vital_stats_summary)
print(f"\nSummary statistics calculated for {len(vital_stats_df)} vital signs.")

In [None]:
# Create box plots for key vitals
if not vital_stats_df.empty:
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for i, vital in enumerate(available_key_vitals[:6]):
        vital_data = vitals_table.filter_by_vital_category(vital)
        
        if not vital_data.empty and 'vital_value' in vital_data.columns:
            # Remove extreme outliers for visualization
            q1 = vital_data['vital_value'].quantile(0.01)
            q99 = vital_data['vital_value'].quantile(0.99)
            filtered_data = vital_data[
                (vital_data['vital_value'] >= q1) & 
                (vital_data['vital_value'] <= q99)
            ]
            
            axes[i].boxplot(filtered_data['vital_value'])
            axes[i].set_title(f'{vital.replace("_", " ").title()}')
            axes[i].set_ylabel('Value')
    
    # Hide unused subplots
    for j in range(i+1, len(axes)):
        axes[j].set_visible(False)
    
    plt.suptitle('Distribution of Key Vital Signs (1st-99th percentile)', fontsize=16)
    plt.tight_layout()
    plt.show()

## Range Validation and Outlier Analysis

In [None]:
# Analyze range validation results
range_report = vitals_table.get_range_validation_report()

print("=== RANGE VALIDATION ANALYSIS ===")
if not range_report.empty:
    print(f"Total range validation issues: {len(range_report)}")
    
    # Group by error type
    error_type_counts = range_report['error_type'].value_counts()
    print("\nError types:")
    for error_type, count in error_type_counts.items():
        print(f"  {error_type}: {count}")
    
    # Show most problematic vitals
    if 'affected_rows' in range_report.columns:
        problematic_vitals = range_report.groupby('vital_category')['affected_rows'].sum().sort_values(ascending=False)
        print("\nVitals with most range validation issues:")
        for vital, affected_rows in problematic_vitals.head(5).items():
            print(f"  {vital}: {affected_rows:,} affected measurements")
    
    # Display detailed report
    print("\nDetailed range validation report:")
    display_cols = ['vital_category', 'error_type', 'affected_rows', 'message']
    available_cols = [col for col in display_cols if col in range_report.columns]
    print(range_report[available_cols].head(10))
else:
    print("✅ No range validation issues found!")

In [None]:
# Analyze extreme values for a specific vital
def analyze_extreme_values(vital_category, percentile_threshold=0.01):
    """Analyze extreme values for a specific vital category."""
    vital_data = vitals_table.filter_by_vital_category(vital_category)
    
    if vital_data.empty or 'vital_value' not in vital_data.columns:
        print(f"No data available for {vital_category}")
        return
    
    # Calculate percentiles
    low_threshold = vital_data['vital_value'].quantile(percentile_threshold)
    high_threshold = vital_data['vital_value'].quantile(1 - percentile_threshold)
    
    extreme_low = vital_data[vital_data['vital_value'] <= low_threshold]
    extreme_high = vital_data[vital_data['vital_value'] >= high_threshold]
    
    print(f"=== EXTREME VALUES ANALYSIS: {vital_category.upper()} ===")
    print(f"Total measurements: {len(vital_data):,}")
    print(f"Threshold percentiles: {percentile_threshold*100:.1f}% and {(1-percentile_threshold)*100:.1f}%")
    print(f"Low threshold: ≤{low_threshold:.1f} ({len(extreme_low):,} measurements)")
    print(f"High threshold: ≥{high_threshold:.1f} ({len(extreme_high):,} measurements)")
    
    if not extreme_low.empty:
        print(f"\nExtreme low values (sample):")
        sample_low = extreme_low.nsmallest(5, 'vital_value')[['patient_id', 'vital_value', 'recorded_dttm']]
        print(sample_low.to_string(index=False))
    
    if not extreme_high.empty:
        print(f"\nExtreme high values (sample):")
        sample_high = extreme_high.nlargest(5, 'vital_value')[['patient_id', 'vital_value', 'recorded_dttm']]
        print(sample_high.to_string(index=False))

# Analyze extreme values for heart rate
if 'heart_rate' in available_key_vitals:
    analyze_extreme_values('heart_rate')

## Time Series Analysis

In [None]:
# Analyze temporal patterns in vital signs
def analyze_temporal_patterns(vital_category, sample_patients=5):
    """Analyze temporal patterns for a specific vital category."""
    vital_data = vitals_table.filter_by_vital_category(vital_category)
    
    if vital_data.empty:
        print(f"No data available for {vital_category}")
        return
    
    # Convert datetime column
    vital_data = vital_data.copy()
    vital_data['recorded_dttm'] = pd.to_datetime(vital_data['recorded_dttm'])
    
    print(f"=== TEMPORAL ANALYSIS: {vital_category.upper()} ===")
    print(f"Date range: {vital_data['recorded_dttm'].min()} to {vital_data['recorded_dttm'].max()}")
    
    # Daily measurement counts
    daily_counts = vital_data.set_index('recorded_dttm').resample('D').size()
    print(f"\nDaily measurement statistics:")
    print(f"  Mean measurements/day: {daily_counts.mean():.1f}")
    print(f"  Max measurements/day: {daily_counts.max()}")
    print(f"  Days with measurements: {(daily_counts > 0).sum()}")
    
    # Hourly patterns
    vital_data['hour'] = vital_data['recorded_dttm'].dt.hour
    hourly_counts = vital_data['hour'].value_counts().sort_index()
    
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    daily_counts.plot()
    plt.title(f'{vital_category} - Daily Measurement Counts')
    plt.xlabel('Date')
    plt.ylabel('Number of Measurements')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 2, 2)
    hourly_counts.plot(kind='bar')
    plt.title(f'{vital_category} - Hourly Distribution')
    plt.xlabel('Hour of Day')
    plt.ylabel('Number of Measurements')
    
    plt.tight_layout()
    plt.show()
    
    return vital_data

# Analyze temporal patterns for available vitals
if available_key_vitals:
    temporal_data = analyze_temporal_patterns(available_key_vitals[0])

## Patient-Level Analysis

In [None]:
# Analyze vital patterns for individual patients
def analyze_patient_vitals(patient_id, vital_categories=None):
    """Analyze vital signs for a specific patient."""
    if vital_categories is None:
        vital_categories = available_key_vitals[:3]  # Top 3 available vitals
    
    patient_data = vitals_table.df[vitals_table.df['patient_id'] == patient_id].copy()
    
    if patient_data.empty:
        print(f"No vital data found for patient {patient_id}")
        return
    
    patient_data['recorded_dttm'] = pd.to_datetime(patient_data['recorded_dttm'])
    
    print(f"=== PATIENT ANALYSIS: {patient_id} ===")
    print(f"Total vital measurements: {len(patient_data):,}")
    print(f"Date range: {patient_data['recorded_dttm'].min()} to {patient_data['recorded_dttm'].max()}")
    print(f"Vital categories: {patient_data['vital_category'].nunique()}")
    
    # Plot vital trends
    fig, axes = plt.subplots(len(vital_categories), 1, figsize=(12, 4*len(vital_categories)))
    if len(vital_categories) == 1:
        axes = [axes]
    
    for i, vital in enumerate(vital_categories):
        vital_subset = patient_data[patient_data['vital_category'] == vital]
        
        if not vital_subset.empty:
            vital_subset = vital_subset.sort_values('recorded_dttm')
            axes[i].plot(vital_subset['recorded_dttm'], vital_subset['vital_value'], 'o-', alpha=0.7)
            axes[i].set_title(f'{vital.replace("_", " ").title()} Trend')
            axes[i].set_ylabel('Value')
            axes[i].grid(True, alpha=0.3)
            
            # Add summary stats
            mean_val = vital_subset['vital_value'].mean()
            axes[i].axhline(y=mean_val, color='red', linestyle='--', alpha=0.5, label=f'Mean: {mean_val:.1f}')
            axes[i].legend()
        else:
            axes[i].text(0.5, 0.5, f'No {vital} data', ha='center', va='center', transform=axes[i].transAxes)
            axes[i].set_title(f'{vital.replace("_", " ").title()} - No Data')
    
    plt.tight_layout()
    plt.show()
    
    return patient_data

# Get a sample patient for analysis
sample_patients = vitals_table.df['patient_id'].value_counts().head(5).index.tolist()
if sample_patients:
    print(f"\nTop 5 patients by measurement count:")
    for i, patient_id in enumerate(sample_patients):
        count = vitals_table.df[vitals_table.df['patient_id'] == patient_id].shape[0]
        print(f"  {i+1}. {patient_id}: {count:,} measurements")
    
    # Analyze the patient with most measurements
    if sample_patients:
        patient_analysis = analyze_patient_vitals(sample_patients[0])

## Clinical Insights and Correlations

In [None]:
# Analyze correlations between vital signs
def analyze_vital_correlations(vital_list):
    """Analyze correlations between different vital signs."""
    correlation_data = []
    
    for vital in vital_list:
        vital_subset = vitals_table.filter_by_vital_category(vital)
        if not vital_subset.empty and 'patient_id' in vital_subset.columns:
            # Get average vital value per patient
            patient_avg = vital_subset.groupby('patient_id')['vital_value'].mean().reset_index()
            patient_avg['vital_category'] = vital
            correlation_data.append(patient_avg)
    
    if not correlation_data:
        print("Insufficient data for correlation analysis")
        return
    
    # Combine all vital data
    combined_data = pd.concat(correlation_data, ignore_index=True)
    
    # Pivot to get vitals as columns
    pivot_data = combined_data.pivot(index='patient_id', columns='vital_category', values='vital_value')
    
    # Calculate correlations
    correlations = pivot_data.corr()
    
    print("=== VITAL SIGN CORRELATIONS ===")
    print(f"Patients with complete data: {len(pivot_data.dropna())}")
    print(f"Vitals analyzed: {list(pivot_data.columns)}")
    
    # Plot correlation heatmap
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlations, annot=True, cmap='coolwarm', center=0, 
                square=True, fmt='.2f', cbar_kws={'label': 'Correlation Coefficient'})
    plt.title('Correlation Matrix of Vital Signs\n(Patient-Level Averages)')
    plt.tight_layout()
    plt.show()
    
    return pivot_data, correlations

# Analyze correlations for available key vitals
if len(available_key_vitals) >= 2:
    vital_correlations = analyze_vital_correlations(available_key_vitals[:4])  # Top 4 vitals

## Advanced Filtering and Cohort Analysis

In [None]:
# Create cohorts based on vital sign characteristics
def create_vital_cohorts(vital_category, threshold_percentiles=[25, 75]):
    """Create patient cohorts based on vital sign values."""
    vital_data = vitals_table.filter_by_vital_category(vital_category)
    
    if vital_data.empty:
        print(f"No data available for {vital_category}")
        return
    
    # Calculate patient-level statistics
    patient_stats = vital_data.groupby('patient_id')['vital_value'].agg([
        'count', 'mean', 'std', 'min', 'max'
    ]).reset_index()
    
    # Define cohorts based on mean values
    low_threshold = patient_stats['mean'].quantile(threshold_percentiles[0]/100)
    high_threshold = patient_stats['mean'].quantile(threshold_percentiles[1]/100)
    
    patient_stats['cohort'] = 'Normal'
    patient_stats.loc[patient_stats['mean'] <= low_threshold, 'cohort'] = 'Low'
    patient_stats.loc[patient_stats['mean'] >= high_threshold, 'cohort'] = 'High'
    
    print(f"=== COHORT ANALYSIS: {vital_category.upper()} ===")
    print(f"Cohort definitions (based on {threshold_percentiles[0]}th and {threshold_percentiles[1]}th percentiles):")
    print(f"  Low: ≤{low_threshold:.1f}")
    print(f"  Normal: {low_threshold:.1f} - {high_threshold:.1f}")
    print(f"  High: ≥{high_threshold:.1f}")
    
    cohort_summary = patient_stats.groupby('cohort').agg({
        'patient_id': 'count',
        'mean': ['mean', 'std'],
        'count': ['mean', 'std']
    }).round(2)
    
    print("\nCohort summary:")
    print(cohort_summary)
    
    # Visualize cohorts
    plt.figure(figsize=(12, 8))
    
    plt.subplot(2, 2, 1)
    patient_stats['cohort'].value_counts().plot(kind='bar')
    plt.title('Patient Count by Cohort')
    plt.ylabel('Number of Patients')
    
    plt.subplot(2, 2, 2)
    for cohort in patient_stats['cohort'].unique():
        cohort_data = patient_stats[patient_stats['cohort'] == cohort]['mean']
        plt.hist(cohort_data, alpha=0.7, label=cohort, bins=20)
    plt.xlabel(f'Mean {vital_category}')
    plt.ylabel('Number of Patients')
    plt.title('Distribution of Mean Values by Cohort')
    plt.legend()
    
    plt.subplot(2, 2, 3)
    sns.boxplot(data=patient_stats, x='cohort', y='mean')
    plt.title(f'Mean {vital_category} by Cohort')
    plt.ylabel(f'Mean {vital_category}')
    
    plt.subplot(2, 2, 4)
    sns.boxplot(data=patient_stats, x='cohort', y='count')
    plt.title('Number of Measurements by Cohort')
    plt.ylabel('Measurement Count')
    
    plt.tight_layout()
    plt.show()
    
    return patient_stats

# Create cohorts for heart rate if available
if 'heart_rate' in available_key_vitals:
    hr_cohorts = create_vital_cohorts('heart_rate')

## Summary and Clinical Insights

In [None]:
# Generate comprehensive summary report
def generate_vitals_summary_report():
    """Generate a comprehensive summary of vitals analysis."""
    print("="*60)
    print("              VITALS ANALYSIS SUMMARY REPORT")
    print("="*60)
    print(f"Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Data source: {DATA_DIR}")
    
    # Basic statistics
    summary = vitals_table.get_summary_stats()
    print(f"\n📊 DATASET OVERVIEW:")
    print(f"  • Total measurements: {summary['total_records']:,}")
    print(f"  • Unique patients: {vitals_table.df['patient_id'].nunique():,}")
    print(f"  • Unique hospitalizations: {summary['unique_hospitalizations']:,}")
    print(f"  • Vital categories: {len(vital_categories)}")
    print(f"  • Date range: {summary['date_range']['earliest']} to {summary['date_range']['latest']}")
    
    # Data quality
    print(f"\n🔍 DATA QUALITY:")
    print(f"  • Validation passed: {vitals_table.isvalid()}")
    print(f"  • Schema errors: {len(vitals_table.errors)}")
    print(f"  • Range validation errors: {len(vitals_table.range_validation_errors)}")
    print(f"  • Missing values: {vitals_table.df.isnull().sum().sum():,} cells")
    print(f"  • Duplicate records: {vitals_table.df.duplicated().sum():,}")
    
    # Top vital categories
    print(f"\n📈 TOP VITAL CATEGORIES:")
    top_categories = pd.Series(summary['vital_category_counts']).nlargest(5)
    for vital, count in top_categories.items():
        percentage = (count / summary['total_records']) * 100
        print(f"  • {vital}: {count:,} ({percentage:.1f}%)")
    
    # Clinical insights
    print(f"\n🏥 CLINICAL INSIGHTS:")
    
    if 'heart_rate' in available_key_vitals:
        hr_data = vitals_table.filter_by_vital_category('heart_rate')
        hr_mean = hr_data['vital_value'].mean()
        hr_std = hr_data['vital_value'].std()
        print(f"  • Average heart rate: {hr_mean:.1f} ± {hr_std:.1f} bpm")
    
    if 'temp_c' in available_key_vitals:
        temp_data = vitals_table.filter_by_vital_category('temp_c')
        temp_mean = temp_data['vital_value'].mean()
        print(f"  • Average temperature: {temp_mean:.1f}°C")
    
    # Measurement frequency
    measurements_per_patient = vitals_table.df.groupby('patient_id').size()
    print(f"  • Avg measurements per patient: {measurements_per_patient.mean():.1f}")
    print(f"  • Max measurements per patient: {measurements_per_patient.max():,}")
    
    print("\n" + "="*60)
    print("End of Report")
    print("="*60)

# Generate the summary report
generate_vitals_summary_report()

## Next Steps and Advanced Usage

This notebook demonstrated:
- Comprehensive vital signs data exploration
- Range validation and outlier detection
- Temporal pattern analysis
- Patient-level vital trends
- Correlation analysis between vitals
- Cohort creation based on vital characteristics
- Clinical insights and summary reporting

### Potential Extensions:
1. **Predictive Modeling**: Use vital trends to predict clinical outcomes
2. **Anomaly Detection**: Identify unusual vital sign patterns
3. **Severity Scoring**: Calculate clinical severity scores (SOFA, APACHE)
4. **Time-to-Event Analysis**: Analyze vital changes before critical events
5. **Multi-Modal Analysis**: Combine vitals with other CLIF tables

### Clinical Applications:
- Early warning systems
- Quality improvement initiatives
- Research on physiological patterns
- Benchmarking and outcome analysis

### Explore Other Notebooks:
- `01_basic_usage.ipynb` - Basic pyCLIF usage
- `02_individual_tables.ipynb` - Individual table classes
- `03_data_validation.ipynb` - Data validation techniques
- `05_timezone_handling.ipynb` - Timezone conversion
- `06_data_filtering.ipynb` - Advanced filtering techniques