# Advanced Data Filtering and Querying

This notebook demonstrates advanced data filtering, querying, and subsetting techniques using pyCLIF for efficient analysis of large healthcare datasets.

## Overview

Healthcare datasets are often large and complex. Effective filtering enables:
- Memory-efficient data loading
- Focused analysis on specific patient populations
- Time-based subsetting for longitudinal studies
- Custom cohort creation for research questions
- Performance optimization for large datasets

## Setup and Imports

In [22]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Import pyCLIF components
from pyclif import CLIF
from pyclif.tables.vitals import vitals
from pyclif.tables.patient import patient
from pyclif.tables.hospitalization import hospitalization
from pyclif.utils.io import load_data

print(f"Data filtering environment setup complete!")
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")

Data filtering environment setup complete!
Python version: 3.10.9 (main, Mar  1 2023, 12:20:14) [Clang 14.0.6 ]
Pandas version: 2.3.0


In [23]:
# Set data directory
DATA_DIR = "../src/pyclif/data/clif_demo/"
print(f"Data directory: {DATA_DIR}")

Data directory: ../src/pyclif/data/clif_demo/


## Basic Filtering During Data Loading

The most efficient way to filter data is during the loading process, which reduces memory usage and improves performance.

### Column Selection

In [24]:
# Load only specific columns for memory efficiency
essential_vitals_columns = [
    'hospitalization_id', 
    'vital_category', 
    'vital_value', 
    'recorded_dttm'
]

vitals_subset = load_data(
    table_name="vitals",
    table_path=DATA_DIR,
    table_format_type="parquet",
    columns=essential_vitals_columns,
    site_tz="US/Eastern"
)

print(f"=== COLUMN SELECTION FILTERING ===")
print(f"Selected columns: {essential_vitals_columns}")
print(f"Loaded data shape: {vitals_subset.shape}")
print(f"Actual columns: {list(vitals_subset.columns)}")
print(f"Memory usage: {vitals_subset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (US/Eastern).
recorded_dttm: null count after conversion= 0
=== COLUMN SELECTION FILTERING ===
Selected columns: ['hospitalization_id', 'vital_category', 'vital_value', 'recorded_dttm']
Loaded data shape: (89085, 4)
Actual columns: ['hospitalization_id', 'vital_category', 'vital_value', 'recorded_dttm']
Memory usage: 12.27 MB


### Value-Based Filtering

In [25]:
# Filter by specific vital categories during loading
cardiac_vitals = load_data(
    table_name="vitals",
    table_path=DATA_DIR,
    table_format_type="parquet",
    columns=essential_vitals_columns,
    filters={'vital_category': ['heart_rate', 'sbp', 'dbp']},  # Only cardiac vitals
    site_tz="US/Eastern"
)

print(f"=== VALUE-BASED FILTERING ===")
print(f"Filter: cardiac vitals only")
print(f"Loaded data shape: {cardiac_vitals.shape}")
if 'vital_category' in cardiac_vitals.columns:
    print(f"Unique vital categories: {cardiac_vitals['vital_category'].unique()}")
    print(f"Category counts:")
    print(cardiac_vitals['vital_category'].value_counts())

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (US/Eastern).
recorded_dttm: null count after conversion= 0
=== VALUE-BASED FILTERING ===
Filter: cardiac vitals only
Loaded data shape: (42620, 4)
Unique vital categories: ['sbp' 'heart_rate' 'dbp']
Category counts:
vital_category
sbp           14356
dbp           14351
heart_rate    13913
Name: count, dtype: int64


In [26]:
# Filter by multiple criteria
respiratory_vitals = load_data(
    table_name="vitals",
    table_path=DATA_DIR,
    table_format_type="parquet",
    columns=essential_vitals_columns,
    filters={
        'vital_category': ['respiratory_rate', 'oxygen_saturation'],  # Respiratory vitals
    },
    sample_size=2000,  # Limit sample size
    site_tz="US/Eastern"
)

print(f"=== MULTI-CRITERIA FILTERING ===")
print(f"Filter: respiratory vitals with sample limit")
print(f"Loaded data shape: {respiratory_vitals.shape}")
if 'vital_category' in respiratory_vitals.columns:
    print(f"Unique vital categories: {respiratory_vitals['vital_category'].unique()}")

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (US/Eastern).
recorded_dttm: null count after conversion= 0
=== MULTI-CRITERIA FILTERING ===
Filter: respiratory vitals with sample limit
Loaded data shape: (2000, 4)
Unique vital categories: ['respiratory_rate']


## Post-Loading Filtering with Table Methods

Use built-in table methods for filtering after data is loaded.

In [27]:
# Load a full vitals table for demonstration
vitals_table = vitals.from_file(DATA_DIR, "parquet")

print(f"=== FULL VITALS TABLE ===")
print(f"Total records: {len(vitals_table.df):,}")
print(f"Unique patients: {vitals_table.df['hospitalization_id'].nunique():,}")
print(f"Vital categories: {len(vitals_table.get_vital_categories())}")

# Get date range
if 'recorded_dttm' in vitals_table.df.columns:
    date_min = vitals_table.df['recorded_dttm'].min()
    date_max = vitals_table.df['recorded_dttm'].max()
    print(f"Date range: {date_min} to {date_max}")

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
recorded_dttm: null count after conversion= 0
Validation completed with 5 error(s).
  - 5 range validation error(s)
See `errors` and `range_validation_errors` attributes for details.
=== FULL VITALS TABLE ===
Total records: 89,085
Unique patients: 128
Vital categories: 9
Date range: 2110-04-11 20:52:00+00:00 to 2201-12-13 23:00:00+00:00


### Filter by Vital Category

In [28]:
# Filter by specific vital category
available_vitals = vitals_table.get_vital_categories()
print(f"Available vital categories: {available_vitals[:10]}...")  # Show first 10

# Focus on heart rate if available
if 'heart_rate' in available_vitals:
    hr_data = vitals_table.filter_by_vital_category('heart_rate')
    
    print(f"\n=== HEART RATE FILTERING ===")
    print(f"Heart rate records: {len(hr_data):,}")
    print(f"Unique hospitalization with HR data: {hr_data['hospitalization_id'].nunique():,}")
    
    if 'vital_value' in hr_data.columns:
        print(f"HR range: {hr_data['vital_value'].min():.1f} - {hr_data['vital_value'].max():.1f} bpm")
        print(f"HR mean ± std: {hr_data['vital_value'].mean():.1f} ± {hr_data['vital_value'].std():.1f} bpm")
else:
    print("Heart rate data not available, using first available vital category")
    if available_vitals:
        first_vital = available_vitals[0]
        vital_data = vitals_table.filter_by_vital_category(first_vital)
        print(f"\n=== {first_vital.upper()} FILTERING ===")
        print(f"{first_vital} records: {len(vital_data):,}")

Available vital categories: ['spo2', 'map', 'sbp', 'heart_rate', 'dbp', 'respiratory_rate', 'weight_kg', 'height_cm', 'temp_c']...

=== HEART RATE FILTERING ===
Heart rate records: 13,913
Unique hospitalization with HR data: 128
HR range: 0.0 - 200.0 bpm
HR mean ± std: 91.1 ± 18.7 bpm


### Filter by Hospitalization

In [29]:
# Get top hospitalizations by measurement count
hosp_counts = vitals_table.df['hospitalization_id'].value_counts()
print(f"=== HOSPITALIZATION FILTERING ===")
print(f"Total hospitalizations: {len(hosp_counts):,}")
print(f"Top 5 hospitalizations by measurement count:")

for i, (hosp_id, count) in enumerate(hosp_counts.head(5).items()):
    print(f"  {i+1}. {hosp_id}: {count:,} measurements")

# Filter by specific hospitalization
if len(hosp_counts) > 0:
    top_hosp_id = hosp_counts.index[0]
    hosp_data = vitals_table.filter_by_hospitalization(top_hosp_id)
    
    print(f"\nAnalysis of hospitalization {top_hosp_id}:")
    print(f"  Total measurements: {len(hosp_data):,}")
    print(f"  Vital categories: {hosp_data['vital_category'].nunique()}")
    
    if 'recorded_dttm' in hosp_data.columns:
        duration = hosp_data['recorded_dttm'].max() - hosp_data['recorded_dttm'].min()
        print(f"  Duration: {duration.days} days, {duration.seconds//3600} hours")
    
    # Show vital category breakdown
    print(f"  Vital breakdown:")
    vital_breakdown = hosp_data['vital_category'].value_counts().head(5)
    for vital, count in vital_breakdown.items():
        print(f"    {vital}: {count:,}")

=== HOSPITALIZATION FILTERING ===
Total hospitalizations: 128
Top 5 hospitalizations by measurement count:
  1. 28258130: 4,377 measurements
  2. 22987108: 3,439 measurements
  3. 23831430: 3,202 measurements
  4. 23559586: 2,934 measurements
  5. 28661809: 2,757 measurements

Analysis of hospitalization 28258130:
  Total measurements: 4,377
  Vital categories: 9
  Duration: 16 days, 4 hours
  Vital breakdown:
    map: 751
    dbp: 749
    sbp: 747
    heart_rate: 652
    respiratory_rate: 648


### Filter by Date Range

In [30]:
# Filter by date range
if 'recorded_dttm' in vitals_table.df.columns:
    # Get overall date range
    full_date_range = {
        'start': vitals_table.df['recorded_dttm'].min(),
        'end': vitals_table.df['recorded_dttm'].max()
    }
    
    print(f"=== DATE RANGE FILTERING ===")
    print(f"Full dataset range: {full_date_range['start']} to {full_date_range['end']}")
    total_days = (full_date_range['end'] - full_date_range['start']).days
    print(f"Total span: {total_days} days")
    
    # Filter to recent data (last 30 days of available data)
    recent_start = full_date_range['end'] - timedelta(days=30)
    recent_data = vitals_table.filter_by_date_range(recent_start, full_date_range['end'])
    
    print(f"\nRecent data (last 30 days):")
    print(f"  Date range: {recent_start} to {full_date_range['end']}")
    print(f"  Records: {len(recent_data):,}")
    print(f"  Patients: {recent_data['hospitalization_id'].nunique():,}")
    
    # Filter to middle period
    mid_start = full_date_range['start'] + timedelta(days=total_days//3)
    mid_end = full_date_range['start'] + timedelta(days=2*total_days//3)
    mid_data = vitals_table.filter_by_date_range(mid_start, mid_end)
    
    print(f"\nMiddle period data:")
    print(f"  Date range: {mid_start} to {mid_end}")
    print(f"  Records: {len(mid_data):,}")
    print(f"  Patients: {mid_data['hospitalization_id'].nunique():,}")
else:
    print("No datetime column available for date filtering")

=== DATE RANGE FILTERING ===
Full dataset range: 2110-04-11 20:52:00+00:00 to 2201-12-13 23:00:00+00:00
Total span: 33483 days

Recent data (last 30 days):
  Date range: 2201-11-13 23:00:00+00:00 to 2201-12-13 23:00:00+00:00
  Records: 324
  Patients: 1

Middle period data:
  Date range: 2140-10-31 20:52:00+00:00 to 2171-05-23 20:52:00+00:00
  Records: 28,329
  Patients: 43


## Advanced Filtering Techniques

### Complex Multi-Condition Filtering

In [31]:
# Create complex filters using pandas operations
def apply_complex_filters(df, conditions):
    """Apply multiple filtering conditions to a DataFrame."""
    filtered_df = df.copy()
    
    print(f"=== COMPLEX FILTERING ===")
    print(f"Starting records: {len(filtered_df):,}")
    
    for i, (description, condition) in enumerate(conditions):
        before_count = len(filtered_df)
        filtered_df = filtered_df[condition(filtered_df)]
        after_count = len(filtered_df)
        removed = before_count - after_count
        
        print(f"  {i+1}. {description}:")
        print(f"     Removed: {removed:,} records")
        print(f"     Remaining: {after_count:,} records")
    
    print(f"\nFinal dataset: {len(filtered_df):,} records")
    return filtered_df

# Define complex filtering conditions
if 'heart_rate' in available_vitals:
    complex_conditions = [
        ("Heart rate data only", 
         lambda df: df['vital_category'] == 'heart_rate'),
        ("Valid heart rate range (30-200 bpm)", 
         lambda df: (df['vital_value'] >= 30) & (df['vital_value'] <= 200)),
        ("Recent data (last 90 days)", 
         lambda df: df['recorded_dttm'] >= (df['recorded_dttm'].max() - timedelta(days=90)))
    ]
    
    filtered_hr_data = apply_complex_filters(vitals_table.df, complex_conditions)
    
    # Analyze filtered results
    print(f"\n=== FILTERED RESULTS ANALYSIS ===")
    print(f"Unique hospitalizations: {filtered_hr_data['hospitalization_id'].nunique():,}")
    print(f"HR statistics:")
    print(f"  Mean: {filtered_hr_data['vital_value'].mean():.1f} bpm")
    print(f"  Std: {filtered_hr_data['vital_value'].std():.1f} bpm")
    print(f"  Range: {filtered_hr_data['vital_value'].min():.1f} - {filtered_hr_data['vital_value'].max():.1f} bpm")
else:
    print("Heart rate data not available for complex filtering demonstration")

=== COMPLEX FILTERING ===
Starting records: 89,085
  1. Heart rate data only:
     Removed: 75,172 records
     Remaining: 13,913 records
  2. Valid heart rate range (30-200 bpm):
     Removed: 5 records
     Remaining: 13,908 records
  3. Recent data (last 90 days):
     Removed: 13,535 records
     Remaining: 373 records

Final dataset: 373 records

=== FILTERED RESULTS ANALYSIS ===
Unique hospitalizations: 2
HR statistics:
  Mean: 96.1 bpm
  Std: 11.0 bpm
  Range: 71.0 - 138.0 bpm


### Statistical Filtering and Outlier Removal

In [32]:
# Statistical filtering to remove outliers
def statistical_filtering(df, vital_category, method='iqr', factor=1.5):
    """Apply statistical filtering to remove outliers."""
    
    vital_data = df[df['vital_category'] == vital_category].copy()
    
    if len(vital_data) == 0:
        print(f"No data available for {vital_category}")
        return pd.DataFrame()
    
    original_count = len(vital_data)
    
    print(f"=== STATISTICAL FILTERING: {vital_category.upper()} ===")
    print(f"Original records: {original_count:,}")
    print(f"Method: {method.upper()}")
    
    if method == 'iqr':
        # Interquartile Range method
        Q1 = vital_data['vital_value'].quantile(0.25)
        Q3 = vital_data['vital_value'].quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - factor * IQR
        upper_bound = Q3 + factor * IQR
        
        print(f"IQR bounds: {lower_bound:.1f} - {upper_bound:.1f}")
        
    elif method == 'zscore':
        # Z-score method
        mean_val = vital_data['vital_value'].mean()
        std_val = vital_data['vital_value'].std()
        
        lower_bound = mean_val - factor * std_val
        upper_bound = mean_val + factor * std_val
        
        print(f"Z-score bounds: {lower_bound:.1f} - {upper_bound:.1f}")
    
    elif method == 'percentile':
        # Percentile method
        lower_percentile = (100 - 95) / 2  # 2.5th percentile
        upper_percentile = 100 - lower_percentile  # 97.5th percentile
        
        lower_bound = vital_data['vital_value'].quantile(lower_percentile / 100)
        upper_bound = vital_data['vital_value'].quantile(upper_percentile / 100)
        
        print(f"Percentile bounds: {lower_bound:.1f} - {upper_bound:.1f}")
    
    # Apply filtering
    filtered_data = vital_data[
        (vital_data['vital_value'] >= lower_bound) & 
        (vital_data['vital_value'] <= upper_bound)
    ]
    
    filtered_count = len(filtered_data)
    removed_count = original_count - filtered_count
    
    print(f"Removed outliers: {removed_count:,} ({removed_count/original_count*100:.1f}%)")
    print(f"Remaining records: {filtered_count:,}")
    
    # Statistics comparison
    print(f"\nBefore filtering:")
    print(f"  Mean ± Std: {vital_data['vital_value'].mean():.1f} ± {vital_data['vital_value'].std():.1f}")
    print(f"  Range: {vital_data['vital_value'].min():.1f} - {vital_data['vital_value'].max():.1f}")
    
    print(f"After filtering:")
    print(f"  Mean ± Std: {filtered_data['vital_value'].mean():.1f} ± {filtered_data['vital_value'].std():.1f}")
    print(f"  Range: {filtered_data['vital_value'].min():.1f} - {filtered_data['vital_value'].max():.1f}")
    
    return filtered_data

# Apply statistical filtering to available vital
if available_vitals:
    test_vital = available_vitals[0]  # Use first available vital
    
    # Test different methods
    for method in ['iqr', 'zscore', 'percentile']:
        filtered_result = statistical_filtering(vitals_table.df, test_vital, method=method)
        print("\n" + "="*50 + "\n")
else:
    print("No vital categories available for statistical filtering")

=== STATISTICAL FILTERING: SPO2 ===
Original records: 13,540
Method: IQR
IQR bounds: 89.0 - 105.0
Removed outliers: 111 (0.8%)
Remaining records: 13,429

Before filtering:
  Mean ± Std: 96.8 ± 3.0
  Range: 29.0 - 100.0
After filtering:
  Mean ± Std: 96.9 ± 2.5
  Range: 89.0 - 100.0


=== STATISTICAL FILTERING: SPO2 ===
Original records: 13,540
Method: ZSCORE
Z-score bounds: 92.4 - 101.2
Removed outliers: 791 (5.8%)
Remaining records: 12,749

Before filtering:
  Mean ± Std: 96.8 ± 3.0
  Range: 29.0 - 100.0
After filtering:
  Mean ± Std: 97.2 ± 2.1
  Range: 93.0 - 100.0


=== STATISTICAL FILTERING: SPO2 ===
Original records: 13,540
Method: PERCENTILE
Percentile bounds: 91.0 - 100.0
Removed outliers: 281 (2.1%)
Remaining records: 13,259

Before filtering:
  Mean ± Std: 96.8 ± 3.0
  Range: 29.0 - 100.0
After filtering:
  Mean ± Std: 97.0 ± 2.3
  Range: 91.0 - 100.0




## Performance-Optimized Filtering

In [33]:
# Demonstrate performance considerations
import time

def benchmark_filtering_methods(df, sample_size=10000):
    """Benchmark different filtering approaches for performance."""
    
    # Use a sample for benchmarking
    if len(df) > sample_size:
        df_sample = df.sample(n=sample_size, random_state=42)
    else:
        df_sample = df
    
    print(f"=== FILTERING PERFORMANCE BENCHMARK ===")
    print(f"Sample size: {len(df_sample):,} records")
    print()
    
    results = {}
    
    # Method 1: Basic pandas filtering
    start_time = time.time()
    if 'vital_category' in df_sample.columns:
        method1_result = df_sample[df_sample['vital_category'].isin(['heart_rate', 'sbp'])]
    else:
        method1_result = df_sample.head(100)  # Fallback
    method1_time = time.time() - start_time
    results['pandas_isin'] = {'time': method1_time, 'records': len(method1_result)}
    
    # Method 2: Query method
    start_time = time.time()
    if 'vital_category' in df_sample.columns:
        method2_result = df_sample.query("vital_category in ['heart_rate', 'sbp']")
    else:
        method2_result = df_sample.head(100)  # Fallback
    method2_time = time.time() - start_time
    results['pandas_query'] = {'time': method2_time, 'records': len(method2_result)}
    
    # Method 3: Boolean indexing
    start_time = time.time()
    if 'vital_category' in df_sample.columns:
        mask = (df_sample['vital_category'] == 'heart_rate') | (df_sample['vital_category'] == 'sbp')
        method3_result = df_sample[mask]
    else:
        method3_result = df_sample.head(100)  # Fallback
    method3_time = time.time() - start_time
    results['boolean_indexing'] = {'time': method3_time, 'records': len(method3_result)}
    
    # Report results
    print("Performance comparison:")
    for method, result in results.items():
        print(f"  {method:<18}: {result['time']*1000:>6.2f} ms ({result['records']:,} records)")
    
    # Determine fastest method
    fastest_method = min(results.keys(), key=lambda x: results[x]['time'])
    print(f"\n🏆 Fastest method: {fastest_method}")
    
    return results

# Run performance benchmark
if len(vitals_table.df) > 0:
    perf_results = benchmark_filtering_methods(vitals_table.df)
else:
    print("Insufficient data for performance benchmarking")

=== FILTERING PERFORMANCE BENCHMARK ===
Sample size: 10,000 records

Performance comparison:
  pandas_isin       :   1.30 ms (3,183 records)
  pandas_query      :   1.39 ms (3,183 records)
  boolean_indexing  :   0.95 ms (3,183 records)

🏆 Fastest method: boolean_indexing


## Memory-Efficient Filtering Strategies

In [34]:
# Memory-efficient filtering for large datasets
def memory_efficient_analysis(data_dir, chunk_size=1000):
    """Demonstrate memory-efficient processing of large datasets."""
    
    print(f"=== MEMORY-EFFICIENT FILTERING ===")
    print(f"Strategy: Load data in chunks of {chunk_size:,} records")
    print()
    
    # Load data in chunks with filters
    chunk_results = []
    
    try:
        # Load small chunks with specific filters
        for i in range(3):  # Process 3 chunks as demonstration
            chunk_data = load_data(
                table_name="vitals",
                table_path=data_dir,
                table_format_type="parquet",
                columns=['hospitalization_id', 'vital_category', 'vital_value'],
                sample_size=chunk_size,
                filters={'vital_category': ['heart_rate']} if i == 0 else None
            )
            
            if not chunk_data.empty:
                # Process chunk
                chunk_summary = {
                    'chunk_id': i,
                    'records': len(chunk_data),
                    'patients': chunk_data['hospitalization_id'].nunique() if 'hospitalization_id' in chunk_data.columns else 0,
                    'memory_mb': chunk_data.memory_usage(deep=True).sum() / 1024**2
                }
                
                if 'vital_category' in chunk_data.columns:
                    chunk_summary['vital_categories'] = chunk_data['vital_category'].nunique()
                
                chunk_results.append(chunk_summary)
                
                print(f"Chunk {i+1}:")
                print(f"  Records: {chunk_summary['records']:,}")
                print(f"  Patients: {chunk_summary['patients']:,}")
                print(f"  Memory: {chunk_summary['memory_mb']:.2f} MB")
                if 'vital_categories' in chunk_summary:
                    print(f"  Vital categories: {chunk_summary['vital_categories']}")
                
                # Clear chunk from memory
                del chunk_data
                print(f"  ✅ Chunk processed and cleared from memory")
                print()
    
    except Exception as e:
        print(f"Error in chunk processing: {e}")
    
    # Summary
    if chunk_results:
        total_records = sum(chunk['records'] for chunk in chunk_results)
        total_memory = sum(chunk['memory_mb'] for chunk in chunk_results)
        
        print(f"CHUNK PROCESSING SUMMARY:")
        print(f"  Total chunks: {len(chunk_results)}")
        print(f"  Total records processed: {total_records:,}")
        print(f"  Peak memory usage: {max(chunk['memory_mb'] for chunk in chunk_results):.2f} MB")
        print(f"  💡 Memory efficient: Process large datasets without loading all data at once")
    
    return chunk_results

# Demonstrate memory-efficient processing
memory_analysis = memory_efficient_analysis(DATA_DIR, chunk_size=500)

=== MEMORY-EFFICIENT FILTERING ===
Strategy: Load data in chunks of 500 records

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
Chunk 1:
  Records: 500
  Patients: 5
  Memory: 0.07 MB
  Vital categories: 1
  ✅ Chunk processed and cleared from memory

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
Chunk 2:
  Records: 500
  Patients: 1
  Memory: 0.07 MB
  Vital categories: 7
  ✅ Chunk processed and cleared from memory

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
Chunk 3:
  Records: 500
  Patients: 1
  Memory: 0.07 MB
  Vital categories: 7
  ✅ Chunk processed and cleared from memory

CHUNK PROCESSING SUMMARY:
  Total chunks: 3
  Total records processed: 1,500
  Peak memory usage: 0.07 MB
  💡 Memory efficient: Process large datasets without loading all data at once


## Filtering Best Practices Summary

In [35]:
# Comprehensive filtering best practices
def filtering_best_practices_summary():
    print("=" * 60)
    print("          FILTERING BEST PRACTICES SUMMARY")
    print("=" * 60)
    print()
    
    print("🚀 PERFORMANCE OPTIMIZATION:")
    print("  1. Filter during data loading when possible")
    print("     • Use 'filters' parameter in load_data()")
    print("     • Select only needed columns")
    print("     • Use sample_size for testing")
    print()
    
    print("💾 MEMORY MANAGEMENT:")
    print("  2. Process large datasets in chunks")
    print("     • Load, process, and clear chunks iteratively")
    print("     • Use generators for streaming processing")
    print("     • Monitor memory usage during processing")
    print()
    
    print("🎯 FILTERING STRATEGIES:")
    print("  3. Layer your filters from general to specific")
    print("     • Start with broad categorical filters")
    print("     • Apply temporal filters")
    print("     • Finish with value-based filters")
    print()
    
    print("📊 DATA QUALITY:")
    print("  4. Always validate filtered results")
    print("     • Check record counts make sense")
    print("     • Verify patient/hospitalization counts")
    print("     • Review statistical summaries")
    print()
    
    print("🔧 IMPLEMENTATION TIPS:")
    print("  5. Use appropriate pandas methods")
    print("     • .isin() for multiple values")
    print("     • .query() for complex conditions")
    print("     • Boolean indexing for simple conditions")
    print()
    
    print("⚠️  COMMON PITFALLS TO AVOID:")
    print("  • Loading entire dataset before filtering")
    print("  • Using loops instead of vectorized operations")
    print("  • Not considering timezone effects in date filters")
    print("  • Forgetting to validate filter results")
    print("  • Not documenting filter criteria for reproducibility")
    print()
    
    print("🎯 YOUR RECOMMENDED WORKFLOW:")
    print("  1. Define your research question and required data")
    print("  2. Identify optimal filters for data loading")
    print("  3. Load data with initial filters applied")
    print("  4. Apply additional post-loading filters as needed")
    print("  5. Validate and document your filtering decisions")
    print("  6. Create reusable filter functions for consistency")
    print()
    print("=" * 60)

filtering_best_practices_summary()

          FILTERING BEST PRACTICES SUMMARY

🚀 PERFORMANCE OPTIMIZATION:
  1. Filter during data loading when possible
     • Use 'filters' parameter in load_data()
     • Select only needed columns
     • Use sample_size for testing

💾 MEMORY MANAGEMENT:
  2. Process large datasets in chunks
     • Load, process, and clear chunks iteratively
     • Use generators for streaming processing
     • Monitor memory usage during processing

🎯 FILTERING STRATEGIES:
  3. Layer your filters from general to specific
     • Start with broad categorical filters
     • Apply temporal filters
     • Finish with value-based filters

📊 DATA QUALITY:
  4. Always validate filtered results
     • Check record counts make sense
     • Verify patient/hospitalization counts
     • Review statistical summaries

🔧 IMPLEMENTATION TIPS:
  5. Use appropriate pandas methods
     • .isin() for multiple values
     • .query() for complex conditions
     • Boolean indexing for simple conditions

⚠️  COMMON PITFALLS T

## Summary

This notebook demonstrated comprehensive data filtering and querying techniques:

### Key Filtering Methods:
1. **Load-time filtering** - Most efficient for large datasets
2. **Table method filtering** - Built-in methods for common operations
3. **Complex multi-condition filtering** - Advanced pandas operations
4. **Statistical filtering** - Outlier detection and removal
5. **Memory-efficient chunking** - For very large datasets

### Best Practices Applied:
- Filter early and often to reduce memory usage
- Layer filters from general to specific
- Always validate filtering results
- Document filtering decisions for reproducibility
- Consider performance implications of different methods

### Your Optimized Setup:
- **Data format**: Parquet (efficient for filtering)
- **Timezone**: US/Eastern (applied during loading)
- **Recommended approach**: Use `load_data()` with filters parameter

### Next Steps:
- Apply these techniques to your specific research questions
- Create reusable filter functions for common analyses
- Combine filtering with other analysis techniques from previous notebooks

### Explore Other Notebooks:
- `01_basic_usage.ipynb` - Basic pyCLIF usage
- `02_individual_tables.ipynb` - Individual table classes
- `03_data_validation.ipynb` - Data validation techniques
- `04_vitals_analysis.ipynb` - Advanced vitals analysis
- `05_timezone_handling.ipynb` - Timezone conversion and management