# Working with Individual Table Classes

This notebook demonstrates how to work with individual CLIF table classes, providing more flexibility and control over data loading and processing.

## Overview

Instead of using the main CLIF class, you can work directly with individual table classes:
- More granular control over data loading
- Independent validation and processing
- Flexibility to load specific columns or apply filters
- Better for memory management with large datasets

## Setup and Imports

In [1]:
import sys
import os
import pandas as pd
from datetime import datetime

# Import individual table classes
from pyclif.tables.patient import patient
from pyclif.tables.vitals import vitals
from pyclif.tables.hospitalization import hospitalization
from pyclif.tables.labs import labs
from pyclif.tables.adt import adt
from pyclif.tables.respiratory_support import respiratory_support

print(f"Individual table classes imported successfully!")
print(f"Python version: {sys.version}")

Individual table classes imported successfully!
Python version: 3.10.9 (main, Mar  1 2023, 12:20:14) [Clang 14.0.6 ]


## Method 1: Loading from Files

Each table class has a `from_file()` class method for loading data directly from files.

In [2]:
# Set your data directory path
DATA_DIR = "../src/pyclif/data/clif_demo/"

print(f"Loading data from: {DATA_DIR}")

Loading data from: ../src/pyclif/data/clif_demo/


### Load Patient Table

In [3]:
# Load patient table using from_file class method
patient_table = patient.from_file(
    table_path=DATA_DIR,
    table_format_type="parquet"
)

print(f"Patient table loaded successfully!")
print(f"Shape: {patient_table.df.shape}")
print(f"Columns: {list(patient_table.df.columns)}")
print(f"Is valid: {patient_table.isvalid()}")

Loading clif_patient.parquet
Data loaded successfully from clif_patient.parquet
death_dttm: null count before conversion= 85
death_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
death_dttm: null count after conversion= 85
Validation completed with 2 error(s). See `errors` attribute.
Patient table loaded successfully!
Shape: (100, 11)
Columns: ['patient_id', 'race_name', 'race_category', 'ethnicity_name', 'ethnicity_category', 'sex_name', 'sex_category', 'birth_date', 'death_dttm', 'language_name', 'language_category']
Is valid: False


In [4]:
# Display sample patient data
print("Sample patient data:")
patient_table.df.head()

Sample patient data:


Unnamed: 0,patient_id,race_name,race_category,ethnicity_name,ethnicity_category,sex_name,sex_category,birth_date,death_dttm,language_name,language_category
0,10002495,UNKNOWN,Unknown,UNKNOWN,Unknown,M,Male,NaT,NaT,ENGLISH,Unknown or NA
1,10012552,UNKNOWN,Unknown,UNKNOWN,Unknown,M,Male,NaT,NaT,ENGLISH,Unknown or NA
2,10015272,WHITE,White,WHITE,Non-Hispanic,F,Female,NaT,NaT,ENGLISH,Unknown or NA
3,10016810,UNKNOWN,Unknown,UNKNOWN,Unknown,F,Female,NaT,NaT,ENGLISH,Unknown or NA
4,10026406,WHITE,White,WHITE,Non-Hispanic,M,Male,NaT,NaT,ENGLISH,Unknown or NA


### Load Vitals Table

In [5]:
# Load vitals table
vitals_table = vitals.from_file(
    table_path=DATA_DIR,
    table_format_type="parquet"
)

print(f"Vitals table loaded successfully!")
print(f"Shape: {vitals_table.df.shape}")
print(f"Is valid: {vitals_table.isvalid()}")

# Show unique vital categories
vital_categories = vitals_table.get_vital_categories()
print(f"Vital categories: {vital_categories}")

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
recorded_dttm: null count after conversion= 0
Validation completed with 5 error(s).
  - 5 range validation error(s)
See `errors` and `range_validation_errors` attributes for details.
Vitals table loaded successfully!
Shape: (89085, 6)
Is valid: False
Vital categories: ['spo2', 'map', 'sbp', 'heart_rate', 'dbp', 'respiratory_rate', 'weight_kg', 'height_cm', 'temp_c']


### Load Hospitalization Table

In [7]:
# Load hospitalization table
hosp_table = hospitalization.from_file(
    table_path=DATA_DIR,
    table_format_type="parquet"
)

print(f"Hospitalization table loaded successfully!")
print(f"Shape: {hosp_table.df.shape}")
print(f"Is valid: {hosp_table.isvalid()}")

Loading clif_hospitalization.parquet
Data loaded successfully from clif_hospitalization.parquet
admission_dttm: null count before conversion= 0
admission_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
admission_dttm: null count after conversion= 0
discharge_dttm: null count before conversion= 0
discharge_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
discharge_dttm: null count after conversion= 0
Validation completed successfully.
Hospitalization table loaded successfully!
Shape: (275, 17)
Is valid: True


## Method 2: Loading with Custom Data

You can also initialize table classes with existing DataFrames for more control.

In [8]:
# Load data manually with custom parameters using the load_data function
from pyclif.utils.io import load_data

# Load vitals data with specific filters and timezone conversion
vitals_df = load_data(
    table_name="vitals",
    table_path=DATA_DIR,
    table_format_type="parquet",
    sample_size=1000,  # Load only first 1000 rows for demo
    site_tz="US/Eastern"  # Apply timezone conversion
)

print(f"Custom vitals data loaded: {vitals_df.shape}")
print(f"Columns: {list(vitals_df.columns)}")

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (US/Eastern).
recorded_dttm: null count after conversion= 0
Custom vitals data loaded: (1000, 6)
Columns: ['hospitalization_id', 'recorded_dttm', 'vital_name', 'vital_category', 'vital_value', 'meas_site_name']


In [9]:
# Create vitals table object from the custom DataFrame
custom_vitals = vitals(data=vitals_df)

print(f"Custom vitals table created!")
print(f"Is valid: {custom_vitals.isvalid()}")
print(f"Validation errors: {len(custom_vitals.errors)}")
print(f"Range validation errors: {len(custom_vitals.range_validation_errors)}")

Validation completed successfully.
Custom vitals table created!
Is valid: True
Validation errors: 0
Range validation errors: 0


## Table-Specific Features

Each table class has specialized methods and properties for working with that type of data.

### Vitals Table Features

In [10]:
# Get vital units mapping
vital_units = vitals_table.vital_units
print("Vital units mapping:")
for vital, unit in list(vital_units.items())[:5]:  # Show first 5
    print(f"  {vital}: {unit}")

Vital units mapping:
  temp_c: Celsius
  heart_rate: (no units)
  sbp: mmHg
  dbp: mmHg
  spo2: %


In [11]:
# Get vital ranges for validation
vital_ranges = vitals_table.vital_ranges
print("\nVital ranges for validation:")
for vital, ranges in list(vital_ranges.items())[:3]:  # Show first 3
    print(f"  {vital}: {ranges}")


Vital ranges for validation:
  temp_c: {'min': 25.0, 'max': 44.0}
  heart_rate: {'min': 0, 'max': 300}
  sbp: {'min': 0, 'max': 300}


In [12]:
# Filter vitals by category
heart_rate_data = vitals_table.filter_by_vital_category('heart_rate')
print(f"Heart rate measurements: {len(heart_rate_data)}")

if not heart_rate_data.empty:
    print("\nHeart rate statistics:")
    print(heart_rate_data['vital_value'].describe())

Heart rate measurements: 13913

Heart rate statistics:
count    13913.000000
mean        91.122332
std         18.689358
min          0.000000
25%         78.000000
50%         90.000000
75%        104.000000
max        200.000000
Name: vital_value, dtype: float64


In [13]:
# Filter by date range
from datetime import datetime, timedelta

# Get recent data (last 30 days from the latest timestamp)
if 'recorded_dttm' in vitals_table.df.columns:
    latest_date = pd.to_datetime(vitals_table.df['recorded_dttm']).max()
    start_date = latest_date - timedelta(days=30)
    
    recent_vitals = vitals_table.filter_by_date_range(start_date, latest_date)
    print(f"Recent vitals (last 30 days): {len(recent_vitals)} records")
    print(f"Date range: {start_date.date()} to {latest_date.date()}")

Recent vitals (last 30 days): 324 records
Date range: 2201-11-13 to 2201-12-13


In [14]:
# Get comprehensive summary statistics
summary = vitals_table.get_summary_stats()
print("=== VITALS SUMMARY STATISTICS ===")
print(f"Total records: {summary.get('total_records', 'N/A')}")
print(f"Unique hospitalizations: {summary.get('unique_hospitalizations', 'N/A')}")

print("\nVital category counts:")
for category, count in list(summary.get('vital_category_counts', {}).items())[:5]:
    print(f"  {category}: {count}")

date_range = summary.get('date_range', {})
print(f"\nDate range: {date_range.get('earliest')} to {date_range.get('latest')}")

=== VITALS SUMMARY STATISTICS ===
Total records: 89085
Unique hospitalizations: 128

Vital category counts:
  map: 14368
  sbp: 14356
  dbp: 14351
  heart_rate: 13913
  respiratory_rate: 13913

Date range: 2110-04-11 20:52:00+00:00 to 2201-12-13 23:00:00+00:00


### Range Validation Report

In [15]:
# Get detailed range validation report
range_report = vitals_table.get_range_validation_report()
print("Range validation report:")
print(range_report)

if not range_report.empty:
    print("\nRange validation issues found:")
    for _, row in range_report.head(3).iterrows():
        print(f"  - {row['message']}")

Range validation report:
            error_type vital_category  affected_rows  min_value  max_value  \
0  values_out_of_range      height_cm             71       61.0      188.0   
1  values_out_of_range            map          14368      -27.0      801.0   
2  values_out_of_range           spo2          13540       29.0      100.0   
3  values_out_of_range         temp_c           3767       31.1       99.0   
4  values_out_of_range      weight_kg            806        0.0      164.0   

   mean_value              expected_range  \
0      167.80     {'min': 70, 'max': 255}   
1       76.44      {'min': 0, 'max': 250}   
2       96.81     {'min': 50, 'max': 100}   
3       37.02  {'min': 25.0, 'max': 44.0}   
4       87.96    {'min': 30, 'max': 1100}   

                                              issues  \
0             [minimum value 61.0 below expected 70]   
1  [minimum value -27.0 below expected 0, maximum...   
2             [minimum value 29.0 below expected 50]   
3          

## Advanced Usage: Custom Filtering and Processing

In [16]:
# Load data with custom filters using load_data
from pyclif.utils.io import load_data

# Example: Load only specific vital categories
filtered_vitals_df = load_data(
    table_name="vitals",
    table_path=DATA_DIR,
    table_format_type="parquet",
    columns=[ 'hospitalization_id', 'vital_category', 'vital_value', 'recorded_dttm'],
    filters={'vital_category': ['heart_rate', 'sbp', 'dbp']},  # Only BP and HR
    sample_size=500,
    site_tz="US/Eastern"
)

print(f"Filtered vitals data: {filtered_vitals_df.shape}")
print(f"Unique vital categories: {filtered_vitals_df['vital_category'].unique()}")

Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (US/Eastern).
recorded_dttm: null count after conversion= 0
Filtered vitals data: (500, 4)
Unique vital categories: ['sbp' 'heart_rate' 'dbp']


In [17]:
# Create table object from filtered data
filtered_vitals_table = vitals(data=filtered_vitals_df)

print(f"Filtered vitals table created!")
print(f"Is valid: {filtered_vitals_table.isvalid()}")

# Get statistics for filtered data
filtered_summary = filtered_vitals_table.get_summary_stats()
print(f"\nFiltered data summary:")
print(f"Total records: {filtered_summary.get('total_records')}")
print(f"Vital categories: {list(filtered_summary.get('vital_category_counts', {}).keys())}")

Validation completed successfully.
Filtered vitals table created!
Is valid: True

Filtered data summary:
Total records: 500
Vital categories: ['heart_rate', 'sbp', 'dbp']


## Working with Multiple Individual Tables

In [18]:
# Load multiple tables independently for comparison
tables_info = {}

# Load different tables
table_classes = {
    'patient': patient,
    'vitals': vitals,
    'hospitalization': hospitalization
}

for table_name, table_class in table_classes.items():
    try:
        table_obj = table_class.from_file(DATA_DIR, "parquet")
        tables_info[table_name] = {
            'shape': table_obj.df.shape,
            'is_valid': table_obj.isvalid(),
            'columns': len(table_obj.df.columns),
            'memory_usage': f"{table_obj.df.memory_usage(deep=True).sum() / 1024**2:.2f} MB"
        }
    except Exception as e:
        tables_info[table_name] = {'error': str(e)}

# Display summary
print("=== TABLE COMPARISON ===")
for table_name, info in tables_info.items():
    print(f"\n{table_name.upper()}:")
    if 'error' in info:
        print(f"  Error: {info['error']}")
    else:
        print(f"  Shape: {info['shape']}")
        print(f"  Valid: {info['is_valid']}")
        print(f"  Columns: {info['columns']}")
        print(f"  Memory: {info['memory_usage']}")

Loading clif_patient.parquet
Data loaded successfully from clif_patient.parquet
death_dttm: null count before conversion= 85
death_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
death_dttm: null count after conversion= 85
Validation completed with 2 error(s). See `errors` attribute.
Loading clif_vitals.parquet
Data loaded successfully from clif_vitals.parquet
recorded_dttm: null count before conversion= 0
recorded_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
recorded_dttm: null count after conversion= 0
Validation completed with 5 error(s).
  - 5 range validation error(s)
See `errors` and `range_validation_errors` attributes for details.
Loading clif_hospitalization.parquet
Data loaded successfully from clif_hospitalization.parquet
admission_dttm: null count before conversion= 0
admission_dttm: Your timezone is UTC, Converting to your site timezone (UTC).
admission_dttm: null count after conversion= 0
discharge_dttm: null count before convers

## Benefits of Individual Table Approach

### Advantages:
1. **Memory Efficiency**: Load only the tables you need
2. **Custom Processing**: Apply specific filters, column selection, and transformations
3. **Independent Validation**: Each table validates independently
4. **Flexible Loading**: Different parameters for different tables
5. **Specialized Methods**: Each table class has domain-specific functionality

### When to Use:
- Working with large datasets where memory is a concern
- Need custom filtering or column selection
- Performing analysis on specific table types
- Building specialized data processing pipelines
- Need fine-grained control over validation and error handling

## Next Steps

This notebook demonstrated:
- Loading individual table classes
- Using `from_file()` class methods
- Creating tables from custom DataFrames
- Table-specific features and methods
- Custom filtering and processing
- Memory and performance considerations

### Explore Other Notebooks:
- `01_basic_usage.ipynb` - Main CLIF class approach
- `03_data_validation.ipynb` - Advanced validation techniques
- `04_vitals_analysis.ipynb` - Deep dive into vitals analysis
- `05_timezone_handling.ipynb` - Timezone conversion details
- `06_data_filtering.ipynb` - Advanced filtering techniques