# Data Validation and Quality Assurance

This notebook demonstrates the comprehensive data validation features in pyCLIF, including schema validation, range checking, and error handling.

## Overview

pyCLIF provides multiple levels of validation:
1. **Schema Validation** - Ensures data conforms to CLIF specifications
2. **Range Validation** - Checks if values are within expected clinical ranges
3. **Data Type Validation** - Verifies correct data types for each column
4. **Missing Data Detection** - Identifies and reports missing required fields

## Setup and Imports

In [None]:
import sys
import os
import pandas as pd
import numpy as np
from datetime import datetime
import json

# Import pyCLIF components
from pyclif import CLIF
from pyclif.tables.patient import patient
from pyclif.tables.vitals import vitals
from pyclif.tables.hospitalization import hospitalization
from pyclif.utils.validator import validate_table

print(f"pyCLIF validation components imported successfully!")
print(f"Python version: {sys.version}")

In [None]:
# Set data directory
DATA_DIR = "/Users/vaishvik/downloads/CLIF_MIMIC"
print(f"Data directory: {DATA_DIR}")

## Schema Validation

pyCLIF validates data against JSON schema specifications stored in the mCIDE directory.

### Load and Validate Patient Data

In [None]:
# Load patient data
patient_table = patient.from_file(DATA_DIR, "parquet")

print(f"Patient data loaded: {patient_table.df.shape}")
print(f"\nValidation status:")
print(f"Is valid: {patient_table.isvalid()}")
print(f"Number of validation errors: {len(patient_table.errors)}")

# Show validation errors if any
if patient_table.errors:
    print("\nValidation errors:")
    for i, error in enumerate(patient_table.errors[:5]):  # Show first 5 errors
        print(f"{i+1}. {error}")
    if len(patient_table.errors) > 5:
        print(f"... and {len(patient_table.errors) - 5} more errors")
else:
    print("\n✅ No validation errors found!")

### Load and Validate Vitals Data

In [None]:
# Load vitals data
vitals_table = vitals.from_file(DATA_DIR, "parquet")

print(f"Vitals data loaded: {vitals_table.df.shape}")
print(f"\nValidation status:")
print(f"Is valid: {vitals_table.isvalid()}")
print(f"Schema validation errors: {len(vitals_table.errors)}")
print(f"Range validation errors: {len(vitals_table.range_validation_errors)}")
print(f"Total validation issues: {len(vitals_table.errors) + len(vitals_table.range_validation_errors)}")

### Detailed Error Analysis

In [None]:
# Schema validation errors
if vitals_table.errors:
    print("=== SCHEMA VALIDATION ERRORS ===")
    for i, error in enumerate(vitals_table.errors[:3]):
        print(f"{i+1}. {error}")
        
# Range validation errors
if vitals_table.range_validation_errors:
    print("\n=== RANGE VALIDATION ERRORS ===")
    for i, error in enumerate(vitals_table.range_validation_errors[:3]):
        print(f"{i+1}. Type: {error.get('error_type')}")
        print(f"   Message: {error.get('message')}")
        if 'vital_category' in error:
            print(f"   Vital: {error.get('vital_category')}")
            print(f"   Affected rows: {error.get('affected_rows')}")
        print()

## Range Validation Deep Dive

Vitals data includes sophisticated range validation to detect clinically implausible values.

In [None]:
# Get detailed range validation report
range_report = vitals_table.get_range_validation_report()

print("=== RANGE VALIDATION REPORT ===")
if not range_report.empty:
    print(f"Total range validation issues: {len(range_report)}")
    print("\nRange validation summary:")
    print(range_report[['error_type', 'vital_category', 'affected_rows', 'message']].head(10))
else:
    print("✅ No range validation issues found!")

In [None]:
# Show vital ranges used for validation
vital_ranges = vitals_table.vital_ranges
print("=== VITAL RANGES FOR VALIDATION ===")
print(f"Total vital categories with ranges: {len(vital_ranges)}")

print("\nSample vital ranges:")
for vital, ranges in list(vital_ranges.items())[:5]:
    print(f"{vital}: {ranges}")

## Creating Test Data with Validation Issues

Let's create some test data with known validation issues to demonstrate error detection.

In [None]:
# Create test vitals data with validation issues
test_vitals_data = pd.DataFrame({
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'hospitalization_id': ['H001', 'H002', 'H003', 'H004', 'H005'],
    'vital_category': ['heart_rate', 'sbp', 'temp_c', 'heart_rate', 'oxygen_saturation'],
    'vital_value': [250, 300, 50, -10, 150],  # Extreme/invalid values
    'vital_unit': ['bpm', 'mmHg', 'celsius', 'bpm', '%'],
    'recorded_dttm': [
        '2023-01-01 10:00:00',
        '2023-01-01 11:00:00', 
        '2023-01-01 12:00:00',
        '2023-01-01 13:00:00',
        '2023-01-01 14:00:00'
    ]
    # Missing some required columns to trigger schema validation errors
})

print("Test vitals data created:")
print(test_vitals_data)
print(f"\nData shape: {test_vitals_data.shape}")

In [None]:
# Create vitals table object from test data
test_vitals_table = vitals(data=test_vitals_data)

print("=== TEST DATA VALIDATION RESULTS ===")
print(f"Is valid: {test_vitals_table.isvalid()}")
print(f"Schema validation errors: {len(test_vitals_table.errors)}")
print(f"Range validation errors: {len(test_vitals_table.range_validation_errors)}")

# Show schema validation errors
if test_vitals_table.errors:
    print("\nSchema validation errors:")
    for error in test_vitals_table.errors[:3]:
        print(f"  - {error}")

# Show range validation errors
if test_vitals_table.range_validation_errors:
    print("\nRange validation errors:")
    for error in test_vitals_table.range_validation_errors:
        print(f"  - {error.get('message')}")

## Manual Validation Using Validator Utility

In [None]:
# Use the validate_table utility directly
manual_errors = validate_table(test_vitals_data, "vitals")

print("=== MANUAL VALIDATION RESULTS ===")
print(f"Number of validation errors: {len(manual_errors)}")

if manual_errors:
    print("\nValidation errors:")
    for i, error in enumerate(manual_errors[:5]):
        print(f"{i+1}. {error}")

## Validation Best Practices

### 1. Always Validate After Loading

In [None]:
def safe_table_load(table_class, data_path, file_type="parquet"):
    """Safely load a table with comprehensive validation reporting."""
    try:
        # Load the table
        table = table_class.from_file(data_path, file_type)
        
        # Report validation status
        print(f"✅ {table_class.__name__} loaded successfully")
        print(f"   Shape: {table.df.shape}")
        print(f"   Valid: {table.isvalid()}")
        
        if hasattr(table, 'errors') and table.errors:
            print(f"   ⚠️  Schema errors: {len(table.errors)}")
            
        if hasattr(table, 'range_validation_errors') and table.range_validation_errors:
            print(f"   ⚠️  Range errors: {len(table.range_validation_errors)}")
            
        return table
        
    except Exception as e:
        print(f"❌ Failed to load {table_class.__name__}: {str(e)}")
        return None

# Test the safe loading function
print("=== SAFE TABLE LOADING ===")
safe_patient = safe_table_load(patient, DATA_DIR)
safe_vitals = safe_table_load(vitals, DATA_DIR)

### 2. Validation Summary Report

In [None]:
def generate_validation_report(tables_dict):
    """Generate a comprehensive validation report for multiple tables."""
    print("=== COMPREHENSIVE VALIDATION REPORT ===")
    print(f"Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("="*50)
    
    total_tables = len(tables_dict)
    valid_tables = 0
    total_errors = 0
    
    for table_name, table_obj in tables_dict.items():
        if table_obj is None:
            print(f"\n{table_name.upper()}: ❌ FAILED TO LOAD")
            continue
            
        is_valid = table_obj.isvalid()
        if is_valid:
            valid_tables += 1
            
        schema_errors = len(table_obj.errors) if hasattr(table_obj, 'errors') else 0
        range_errors = len(table_obj.range_validation_errors) if hasattr(table_obj, 'range_validation_errors') else 0
        table_total_errors = schema_errors + range_errors
        total_errors += table_total_errors
        
        status = "✅ VALID" if is_valid else "⚠️  ISSUES FOUND"
        
        print(f"\n{table_name.upper()}: {status}")
        print(f"  Records: {len(table_obj.df):,}")
        print(f"  Columns: {len(table_obj.df.columns)}")
        if schema_errors > 0:
            print(f"  Schema errors: {schema_errors}")
        if range_errors > 0:
            print(f"  Range errors: {range_errors}")
            
    print("\n" + "="*50)
    print("SUMMARY:")
    print(f"  Tables loaded: {total_tables}")
    print(f"  Valid tables: {valid_tables}")
    print(f"  Tables with issues: {total_tables - valid_tables}")
    print(f"  Total validation errors: {total_errors}")
    
    if total_errors == 0:
        print("\n🎉 All data passed validation!")
    else:
        print(f"\n⚠️  {total_errors} validation issues require attention.")

# Generate report for loaded tables
tables_for_report = {
    'patient': safe_patient,
    'vitals': safe_vitals
}

generate_validation_report(tables_for_report)

## Data Quality Metrics

In [None]:
def calculate_data_quality_metrics(table_obj, table_name):
    """Calculate comprehensive data quality metrics."""
    if table_obj is None or table_obj.df is None:
        return None
        
    df = table_obj.df
    total_cells = df.shape[0] * df.shape[1]
    
    metrics = {
        'table_name': table_name,
        'total_records': len(df),
        'total_columns': len(df.columns),
        'total_cells': total_cells,
        'missing_cells': df.isnull().sum().sum(),
        'missing_percentage': (df.isnull().sum().sum() / total_cells) * 100,
        'duplicate_records': df.duplicated().sum(),
        'duplicate_percentage': (df.duplicated().sum() / len(df)) * 100,
        'validation_passed': table_obj.isvalid(),
        'schema_errors': len(table_obj.errors) if hasattr(table_obj, 'errors') else 0,
        'range_errors': len(table_obj.range_validation_errors) if hasattr(table_obj, 'range_validation_errors') else 0
    }
    
    # Calculate completeness per column
    column_completeness = {}
    for col in df.columns:
        non_null_count = df[col].notna().sum()
        completeness = (non_null_count / len(df)) * 100
        column_completeness[col] = completeness
    
    metrics['column_completeness'] = column_completeness
    
    return metrics

# Calculate quality metrics for our tables
if safe_vitals:
    vitals_metrics = calculate_data_quality_metrics(safe_vitals, 'vitals')
    
    print("=== VITALS DATA QUALITY METRICS ===")
    print(f"Total records: {vitals_metrics['total_records']:,}")
    print(f"Total columns: {vitals_metrics['total_columns']}")
    print(f"Missing data: {vitals_metrics['missing_percentage']:.2f}%")
    print(f"Duplicate records: {vitals_metrics['duplicate_percentage']:.2f}%")
    print(f"Validation passed: {vitals_metrics['validation_passed']}")
    print(f"Schema errors: {vitals_metrics['schema_errors']}")
    print(f"Range errors: {vitals_metrics['range_errors']}")
    
    # Show columns with low completeness
    print("\nColumns with <95% completeness:")
    low_completeness = {k: v for k, v in vitals_metrics['column_completeness'].items() if v < 95}
    for col, completeness in sorted(low_completeness.items(), key=lambda x: x[1]):
        print(f"  {col}: {completeness:.1f}%")

## Validation Configuration and Customization

Understanding the validation schemas and how to work with them.

In [None]:
# Explore the validation schema files
import os
from pathlib import Path

# Find the mCIDE schema directory
schema_dir = Path("src/pyclif/mCIDE")
if schema_dir.exists():
    print("=== AVAILABLE VALIDATION SCHEMAS ===")
    schema_files = list(schema_dir.glob("*.json"))
    for schema_file in schema_files:
        print(f"  - {schema_file.name}")
        
    # Load and examine a schema file
    vitals_schema_path = schema_dir / "VitalsModel.json"
    if vitals_schema_path.exists():
        with open(vitals_schema_path, 'r') as f:
            vitals_schema = json.load(f)
        
        print("\n=== VITALS SCHEMA STRUCTURE ===")
        print(f"Schema keys: {list(vitals_schema.keys())}")
        
        if 'vital_ranges' in vitals_schema:
            print(f"\nVital categories with ranges: {len(vitals_schema['vital_ranges'])}")
            
        if 'vital_units' in vitals_schema:
            print(f"Vital categories with units: {len(vitals_schema['vital_units'])}")
else:
    print("Schema directory not found in current location")

## Validation Summary and Best Practices

### Key Validation Features:

1. **Automatic Validation**: All table classes automatically validate on load
2. **Schema Compliance**: Ensures data matches CLIF specifications
3. **Range Checking**: Validates clinical plausibility of vital signs
4. **Error Reporting**: Detailed error messages for debugging
5. **Quality Metrics**: Comprehensive data quality assessment

### Best Practices:

1. **Always check validation status** after loading data
2. **Review validation errors** before proceeding with analysis
3. **Use safe loading functions** that handle errors gracefully
4. **Generate validation reports** for data quality documentation
5. **Monitor data completeness** and missing value patterns
6. **Validate test data** to ensure your processing pipeline works correctly

### When Validation Fails:

1. **Check data source** - Ensure data is in expected format
2. **Review error messages** - Understand what specific issues exist
3. **Clean data** - Fix known issues before reloading
4. **Document exceptions** - Note any acceptable deviations from standards
5. **Update schemas** - If business rules change, update validation accordingly

## Next Steps

This notebook covered:
- Schema validation against CLIF specifications
- Range validation for clinical data
- Error detection and reporting
- Data quality metrics
- Validation best practices
- Custom validation workflows

### Explore Other Notebooks:
- `01_basic_usage.ipynb` - Basic pyCLIF usage
- `02_individual_tables.ipynb` - Individual table classes
- `04_vitals_analysis.ipynb` - Advanced vitals analysis
- `05_timezone_handling.ipynb` - Timezone conversion
- `06_data_filtering.ipynb` - Data filtering techniques