# Fraser Health Synthetic Data - Automated Testing & Clinical Validation SuiteThis notebook performs comprehensive validation of synthetic health data generated by the Fraser Health onboarding notebook.**Validation Categories:**1. **Statistical Demographic Parity** - Age and ethnicity distribution checks2. **Geographic Integrity & Referral Routing** - Location and distance validation3. **Clinical Timeline & FHIR R4 Compliance** - Encounter sequences and FHIR validation4. **Fraser Health Specific Stress Test** - Multi-seed consistency testing**Prerequisites:**- Run `fraser_health_onboarding.ipynb` first to generate synthetic data- Data expected in: `synthea/output/csv/` and `synthea/output/fhir/`

## Section 0: Environment Setup & Imports

### 0.1 Install Required Dependencies

In [None]:
# Install required packages for validation
import sys
import subprocess

print("Installing validation dependencies...")
!pip install -q pandas numpy matplotlib haversine fhir.resources scipy

print("✓ Dependencies installed")

### 0.2 Import Libraries

In [None]:
import pandas as pd
# Core data processing and analysis libraries
import numpy as np
import matplotlib.pyplot as plt
import json
import os
from pathlib import Path
from datetime import datetime

# Date and time utilities
from collections import Counter, defaultdict
import warnings
# Statistical utilities
warnings.filterwarnings('ignore')


# Suppress non-critical warnings for cleaner output
# For geographic calculations
try:

# ========== GEOGRAPHIC CALCULATIONS ==========
    from haversine import haversine, Unit
# Haversine formula for calculating great-circle distances between two points on Earth
    haversine_available = True
# Used to detect unrealistic patient-provider distances ("teleporting" patients)
except ImportError:
    print("⚠ Haversine not available - installing...")
    !pip install haversine
    from haversine import haversine, Unit
    haversine_available = True

# For FHIR validation

# ========== FHIR VALIDATION ==========
try:
# FHIR (Fast Healthcare Interoperability Resources) R4 validation
    from fhir.resources.bundle import Bundle
# Ensures generated health data conforms to international healthcare data standards
    from fhir.resources.patient import Patient
    fhir_available = True
except ImportError:
    print("⚠ FHIR resources not available - some validation will be skipped")
    fhir_available = False

print("✓ All libraries imported successfully")
print(f"  - Haversine: {'✓' if haversine_available else '✗'}")
print(f"  - FHIR Resources: {'✓' if fhir_available else '✗'}")

### 0.3 Configuration & Path Setup

In [None]:
# ========== CONFIGURATION ==========

# Define paths to Synthea output directories
# Paths
SYNTHEA_DIR = Path("synthea")
OUTPUT_DIR = SYNTHEA_DIR / "output"
CSV_DIR = OUTPUT_DIR / "csv"
FHIR_DIR = OUTPUT_DIR / "fhir"
CONFIG_DIR = Path("config")

# ========== DEMOGRAPHIC BASELINES ==========

# Reference values from Statistics Canada 2021 Census for British Columbia
# Expected BC demographics (BC 2021 Census approximations)
# Used to validate that synthetic data matches real-world demographics
# Source: Statistics Canada 2021 Census
BC_MEDIAN_AGE_2021 = 42.8  # Years
BC_AGE_TOLERANCE = 0.10  # 10% tolerance

# ========== GEOGRAPHIC VALIDATION ==========

# Cities within Fraser Health Authority jurisdiction
# Fraser Health cities
FRASER_HEALTH_CITIES = ['Surrey', 'Burnaby', 'New Westminster', 'Coquitlam']

# Distance threshold to flag unrealistic patient travel
# Distance threshold for "teleporting" patients (km)
# Patients traveling >100km for routine care may indicate data quality issues
MAX_PATIENT_PROVIDER_DISTANCE_KM = 100  # Flag if > 100km


# ========== TEST TRACKING ==========
# Validation results storage
# Central repository for all validation test results
validation_results = {
    'timestamp': datetime.now().isoformat(),
    'tests': [],
    'passed': 0,
    'failed': 0,
    'warnings': 0
}

def log_test(category, test_name, status, message, details=None):
    """Log a test result"""
    result = {
        'category': category,
        'test': test_name,
        'status': status,  # PASS, FAIL, WARN
        'message': message,
        'details': details or {}
    }
    validation_results['tests'].append(result)
    
    if status == 'PASS':
        validation_results['passed'] += 1
        icon = '✓'
    elif status == 'FAIL':
        validation_results['failed'] += 1
        icon = '✗'
    else:  # WARN
        validation_results['warnings'] += 1
        icon = '⚠'
    
    print(f"{icon} [{category}] {test_name}: {message}")
    return result

print("Configuration loaded:")
print(f"  CSV Directory: {CSV_DIR}")
print(f"  FHIR Directory: {FHIR_DIR}")
print(f"  BC Median Age (2021 Census): {BC_MEDIAN_AGE_2021} years")
print(f"  Fraser Health Cities: {', '.join(FRASER_HEALTH_CITIES)}")

### 0.4 Load Generated Data

In [None]:
def load_data_file(filename, required=True):
    """Load a CSV file from the output directory"""
    """Load a CSV file from the output directory
    
    Args:
        filename: Name of the CSV file to load
        required: If True, logs FAIL if file not found; if False, logs WARN
    
    Returns:
        DataFrame if successful, None if file not found or error occurred
    """
    filepath = CSV_DIR / filename
    if not filepath.exists():
        # File doesn't exist - log appropriate message
        msg = f"File not found: {filepath}"
        if required:
            log_test("Setup", f"Load {filename}", "FAIL", msg)
            return None
        else:
            log_test("Setup", f"Load {filename}", "WARN", f"{msg} (optional)")
            return None
    
        # Try to read the CSV file
    try:
        df = pd.read_csv(filepath)
        log_test("Setup", f"Load {filename}", "PASS", f"Loaded {len(df)} rows")
        return df
    except Exception as e:
        # Handle any pandas read errors
        log_test("Setup", f"Load {filename}", "FAIL", f"Error loading: {e}")
        return None

print("Loading generated data files...")

# ========== LOAD CORE DATASETS ==========
print()
# These files are generated by fraser_health_onboarding.ipynb

# Load required files

# Load primary patient and encounter data (required for all tests)
patients_df = load_data_file("patients.csv", required=True)
encounters_df = load_data_file("encounters.csv", required=True)
organizations_df = load_data_file("organizations.csv", required=False)
conditions_df = load_data_file("conditions.csv", required=False)


# ========== LOAD CONFIGURATION FILES ==========
# Load config files if available
# These are the filtered CSV files created during onboarding
print("\nLoading configuration files...")
# Used to validate that generated data matches configuration
demographics_config = None
hospitals_config = None

if CONFIG_DIR.exists():
    demographics_path = CONFIG_DIR / "demographics_ca.csv"
    hospitals_path = CONFIG_DIR / "hospitals_ca.csv"
    
    if demographics_path.exists():
        try:
            demographics_config = pd.read_csv(demographics_path)
            log_test("Setup", "Load demographics_ca.csv", "PASS", f"Loaded {len(demographics_config)} rows")
        except Exception as e:
            log_test("Setup", "Load demographics_ca.csv", "WARN", f"Error: {e}")
    
    if hospitals_path.exists():
        try:
            hospitals_config = pd.read_csv(hospitals_path)
            log_test("Setup", "Load hospitals_ca.csv", "PASS", f"Loaded {len(hospitals_config)} rows")
        except Exception as e:
            log_test("Setup", "Load hospitals_ca.csv", "WARN", f"Error: {e}")

print("\n" + "="*70)
print(f"Data Loading Summary: {validation_results['passed']} passed, {validation_results['failed']} failed, {validation_results['warnings']} warnings")
print("="*70)

## Section 1: Statistical Demographic Parity Tests

### 1.1 Age Distribution Analysis

In [None]:
if patients_df is not None:
    print("="*70)
    print("TEST 1.1: AGE DISTRIBUTION vs BC 2021 CENSUS")
    print("="*70)
    
    # ========== CALCULATE AGE STATISTICS ==========
    print()
    # Extract birth year from BIRTHDATE column and calculate current age
    
    # Calculate patient ages
    if 'BIRTHDATE' in patients_df.columns:
        patients_df['BIRTHDATE'] = pd.to_datetime(patients_df['BIRTHDATE'])
    
    # ========== COMPARE TO CENSUS BASELINE ==========
        reference_date = pd.Timestamp.now()
    # Check if synthetic data median age is within tolerance of BC 2021 Census
        patients_df['AGE'] = (reference_date - patients_df['BIRTHDATE']).dt.days / 365.25
        
        median_age = patients_df['AGE'].median()
        mean_age = patients_df['AGE'].mean()
        
        print(f"Generated Data Statistics:")
    
    # ========== VISUALIZE DISTRIBUTION ==========
        print(f"  Median Age: {median_age:.1f} years")
    # Create histogram to show age distribution and identify outliers
        print(f"  Mean Age: {mean_age:.1f} years")
        print(f"  Age Range: {patients_df['AGE'].min():.1f} - {patients_df['AGE'].max():.1f} years")
        print()
        print(f"BC 2021 Census Baseline:")
        print(f"  Median Age: {BC_MEDIAN_AGE_2021} years")
        print(f"  Tolerance: ±{BC_AGE_TOLERANCE*100:.0f}%")
        print()
        
        # Test: Median age deviation
        deviation = abs(median_age - BC_MEDIAN_AGE_2021) / BC_MEDIAN_AGE_2021
        if deviation <= BC_AGE_TOLERANCE:
            log_test("Demographics", "Age Distribution - Median", "PASS",
                    f"Median age {median_age:.1f} is within {BC_AGE_TOLERANCE*100:.0f}% of BC baseline ({BC_MEDIAN_AGE_2021})",
                    {'median_age': median_age, 'bc_baseline': BC_MEDIAN_AGE_2021, 'deviation_pct': deviation*100})
        else:
            log_test("Demographics", "Age Distribution - Median", "FAIL",
                    f"Median age {median_age:.1f} deviates by {deviation*100:.1f}% from BC baseline (>{BC_AGE_TOLERANCE*100:.0f}%)",
                    {'median_age': median_age, 'bc_baseline': BC_MEDIAN_AGE_2021, 'deviation_pct': deviation*100})
        
        # Plot age distribution
        plt.figure(figsize=(12, 5))
        
        plt.subplot(1, 2, 1)
        plt.hist(patients_df['AGE'], bins=20, edgecolor='black', alpha=0.7)
        plt.axvline(median_age, color='red', linestyle='--', label=f'Median: {median_age:.1f}')
        plt.axvline(BC_MEDIAN_AGE_2021, color='green', linestyle='--', label=f'BC 2021: {BC_MEDIAN_AGE_2021}')
        plt.xlabel('Age (years)')
        plt.ylabel('Frequency')
        plt.title('Age Distribution of Generated Patients')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        plt.subplot(1, 2, 2)
        age_groups = pd.cut(patients_df['AGE'], bins=[0, 18, 30, 45, 60, 75, 120], 
                           labels=['0-17', '18-29', '30-44', '45-59', '60-74', '75+'])
        age_group_counts = age_groups.value_counts().sort_index()
        age_group_counts.plot(kind='bar', color='steelblue', edgecolor='black')
        plt.xlabel('Age Group')
        plt.ylabel('Count')
        plt.title('Age Group Distribution')
        plt.xticks(rotation=45)
        plt.grid(True, alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.show()
        
    else:
        log_test("Demographics", "Age Distribution", "FAIL", "BIRTHDATE column not found")
else:
    log_test("Demographics", "Age Distribution", "FAIL", "Patient data not loaded")

### 1.2 Ethnicity Distribution Analysis

In [None]:
if patients_df is not None:
    print()
    print("="*70)
    print("TEST 1.2: ETHNICITY DISTRIBUTION vs FRASER HEALTH DEMOGRAPHICS")
    
    # ========== ANALYZE ETHNICITY REPRESENTATION ==========
    print("="*70)
    # Fraser Health serves diverse populations, particularly South Asian and East Asian communities
    print()
    # This test verifies that synthetic data reflects this demographic reality
    
    # Check for ethnicity/race columns
    ethnicity_col = None
    for col in ['ETHNICITY', 'RACE', 'ethnicity', 'race']:
        if col in patients_df.columns:
            ethnicity_col = col
            break
    
    if ethnicity_col:
        ethnicity_dist = patients_df[ethnicity_col].value_counts()
        total = len(patients_df)
        
    
    # ========== CHECK FOR EXPECTED ETHNICITIES ==========
        print("Generated Ethnicity Distribution:")
    # Surrey and surrounding Fraser Health cities have significant South Asian and East Asian populations
        for ethnicity, count in ethnicity_dist.items():
            pct = (count / total) * 100
            print(f"  {ethnicity}: {count} ({pct:.1f}%)")
        print()
        
        # Check for South Asian and East Asian representation
        # Fraser Health has high South Asian (especially Surrey) and East Asian populations
        ethnicities_str = ' '.join(ethnicity_dist.index.str.lower())
        
        has_south_asian = any(term in ethnicities_str for term in ['asian', 'indian', 'south'])
        has_east_asian = any(term in ethnicities_str for term in ['asian', 'chinese', 'east'])
        
        if has_south_asian or has_east_asian:
            log_test("Demographics", "Ethnicity Distribution", "PASS",
                    "Dataset includes Asian ethnic groups consistent with Fraser Health demographics",
                    {'ethnicities': ethnicity_dist.to_dict()})
        else:
            log_test("Demographics", "Ethnicity Distribution", "WARN",
                    "Asian ethnic groups may be underrepresented for Fraser Health region",
                    {'ethnicities': ethnicity_dist.to_dict()})
        
        # Plot ethnicity distribution
        plt.figure(figsize=(10, 6))
        ethnicity_dist.plot(kind='bar', color='coral', edgecolor='black')
        plt.xlabel('Ethnicity')
        plt.ylabel('Count')
        plt.title('Ethnicity Distribution of Generated Patients')
        plt.xticks(rotation=45, ha='right')
        plt.grid(True, alpha=0.3, axis='y')
        plt.tight_layout()
        plt.show()
        
    else:
        log_test("Demographics", "Ethnicity Distribution", "WARN", 
                "No ethnicity/race column found in patient data")
        
    # Compare with demographics config if available
    if demographics_config is not None:
        print("Demographics configuration loaded - can be used for detailed comparison")
        if 'RACE' in demographics_config.columns or 'ETHNICITY' in demographics_config.columns:
            print("Reference ethnicity data available in config")
else:
    log_test("Demographics", "Ethnicity Distribution", "FAIL", "Patient data not loaded")

## Section 2: Geographic Integrity & Referral Routing

### 2.1 Fraser Health Boundary Check

In [None]:
if patients_df is not None:
    print("="*70)
    print("TEST 2.1: FRASER HEALTH BOUNDARY CHECK")
    print("="*70)
    
    # ========== VERIFY GEOGRAPHIC BOUNDARIES ==========
    print()
    # All patients must reside in one of the four target Fraser Health cities
    
    # This ensures we're generating data for the correct health authority region
    # Check patient cities
    city_col = None
    for col in ['CITY', 'city', 'City']:
        if col in patients_df.columns:
            city_col = col
            break
    
    if city_col:
        patient_cities = patients_df[city_col].value_counts()
        print("Patient Cities Distribution:")
        for city, count in patient_cities.items():
    
    # ========== IDENTIFY INVALID LOCATIONS ==========
            in_fraser = "✓" if city in FRASER_HEALTH_CITIES else "✗"
    # Any patient outside Fraser Health boundaries indicates a configuration error
            print(f"  {in_fraser} {city}: {count}")
        print()
        
        # Test: All patients should be in Fraser Health cities
        patients_in_fraser = patients_df[city_col].isin(FRASER_HEALTH_CITIES).sum()
        total_patients = len(patients_df)
        pct_in_fraser = (patients_in_fraser / total_patients) * 100
        
        if pct_in_fraser == 100:
            log_test("Geography", "Fraser Health Boundary", "PASS",
                    f"All {total_patients} patients are in Fraser Health cities",
                    {'cities': patient_cities.to_dict()})
        elif pct_in_fraser >= 90:
            log_test("Geography", "Fraser Health Boundary", "WARN",
                    f"{pct_in_fraser:.1f}% of patients in Fraser Health cities (not 100%)",
                    {'cities': patient_cities.to_dict(), 'pct_in_fraser': pct_in_fraser})
        else:
            log_test("Geography", "Fraser Health Boundary", "FAIL",
                    f"Only {pct_in_fraser:.1f}% of patients in Fraser Health cities",
                    {'cities': patient_cities.to_dict(), 'pct_in_fraser': pct_in_fraser})
    else:
        log_test("Geography", "Fraser Health Boundary", "FAIL", "No city column found in patient data")
else:
    log_test("Geography", "Fraser Health Boundary", "FAIL", "Patient data not loaded")

### 2.2 Provider-Patient Location Matching

In [None]:
if patients_df is not None and encounters_df is not None and organizations_df is not None:
    print()
    print("="*70)
    print("TEST 2.2: PROVIDER-PATIENT LOCATION MATCHING")
    
    # ========== VALIDATE PROVIDER LOCATIONS ==========
    print("="*70)
    # Ensure healthcare providers (hospitals, clinics) are within Fraser Health boundaries
    print()
    # This prevents patients from being assigned to providers outside their health authority
    
    # Merge encounters with organizations to get provider locations
    encounters_with_org = encounters_df.merge(
        organizations_df[['Id', 'NAME', 'CITY', 'STATE'] if 'STATE' in organizations_df.columns else ['Id', 'NAME', 'CITY']],
        left_on='ORGANIZATION' if 'ORGANIZATION' in encounters_df.columns else 'PROVIDER',
        right_on='Id',
        how='left',
        suffixes=('_enc', '_org')
    )
    
    # Get patient cities
    patient_city_col = None
    for col in ['CITY', 'city']:
        if col in patients_df.columns:
            patient_city_col = col
            break
    
    if patient_city_col:
        # Merge with patients to get patient city
        full_data = encounters_with_org.merge(
            patients_df[['Id', patient_city_col]],
            left_on='PATIENT',
            right_on='Id',
            how='left',
            suffixes=('', '_patient')
        )
        
        # Check if provider city matches patient city or is in Fraser Health
        if 'CITY_org' in full_data.columns:
            provider_cities = full_data['CITY_org'].value_counts()
            print("Provider Cities:")
            for city, count in provider_cities.head(10).items():
                if pd.notna(city):
                    in_fraser = "✓" if city in FRASER_HEALTH_CITIES else "✗"
                    print(f"  {in_fraser} {city}: {count} encounters")
            print()
            
            # Test: Providers should be in Fraser Health
            providers_in_fraser = full_data['CITY_org'].isin(FRASER_HEALTH_CITIES).sum()
            total_encounters = len(full_data)
            pct_in_fraser = (providers_in_fraser / total_encounters) * 100 if total_encounters > 0 else 0
            
            if pct_in_fraser >= 90:
                log_test("Geography", "Provider Location", "PASS",
                        f"{pct_in_fraser:.1f}% of encounters are with Fraser Health providers",
                        {'pct_in_fraser': pct_in_fraser})
            elif pct_in_fraser >= 70:
                log_test("Geography", "Provider Location", "WARN",
                        f"Only {pct_in_fraser:.1f}% of encounters with Fraser Health providers",
                        {'pct_in_fraser': pct_in_fraser})
            else:
                log_test("Geography", "Provider Location", "FAIL",
                        f"Only {pct_in_fraser:.1f}% of encounters with Fraser Health providers",
                        {'pct_in_fraser': pct_in_fraser})
        else:
            log_test("Geography", "Provider Location", "WARN", 
                    "Provider city information not available in merged data")
    else:
        log_test("Geography", "Provider Location", "WARN", "Patient city column not found")
        
elif organizations_df is None:
    log_test("Geography", "Provider Location", "WARN", "Organizations data not loaded")
else:
    log_test("Geography", "Provider Location", "FAIL", "Required data not loaded")

### 2.3 Haversine Distance Check (Teleporting Patients)

In [None]:
if patients_df is not None and organizations_df is not None and haversine_available:
    print()
    print("="*70)
    print("TEST 2.3: PATIENT-PROVIDER DISTANCE CHECK")
    
    # ========== CALCULATE PATIENT-PROVIDER DISTANCES ==========
    print("="*70)
    # Use haversine formula to compute great-circle distance between patient home and provider
    print()
    # Identifies 'teleporting' patients who travel unrealistic distances for care
    
    # Check for coordinate columns
    patient_has_coords = 'LAT' in patients_df.columns and 'LON' in patients_df.columns
    org_has_coords = 'LAT' in organizations_df.columns and 'LON' in organizations_df.columns
    
    if patient_has_coords and org_has_coords and encounters_df is not None:
        # Sample encounters for distance calculation (to avoid performance issues)
        sample_size = min(100, len(encounters_df))
        sample_encounters = encounters_df.sample(n=sample_size, random_state=42) if len(encounters_df) > sample_size else encounters_df
        
    
    # ========== DETECT OUTLIERS ==========
        print(f"Analyzing {len(sample_encounters)} encounters for distance validation...")
    # Flag any patient traveling more than 100km for routine healthcare
        print()
    # In Fraser Health region, most patients should be within 50km of their provider
        
        distances = []
        teleporting_count = 0
        
        for _, enc in sample_encounters.iterrows():
            patient_id = enc.get('PATIENT')
            org_id = enc.get('ORGANIZATION') if 'ORGANIZATION' in enc else enc.get('PROVIDER')
            
            if pd.notna(patient_id) and pd.notna(org_id):
                patient_row = patients_df[patients_df['Id'] == patient_id]
                org_row = organizations_df[organizations_df['Id'] == org_id]
                
                if not patient_row.empty and not org_row.empty:
                    patient_coords = (patient_row.iloc[0]['LAT'], patient_row.iloc[0]['LON'])
                    org_coords = (org_row.iloc[0]['LAT'], org_row.iloc[0]['LON'])
                    
                    # Calculate Haversine distance
                    distance = haversine(patient_coords, org_coords, unit=Unit.KILOMETERS)
                    distances.append(distance)
                    
                    if distance > MAX_PATIENT_PROVIDER_DISTANCE_KM:
                        teleporting_count += 1
        
        if distances:
            avg_distance = np.mean(distances)
            max_distance = np.max(distances)
            teleporting_pct = (teleporting_count / len(distances)) * 100
            
            print(f"Distance Statistics (km):")
            print(f"  Average: {avg_distance:.1f} km")
            print(f"  Maximum: {max_distance:.1f} km")
            print(f"  Median: {np.median(distances):.1f} km")
            print(f"  'Teleporting' patients (>{MAX_PATIENT_PROVIDER_DISTANCE_KM}km): {teleporting_count} ({teleporting_pct:.1f}%)")
            print()
            
            # Test: Flag if too many teleporting patients
            if teleporting_pct == 0:
                log_test("Geography", "Distance Check", "PASS",
                        f"No teleporting patients detected (all within {MAX_PATIENT_PROVIDER_DISTANCE_KM}km)",
                        {'avg_distance_km': avg_distance, 'max_distance_km': max_distance})
            elif teleporting_pct <= 5:
                log_test("Geography", "Distance Check", "PASS",
                        f"Only {teleporting_pct:.1f}% teleporting patients (<5% threshold)",
                        {'avg_distance_km': avg_distance, 'max_distance_km': max_distance, 'teleporting_pct': teleporting_pct})
            elif teleporting_pct <= 10:
                log_test("Geography", "Distance Check", "WARN",
                        f"{teleporting_pct:.1f}% teleporting patients (>5% but <10%)",
                        {'avg_distance_km': avg_distance, 'max_distance_km': max_distance, 'teleporting_pct': teleporting_pct})
            else:
                log_test("Geography", "Distance Check", "FAIL",
                        f"{teleporting_pct:.1f}% teleporting patients (>10% threshold)",
                        {'avg_distance_km': avg_distance, 'max_distance_km': max_distance, 'teleporting_pct': teleporting_pct})
            
            # Plot distance distribution
            plt.figure(figsize=(10, 5))
            plt.hist(distances, bins=20, edgecolor='black', alpha=0.7, color='skyblue')
            plt.axvline(MAX_PATIENT_PROVIDER_DISTANCE_KM, color='red', linestyle='--', 
                       label=f'Threshold: {MAX_PATIENT_PROVIDER_DISTANCE_KM}km')
            plt.xlabel('Distance (km)')
            plt.ylabel('Frequency')
            plt.title('Patient-Provider Distance Distribution')
            plt.legend()
            plt.grid(True, alpha=0.3)
            plt.tight_layout()
            plt.show()
        else:
            log_test("Geography", "Distance Check", "WARN", "No valid coordinate pairs found for distance calculation")
    else:
        missing = []
        if not patient_has_coords:
            missing.append("patient coordinates")
        if not org_has_coords:
            missing.append("organization coordinates")
        log_test("Geography", "Distance Check", "WARN", 
                f"Cannot perform distance check - missing: {', '.join(missing)}")
else:
    if not haversine_available:
        log_test("Geography", "Distance Check", "WARN", "Haversine library not available")
    else:
        log_test("Geography", "Distance Check", "FAIL", "Required data not loaded")

## Section 3: Clinical Timeline & FHIR R4 Compliance

### 3.1 Encounter Sequence Validation

In [None]:
if encounters_df is not None and patients_df is not None:
    print("="*70)
    print("TEST 3.1: CLINICAL ENCOUNTER SEQUENCE VALIDATION")
    print("="*70)
    
    # ========== VALIDATE CLINICAL TIMELINES ==========
    print()
    # Ensure encounters happen in logical chronological order
    
    # Example: inpatient stays should be preceded by ambulatory or emergency visits
    # Check for date and encounter type columns
    has_date = 'START' in encounters_df.columns or 'DATE' in encounters_df.columns
    has_type = 'ENCOUNTERCLASS' in encounters_df.columns or 'CLASS' in encounters_df.columns
    
    if has_date and has_type:
        date_col = 'START' if 'START' in encounters_df.columns else 'DATE'
        type_col = 'ENCOUNTERCLASS' if 'ENCOUNTERCLASS' in encounters_df.columns else 'CLASS'
        
        # Convert dates
        encounters_df[date_col] = pd.to_datetime(encounters_df[date_col], errors='coerce')
        
        # Sample patients for detailed analysis
        sample_patients = patients_df['Id'].sample(n=min(10, len(patients_df)), random_state=42).tolist()
        
        print(f"Analyzing encounter sequences for {len(sample_patients)} sample patients...")
    
    # ========== CHECK ENCOUNTER PATTERNS ==========
        print()
    # Look for suspicious patterns like inpatient admission without prior encounters
        
        sequence_issues = []
        valid_sequences = 0
        
        for patient_id in sample_patients:
            patient_encounters = encounters_df[encounters_df['PATIENT'] == patient_id].sort_values(date_col)
            
            if len(patient_encounters) > 1:
                encounter_types = patient_encounters[type_col].tolist()
                
                # Check for logical sequences
                # e.g., Inpatient should typically be preceded by Emergency or Ambulatory
                for i in range(1, len(encounter_types)):
                    current_type = encounter_types[i]
                    previous_type = encounter_types[i-1]
                    
                    # Simple validation: Inpatient typically follows Emergency/Ambulatory/Urgent
                    if current_type in ['inpatient', 'Inpatient', 'INPATIENT']:
                        if previous_type not in ['emergency', 'Emergency', 'EMERGENCY', 
                                                'ambulatory', 'Ambulatory', 'AMBULATORY',
                                                'urgent', 'Urgent', 'URGENT',
                                                'inpatient', 'Inpatient', 'INPATIENT']:
                            sequence_issues.append({
                                'patient_id': patient_id,
                                'sequence': f"{previous_type} -> {current_type}",
                                'issue': 'Inpatient without preceding appropriate encounter'
                            })
                
                if not sequence_issues or sequence_issues[-1]['patient_id'] != patient_id:
                    valid_sequences += 1
        
        print(f"Sequence Validation Results:")
        print(f"  Valid sequences: {valid_sequences}/{len(sample_patients)}")
        print(f"  Issues found: {len(sequence_issues)}")
        
        if sequence_issues:
            print(f"\n  Sample issues:")
            for issue in sequence_issues[:3]:
                print(f"    - Patient {issue['patient_id'][:8]}...: {issue['sequence']}")
        print()
        
        # Test: Most sequences should be valid
        valid_pct = (valid_sequences / len(sample_patients)) * 100 if sample_patients else 0
        
        if valid_pct >= 80:
            log_test("Clinical", "Encounter Sequences", "PASS",
                    f"{valid_pct:.0f}% of sampled patients have valid encounter sequences",
                    {'valid_count': valid_sequences, 'total_sampled': len(sample_patients)})
        elif valid_pct >= 60:
            log_test("Clinical", "Encounter Sequences", "WARN",
                    f"Only {valid_pct:.0f}% of sampled patients have valid sequences",
                    {'valid_count': valid_sequences, 'total_sampled': len(sample_patients), 'issues': len(sequence_issues)})
        else:
            log_test("Clinical", "Encounter Sequences", "FAIL",
                    f"Only {valid_pct:.0f}% of sampled patients have valid sequences",
                    {'valid_count': valid_sequences, 'total_sampled': len(sample_patients), 'issues': len(sequence_issues)})
        
        # Show encounter type distribution
        encounter_type_dist = encounters_df[type_col].value_counts()
        print("Encounter Type Distribution:")
        for enc_type, count in encounter_type_dist.items():
            print(f"  {enc_type}: {count}")
            
    else:
        missing = []
        if not has_date:
            missing.append("date column")
        if not has_type:
            missing.append("encounter type column")
        log_test("Clinical", "Encounter Sequences", "WARN", 
                f"Cannot validate sequences - missing: {', '.join(missing)}")
else:
    log_test("Clinical", "Encounter Sequences", "FAIL", "Required data not loaded")

### 3.2 FHIR R4 Bundle Validation

In [None]:
if fhir_available:
    print()
    print("="*70)
    print("TEST 3.2: FHIR R4 BUNDLE VALIDATION")
    
    # ========== VALIDATE FHIR CONFORMANCE ==========
    print("="*70)
    # FHIR R4 is the international standard for healthcare data exchange
    print()
    # Each patient should have a valid FHIR Bundle containing all their health records
    
    if FHIR_DIR.exists():
        fhir_files = list(FHIR_DIR.glob("*.json"))
        
        if fhir_files:
            print(f"Found {len(fhir_files)} FHIR JSON files")
            print()
            
            # Sample and validate FHIR files
            sample_size = min(10, len(fhir_files))
    
    # ========== VERIFY BC-SPECIFIC REQUIREMENTS ==========
            sample_files = fhir_files[:sample_size]
    # All address.state values must be 'BC' for British Columbia
            
    # This ensures data is properly tagged for provincial health systems
            valid_bundles = 0
            bc_address_count = 0
            validation_errors = []
            
            for fhir_file in sample_files:
                try:
                    with open(fhir_file, 'r') as f:
                        fhir_data = json.load(f)
                    
                    # Try to parse as FHIR Bundle
                    try:
                        bundle = Bundle.parse_obj(fhir_data)
                        valid_bundles += 1
                        
                        # Check for BC addresses
                        if 'entry' in fhir_data:
                            for entry in fhir_data['entry']:
                                if 'resource' in entry:
                                    resource = entry['resource']
                                    if resource.get('resourceType') == 'Patient':
                                        if 'address' in resource:
                                            for addr in resource['address']:
                                                if addr.get('state') == 'BC' or addr.get('state') == 'British Columbia':
                                                    bc_address_count += 1
                                                    break
                    except Exception as e:
                        validation_errors.append({
                            'file': fhir_file.name,
                            'error': str(e)
                        })
                        
                except Exception as e:
                    validation_errors.append({
                        'file': fhir_file.name,
                        'error': f"Failed to load: {str(e)}"
                    })
            
            print(f"FHIR Validation Results:")
            print(f"  Valid R4 Bundles: {valid_bundles}/{sample_size}")
            print(f"  Files with BC addresses: {bc_address_count}")
            print(f"  Validation errors: {len(validation_errors)}")
            
            if validation_errors:
                print(f"\n  Sample errors:")
                for err in validation_errors[:3]:
                    print(f"    - {err['file']}: {err['error'][:80]}")
            print()
            
            # Test: All sampled files should be valid
            if valid_bundles == sample_size:
                log_test("FHIR", "R4 Bundle Validation", "PASS",
                        f"All {sample_size} sampled FHIR files are valid R4 bundles",
                        {'valid_count': valid_bundles, 'total_sampled': sample_size})
            elif valid_bundles >= sample_size * 0.8:
                log_test("FHIR", "R4 Bundle Validation", "WARN",
                        f"{valid_bundles}/{sample_size} sampled files are valid R4 bundles",
                        {'valid_count': valid_bundles, 'total_sampled': sample_size, 'errors': len(validation_errors)})
            else:
                log_test("FHIR", "R4 Bundle Validation", "FAIL",
                        f"Only {valid_bundles}/{sample_size} sampled files are valid R4 bundles",
                        {'valid_count': valid_bundles, 'total_sampled': sample_size, 'errors': len(validation_errors)})
            
            # Test BC addresses
            if bc_address_count > 0:
                log_test("FHIR", "BC Address Check", "PASS",
                        f"Found {bc_address_count} resources with BC addresses",
                        {'bc_address_count': bc_address_count})
            else:
                log_test("FHIR", "BC Address Check", "WARN",
                        "No BC addresses found in sampled FHIR files",
                        {'bc_address_count': 0})
        else:
            log_test("FHIR", "R4 Bundle Validation", "WARN", 
                    f"No FHIR JSON files found in {FHIR_DIR}")
    else:
        log_test("FHIR", "R4 Bundle Validation", "WARN", 
                f"FHIR directory not found: {FHIR_DIR}")
else:
    log_test("FHIR", "R4 Bundle Validation", "WARN", 
            "fhir.resources library not available - FHIR validation skipped")

## Section 4: Fraser Health Specific Stress Test

### 4.1 Top Conditions Analysis

In [None]:
if conditions_df is not None:
    print("="*70)
    print("TEST 4.1: TOP CONDITIONS ANALYSIS")
    print("="*70)
    
    # ========== ANALYZE CONDITION PREVALENCE ==========
    print()
    # Identify most common diagnoses in synthetic population
    
    # Should match expected chronic disease patterns in Fraser Health region
    # Check for required columns
    has_code = 'CODE' in conditions_df.columns or 'SNOMED_CODE' in conditions_df.columns
    has_description = 'DESCRIPTION' in conditions_df.columns
    
    if has_code:
        code_col = 'CODE' if 'CODE' in conditions_df.columns else 'SNOMED_CODE'
        
        # Get top conditions
        top_conditions = conditions_df[code_col].value_counts().head(10)
        
        print("Top 10 Conditions (by SNOMED code):")
        for i, (code, count) in enumerate(top_conditions.items(), 1):
            desc = ""
            if has_description:
                desc_row = conditions_df[conditions_df[code_col] == code][['DESCRIPTION']].iloc[0]
                desc = f" - {desc_row['DESCRIPTION']}"
            print(f"  {i}. Code {code}: {count} occurrences{desc}")
        print()
        
        # Check for expected common conditions in Fraser Health
        # Essential Hypertension is typically one of the most common
        condition_descriptions = []
        if has_description:
            condition_descriptions = conditions_df['DESCRIPTION'].str.lower().unique()
        
        has_hypertension = any('hypertension' in desc for desc in condition_descriptions)
        has_diabetes = any('diabetes' in desc for desc in condition_descriptions)
        
        print("Common Chronic Conditions Check:")
        print(f"  Hypertension: {'✓ Found' if has_hypertension else '✗ Not found'}")
        print(f"  Diabetes: {'✓ Found' if has_diabetes else '✗ Not found'}")
        print()
        
        if has_hypertension or has_diabetes:
            log_test("Conditions", "Common Conditions", "PASS",
                    "Dataset includes expected common chronic conditions",
                    {'top_10_codes': top_conditions.to_dict()})
        else:
            log_test("Conditions", "Common Conditions", "WARN",
                    "Expected chronic conditions (hypertension, diabetes) may be missing",
                    {'top_10_codes': top_conditions.to_dict()})
        
        # Store for seed stability test
        globals()['baseline_top_conditions'] = top_conditions
        
    else:
        log_test("Conditions", "Top Conditions", "WARN", "SNOMED code column not found in conditions data")
else:
    log_test("Conditions", "Top Conditions", "WARN", "Conditions data not loaded - cannot analyze")

### 4.2 Multi-Seed Consistency Test

**Note:** This test requires running the simulation multiple times with different seeds. For this validation notebook, we document the approach. To actually perform multi-seed testing:1. Run `fraser_health_onboarding.ipynb` with different `RANDOM_SEED` values (e.g., 12345, 12346, 12347, 12348, 12349)2. Save outputs to different directories (e.g., `output_seed_12345/`, `output_seed_12346/`, etc.)3. Load and compare results from each runBelow is a template for seed stability testing:

In [None]:
print("="*70)
print("TEST 4.2: MULTI-SEED CONSISTENCY (Template)")
print("="*70)
print()
# ========== REPRODUCIBILITY TESTING ==========

# This section provides a template for multi-seed validation
print("Multi-Seed Testing Approach:")
# Running with different random seeds ensures data generation is stable and consistent
print()
print("To test seed stability:")
print("  1. Run simulations with seeds: 12345, 12346, 12347, 12348, 12349")
print("  2. For each run, extract Top 10 conditions")
print("  3. Compare consistency across runs")
print()
print("Expected behavior:")
print("  - Top conditions should be similar across runs")
print("  - If 'Essential Hypertension' is #1 in one run, it should be in top 5 in others")
print("  - Major deviations indicate seed instability")
print()

# Template code for multi-seed comparison
seed_comparison_template = '''
# Example multi-seed comparison code:

seeds = [12345, 12346, 12347, 12348, 12349]
top_conditions_by_seed = {}

for seed in seeds:
    # Load conditions from output directory for this seed
    conditions_file = f"synthea/output_seed_{seed}/csv/conditions.csv"
    if os.path.exists(conditions_file):
        seed_conditions = pd.read_csv(conditions_file)
        top_10 = seed_conditions['CODE'].value_counts().head(10)
        top_conditions_by_seed[seed] = top_10
        
# Compare consistency
if len(top_conditions_by_seed) == len(seeds):
    # Check if top condition is consistent
    top_codes = [list(tops.index)[0] for tops in top_conditions_by_seed.values()]
    consistency = len(set(top_codes)) / len(top_codes)
    
    if consistency < 0.3:  # Less than 30% variation
        print("✓ PASS: Top conditions are consistent across seeds")
    else:
        print("✗ FAIL: High variation in top conditions across seeds")
'''

print("Template code for multi-seed testing:")
print(seed_comparison_template)

log_test("Stress Test", "Multi-Seed Consistency", "WARN",
        "Multi-seed testing requires manual execution with different seed values",
        {'note': 'Run simulations with 5 different seeds and compare results'})

## Section 5: Validation Report Generation

### 5.1 Generate Summary Report

In [None]:
print("\n")
print("="*70)
print("FRASER HEALTH SYNTHETIC DATA - VALIDATION SUMMARY REPORT")
print("="*70)
# ========== VALIDATION SUMMARY ==========
print()
# Aggregate all test results and generate final pass/fail counts

# Overall statistics
total_tests = len(validation_results['tests'])
passed = validation_results['passed']
failed = validation_results['failed']
warnings = validation_results['warnings']

print(f"Validation Timestamp: {validation_results['timestamp']}")
print(f"Total Tests: {total_tests}")
print(f"  ✓ Passed: {passed} ({passed/total_tests*100:.1f}%)" if total_tests > 0 else "  ✓ Passed: 0")
print(f"  ✗ Failed: {failed} ({failed/total_tests*100:.1f}%)" if total_tests > 0 else "  ✗ Failed: 0")
print(f"  ⚠ Warnings: {warnings} ({warnings/total_tests*100:.1f}%)" if total_tests > 0 else "  ⚠ Warnings: 0")
print()

# Overall status
if failed == 0 and warnings == 0:
    overall_status = "✓ EXCELLENT - All tests passed"
elif failed == 0:
    overall_status = "✓ GOOD - All tests passed with some warnings"
elif failed <= total_tests * 0.1:
    overall_status = "⚠ ACCEPTABLE - Most tests passed, minor issues detected"
else:
    overall_status = "✗ NEEDS ATTENTION - Multiple test failures"

print(f"Overall Status: {overall_status}")
print()

# Group results by category
print("="*70)
print("RESULTS BY CATEGORY")
print("="*70)
print()

categories = {}
for test in validation_results['tests']:
    cat = test['category']
    if cat not in categories:
        categories[cat] = {'passed': 0, 'failed': 0, 'warned': 0, 'tests': []}
    
    categories[cat]['tests'].append(test)
    if test['status'] == 'PASS':
        categories[cat]['passed'] += 1
    elif test['status'] == 'FAIL':
        categories[cat]['failed'] += 1
    else:
        categories[cat]['warned'] += 1

for category, results in categories.items():
    total_cat = len(results['tests'])
    print(f"{category}:")
    print(f"  Total: {total_cat} | ✓ {results['passed']} | ✗ {results['failed']} | ⚠ {results['warned']}")
    
    # Show details
    for test in results['tests']:
        icon = '✓' if test['status'] == 'PASS' else '✗' if test['status'] == 'FAIL' else '⚠'
        print(f"    {icon} {test['test']}: {test['message']}")
    print()

### 5.2 Create Validation DataFrame

In [None]:
# Create a DataFrame for easy analysis
validation_df = pd.DataFrame(validation_results['tests'])
# Convert validation results to DataFrame for easy analysis and export

# This allows filtering, sorting, and statistical analysis of test results
print("="*70)
print("VALIDATION RESULTS TABLE")
print("="*70)
print()

if not validation_df.empty:
    # Display summary table
    summary_table = validation_df.groupby(['category', 'status']).size().unstack(fill_value=0)
    print(summary_table)
    print()
    
    # Show failed tests in detail
    failed_tests = validation_df[validation_df['status'] == 'FAIL']
    if not failed_tests.empty:
        print("="*70)
        print("FAILED TESTS (Requires Attention)")
        print("="*70)
        print()
        for _, test in failed_tests.iterrows():
            print(f"Category: {test['category']}")
            print(f"Test: {test['test']}")
            print(f"Message: {test['message']}")
            if test['details']:
                print(f"Details: {test['details']}")
            print()
else:
    print("No validation results available")

# Make the dataframe available for export
print("\nValidation DataFrame created: 'validation_df'")
print("You can export it: validation_df.to_csv('validation_report.csv', index=False)")

### 5.3 Generate Markdown Report

In [None]:
# Generate a markdown report
markdown_report = []
# Create human-readable Markdown report for documentation

# This report can be committed to version control or shared with stakeholders
markdown_report.append("# Fraser Health Synthetic Data - Validation Report\n")
markdown_report.append(f"**Generated:** {validation_results['timestamp']}\n\n")

markdown_report.append("## Executive Summary\n")
markdown_report.append(f"- **Total Tests:** {total_tests}\n")
markdown_report.append(f"- **Passed:** {passed} ({passed/total_tests*100:.1f}%)\n" if total_tests > 0 else "- **Passed:** 0\n")
markdown_report.append(f"- **Failed:** {failed} ({failed/total_tests*100:.1f}%)\n" if total_tests > 0 else "- **Failed:** 0\n")
markdown_report.append(f"- **Warnings:** {warnings} ({warnings/total_tests*100:.1f}%)\n" if total_tests > 0 else "- **Warnings:** 0\n")
markdown_report.append(f"- **Status:** {overall_status}\n\n")

markdown_report.append("## Results by Category\n\n")

for category, results in categories.items():
    markdown_report.append(f"### {category}\n\n")
    
    for test in results['tests']:
        icon = '✅' if test['status'] == 'PASS' else '❌' if test['status'] == 'FAIL' else '⚠️'
        markdown_report.append(f"{icon} **{test['test']}:** {test['message']}\n")
    
    markdown_report.append("\n")

# Add recommendations
markdown_report.append("## Recommendations\n\n")

if failed > 0:
    markdown_report.append("### Critical Issues\n\n")
    for test in validation_results['tests']:
        if test['status'] == 'FAIL':
            markdown_report.append(f"- **{test['category']} - {test['test']}:** {test['message']}\n")
    markdown_report.append("\n")

if warnings > 0:
    markdown_report.append("### Warnings\n\n")
    for test in validation_results['tests']:
        if test['status'] == 'WARN':
            markdown_report.append(f"- **{test['category']} - {test['test']}:** {test['message']}\n")
    markdown_report.append("\n")

if failed == 0 and warnings == 0:
    markdown_report.append("✅ No issues detected. The synthetic data meets all validation criteria.\n\n")

markdown_report.append("---\n")
markdown_report.append("*Report generated by Fraser Health Validation Suite*\n")

# Save markdown report
report_content = ''.join(markdown_report)

print("="*70)
print("MARKDOWN REPORT")
print("="*70)
print()
print(report_content)

# Save to file
report_path = Path("validation_report.md")
with open(report_path, 'w') as f:
    f.write(report_content)

print(f"\n✓ Markdown report saved to: {report_path}")

### 5.4 Save Validation Results

In [None]:
# Save validation results as JSON
results_path = Path("validation_results.json")
# Export validation results in multiple formats for different use cases

# - JSON: Machine-readable format for automation and CI/CD pipelines
with open(results_path, 'w') as f:
# - CSV: Spreadsheet format for analysis in Excel or other tools
    json.dump(validation_results, f, indent=2)
# - Markdown: Human-readable report for documentation

print(f"✓ Validation results saved to: {results_path}")

# Save DataFrame to CSV
if not validation_df.empty:
    csv_path = Path("validation_results.csv")
    validation_df.to_csv(csv_path, index=False)
    print(f"✓ Validation DataFrame saved to: {csv_path}")

print()
print("="*70)
print("VALIDATION COMPLETE")
print("="*70)
print()
print("Output files generated:")
print(f"  1. {results_path} - Full validation results (JSON)")
print(f"  2. validation_results.csv - Validation results table")
print(f"  3. validation_report.md - Human-readable report")
print()
print("Next steps:")
print("  - Review failed tests and warnings")
print("  - Address any critical issues")
print("  - Re-run validation after fixes")

## Conclusion

This validation suite provides comprehensive testing of the Fraser Health synthetic data across four key dimensions:1. **Demographics** - Ensures age and ethnicity distributions match BC/Fraser Health characteristics2. **Geography** - Validates that patients and providers are within Fraser Health boundaries with reasonable distances3. **Clinical** - Checks encounter sequences and FHIR R4 compliance4. **Consistency** - Tests stability across multiple simulation runs**How to Use This Notebook:**1. First, run `fraser_health_onboarding.ipynb` to generate synthetic data2. Run this validation notebook to check data quality3. Review the validation report and address any failures or warnings4. For multi-seed testing, manually run simulations with different seeds and compare results**Interpreting Results:**- ✓ **PASS**: Test met all criteria- ⚠ **WARN**: Test passed but with minor concerns or missing optional data- ✗ **FAIL**: Test did not meet criteria - requires attention**Continuous Improvement:**As you make changes to the data generation process, re-run this validation suite to ensure quality is maintained.