# **11 - Exposure Data - GMPE Statistics Matching Script - ASC**

**IRDR0012 MSc Independent Research Project**

*   Candidate number: NWHL6
*   Institution: UCL IRDR
*   Supervisor: Dr. Roberto Gentile
*   Date: 01/09/2025
*   Version: v1.0

**Description:**

Matches exposure assets to nearest site model locations and combines with ASC GMPE statistics

**INPUT FILES:**

*   Exposure model_combined.csv
*   sitemesh_ASC_83.csv
*   gmf_statistics_AkkarEtAlRjb2014.csv
*   gmf_statistics_ChiouYoungs2014.csv

**OUTPUT FILES:**

*   exposure_gmpe_AkkarEtAlRjb2014.csv
*   exposure_gmpe_ChiouYoungs2014.csvv

# 0 - ENVIRONMENT SETUP

In [None]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set Arial font globally
plt.rcParams['font.family'] = 'Arial'
plt.rcParams['font.sans-serif'] = ['Arial', 'DejaVu Sans', 'sans-serif']

print("✅ Environment setup complete")
print("📋 Libraries loaded: pandas, numpy, scipy")

✅ Environment setup complete
📋 Libraries loaded: pandas, numpy, scipy


# 1 - GOOGLE DRIVE INTEGRATION

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Define file paths (UPDATE THESE PATHS TO YOUR INPUT FOLDER)
INPUT_FOLDER = '/content/drive/MyDrive/IRDR0012_Research Project/00 INPUT/'

# Input files
EXPOSURE_FILE = f'{INPUT_FOLDER}Exposure model_combined.csv'
SITEMESH_FILE = f'{INPUT_FOLDER}sitemesh_ASC_83.csv'

# GMPE statistics files
GMPE_FILES = {
    'AkkarEtAlRjb2014': f'{INPUT_FOLDER}gmf_statistics_AkkarEtAlRjb2014.csv',
    'ChiouYoungs2014': f'{INPUT_FOLDER}gmf_statistics_ChiouYoungs2014.csv',
}

# Output folder
OUTPUT_FOLDER = '/content/drive/MyDrive/IRDR0012_Research Project/01 OUTPUT/'

print("✅ Google Drive mounted successfully")
print("📁 File paths configured")
print(f"   Input folder: {INPUT_FOLDER}")
print(f"   Output folder: {OUTPUT_FOLDER}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Google Drive mounted successfully
📁 File paths configured
   Input folder: /content/drive/MyDrive/IRDR0012_Research Project/00 INPUT/
   Output folder: /content/drive/MyDrive/IRDR0012_Research Project/01 OUTPUT/


# 2 - DATA LOADING

In [None]:
def load_exposure_data(file_path):
    """
    Load and clean exposure model data

    Parameters:
    - file_path: Path to exposure CSV file

    Returns:
    - exposure_df: Cleaned exposure dataframe
    """

    print("🔄 Loading exposure data...")

    # Try different approaches for CSV reading
    try:
        exposure_df = pd.read_csv(file_path)
    except:
        # If standard read fails, try with different parameters
        exposure_df = pd.read_csv(file_path, encoding='utf-8-sig')

    # Clean column names
    exposure_df.columns = exposure_df.columns.str.strip()

    # Ensure required columns exist
    required_cols = ['id', 'lat', 'lon', 'taxonomy']
    missing_cols = [col for col in required_cols if col not in exposure_df.columns]

    if missing_cols:
        print(f"⚠️  Warning: Missing columns in exposure data: {missing_cols}")
        print(f"   Available columns: {list(exposure_df.columns)}")

    print(f"✅ Exposure data loaded: {len(exposure_df):,} assets")
    print(f"   Columns: {list(exposure_df.columns)}")

    return exposure_df

def load_sitemesh_data(file_path):
    """
    Load site mesh coordinate data

    Parameters:
    - file_path: Path to sitemesh CSV file

    Returns:
    - sitemesh_df: Site mesh dataframe
    """

    print("🔄 Loading site mesh data...")

    # Load with header on row 2 (skip first row)
    sitemesh_df = pd.read_csv(file_path, skiprows=1)
    sitemesh_df.columns = sitemesh_df.columns.str.strip()

    print(f"✅ Site mesh loaded: {len(sitemesh_df):,} sites")
    print(f"   Columns: {list(sitemesh_df.columns)}")

    return sitemesh_df

def load_gmpe_statistics(gmpe_files_dict):
    """
    Load all GMPE statistics files

    Parameters:
    - gmpe_files_dict: Dictionary of GMPE names and file paths

    Returns:
    - gmpe_data_dict: Dictionary of loaded GMPE dataframes
    """

    print("🔄 Loading GMPE statistics files...")

    gmpe_data = {}

    for gmpe_name, file_path in gmpe_files_dict.items():
        try:
            df = pd.read_csv(file_path)
            df.columns = df.columns.str.strip()
            gmpe_data[gmpe_name] = df
            print(f"   ✅ {gmpe_name}: {len(df):,} records")
        except Exception as e:
            print(f"   ❌ Failed to load {gmpe_name}: {e}")

    print(f"✅ GMPE statistics loaded: {len(gmpe_data)} files")

    return gmpe_data

# Load all data
exposure_data = load_exposure_data(EXPOSURE_FILE)
sitemesh_data = load_sitemesh_data(SITEMESH_FILE)
gmpe_statistics = load_gmpe_statistics(GMPE_FILES)

🔄 Loading exposure data...
✅ Exposure data loaded: 16,976 assets
   Columns: ['id', 'lat', 'lon', 'taxonomy', 'number', 'value', 'area']
🔄 Loading site mesh data...
✅ Site mesh loaded: 1,773 sites
   Columns: ['custom_site_id', 'lon', 'lat']
🔄 Loading GMPE statistics files...
   ✅ AkkarEtAlRjb2014: 5,944 records
   ✅ ChiouYoungs2014: 5,944 records
✅ GMPE statistics loaded: 2 files


# 3 - SPATIAL MATCHING - EXPOSURE TO NEAREST SITES

In [None]:
def find_nearest_sites(exposure_df, sitemesh_df):
    """
    Find nearest site model location for each exposure asset using spatial distance

    Parameters:
    - exposure_df: Exposure assets with lat/lon
    - sitemesh_df: Site mesh with custom_site_id and coordinates

    Returns:
    - matched_df: Exposure data with nearest custom_site_id
    """

    print("🔄 Finding nearest site model locations for each exposure asset...")

    # Extract coordinates
    exposure_coords = exposure_df[['lat', 'lon']].values
    site_coords = sitemesh_df[['lat', 'lon']].values

    print(f"   Matching {len(exposure_coords):,} exposure assets to {len(site_coords):,} sites...")

    # Calculate distance matrix and find nearest sites
    distances = cdist(exposure_coords, site_coords, metric='euclidean')
    nearest_indices = np.argmin(distances, axis=1)

    # Create matched dataframe
    matched_df = exposure_df.copy()
    matched_df['nearest_custom_site_id'] = sitemesh_df.iloc[nearest_indices]['custom_site_id'].values
    matched_df['distance_to_site'] = np.min(distances, axis=1)

    # Add nearest site coordinates for verification
    matched_df['site_lat'] = sitemesh_df.iloc[nearest_indices]['lat'].values
    matched_df['site_lon'] = sitemesh_df.iloc[nearest_indices]['lon'].values

    print(f"✅ Spatial matching complete")
    print(f"   Average distance to nearest site: {matched_df['distance_to_site'].mean():.6f} degrees (~{matched_df['distance_to_site'].mean() * 111:.0f}m)")
    print(f"   Maximum distance: {matched_df['distance_to_site'].max():.6f} degrees (~{matched_df['distance_to_site'].max() * 111:.0f}m)")
    print(f"   Unique sites used: {matched_df['nearest_custom_site_id'].nunique():,} of {len(sitemesh_df):,}")

    # VERIFICATION: Check that every exposure ID has a corresponding custom_site_id
    print(f"\n🔍 VERIFICATION CHECK:")

    # Check for missing assignments
    missing_assignments = matched_df['nearest_custom_site_id'].isna().sum()
    print(f"   Exposure assets without site assignment: {missing_assignments:,}")

    # Check for valid custom_site_id matches
    valid_site_ids = set(sitemesh_df['custom_site_id'].unique())
    matched_site_ids = set(matched_df['nearest_custom_site_id'].dropna().unique())
    invalid_assignments = matched_site_ids - valid_site_ids

    if len(invalid_assignments) > 0:
        print(f"   ⚠️  Invalid site ID assignments: {len(invalid_assignments)}")
        print(f"      Invalid IDs: {list(invalid_assignments)[:5]}...")  # Show first 5
    else:
        print(f"   ✅ All assigned site IDs are valid")

    # Check assignment coverage
    total_exposure = len(matched_df)
    successfully_matched = len(matched_df.dropna(subset=['nearest_custom_site_id']))
    match_rate = (successfully_matched / total_exposure) * 100

    print(f"   Match success rate: {match_rate:.1f}% ({successfully_matched:,}/{total_exposure:,})")

    # Show assignment distribution
    assignment_counts = matched_df['nearest_custom_site_id'].value_counts()
    print(f"   Assets per site - Mean: {assignment_counts.mean():.1f}, Max: {assignment_counts.max()}, Min: {assignment_counts.min()}")

    # Identify exposure assets without matches (if any)
    if missing_assignments > 0:
        print(f"\n⚠️  WARNING: {missing_assignments:,} exposure assets could not be matched!")
        unmatched_assets = matched_df[matched_df['nearest_custom_site_id'].isna()]
        print(f"   Sample unmatched assets:")
        print(unmatched_assets[['id', 'lat', 'lon']].head())

        # Check if unmatched assets are outside the site model bounds
        site_lat_range = (sitemesh_df['lat'].min(), sitemesh_df['lat'].max())
        site_lon_range = (sitemesh_df['lon'].min(), sitemesh_df['lon'].max())
        print(f"   Site model coverage - Lat: {site_lat_range}, Lon: {site_lon_range}")

        unmatched_lat_range = (unmatched_assets['lat'].min(), unmatched_assets['lat'].max())
        unmatched_lon_range = (unmatched_assets['lon'].min(), unmatched_assets['lon'].max())
        print(f"   Unmatched assets range - Lat: {unmatched_lat_range}, Lon: {unmatched_lon_range}")
    else:
        print(f"   ✅ Perfect match: All exposure assets successfully assigned to sites")

    return matched_df

# Perform spatial matching
matched_exposure = find_nearest_sites(exposure_data, sitemesh_data)

# Display sample results
print("\n📋 Sample of Matched Data:")
sample_cols = ['id', 'lat', 'lon', 'taxonomy', 'nearest_custom_site_id', 'distance_to_site']
print(matched_exposure[sample_cols].head())

🔄 Finding nearest site model locations for each exposure asset...
   Matching 16,976 exposure assets to 1,773 sites...
✅ Spatial matching complete
   Average distance to nearest site: 0.000385 degrees (~0m)
   Maximum distance: 0.000720 degrees (~0m)
   Unique sites used: 1,233 of 1,773

🔍 VERIFICATION CHECK:
   Exposure assets without site assignment: 0
   ✅ All assigned site IDs are valid
   Match success rate: 100.0% (16,976/16,976)
   Assets per site - Mean: 13.8, Max: 99, Min: 1
   ✅ Perfect match: All exposure assets successfully assigned to sites

📋 Sample of Matched Data:
   id       lat      lon                                           taxonomy  \
0   1  31.11219 -8.48772  MCF+MO/LWAL+CDL/H:2/MIX/IR+IRPP:IRN+IRPS:IRN+I...   
1   2  30.97982 -7.10382  EU+ETR/LWAL+CDN/H:1/COM:5C/IR+IRPP:IRHO+IRPS:I...   
2   3  31.04150 -7.20561  EU+ETO/LWAL+CDN/H:1/RES/IR+IRPP:IRN+IRPS:IRN+I...   
3   4  31.07750 -7.26794  EU+ETO/LWAL+CDN/H:1/RES/IR+IRPP:IRN+IRPS:IRN+I...   
4   5  31.21061 -8

# 4 - COMBINE EXPOSURE DATA WITH GMPE STATISTICS

In [None]:
def combine_exposure_with_gmpe(matched_exposure_df, gmpe_stats_dict):
    """
    Combine exposure data with GMPE statistics for each GMPE

    Parameters:
    - matched_exposure_df: Exposure data with nearest site assignments
    - gmpe_stats_dict: Dictionary of GMPE statistics dataframes

    Returns:
    - combined_results: Dictionary of combined dataframes per GMPE
    """

    print("🔄 Combining exposure data with GMPE statistics...")

    combined_results = {}

    for gmpe_name, gmpe_stats in gmpe_stats_dict.items():
        print(f"   Processing {gmpe_name}...")

        # Debug: Check columns before merge
        print(f"     Exposure columns: {list(matched_exposure_df.columns)}")
        print(f"     GMPE stats columns: {list(gmpe_stats.columns)}")
        print(f"     GMPE stats shape: {gmpe_stats.shape}")
        print(f"     Sample GMPE record: {gmpe_stats.iloc[0].to_dict() if len(gmpe_stats) > 0 else 'Empty'}")

        # Merge exposure data with GMPE statistics
        combined_df = matched_exposure_df.merge(
            gmpe_stats,
            left_on='nearest_custom_site_id',
            right_on='custom_site_id',
            how='left'
        )

        print(f"     Combined columns after merge: {list(combined_df.columns)}")
        print(f"     Combined shape: {combined_df.shape}")

        # Check if merge was successful
        successful_merges = combined_df['custom_site_id'].notna().sum()
        print(f"     Successful merges: {successful_merges:,} out of {len(combined_df):,}")

        # Check if key columns exist after merge
        key_check = {
            'id': 'id' in combined_df.columns,
            'lat': 'lat' in combined_df.columns,
            'lon': 'lon' in combined_df.columns,
            'taxonomy': 'taxonomy' in combined_df.columns,
            'imt': 'imt' in combined_df.columns
        }
        print(f"     Key columns check: {key_check}")

        # Select and rename columns according to specifications
        # Note: After merge, lat/lon from exposure become lat_x/lon_x, from GMPE become lat_y/lon_y
        output_columns = {
            'id': 'id',                    # from exposure
            'lat_x': 'lat',                # from exposure (renamed during merge)
            'lon_x': 'lon',                # from exposure (renamed during merge)
            'taxonomy': 'taxonomy',        # from exposure
            'imt': 'imt',                  # from GMPE stats
            'mean': 'mean',                # from GMPE stats
            'std': 'std',                  # from GMPE stats
            'p5': 'p5',                    # from GMPE stats
            'p16': 'p16',                  # from GMPE stats
            'p50': 'p50',                  # from GMPE stats
            'p84': 'p84',                  # from GMPE stats
            'p95': 'p95'                   # from GMPE stats
        }

        # Check which columns are available
        available_cols = {}
        missing_cols = []

        for old_name, new_name in output_columns.items():
            if old_name in combined_df.columns:
                available_cols[old_name] = new_name
            else:
                missing_cols.append(old_name)

        if missing_cols:
            print(f"     ⚠️  Missing columns: {missing_cols}")

        # Select and rename available columns
        if available_cols:
            final_df = combined_df[list(available_cols.keys())].rename(columns=available_cols)
        else:
            print(f"     ❌ No matching columns found! Using all columns.")
            final_df = combined_df

        # Add GMPE identifier
        final_df['gmpe'] = gmpe_name

        combined_results[gmpe_name] = final_df

        print(f"     ✅ {gmpe_name}: {len(final_df):,} records with {len(final_df.columns)} columns")

        # Check for missing matches
        if 'imt' in final_df.columns:
            missing_matches = final_df['imt'].isna().sum()
            if missing_matches > 0:
                print(f"     ⚠️  {missing_matches:,} assets couldn't be matched to GMPE statistics")

        # Show final column structure
        print(f"     Final columns: {list(final_df.columns)}")
        print(f"     Sample record shape: {final_df.shape}")

        # Debug: Show a sample of the data structure
        if len(final_df) > 0:
            print(f"     Sample row (first 5 columns): {final_df.iloc[0, :5].to_dict()}")


    print(f"✅ Exposure-GMPE combination complete for {len(combined_results)} GMPEs")

    return combined_results

# Combine exposure with GMPE statistics
combined_datasets = combine_exposure_with_gmpe(matched_exposure, gmpe_statistics)

# Display sample results
print("\n📋 Sample of Combined Results:")
if combined_datasets:
    sample_gmpe = list(combined_datasets.keys())[0]
    sample_df = combined_datasets[sample_gmpe]

    print(f"\n{sample_gmpe} Sample:")
    print(f"Available columns: {list(sample_df.columns)}")

    # Check which columns are actually available for display
    display_cols = []
    desired_cols = ['id', 'lat', 'lon', 'taxonomy', 'imt', 'p50', 'std']

    for col in desired_cols:
        if col in sample_df.columns:
            display_cols.append(col)
        else:
            print(f"⚠️  Column '{col}' not found in combined data")

    if display_cols:
        print(f"Displaying available columns: {display_cols}")
        print(sample_df[display_cols].head())
    else:
        print("⚠️  No standard columns found - showing first 5 columns:")
        print(sample_df.iloc[:, :5].head())

🔄 Combining exposure data with GMPE statistics...
   Processing AkkarEtAlRjb2014...
     Exposure columns: ['id', 'lat', 'lon', 'taxonomy', 'number', 'value', 'area', 'nearest_custom_site_id', 'distance_to_site', 'site_lat', 'site_lon']
     GMPE stats columns: ['custom_site_id', 'lon', 'lat', 'mean', 'std', 'count', 'p5', 'p16', 'p50', 'p84', 'p95', 'imt', 'gmpe']
     GMPE stats shape: (5944, 13)
     Sample GMPE record: {'custom_site_id': 'ev31wbdb', 'lon': -9.5432, 'lat': 29.7978, 'mean': 0.06244, 'std': 0.0, 'count': 1000, 'p5': 0.06244, 'p16': 0.06244, 'p50': 0.06244, 'p84': 0.06244, 'p95': 0.06244, 'imt': 'PGA', 'gmpe': 'AkkarEtAlRjb2014'}
     Combined columns after merge: ['id', 'lat_x', 'lon_x', 'taxonomy', 'number', 'value', 'area', 'nearest_custom_site_id', 'distance_to_site', 'site_lat', 'site_lon', 'custom_site_id', 'lon_y', 'lat_y', 'mean', 'std', 'count', 'p5', 'p16', 'p50', 'p84', 'p95', 'imt', 'gmpe']
     Combined shape: (67904, 24)
     Successful merges: 67,904 out

# 5 - EXPORT RESULTS

In [None]:
def export_combined_results(combined_data_dict, output_folder):
    """
    Export combined results to separate CSV files for each GMPE

    Parameters:
    - combined_data_dict: Dictionary of combined dataframes
    - output_folder: Output directory path
    """

    print("🔄 Exporting combined results...")

    # Create output directory if it doesn't exist
    import os
    os.makedirs(output_folder, exist_ok=True)

    exported_files = []

    for gmpe_name, df in combined_data_dict.items():
        # Clean filename
        filename = f"exposure_gmpe_{gmpe_name}.csv"
        filepath = os.path.join(output_folder, filename)

        # Export to CSV
        df.to_csv(filepath, index=False)
        exported_files.append(filepath)

        print(f"   ✅ {gmpe_name}: {len(df):,} records → {filename}")

    print(f"✅ Export complete: {len(exported_files)} files created")
    print(f"   Output location: {output_folder}")

    return exported_files

# Export results
exported_file_paths = export_combined_results(combined_datasets, OUTPUT_FOLDER)

🔄 Exporting combined results...
   ✅ AkkarEtAlRjb2014: 67,904 records → exposure_gmpe_AkkarEtAlRjb2014.csv
   ✅ ChiouYoungs2014: 67,904 records → exposure_gmpe_ChiouYoungs2014.csv
✅ Export complete: 2 files created
   Output location: /content/drive/MyDrive/IRDR0012_Research Project/01 OUTPUT/


# 6 - QUALITY CHECK AND SUMMARY

In [None]:
def generate_summary_report(combined_data_dict, matched_exposure_df):
    """
    Generate summary report of the matching and combination process
    """

    print("=" * 80)
    print("📊 EXPOSURE-GMPE MATCHING SUMMARY REPORT")
    print("=" * 80)

    print(f"\n🔢 PROCESSING SUMMARY:")
    print(f"   Original exposure assets: {len(matched_exposure_df):,}")
    print(f"   Unique sites matched: {matched_exposure_df['nearest_custom_site_id'].nunique():,}")
    print(f"   GMPEs processed: {len(combined_data_dict)}")

    print(f"\n📊 SPATIAL MATCHING QUALITY:")
    avg_distance_m = matched_exposure_df['distance_to_site'].mean() * 111000
    max_distance_m = matched_exposure_df['distance_to_site'].max() * 111000
    print(f"   Average distance to nearest site: {avg_distance_m:.0f} meters")
    print(f"   Maximum distance to nearest site: {max_distance_m:.0f} meters")

    print(f"\n📋 OUTPUT FILES CREATED:")
    for gmpe_name, df in combined_data_dict.items():
        print(f"   • exposure_gmpe_{gmpe_name}.csv: {len(df):,} records")

        # Show IMT distribution
        if 'imt' in df.columns:
            imt_counts = df['imt'].value_counts()
            print(f"     IMTs: {', '.join([f'{imt}({count:,})' for imt, count in imt_counts.items()])}")

    print(f"\n📄 OUTPUT STRUCTURE:")
    if combined_data_dict:
        sample_df = list(combined_data_dict.values())[0]
        print(f"   Columns: {', '.join(sample_df.columns)}")

    print(f"\n✅ READY FOR DAMAGE MATRIX CALCULATIONS")
    print(f"   Each file contains exposure assets with matched GMPE statistics")
    print(f"   Use appropriate GMPE for your tectonic setting:")
    print(f"   • ASC (Active): AkkarEtAlRjb2014, ChiouYoungs2014")

    print("=" * 80)

# Generate summary report
generate_summary_report(combined_datasets, matched_exposure)

print(f"\n🎯 SCRIPT COMPLETE!")
print(f"📁 Output files saved to: {OUTPUT_FOLDER}")
print(f"💻 Ready for fragility curve application and damage calculations")

📊 EXPOSURE-GMPE MATCHING SUMMARY REPORT

🔢 PROCESSING SUMMARY:
   Original exposure assets: 16,976
   Unique sites matched: 1,233
   GMPEs processed: 2

📊 SPATIAL MATCHING QUALITY:
   Average distance to nearest site: 43 meters
   Maximum distance to nearest site: 80 meters

📋 OUTPUT FILES CREATED:
   • exposure_gmpe_AkkarEtAlRjb2014.csv: 67,904 records
     IMTs: PGA(16,976), SA(0.3)(16,976), SA(0.6)(16,976), SA(1.0)(16,976)
   • exposure_gmpe_ChiouYoungs2014.csv: 67,904 records
     IMTs: PGA(16,976), SA(0.3)(16,976), SA(0.6)(16,976), SA(1.0)(16,976)

📄 OUTPUT STRUCTURE:
   Columns: id, lat, lon, taxonomy, imt, mean, std, p5, p16, p50, p84, p95, gmpe

✅ READY FOR DAMAGE MATRIX CALCULATIONS
   Each file contains exposure assets with matched GMPE statistics
   Use appropriate GMPE for your tectonic setting:
   • ASC (Active): AkkarEtAlRjb2014, ChiouYoungs2014

🎯 SCRIPT COMPLETE!
📁 Output files saved to: /content/drive/MyDrive/IRDR0012_Research Project/01 OUTPUT/
💻 Ready for fragility c