# **pyCoreRelator** [![GitHub](https://img.shields.io/badge/GitHub-pyCoreRelator-blue?logo=github)](https://github.com/GeoLarryLai/pyCoreRelator) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.xxxxxxxx.svg)](https://doi.org/10.5281/zenodo.xxxxxxxx)
## **Workshop Notebook #7: Compare Real Core Correlation to Synthetic Null Hypothesis**   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GeoLarryLai/pyCoreRelator/blob/main/pyCoreRelator_7_compare2syn.ipynb)
This notebook demonstrates the workflow for comparing real core pair correlation quality metrics against synthetic null hypothesis distributions using **pyCoreRelator**.

### Key Functions from **pyCoreRelator**
- **`load_log_data()`**: Load and process core log data
- **`load_core_age_constraints()`**: Load age constraint data for cores
- **`load_pickeddepth_ages_from_csv()`**: Load estimated ages for picked boundaries
- **`run_multi_parameter_analysis()`**: Run correlation analysis with multiple parameter combinations
- **`calculate_quality_comparison_t_statistics()`**: Calculate statistical comparison metrics
- **`plot_quality_comparison_t_statistics()`**: Visualize comparison results

For advanced usage, see [FUNCTION_DOCUMENTATION.md](https://github.com/GeoLarryLai/pyCoreRelator/blob/main/FUNCTION_DOCUMENTATION.md) for more details.
<hr>

# **Import Packages**
Load correlation analysis and comparison functions from **pyCoreRelator**

In [None]:
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from pyCoreRelator import (
    load_log_data,
    load_core_age_constraints,
    load_pickeddepth_ages_from_csv,
    run_multi_parameter_analysis,
    calculate_quality_comparison_t_statistics,
    plot_quality_comparison_t_statistics
)

%matplotlib inline

<hr>

# **Configure Core Pair and Data Paths**

Define the core pairs to analyze and specify paths to log data and picked datum files.

## Define Core Pair

Select which cores to correlate (Core A and Core B)

In [None]:
CORE_A = "M9907-25PC"
# CORE_A = "M9907-23PC"

In [None]:
CORE_B = "M9907-23PC"
# CORE_B = "M9907-11PC"

## Define Log Data Paths and Column Structure

Configure which log types to use for correlation and specify file paths for each core.

In [None]:
# Define log columns to extract
LOG_COLUMNS = ['hiresMS', 'CT', 'Lumin']  # Choose which logs to include
# LOG_COLUMNS = ['hiresMS']  # Choose which logs to include

# Define depth column
DEPTH_COLUMN = 'SB_DEPTH_cm'

# Define paths for Core A
core_a_log_paths = {
    'hiresMS': f'example_data/processed_data/{CORE_A}/{CORE_A}_hiresMS_MLfilled.csv',
    'CT': f'example_data/processed_data/{CORE_A}/{CORE_A}_CT_MLfilled.csv',
    'Lumin': f'example_data/processed_data/{CORE_A}/{CORE_A}_RGB_MLfilled.csv',
}

# Define paths for Core B
core_b_log_paths = {
    'hiresMS': f'example_data/processed_data/{CORE_B}/{CORE_B}_hiresMS_MLfilled.csv',
    'CT': f'example_data/processed_data/{CORE_B}/{CORE_B}_CT_MLfilled.csv',
    'Lumin': f'example_data/processed_data/{CORE_B}/{CORE_B}_RGB_MLfilled.csv',
}

# Define column mapping for alternative column names
column_alternatives = {
    'hiresMS': ['MS'],
    'CT': ['CT_value'],
    'Lumin': ['luminance', 'Luminance'],
}

## Load Log Data

**Function: `load_log_data()`**

**What it does:**
Loads and processes core log data from CSV files, resamples to common depth scale, and optionally loads core images.

**Key Parameters:**
- `log_paths` *(dict)*: Dictionary mapping log names to CSV file paths
- `img_paths` *(dict, default=None)*: Dictionary mapping image types ('rgb', 'ct') to file paths or directories
- `log_columns` *(list)*: List of log column names to extract from CSV files
- `depth_column` *(str, default='SB_DEPTH_cm')*: Name of the depth column in CSV files
- `normalize` *(bool, default=True)*: Whether to normalize log values to [0, 1] range
- `column_alternatives` *(dict, default=None)*: Dictionary mapping log names to lists of alternative column names

**Returns:**
- `log` (numpy.ndarray): Log data array with shape (n_samples, n_logs) or (n_samples,) for single log
- `md` (numpy.ndarray): Measured depth array
- `available_columns` (list): Names of successfully loaded logs
- `rgb_img` (numpy.ndarray or None): RGB image array if available
- `ct_img` (numpy.ndarray or None): CT image array if available


<hr>

# **Load Core Data**

Load log data, picked depth boundaries, and age constraints for both cores.

In [None]:
# Load data for Core A
log_a, md_a, _, _, _ = load_log_data(
    core_a_log_paths,
    log_columns=LOG_COLUMNS,
    depth_column=DEPTH_COLUMN,
    normalize=True,
    column_alternatives=column_alternatives
)

# Load data for Core B
log_b, md_b, _, _, _ = load_log_data(
    core_b_log_paths,
    log_columns=LOG_COLUMNS,
    depth_column=DEPTH_COLUMN,
    normalize=True,
    column_alternatives=column_alternatives
)

## Load Picked Depth Boundaries

Load the manually picked stratigraphic boundaries (datum) for both cores from CSV files.

In [None]:
# Define paths to the CSV files
pickeddepth_a_csv = f'pickeddepth/{CORE_A}_pickeddepth.csv'
pickeddepth_b_csv = f'pickeddepth/{CORE_B}_pickeddepth.csv'

# Load picked depths and extract category 1 depths
if os.path.exists(pickeddepth_a_csv):
    picked_data_a = pd.read_csv(pickeddepth_a_csv)
    all_depths_a_cat1 = picked_data_a[picked_data_a['category'] == 1]['picked_depths_cm'].values.astype('float32')
    intepreted_bed_a_cat1 = picked_data_a[picked_data_a['category'] == 1]['interpreted_bed'].fillna('').values.astype('str')
else:
    print(f"Warning: {pickeddepth_a_csv} not found. Using empty array for all_depths_a_cat1.")
    all_depths_a_cat1 = np.array([]).astype('float32')
    intepreted_bed_a_cat1 = np.array([]).astype('str').fillna('')

if os.path.exists(pickeddepth_b_csv):
    picked_data_b = pd.read_csv(pickeddepth_b_csv)
    all_depths_b_cat1 = picked_data_b[picked_data_b['category'] == 1]['picked_depths_cm'].values.astype('float32')
    intepreted_bed_b_cat1 = picked_data_b[picked_data_b['category'] == 1]['interpreted_bed'].fillna('').values.astype('str')
else:
    print(f"Warning: {pickeddepth_b_csv} not found. Using empty array for all_depths_b_cat1.")
    all_depths_b_cat1 = np.array([]).astype('float32')
    intepreted_bed_b_cat1 = np.array([]).astype('str').fillna('')

## Load Age Constraints

**Function: `load_core_age_constraints()`**

**What it does:**
Loads age constraint data for a single core from CSV files in a directory. Searches for all CSV files containing the core name and combines them.

**Key Parameters:**
- `core_name` *(str)*: Name of the core to load age constraints for
- `age_base_path` *(str)*: Base directory path containing age constraint CSV files
- `data_columns` *(dict, default=None)*: Dictionary mapping standard column names to actual CSV column names. Required keys: 'age', 'pos_error', 'neg_error', 'min_depth', 'max_depth', 'in_sequence', 'core', 'interpreted_bed'
- `mute_mode` *(bool, default=False)*: If True, suppress all print statements

**Returns:**
- `age_data` (dict): Dictionary containing age constraint data with keys:
  - `depths` (list): Mean depths of age constraints
  - `ages` (list): Calibrated ages in years BP
  - `pos_errors` (list): Positive 2-sigma uncertainties
  - `neg_errors` (list): Negative 2-sigma uncertainties
  - `in_sequence_flags` (list): Boolean flags for stratigraphic sequence
  - `in_sequence_depths`, `in_sequence_ages`, `in_sequence_pos_errors`, `in_sequence_neg_errors` (lists): Filtered in-sequence data
  - `out_sequence_depths`, `out_sequence_ages`, `out_sequence_pos_errors`, `out_sequence_neg_errors` (lists): Out-of-sequence data
  - `core` (list): Core identifiers
  - `interpreted_bed` (list): Interpreted bed names

In [None]:
data_columns = {
    'age': 'calib810_agebp',
    'pos_error': 'calib810_2sigma_pos', 
    'neg_error': 'calib810_2sigma_neg',
    'min_depth': 'mindepth_cm',
    'max_depth': 'maxdepth_cm',
    'in_sequence': 'in_sequence',
    'core': 'core',
    'interpreted_bed': 'interpreted_bed'
}

# Define the path to the age constraints csv file
age_base_path = '/Users/larryslai/Library/CloudStorage/Dropbox/My Documents/University of Texas Austin/(Project) NWP turbidites/Cascadia_core_data/Age constraints/Goldfinger2012'

# Load age constraints for both cores from the csv file
age_data_a = load_core_age_constraints(CORE_A, age_base_path, data_columns=data_columns, mute_mode=True)
age_data_b = load_core_age_constraints(CORE_B, age_base_path, data_columns=data_columns, mute_mode=True)

## Load Estimated Ages for Picked Boundaries

**Function: `load_pickeddepth_ages_from_csv()`**

**What it does:**
Loads pre-calculated interpolated ages for picked depth boundaries from CSV file (output from `calculate_interpolated_ages()`).

**Key Parameters:**
- `pickeddepth_age_csv` *(str)*: Path to CSV file containing pre-calculated ages

**Returns:**
- `age_data` (dict): Dictionary containing age estimates with keys:
  - `depths` (numpy.ndarray): Picked datum depths
  - `ages` (numpy.ndarray): Interpolated ages in years BP
  - `pos_uncertainties` (numpy.ndarray): Positive age uncertainties
  - `neg_uncertainties` (numpy.ndarray): Negative age uncertainties

In [None]:
cores = [CORE_A, CORE_B]
pickeddepth_ages = {}

# Define the uncertainty method: 'MonteCarlo', 'Linear', or 'Gaussian'
uncertainty_method='MonteCarlo'   

for core in cores:
    core_age_csv = f"pickeddepth_ages/{core}_pickeddepth_ages_{uncertainty_method}.csv"
    pickeddepth_ages[core] = load_pickeddepth_ages_from_csv(core_age_csv)

# Assign to individual variables
if CORE_A in pickeddepth_ages:
    pickeddepth_ages_a = pickeddepth_ages[CORE_A]
if CORE_B in pickeddepth_ages:
    pickeddepth_ages_b = pickeddepth_ages[CORE_B]

## Perform Comprehensive DTW Analysis

**Function: `run_multi_parameter_analysis()`**

**What it does:**
1. Runs DTW correlation analysis for multiple parameter combinations (with/without age constraints)
2. Tests scenarios with progressive removal of age constraints (if enabled)
3. Computes quality metric distributions for each scenario
4. Fits probability distributions and calculates statistical parameters
5. Exports fit parameters and distribution data to CSV files for comparison

**Key Parameters:**
- `log_a`, `log_b` *(array-like)*: Core log data arrays
- `md_a`, `md_b` *(array-like)*: Measured depth arrays
- `all_depths_a_cat1`, `all_depths_b_cat1` *(array-like)*: Picked boundary depths (category 1)
- `pickeddepth_ages_a`, `pickeddepth_ages_b` *(dict)*: Age interpolation results for picked depths with keys: 'depths', 'ages', 'pos_uncertainties', 'neg_uncertainties'
- `age_data_a`, `age_data_b` *(dict)*: Age constraint data from `load_core_age_constraints()`
- `uncertainty_method` *(str)*: Age uncertainty calculation method ('MonteCarlo', 'Linear', or 'Gaussian')
- `parameter_combinations` *(list of dict)*: List of parameter dictionaries to test. Each dict should contain: 'age_consideration', 'restricted_age_correlation', 'shortest_path_search'
- `target_quality_indices` *(list)*: Quality metrics to analyze (e.g., ['corr_coef', 'norm_dtw', 'perc_diag'])
- `test_age_constraint_removal` *(bool, default=True)*: Whether to test progressive age constraint removal scenarios
- `core_a_name`, `core_b_name` *(str)*: Core identifiers for output file naming
- `output_csv_filenames` *(dict)*: Dictionary mapping quality_index to output CSV filename paths
- `synthetic_csv_filenames` *(dict, default=None)*: Dictionary mapping quality_index to synthetic CSV filenames for consistent bin sizing
- `pca_for_dependent_dtw` *(bool, default=False)*: Use PCA for dependent multidimensional DTW
- `n_jobs` *(int, default=-1)*: Number of parallel jobs (-1 uses all available CPU cores, 1 for sequential)
- `max_search_per_layer` *(int or None, default=None)*: Maximum scenarios per constraint removal layer. If None, processes all scenarios


<hr>

# **Run Multi-Parameter Correlation Analysis**

Execute correlation analysis with different parameter combinations to test various hypotheses and constraints.

In [None]:
# Define parameter combinations to test
parameter_combinations = [
    {'age_consideration': True, 'restricted_age_correlation': True, 'shortest_path_search': True},
    {'age_consideration': False, 'restricted_age_correlation': False, 'shortest_path_search': True}
]

# Define quality indices to process
target_quality_indices = ['corr_coef', 'norm_dtw']

output_csv_filenames = {}
for quality_index in target_quality_indices:
    output_csv_filenames[quality_index] = f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/{quality_index}_fit_params.csv'

# Define synthetic CSV filenames for consistent bin sizing
synthetic_csv_filenames = {}
for quality_index in target_quality_indices:
    synthetic_csv_filenames[quality_index] = f'example_data/analytical_outputs/synthetic_PDFs_{"_".join(LOG_COLUMNS)}_{quality_index}.csv'

# Run the multi-parameter analysis
run_multi_parameter_analysis(
    # Core data inputs
    log_a=log_a, 
    log_b=log_b, 
    md_a=md_a, 
    md_b=md_b,
    all_depths_a_cat1=all_depths_a_cat1,
    all_depths_b_cat1=all_depths_b_cat1,
    pickeddepth_ages_a=pickeddepth_ages_a,
    pickeddepth_ages_b=pickeddepth_ages_b,
    age_data_a=age_data_a,
    age_data_b=age_data_b,
    uncertainty_method=uncertainty_method,
    
    # Analysis parameters
    parameter_combinations=parameter_combinations,
    target_quality_indices=target_quality_indices,
    
    # Core identifiers
    core_a_name=CORE_A,
    core_b_name=CORE_B,
    
    # Output configuration
    output_csv_filenames=output_csv_filenames,
    
    max_search_per_layer=50      # Maximum scenarios per constraint removal layer (higher = better coverage, longer runtime)
)

## Calculate Statistical Comparison Metrics

**Function: `calculate_quality_comparison_t_statistics()`**

**What it does:**
1. Loads real core correlation quality distributions from master CSV files
2. Loads synthetic null hypothesis distributions from synthetic CSV files
3. Calculates t-statistics comparing real vs. synthetic distributions
4. Computes percentile rankings and significance levels
5. Stores statistical comparison results for visualization

**Key Parameters:**
- `target_quality_indices` *(list)*: Quality metrics to compare (e.g., ['corr_coef', 'norm_dtw'])
- `master_csv_filenames` *(dict)*: Dictionary mapping quality indices to real data CSV paths
- `synthetic_csv_filenames` *(dict)*: Dictionary mapping quality indices to synthetic data CSV paths
- `CORE_A`, `CORE_B` *(str)*: Core identifiers for context
- `mute_mode` *(bool, default=False)*: Suppress console output


<hr>

# **Compare Real Correlation to Synthetic Null Hypothesis**

Calculate statistical comparisons and visualize how real core correlation quality compares to synthetic distributions.

In [None]:
target_quality_indices = ['corr_coef', 'norm_dtw']

# Define paths to real core correlation fit parameters
master_csv_filenames = {}
for quality_index in target_quality_indices:
    master_csv_filenames[quality_index] = f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/{quality_index}_fit_params.csv'

# Define paths to synthetic null hypothesis distributions
synthetic_csv_filenames = {}
for quality_index in target_quality_indices:
    synthetic_csv_filenames[quality_index] = f'example_data/analytical_outputs/synthetic_PDFs_{"_".join(LOG_COLUMNS)}_{quality_index}.csv'

# Calculate statistical comparison metrics
calculate_quality_comparison_t_statistics(
    target_quality_indices=target_quality_indices,
    master_csv_filenames=master_csv_filenames,
    synthetic_csv_filenames=synthetic_csv_filenames,
    CORE_A=CORE_A,
    CORE_B=CORE_B,
    mute_mode=False
)

### Generate Static Comparison Plots


## Visualize Comparison Results

**Function: `plot_quality_comparison_t_statistics()`**

**What it does:**
1. Creates visualizations comparing real core correlation quality to synthetic null hypothesis
2. Plots probability distribution curves overlaying real and synthetic data
3. Shows statistical significance and percentile rankings
4. Can generate static plots (PNG/PDF/SVG) or animated GIFs showing progressive constraint addition
5. Optionally highlights the optimal datum match solution

**Key Parameters:**
- `target_quality_indices` *(list)*: Quality metrics to plot (e.g., ['corr_coef', 'norm_dtw', 'perc_diag'])
- `master_csv_filenames` *(dict)*: Dictionary mapping quality indices to master CSV file paths (should contain t-statistics columns)
- `synthetic_csv_filenames` *(dict)*: Dictionary mapping quality indices to synthetic CSV file paths
- `CORE_A` *(str)*: Name of core A for plot titles
- `CORE_B` *(str)*: Name of core B for plot titles
- `mute_mode` *(bool, default=False)*: If True, suppress detailed output messages and show only essential progress information
- `save_fig` *(bool, default=False)*: If True, save static figures to files
- `output_figure_filenames` *(dict, default=None)*: Dictionary mapping quality indices to output figure file paths (only used when save_fig=True)
- `save_gif` *(bool, default=False)*: If True, create animated GIF showing progressive addition of age constraints.
- `output_gif_filenames` *(dict, default=None)*: Dictionary mapping quality indices to GIF file paths (only used when save_gif=True)
- `max_frames` *(int, default=50)*: Maximum number of frames for GIF animations
- `plot_real_data_histogram` *(bool, default=False)*: If True, plot histograms for real data (no age and all age constraint cases)
- `plot_age_removal_step_pdf` *(bool, default=False)*: If True, plot all PDF curves including dashed lines for partially removed constraints
- `show_best_datum_match` *(bool, default=True)*: If True, plot vertical line showing best datum match value from sequential_mappings_csv
- `sequential_mappings_csv` *(str or dict, default=None)*: Path to CSV file(s) containing sequential mappings with 'Ranking_datums' column. Can be a single CSV path (str) or dictionary mapping quality indices to CSV paths


In [None]:
# Define path to optimal mapping CSV (try restricted_age_optimal first, fallback to no_age_optimal)
sequential_mappings_csv = f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/mappings_restricted_age_optimal.csv'
if not os.path.exists(sequential_mappings_csv):
    sequential_mappings_csv = f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/mappings_no_age_optimal.csv'
    if not os.path.exists(sequential_mappings_csv):
        sequential_mappings_csv = None

# Define output paths for static comparison figures
output_figure_filenames = {}
for quality_index in target_quality_indices:
    output_figure_filenames[quality_index] = f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/{quality_index}_compare2null.png'

# Generate static comparison plots
plot_quality_comparison_t_statistics(
    target_quality_indices=target_quality_indices,
    master_csv_filenames=master_csv_filenames,
    synthetic_csv_filenames=synthetic_csv_filenames,
    CORE_A=CORE_A,
    CORE_B=CORE_B,
    save_fig=True,
    plot_real_data_histogram=True,
    output_figure_filenames=output_figure_filenames,   # Acceptable formats: png, jpg, svg, pdf
    sequential_mappings_csv=sequential_mappings_csv
)

### Generate Animated GIF Showing Progressive Constraint Addition

Create animated visualizations showing how the quality metric distributions evolve as age constraints are progressively removed.


In [None]:
# Define output paths for animated GIFs
output_gif_filenames = {}
for quality_index in target_quality_indices:
    output_gif_filenames[quality_index] = f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/{quality_index}_compare2null.gif'

plot_quality_comparison_t_statistics(
    target_quality_indices=target_quality_indices,
    master_csv_filenames=master_csv_filenames,
    synthetic_csv_filenames=synthetic_csv_filenames,
    CORE_A=CORE_A,
    CORE_B=CORE_B,
    save_gif=True, 
    output_gif_filenames=output_gif_filenames,
    plot_real_data_histogram=False,
    plot_age_removal_step_pdf=True,
    sequential_mappings_csv=sequential_mappings_csv
)