# **pyCoreRelator** [![GitHub](https://img.shields.io/badge/GitHub-pyCoreRelator-blue?logo=github)](https://github.com/GeoLarryLai/pyCoreRelator) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.xxxxxxxx.svg)](https://doi.org/10.5281/zenodo.xxxxxxxx)
## **Workshop Notebook #7: Compare Real Core Correlation to Synthetic Null Hypothesis**   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GeoLarryLai/pyCoreRelator/blob/main/pyCoreRelator_7_compare2syn.ipynb)
This notebook demonstrates the workflow for comparing real core pair correlation quality metrics against synthetic null hypothesis distributions using **pyCoreRelator**.

### Key Functions from **pyCoreRelator**
- **`load_core_log_data()`**: Load and process core log data with picked depths
- **`load_core_age_constraints()`**: Load age constraint data for cores
- **`load_pickeddepth_ages_from_csv()`**: Load estimated ages for picked boundaries
- **`run_multi_parameter_analysis()`**: Run correlation analysis with multiple parameter combinations
- **`calculate_quality_comparison_t_statistics()`**: Calculate statistical comparison metrics
- **`plot_quality_comparison_t_statistics()`**: Visualize comparison results

For advanced usage, see [FUNCTION_DOCUMENTATION.md](https://github.com/GeoLarryLai/pyCoreRelator/blob/main/FUNCTION_DOCUMENTATION.md) for more details.
<hr>

# **0. Check Installation**
Check if **pyCoreRelator** is installed and install if needed

In [None]:
%pip install pycorerelator

# **Import Packages**
Load correlation analysis and comparison functions from **pyCoreRelator**


In [None]:
from pyCoreRelator import (
    load_core_log_data,
    load_core_age_constraints,
    load_pickeddepth_ages_from_csv,
    run_multi_parameter_analysis,
    calculate_quality_comparison_t_statistics,
    plot_quality_comparison_t_statistics
)

%matplotlib inline


<hr>

# **Configure Core Pair Data**

## Define Core Pair

Select which cores to correlate (Core A and Core B)

In [None]:
CORE_A = "M9907-25PC"
# CORE_A = "M9907-23PC"

In [None]:
CORE_B = "M9907-23PC"
# CORE_B = "M9907-11PC"

## **Check and Download Example Data**
Download necessary example data if not already present


In [None]:
import os, requests

cores = ["M9907-25PC", "M9907-23PC"]  # Change as needed

print("Checking for example data...")
if not all(os.path.exists(f"example_data/processed_data/{c}/{c}_hiresMS_MLfilled.csv") for c in cores):
    print("Downloading...")
    for core in cores:
        os.makedirs(f"example_data/processed_data/{core}", exist_ok=True)
        os.makedirs("example_data/picked_datum", exist_ok=True)
        os.makedirs("example_data/raw_data/C14age_data", exist_ok=True)
        os.makedirs("example_data/analytical_outputs", exist_ok=True)
        for path in [f"processed_data/{core}/{core}_hiresMS_MLfilled.csv", f"processed_data/{core}/{core}_CT_MLfilled.csv",
                     f"processed_data/{core}/{core}_RGB_MLfilled.csv", f"picked_datum/{core}_pickeddepth.csv",
                     f"picked_datum/{core}_pickeddepth_ages_MonteCarlo.csv", f"raw_data/C14age_data/{core}_C14.csv"]:
            try:
                with open(f"example_data/{path}", "wb") as f:
                    f.write(requests.get(f"https://github.com/GeoLarryLai/pyCoreRelator/raw/main/example_data/{path}").content)
            except: pass
    for path in ["analytical_outputs/synthetic_PDFs_hiresMS_CT_Lumin_corr_coef.csv",
                 "analytical_outputs/synthetic_PDFs_hiresMS_CT_Lumin_norm_dtw.csv"]:
        try:
            with open(f"example_data/{path}", "wb") as f:
                f.write(requests.get(f"https://github.com/GeoLarryLai/pyCoreRelator/raw/main/example_data/{path}").content)
        except: pass
    print("Download complete")
else:
    print("âœ“ Data already exists")

print("Ready to proceed")

## Define Log Data Paths and Column Structure

Configure which log types to use for correlation and specify file paths for each core.

In [None]:
# Define log columns to extract
LOG_COLUMNS = ['hiresMS', 'CT', 'Lumin']  # Choose which logs to include
# LOG_COLUMNS = ['hiresMS']  # Choose which logs to include

# Define paths for Core A
core_a_log_paths = {
    'hiresMS': f'example_data/processed_data/{CORE_A}/{CORE_A}_hiresMS_MLfilled.csv',
    'CT': f'example_data/processed_data/{CORE_A}/{CORE_A}_CT_MLfilled.csv',
    'Lumin': f'example_data/processed_data/{CORE_A}/{CORE_A}_RGB_MLfilled.csv',
}

# Define paths for Core B
core_b_log_paths = {
    'hiresMS': f'example_data/processed_data/{CORE_B}/{CORE_B}_hiresMS_MLfilled.csv',
    'CT': f'example_data/processed_data/{CORE_B}/{CORE_B}_CT_MLfilled.csv',
    'Lumin': f'example_data/processed_data/{CORE_B}/{CORE_B}_RGB_MLfilled.csv',
}

## Load Log Data and Picked Depths

**Function: `load_core_log_data()`**

**What it does:**
Loads and processes core log data from CSV files, resamples to common depth scale, and loads picked stratigraphic boundaries.

**Key Parameters:**
- `log_paths` *(dict)*: Dictionary mapping log names to CSV file paths
- `core_name` *(str)*: Name of the core for identification
- `log_columns` *(list, optional)*: List of log column names to extract from CSV files. If None, uses all keys from log_paths
- `depth_column` *(str, default='SB_DEPTH_cm')*: Name of the depth column in CSV files
- `normalize` *(bool, default=True)*: Whether to normalize log values to [0, 1] range
- `picked_datum` *(str, optional)*: Path to CSV file containing picked depths
- `categories` *(int, list, optional)*: Category or categories to filter (e.g., 1 or [1, 2])
- `show_fig` *(bool, default=True)*: Whether to display visualization

**Returns:**
- `log` (numpy.ndarray): Log data array with shape (n_samples, n_logs) or (n_samples,) for single log
- `md` (numpy.ndarray): Measured depth array
- `picked_depths` (list): List of picked depth values for specified categories
- `interpreted_bed` (list): List of interpreted bed names corresponding to picked depths


In [None]:
# Load data for Core A
log_a, md_a, picked_depths_a, interpreted_bed_a = load_core_log_data(
    log_paths=core_a_log_paths,
    core_name=CORE_A,
    log_columns=LOG_COLUMNS,
    depth_column='SB_DEPTH_cm',
    picked_datum=f'example_data/picked_datum/{CORE_A}_pickeddepth.csv',
    categories=[1],
    show_fig=False
)

# Load data for Core B
log_b, md_b, picked_depths_b, interpreted_bed_b = load_core_log_data(
    log_paths=core_b_log_paths,
    core_name=CORE_B,
    log_columns=LOG_COLUMNS,
    depth_column='SB_DEPTH_cm',
    picked_datum=f'example_data/picked_datum/{CORE_B}_pickeddepth.csv',
    categories=[1],
    show_fig=False
)


## Load Age Constraints

**Function: `load_core_age_constraints()`**

**What it does:**
Loads age constraint data for a single core from CSV files in a directory. Searches for all CSV files containing the core name and combines them.

**Key Parameters:**
- `core_name` *(str)*: Name of the core to load age constraints for
- `age_base_path` *(str)*: Base directory path containing age constraint CSV files
- `data_columns` *(dict, default=None)*: Dictionary mapping standard column names to actual CSV column names. Required keys: 'age', 'pos_error', 'neg_error', 'min_depth', 'max_depth', 'in_sequence', 'core', 'interpreted_bed'
- `mute_mode` *(bool, default=False)*: If True, suppress all print statements

**Returns:**
- `age_data` (dict): Dictionary containing age constraint data with keys:
  - `depths` (list): Mean depths of age constraints
  - `ages` (list): Calibrated ages in years BP
  - `pos_errors` (list): Positive 2-sigma uncertainties
  - `neg_errors` (list): Negative 2-sigma uncertainties
  - `in_sequence_flags` (list): Boolean flags for stratigraphic sequence
  - `in_sequence_depths`, `in_sequence_ages`, `in_sequence_pos_errors`, `in_sequence_neg_errors` (lists): Filtered in-sequence data
  - `out_sequence_depths`, `out_sequence_ages`, `out_sequence_pos_errors`, `out_sequence_neg_errors` (lists): Out-of-sequence data
  - `core` (list): Core identifiers
  - `interpreted_bed` (list): Interpreted bed names

In [None]:
staisch2024 = {
    'age': 'calib810_agebp',
    'pos_error': 'calib810_2sigma_pos', 
    'neg_error': 'calib810_2sigma_neg',
    'min_depth': 'mindepth_cm',
    'max_depth': 'maxdepth_cm',
    'in_sequence': 'in_sequence',
    'core': 'core',
    'interpreted_bed': 'interpreted_bed'
}

In [None]:
age_data_a = load_core_age_constraints(
    CORE_A,
    age_base_path='example_data/raw_data/C14age_data',
    data_columns=staisch2024
)

In [None]:
age_data_b = load_core_age_constraints(
    CORE_B,
    age_base_path='example_data/raw_data/C14age_data',
    data_columns=staisch2024
)

## Load Estimated Ages for Picked Boundaries

**Function: `load_pickeddepth_ages_from_csv()`**

**What it does:**
Loads pre-calculated interpolated ages for picked depth boundaries from CSV file (output from `calculate_interpolated_ages()`).

**Key Parameters:**
- `pickeddepth_age_csv` *(str)*: Path to CSV file containing pre-calculated ages

**Returns:**
- `age_data` (dict): Dictionary containing age estimates with keys:
  - `depths` (numpy.ndarray): Picked datum depths
  - `ages` (numpy.ndarray): Interpolated ages in years BP
  - `pos_uncertainties` (numpy.ndarray): Positive age uncertainties
  - `neg_uncertainties` (numpy.ndarray): Negative age uncertainties

In [None]:
# Define the uncertainty method: 'MonteCarlo', 'Linear', or 'Gaussian'
uncertainty_method = 'MonteCarlo'   

# Load pre-calculated ages for picked depths
pickeddepth_ages_a = load_pickeddepth_ages_from_csv(
    pickeddepth_age_csv=f"example_data/picked_datum/{CORE_A}_pickeddepth_ages_{uncertainty_method}.csv"
)

pickeddepth_ages_b = load_pickeddepth_ages_from_csv(
    pickeddepth_age_csv=f"example_data/picked_datum/{CORE_B}_pickeddepth_ages_{uncertainty_method}.csv"
)

## Perform Comprehensive DTW Analysis

**Function: `run_multi_parameter_analysis()`**

**What it does:**
1. Runs DTW correlation analysis for multiple parameter combinations (with/without age constraints)
2. Tests scenarios with progressive removal of age constraints (if enabled)
3. Computes quality metric distributions for each scenario
4. Fits probability distributions and calculates statistical parameters
5. Exports fit parameters and distribution data to CSV files for comparison

**Key Parameters:**
- `log_a`, `log_b` *(array-like)*: Core log data arrays
- `md_a`, `md_b` *(array-like)*: Measured depth arrays
- `picked_datum_a`, `picked_datum_b` *(list)*: Picked boundary depths (category 1)
- `datum_ages_a`, `datum_ages_b` *(dict)*: Age interpolation results for picked depths with keys: 'depths', 'ages', 'pos_uncertainties', 'neg_uncertainties'
- `core_a_age_data`, `core_b_age_data` *(dict)*: Age constraint data from `load_core_age_constraints()`
- `uncertainty_method` *(str)*: Age uncertainty calculation method ('MonteCarlo', 'Linear', or 'Gaussian')
- `core_a_name`, `core_b_name` *(str)*: Core identifiers for output file naming
- `output_csv_directory` *(str)*: Directory path where output CSV files will be saved
- `parameter_combinations` *(list of dict)*: List of parameter dictionaries to test. Each dict should contain: 'age_consideration', 'restricted_age_correlation', 'shortest_path_search'
- `target_quality_indices` *(list, default=['corr_coef', 'norm_dtw'])*: Quality metrics to analyze (e.g., ['corr_coef', 'norm_dtw', 'perc_diag'])
- `log_columns` *(list, optional)*: List of log column names (e.g., ['hiresMS', 'CT', 'Lumin']). If provided, creates subdirectory structure
- `test_age_constraint_removal` *(bool, default=True)*: Whether to test progressive age constraint removal scenarios
- `synthetic_csv_filenames` *(dict, default=None)*: Dictionary mapping quality_index to synthetic CSV filenames for consistent bin sizing
- `pca_for_dependent_dtw` *(bool, default=False)*: Use PCA for dependent multidimensional DTW
- `n_jobs` *(int, default=-1)*: Number of parallel jobs (-1 uses all available CPU cores, 1 for sequential)
- `max_search_per_layer` *(int or None, default=None)*: Maximum scenarios per constraint removal layer. Higher values yield more comprehensive results but increase computation time. If None, processes all scenarios

<hr>

# **Run Multi-Parameter Correlation Analysis**

Execute correlation analysis with different parameter combinations to test various hypotheses and constraints.

In [None]:
# Define parameter combinations to test
parameter_combinations = [
    {'age_consideration': True, 'restricted_age_correlation': True, 'shortest_path_search': True},
    {'age_consideration': False, 'restricted_age_correlation': False, 'shortest_path_search': True}
]

# Run the multi-parameter analysis
run_multi_parameter_analysis(
    log_a, 
    log_b, 
    md_a, 
    md_b,
    picked_datum_a=picked_depths_a,
    picked_datum_b=picked_depths_b,
    datum_ages_a=pickeddepth_ages_a,
    datum_ages_b=pickeddepth_ages_b,
    core_a_age_data=age_data_a,
    core_b_age_data=age_data_b,
    uncertainty_method=uncertainty_method,
    core_a_name=CORE_A,
    core_b_name=CORE_B,
    log_columns=LOG_COLUMNS,
    max_search_per_layer=50,
    output_csv_directory=f'example_data/analytical_outputs/{CORE_A}_{CORE_B}',
    parameter_combinations=parameter_combinations
)

## Calculate Statistical Comparison Metrics

**Function: `calculate_quality_comparison_t_statistics()`**

**What it does:**
1. Loads real core correlation quality distributions from output CSV files
2. Loads synthetic null hypothesis distributions from synthetic PDF directory
3. Calculates t-statistics comparing real vs. synthetic distributions
4. Computes percentile rankings and significance levels
5. Stores statistical comparison results for visualization

**Key Parameters:**
- `target_quality_indices` *(list)*: Quality metrics to compare (e.g., ['corr_coef', 'norm_dtw'])
- `output_csv_directory` *(str)*: Directory path where output CSV files are located
- `input_syntheticPDF_directory` *(str)*: Directory path where synthetic/null hypothesis CSV files are located
- `core_a_name`, `core_b_name` *(str)*: Core identifiers
- `log_columns` *(list, optional)*: List of log column names (e.g., ['hiresMS', 'CT', 'Lumin']). If provided, creates subdirectory structure
- `mute_mode` *(bool, default=False)*: Suppress console output


<hr>

# **Compare Real Correlation to Synthetic Null Hypothesis**

Calculate statistical comparisons and visualize how real core correlation quality compares to synthetic distributions.

In [None]:
# Define quality indices to compare (should match those used in run_multi_parameter_analysis)
target_quality_indices = ['corr_coef', 'norm_dtw']

# Define directory paths
t_stats_dir = f'example_data/analytical_outputs/{CORE_A}_{CORE_B}'
syntheticPDF_dir = 'example_data/analytical_outputs'

# Calculate statistical comparison metrics
calculate_quality_comparison_t_statistics(
    target_quality_indices=target_quality_indices,
    output_csv_directory=t_stats_dir,
    input_syntheticPDF_directory=syntheticPDF_dir,
    core_a_name=CORE_A,
    core_b_name=CORE_B,
    log_columns=LOG_COLUMNS,
    mute_mode=False
)

### Generate Static Comparison Plots


## Visualize Comparison Results

**Function: `plot_quality_comparison_t_statistics()`**

**What it does:**
1. Creates visualizations comparing real core correlation quality to synthetic null hypothesis
2. Plots probability distribution curves overlaying real and synthetic data
3. Shows statistical significance and percentile rankings
4. Can generate static plots (PNG/PDF/SVG) or animated GIFs showing progressive constraint addition
5. Optionally highlights the optimal datum match solution

**Key Parameters:**
- `target_quality_indices` *(list)*: Quality metrics to plot (e.g., ['corr_coef', 'norm_dtw', 'perc_diag'])
- `output_csv_directory` *(str)*: Directory path where output CSV files are located (should contain t-statistics columns)
- `input_syntheticPDF_directory` *(str)*: Directory path where synthetic/null hypothesis CSV files are located
- `core_a_name` *(str)*: Name of core A for plot titles
- `core_b_name` *(str)*: Name of core B for plot titles
- `log_columns` *(list, optional)*: List of log column names (e.g., ['hiresMS', 'CT', 'Lumin']). If provided, creates subdirectory structure
- `mute_mode` *(bool, default=False)*: If True, suppress detailed output messages and show only essential progress information
- `save_fig` *(bool, default=False)*: If True, save static figures to files
- `output_figure_directory` *(str, default=None)*: Directory path where output figures will be saved (filenames auto-generated, only used when save_fig=True)
- `fig_format` *(list, default=['png'])*: List of file formats for saved figures. Accepted formats: 'png', 'jpg', 'svg', 'pdf', 'tiff'. Only used when save_fig=True
- `dpi` *(int, default=150)*: Resolution for saved figures in dots per inch. Only used when save_fig=True
- `save_gif` *(bool, default=False)*: If True, create animated GIF showing progressive addition of age constraints
- `output_gif_directory` *(str, default=None)*: Directory path where GIF files will be saved (filenames auto-generated, only used when save_gif=True)
- `max_frames` *(int, default=50)*: Maximum number of frames for GIF animations
- `plot_real_data_histogram` *(bool, default=False)*: If True, plot histograms for real data (no age and all age constraint cases)
- `plot_age_removal_step_pdf` *(bool, default=False)*: If True, plot all PDF curves including dashed lines for partially removed constraints
- `show_best_datum_match` *(bool, default=True)*: If True, plot vertical line showing best datum match value from sequential_mappings_csv
- `sequential_mappings_csv` *(str or dict, default=None)*: Path to CSV file(s) containing sequential mappings with 'Ranking_datums' column. Can be a single CSV path (str) or dictionary mapping quality indices to CSV paths


In [None]:
plot_quality_comparison_t_statistics(
    target_quality_indices=target_quality_indices,
    output_csv_directory=t_stats_dir,
    input_syntheticPDF_directory=syntheticPDF_dir,
    core_a_name=CORE_A,
    core_b_name=CORE_B,
    log_columns=LOG_COLUMNS,
    save_fig=True,
    plot_real_data_histogram=True,
    output_figure_directory=t_stats_dir,   
    sequential_mappings_csv=f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/mappings_restricted_age_optimal.csv'
)

### Generate Animated GIF Showing Progressive Constraint Addition

Create animated visualizations showing how the quality metric distributions evolve as age constraints are progressively removed.


In [None]:
plot_quality_comparison_t_statistics(
    target_quality_indices=target_quality_indices,
    output_csv_directory=t_stats_dir,
    input_syntheticPDF_directory=syntheticPDF_dir,
    core_a_name=CORE_A,
    core_b_name=CORE_B,
    log_columns=LOG_COLUMNS,
    save_gif=True, 
    output_gif_directory=t_stats_dir,
    plot_real_data_histogram=False,
    plot_age_removal_step_pdf=True,
    sequential_mappings_csv=f'example_data/analytical_outputs/{CORE_A}_{CORE_B}/{"_".join(LOG_COLUMNS)}/mappings_restricted_age_optimal.csv'
)