# **pyCoreRelator** [![GitHub](https://img.shields.io/badge/GitHub-pyCoreRelator-blue?logo=github)](https://github.com/GeoLarryLai/pyCoreRelator) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.xxxxxxxx.svg)](https://doi.org/10.5281/zenodo.xxxxxxxx)
## **Workshop Notebook #6: Synthetic Stratigraphy Analysis**   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GeoLarryLai/pyCoreRelator/blob/main/pyCoreRelator_6_synthetic_strat.ipynb)
This notebook demonstrates the workflow for generating synthetic stratigraphic sequences and analyzing correlation quality distributions using **pyCoreRelator**.

### Key Functions from **pyCoreRelator**
- **`load_segment_pool()`**: Load and organize turbidite segments from multiple cores
- **`modify_segment_pool()`**: Remove or modify segments in the pool
- **`plot_segment_pool()`**: Visualize all segments in the pool
- **`create_synthetic_log()`**: Generate synthetic core logs from segment pool
- **`plot_synthetic_log()`**: Visualize synthetic core logs
- **`run_comprehensive_dtw_analysis()`**: Perform DTW analysis on synthetic pairs
- **`find_complete_core_paths()`**: Search for complete correlation paths
- **`plot_correlation_distribution()`**: Plot quality metric distributions
- **`synthetic_correlation_quality()`**: Run multiple synthetic iterations
- **`plot_synthetic_correlation_quality()`**: Visualize synthetic correlation results

For advanced usage, see [FUNCTION_DOCUMENTATION.md](https://github.com/GeoLarryLai/pyCoreRelator/blob/main/FUNCTION_DOCUMENTATION.md) for more details.
<hr>


# **Check Installation**
Check if **pyCoreRelator** is installed and install if needed


In [None]:
%pip install pycorerelator

# **Import Packages**
Load synthetic stratigraphy and correlation analysis functions from **pyCoreRelator**

In [None]:
import os

from pyCoreRelator import (
    load_segment_pool,
    modify_segment_pool,
    plot_segment_pool,
    create_synthetic_log,
    plot_synthetic_log,
    run_comprehensive_dtw_analysis,
    find_complete_core_paths,
    plot_correlation_distribution,
    synthetic_correlation_quality,
    plot_synthetic_correlation_quality
)

%matplotlib inline


<hr>

# **Build Segment Pool from Real Cores**

Configure the cores and data sources to create a pool of stratigraphic segments for synthetic generation.


In [None]:
# LOG_COLUMNS = ['hiresMS']  # Single log option
LOG_COLUMNS = ['hiresMS', 'CT', 'Lumin']  # Multiple logs for correlation

SEGMENT_POOL_CORES = ["M9907-11PC", "M9907-23PC", "M9907-25PC"]

CORE_LOG_PATHS = {
    core_name: {
        'hiresMS': f'example_data/processed_data/{core_name}/{core_name}_hiresMS_MLfilled.csv',
        'CT': f'example_data/processed_data/{core_name}/{core_name}_CT_MLfilled.csv',
        'Lumin': f'example_data/processed_data/{core_name}/{core_name}_RGB_MLfilled.csv'
    }
    for core_name in SEGMENT_POOL_CORES
}

PICKED_DEPTH_PATHS = {
    core_name: f'example_data/picked_datum/{core_name}_pickeddepth.csv'
    for core_name in SEGMENT_POOL_CORES
}

## **Check and Download Example Data**
Download necessary example data if not already present


In [None]:
import requests
print("Checking for example data...")
if not all(os.path.exists(f"example_data/processed_data/{c}/{c}_hiresMS_MLfilled.csv") for c in SEGMENT_POOL_CORES):
    print("Downloading...")
    for core in SEGMENT_POOL_CORES:
        os.makedirs(f"example_data/processed_data/{core}", exist_ok=True)
        os.makedirs("example_data/picked_datum", exist_ok=True)
        for path in [f"processed_data/{core}/{core}_hiresMS_MLfilled.csv", f"processed_data/{core}/{core}_CT_MLfilled.csv",
                     f"processed_data/{core}/{core}_RGB_MLfilled.csv", f"picked_datum/{core}_pickeddepth.csv"]:
            try:
                with open(f"example_data/{path}", "wb") as f:
                    f.write(requests.get(f"https://github.com/GeoLarryLai/pyCoreRelator/raw/main/example_data/{path}").content)
            except: pass
    print("Download complete")
else:
    print("âœ“ Data already exists")
print("Ready to proceed")

## Load Segment Pool from Multiple Cores

**Function: `load_segment_pool()`**

**What it does:**
Extracts stratigraphic segments from multiple cores to create a reusable pool for synthetic core generation.

**Key Parameters:**
- `core_names` *(list)*: List of core identifiers
- `log_data_csv` *(dict)*: Nested dictionary mapping core names to log file paths
- `log_data_type` *(list)*: List of log column names to extract
- `picked_datum` *(dict)*: Dictionary mapping core names to picked datum CSV files
- `depth_column` *(str, default='SB_DEPTH_cm')*: Name of the depth column
- `alternative_column_names` *(dict, default=None)*: Optional dictionary of alternative column names (e.g., {'hiresMS': ['MS'], 'CT': ['CT_value'], 'Lumin': ['luminance', 'Luminance']})
- `boundary_category` *(int, default=None)*: Category filter for segment boundaries. If None, uses category 1 if available, otherwise uses the lowest available category
- `neglect_topbottom` *(bool, default=True)*: Exclude top/bottom boundary segments

**Returns:**
- `segment_logs` (list): List of log arrays for each segment
- `segment_depths` (list): List of depth arrays for each segment
- `segment_info` (list): List of dictionaries with segment metadata


In [None]:
seg_logs, seg_depths, _ = load_segment_pool(
    core_names=SEGMENT_POOL_CORES,
    log_data_csv=CORE_LOG_PATHS,
    log_data_type=LOG_COLUMNS,
    picked_datum=PICKED_DEPTH_PATHS,
    depth_column='SB_DEPTH_cm'
)

## Visualize Segment Pool

**Function: `plot_segment_pool()`**

**What it does:**
Creates a visualization of all segments in the pool, displaying log traces for quality inspection.

**Key Parameters:**
- `segment_logs` *(list)*: List of log arrays from `load_segment_pool()`
- `segment_depths` *(list)*: List of depth arrays from `load_segment_pool()`
- `log_data_type` *(list)*: List of log names to plot
- `n_cols` *(int, default=10)*: Number of columns in the plot grid
- `figsize_per_row` *(float, default=3)*: Figure height per row
- `plot_segments` *(bool, default=True)*: Whether to display the plot
- `save_plot` *(bool, default=False)*: Whether to save the figure
- `plot_filename` *(str, default=None)*: Output filename for saved plot

In [None]:
plot_segment_pool(
    segment_logs=seg_logs,
    segment_depths=seg_depths,
    log_data_type=LOG_COLUMNS
)

## Modify Segment Pool (Optional)

**Function: `modify_segment_pool()`**

**What it does:**
Removes unwanted segments from the pool based on quality or suitability criteria.

**Key Parameters:**
- `segment_logs` *(list)*: List of log arrays
- `segment_depths` *(list)*: List of depth arrays
- `remove_list` *(list)*: List of segment indices to remove

**Returns:**
- `modified_segment_logs` (list): Filtered log arrays
- `modified_segment_depths` (list): Filtered depth arrays


In [None]:
exclude_segs = [18, 19, 20, 21, 22, 23, 24, 25, 26, 50, 51]

mod_seg_logs, mod_seg_depths = modify_segment_pool(seg_logs, seg_depths, remove_list=exclude_segs)

In [None]:
plot_segment_pool(
    segment_logs=mod_seg_logs,
    segment_depths=mod_seg_depths,
    log_data_type=LOG_COLUMNS
)

<hr>

# **Generate Synthetic Core Pair**

Create synthetic stratigraphic sequences by randomly selecting segments from the pool.


### Create Synthetic Core A

**Function: `create_synthetic_log()`**

**What it does:**
Generates a synthetic core log by randomly selecting and stacking segments from the pool to reach a target thickness.

**Key Parameters:**
- `target_thickness` *(float)*: Desired total thickness in cm
- `segment_logs` *(list)*: List of available segment log arrays
- `segment_depths` *(list)*: List of available segment depth arrays
- `repetition` *(bool, default=False)*: Allow re-selecting the same segment

**Returns:**
- `synthetic_log` (numpy.ndarray): Combined log array
- `synthetic_md` (numpy.ndarray): Measured depth array
- `synthetic_picked_datum` (list): List of boundary depth values
- `selected_indices` (list): Indices of segments used from the pool


In [None]:
syn_log_a, syn_md_a, syn_depth_a, _ = create_synthetic_log(
    target_thickness=400,
    segment_logs=mod_seg_logs,
    segment_depths=mod_seg_depths
)

**Function: `plot_synthetic_log()`**

**What it does:**
Visualizes the generated synthetic core log with picked datum boundaries.

**Key Parameters:**
- `synthetic_log` *(numpy.ndarray)*: Synthetic log array from `create_synthetic_log()`
- `synthetic_md` *(numpy.ndarray)*: Synthetic measured depth array
- `synthetic_picked_datum` *(list)*: List of picked datum depth values
- `log_data_type` *(list)*: List of log names to plot
- `title` *(str, default=None)*: Plot title (optional)
- `save_plot` *(bool, default=False)*: Whether to save the figure
- `plot_filename` *(str, default=None)*: Output filename for saved plot


In [None]:
plot_synthetic_log(
    synthetic_log=syn_log_a,
    synthetic_md=syn_md_a,
    synthetic_picked_datum=syn_depth_a,
    log_data_type=LOG_COLUMNS
)

### Create Synthetic Core B

In [None]:
syn_log_b, syn_md_b, syn_depth_b, _ = create_synthetic_log(
    target_thickness=400,
    segment_logs=mod_seg_logs,
    segment_depths=mod_seg_depths
)

In [None]:
plot_synthetic_log(
    synthetic_log=syn_log_b,
    synthetic_md=syn_md_b,
    synthetic_picked_datum=syn_depth_b,
    log_data_type=LOG_COLUMNS
)


<hr>

# **Analyze Single Synthetic Core Pair**

Perform DTW correlation analysis on one synthetic pair to examine the distribution of correlation solutions.


## Run DTW Analysis on Synthetic Pair
**Function: `run_comprehensive_dtw_analysis()`**

**What it does:**
1. Creates segments between picked datum boundaries
2. Identifies valid segment pairs based on depth constraints
3. Performs DTW analysis on each valid segment pair
4. Calculates DTW distances and quality metrics

**Key Parameters:**
- `log_a` *(array)*: Synthetic Core A log data
- `log_b` *(array)*: Synthetic Core B log data
- `md_a` *(array)*: Synthetic Core A measured depths
- `md_b` *(array)*: Synthetic Core B measured depths
- `picked_datum_a` *(list)*: Picked datum depths for Core A
- `picked_datum_b` *(list)*: Picked datum depths for Core B
- `independent_dtw` *(bool, default=False)*: Use independent DTW for each log dimension
- `pca_for_dependent_dtw` *(bool, default=False)*: Use PCA for dependent multidimensional DTW
- `top_bottom` *(bool, default=False)*: Include top and bottom boundaries
- `mute_mode` *(bool, default=False)*: Suppress print output

**Returns:**
- `dtw_result` (dict): Dictionary containing DTW analysis results


In [None]:
dtw_result = run_comprehensive_dtw_analysis(
    syn_log_a, syn_log_b, syn_md_a, syn_md_b,
    picked_datum_a=syn_depth_a,
    picked_datum_b=syn_depth_b
)

## Search for All Complete DTW Paths

**Function: `find_complete_core_paths()`**

**What it does:**
1. Searches through all valid segment pairs to find complete correlation paths
2. Calculates quality metrics for each complete path
3. Uses shortest path algorithms to optimize search
4. Exports all valid mappings to CSV file

**Key Parameters:**
- `dtw_result` *(dict)*: Dictionary containing DTW analysis results from `run_comprehensive_dtw_analysis()`. Expected keys: 'dtw_correlation', 'valid_dtw_pairs', 'segments_a', 'segments_b', 'depth_boundaries_a', 'depth_boundaries_b', 'dtw_distance_matrix_full'
- `log_a` *(array)*: Core A log data for metric computation
- `log_b` *(array)*: Core B log data for metric computation
- `output_csv` *(str, default='complete_core_paths.csv')*: Output CSV filename for mappings
- `debug` *(bool, default=False)*: Enable detailed progress reporting
- `start_from_top_only` *(bool, default=True)*: Only start paths from top segments
- `batch_size` *(int, default=1000)*: Processing batch size for memory management
- `n_jobs` *(int, default=-1)*: Number of parallel jobs (-1 uses all CPU cores)
- `shortest_path_search` *(bool, default=True)*: Keep only shortest path lengths during search
- `shortest_path_level` *(int, default=2)*: Number of shortest unique lengths to keep (higher = more segments)
- `max_search_path` *(int, default=5000)*: Maximum paths per segment pair to prevent memory overflow. Higher, more comprehensive in the solution search (e.g., 100000 would typically work great).
- `output_metric_only` *(bool, default=False)*: If True, only output quality metrics without full path details
- `mute_mode` *(bool, default=False)*: Suppress all print output
- `pca_for_dependent_dtw` *(bool, default=False)*: Use PCA for dependent DTW quality calculations

**Returns:**
- `complete_path_search_result` (dict): Dictionary containing:
    - `complete_paths` (list): All complete correlation paths with quality metrics
    - `num_paths` (int): Total number of complete paths found
    - `csv_file` (str): Path to output CSV file containing all mappings


In [None]:
_ = find_complete_core_paths(
    dtw_result,
    syn_log_a, 
    syn_log_b,
    output_csv=f"example_data/analytical_outputs/temp_synthetic_{'_'.join(LOG_COLUMNS)}_core_pair_metrics.csv"
)

## Visualize Quality Metric Distributions

**Function: `plot_correlation_distribution()`**

**What it does:**
1. Loads all mappings from CSV file
2. Extracts quality metrics for the target mapping
3. Plots histogram and probability distribution
4. Compares target mapping against all other solutions
5. Saves distribution plot as PNG

**Key Parameters:**
- `mapping_csv` *(str)*: Path to mappings CSV or Parquet file
- `target_mapping_id` *(int, default=None)*: ID of mapping to highlight in the plot (optional)
- `quality_index` *(str)*: Quality metric to plot - **required**. Options: 'corr_coef', 'norm_dtw', 'dtw_ratio', 'variance_deviation', 'perc_diag', 'match_min', 'match_mean', 'perc_age_overlap'
- `save_png` *(bool, default=True)*: Whether to save plot as PNG
- `png_filename` *(str, default=None)*: Output PNG filename (optional)
- `core_a_name` *(str, default=None)*: Core A name for plot title (optional)
- `core_b_name` *(str, default=None)*: Core B name for plot title (optional)
- `bin_width` *(float, default=None)*: Histogram bin width (auto if None, based on quality_index)
- `pdf_method` *(str, default='normal')*: PDF fitting method ('KDE', 'skew-normal', or 'normal')
- `kde_bandwidth` *(float, default=0.05)*: Bandwidth for KDE when pdf_method='KDE'
- `mute_mode` *(bool, default=False)*: If True, suppress all print statements
- `targeted_binsize` *(tuple, default=None)*: (synthetic_bins, bin_width) for consistent bin sizing with synthetic data
- `dpi` *(int, default=None)*: Resolution for saved figures in dots per inch. If None, uses default (150)

**Returns:**
- `fit_params` (dict): Dictionary containing distribution statistics including histogram data, PDF parameters, and percentile information

##### Plot Correlation Coefficient Distribution

In [None]:
_ = plot_correlation_distribution(
    mapping_csv=f'example_data/analytical_outputs/temp_synthetic_{"_".join(LOG_COLUMNS)}_core_pair_metrics.csv',
    quality_index='corr_coef'
)

##### Plot Normalized DTW Cost Distribution

In [None]:
_ = plot_correlation_distribution(
    mapping_csv=f'example_data/analytical_outputs/temp_synthetic_{"_".join(LOG_COLUMNS)}_core_pair_metrics.csv',
    quality_index='norm_dtw'
)

#### Remove demo temporary files

In [None]:
if os.path.exists(f'example_data/analytical_outputs/temp_synthetic_{"_".join(LOG_COLUMNS)}_core_pair_metrics.csv'):
    os.remove(f"example_data/analytical_outputs/temp_synthetic_{"_".join(LOG_COLUMNS)}_core_pair_metrics.csv")

<hr>

# **Run Multiple Synthetic Iterations**

Generate and analyze many synthetic core pairs to establish null hypothesis distributions for correlation quality metrics.


## Perform Iterative Synthetic Analysis

**Function: `synthetic_correlation_quality()`**

**What it does:**
1. Generates multiple synthetic core pairs from the segment pool
2. Runs DTW analysis on each pair
3. Collects quality metrics from all correlation solutions
4. Exports probability distributions for each quality metric to CSV

**Key Parameters:**
- `segment_logs` *(list)*: Segment log arrays from `load_segment_pool()` or `modify_segment_pool()`
- `segment_depths` *(list)*: Segment depth arrays from `load_segment_pool()` or `modify_segment_pool()`
- `log_data_type` *(list)*: List of log names to analyze
- `quality_indices` *(list)*: Quality metrics to compute (e.g., ['corr_coef', 'norm_dtw'])
- `number_of_iterations` *(int)*: Number of synthetic pairs to generate
- `core_a_length` *(float)*: Target thickness for synthetic Core A
- `core_b_length` *(float)*: Target thickness for synthetic Core B
- `repetition` *(bool, default=False)*: Allow segment reuse within a core
- `pca_for_dependent_dtw` *(bool, default=False)*: Use PCA for multidimensional DTW
- `output_csv_dir` *(str, default=None)*: Directory for output CSV files
- `max_search_path` *(int, default=5000)*: Maximum paths to search per segment pair
- `mute_mode` *(bool, default=False)*: Suppress console output

**Returns:**
None (outputs CSV files for each quality index)


In [None]:
synthetic_correlation_quality(
    segment_logs=mod_seg_logs,
    segment_depths=mod_seg_depths,
    log_data_type=LOG_COLUMNS,
    quality_indices=['corr_coef', 'norm_dtw'],
    number_of_iterations=50,
    core_a_length=400,
    core_b_length=400,
    output_csv_dir='example_data/analytical_outputs'
)

<hr>

# **Visualize Synthetic Correlation Results**

Plot the probability distributions from multiple iterations to establish baseline correlation quality expectations.


## Plot Individual Iteration PDFs

**Function: `plot_synthetic_correlation_quality()`**

**What it does:**
Visualizes the probability distribution functions (PDFs) from synthetic correlation analysis, either as individual iteration curves or as a combined distribution.

**Key Parameters:**
- `input_csv` *(str)*: Path to CSV file with `{quality_index}` placeholder
- `quality_indices` *(list)*: Quality metrics to plot
- `bin_width` *(float, default=None)*: Histogram bin width (auto if None)
- `plot_individual_pdf` *(bool, default=True)*: True shows individual PDFs, False shows combined
- `save_plot` *(bool, default=False)*: Whether to save the figure
- `plot_filename` *(str, default=None)*: Output filename with `{quality_index}` placeholder

In [None]:
quality_indices = ['corr_coef', 'norm_dtw']

plot_synthetic_correlation_quality(
    input_csv=f'example_data/analytical_outputs/synthetic_PDFs_{"_".join(LOG_COLUMNS)}_{{quality_index}}.csv',
    quality_indices=quality_indices)

## Plot Combined Distribution from All Iterations


In [None]:
quality_indices = ['corr_coef', 'norm_dtw']

plot_synthetic_correlation_quality(
    input_csv=f'example_data/analytical_outputs/synthetic_PDFs_{"_".join(LOG_COLUMNS)}_{{quality_index}}.csv',
    quality_indices=quality_indices)