# **pyCoreRelator** [![GitHub](https://img.shields.io/badge/GitHub-pyCoreRelator-blue?logo=github)](https://github.com/GeoLarryLai/pyCoreRelator) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17847259.svg)](https://doi.org/10.5281/zenodo.17847259)  [![PyPI version](https://img.shields.io/pypi/v/pycorerelator.svg)](https://pypi.org/project/pycorerelator/)  [![Conda Version](https://img.shields.io/conda/vn/conda-forge/pycorerelator.svg)](https://anaconda.org/conda-forge/pycorerelator)
## **Workshop Notebook #3: Log Data Processing and Gap Filling**   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GeoLarryLai/pyCoreRelator/blob/main/pyCoreRelator_3_data_gap_fill.ipynb)
This notebook demonstrates the general workflow for using modules from **pyCoreRelator** to clean, process, and fill data gaps in core log data using machine learning methods.

### Key Functions from **pyCoreRelator**
- **`preprocess_core_data()`**: Data cleaning and preprocessing function
- **`plot_core_logs()`**: Visualization function for plotting cleaned and filled core logs
- **`process_and_fill_logs()`**: ML-based data gap filling function

For advanced usage, see [FUNCTION_DOCUMENTATION.md](https://github.com/GeoLarryLai/pyCoreRelator/blob/main/FUNCTION_DOCUMENTATION.md) for more details.
<hr>

# **Check Installation**
Check if **pyCoreRelator** is installed and install if needed

In [None]:
%pip install pycorerelator

# **Import Packages**
Load data processing and gap filling functions from **pyCoreRelator**


In [None]:
from pyCoreRelator import preprocess_core_data, plot_core_logs, process_and_fill_logs

%matplotlib inline

<hr>

# **Define Core Data Configuration Structure**

Configure your core data structure and processing parameters below. The `data_reading_config` dictionary defines all necessary paths, parameters, and column configurations for data cleaning and gap filling.

### Core Information
Basic core identification and dimensions.

In [None]:
core_name = "M9907-23PC"  # Core name
core_length_cm = 783     # Core length in cm

# core_name = "M9907-25PC"  # Core name
# core_length_cm = 797     # Core length in cm

# core_name = "M9907-11PC"  # Core name
# core_length_cm = 439  # Core length in cm

In [None]:
data_reading_config = {
    'core_name': core_name,
    'core_length': core_length_cm,
    
    'input_file_paths': {
        'ct': f'example_data/processed_data/{core_name}/{core_name}_CT.csv',
        'rgb': f'example_data/processed_data/{core_name}/{core_name}_RGB.csv',
        'hrms': f'example_data/raw_data/log_hiresMS/{core_name}_ptMS.csv',
        'mst': f'example_data/raw_data/log_MST/{core_name}_MST.csv'
    },
    
    'clean_file_paths': {
        'ct': f'example_data/processed_data/{core_name}/{core_name}_CT_clean.csv',
        'rgb': f'example_data/processed_data/{core_name}/{core_name}_RGB_clean.csv',
        'hrms': f'example_data/processed_data/{core_name}/{core_name}_hiresMS_clean.csv',
        'mst': f'example_data/processed_data/{core_name}/{core_name}_MST_clean.csv'
    },
    
    'filled_file_paths': {
        'ct': f'example_data/processed_data/{core_name}/{core_name}_CT_MLfilled.csv',
        'rgb': f'example_data/processed_data/{core_name}/{core_name}_RGB_MLfilled.csv',
        'hrms': f'example_data/processed_data/{core_name}/{core_name}_hiresMS_MLfilled.csv',
        'mst': f'example_data/processed_data/{core_name}/{core_name}_MST_MLfilled.csv'
    },

    'column_configs': {
        'ct': {
            'data_col': 'CT', 
            'std_col': 'CT_std', 
            'depth_col': 'SB_DEPTH_cm',
            'plot_label': 'CT\nBrightness',
            'plot_colors': ['black'],
            'show_colormap': True,
            'colormap': 'jet',
            'image_path': f'example_data/processed_data/{core_name}/{core_name}_CT.tiff'
        },
        'rgb': {
            'data_cols': ['Lumin'],
            'std_cols': ['Lumin_std'],
            'depth_col': 'SB_DEPTH_cm',
            'feature_weights': [2.0],
            'rgb_threshold': [35, 220, 2],
            'group_in_subplot': True,
            'plot_label': 'Relative\nLuminance',
            'plot_colors': ['black'],
            'show_colormap': True,
            'colormap_cols': ['Lumin'],
            'colormap': 'inferno',
            'image_path': f'example_data/processed_data/{core_name}/{core_name}_RGB.tiff',
            'additional_feature_source': 'mst',
            'additional_feature_columns': ['Den_gm/cc']
        },
        'hrms': {
            'data_col': 'hiresMS', 
            'depth_col': 'SB_DEPTH_cm',
            'plot_label': 'High-Res\nMagnetic\nSusceptibility\n(μSI)',
            'plot_color': 'darkgreen',
            'feature_weight': 3.0,
            'threshold': ['<=', 5, 1]
        },
        'mst': {
            'ms': {
                'data_col': 'MS', 
                'depth_col': 'SB_DEPTH_cm',
                'plot_label': 'Low-Res\nMagnetic\nSusceptibility\n(μSI)',
                'plot_color': 'lightgreen',
                'feature_weight': 1.0,
                'threshold': ['>', 250, 1]
            },
            'density': {
                'data_col': 'Den_gm/cc', 
                'depth_col': 'SB_DEPTH_cm',
                'plot_label': 'Density\n(g/cc)',
                'plot_color': 'orange',
                'feature_weight': 2.0,
                'threshold': ['<', 1.1, 1]
            },
            'pwvel': {
                'data_col': 'PWVel_m/s', 
                'depth_col': 'SB_DEPTH_cm',
                'plot_label': 'P-wave\nVelocity\n(m/s)',
                'plot_color': 'purple',
                'feature_weight': 0.01,
                'threshold': ['>=', 1076, 1]
            },
            'pwamp': {
                'data_col': 'PWAmp', 
                'depth_col': 'SB_DEPTH_cm',
                'plot_label': 'P-wave\nAmplitude',
                'plot_color': 'purple',
                'feature_weight': 0.01,
                'threshold': ['>=', 80, 1]
            },
            'elecres': {
                'data_col': 'ElecRes_ohmm', 
                'depth_col': 'SB_DEPTH_cm',
                'plot_label': 'Electrical\nResistivity\n(ohm-m)',
                'plot_color': 'brown',
                'feature_weight': 0.01,
                'threshold': ['>', 0.51, 1]
            }
        }
    }
}

<hr>

## **Check and Download Example Data**
Download necessary example data if not already present


In [None]:
import os, requests
print("Checking for example data...")
if not os.path.exists(f"example_data/processed_data/{core_name}/{core_name}_CT.csv"):
    print("Downloading...")
    os.makedirs(f"example_data/processed_data/{core_name}", exist_ok=True)
    os.makedirs("example_data/raw_data/log_hiresMS", exist_ok=True)
    os.makedirs("example_data/raw_data/log_MST", exist_ok=True)
    os.makedirs("example_data/picked_datum", exist_ok=True)
    for path in [f"processed_data/{core_name}/{core_name}_CT.csv", f"processed_data/{core_name}/{core_name}_RGB.csv",
                 f"processed_data/{core_name}/{core_name}_CT.tiff", f"processed_data/{core_name}/{core_name}_RGB.tiff",
                 f"raw_data/log_hiresMS/{core_name}_ptMS.csv", f"raw_data/log_MST/{core_name}_MST.csv",
                 f"picked_datum/{core_name}_pickeddepth.csv"]:
        try:
            with open(f"example_data/{path}", "wb") as f:
                f.write(requests.get(f"https://github.com/GeoLarryLai/pyCoreRelator/raw/main/example_data/{path}").content)
        except: pass
    print("Download complete")
else:
    print("✓ Data already exists")
print("Ready to proceed")

<hr>

# **Execute the functions:**


## Clean and preprocess log data

**Function: `preprocess_core_data()`**

**What it does:**
1. Reads raw core log data from multiple sources (CT, RGB, high-res MS, MST)
2. Applies outlier detection and removal based on configured thresholds
3. Resamples data to consistent depth resolution
4. Exports cleaned data to CSV files

**Key Parameters:**
- `data_config` *(dict)*: Complete data configuration dictionary containing all paths, parameters, and column configurations
- `resample_resolution` *(float, default=1)*: Desired spacing for resampling the depth column; should be in the same units as the logs' depth values.

In [None]:
preprocess_core_data(data_reading_config, resample_resolution=0.5)

## Visualize cleaned logs

**Function: `plot_core_logs()`**

**What it does:**
1. Loads cleaned or gap-filled core log data based on file_type parameter
2. Creates multi-panel visualization with core images and data traces
3. Optionally displays picked datum depths from CSV file
4. Automatically adjusts figure dimensions based on core length
5. Saves figures in specified formats

**Key Parameters:**
- `data_config` *(dict)*: Complete data configuration dictionary
- `file_type` *(str)*: Type of data files to plot ('clean' for cleaned data, 'filled' for gap-filled data)
- `pickeddepth_csv` *(str, default=None)*: Path to CSV file containing picked datum depths for visualization
- `title` *(str, default='')*: Title text for the plot figure
- `save_fig` *(bool, default=False)*: Whether to save the figure to disk
- `fig_format` *(list, default=['png'])*: List of file formats to save (options: 'png', 'jpg', 'svg', 'pdf')
- `dpi` *(int, default=300)*: Resolution in dots per inch for saved figures
- `output_dir` *(str, default=None)*: Directory to save figures (required if save_fig=True)


In [None]:
plot_core_logs(
    data_reading_config,
    file_type='clean',
    pickeddepth_csv=f'example_data/picked_datum/{core_name}_pickeddepth.csv',
    title=f'{core_name} [Cleaned Logs]',
    save_fig=True,
    output_dir=f'example_data/processed_data/{core_name}/'
)

## Using machine learning to fill log data gaps

**Function: `process_and_fill_logs()`**

**What it does:**
1. Loads cleaned core log data
2. Identifies gaps in the data
3. Uses machine learning algorithms to predict missing values based on available features
4. Supports multiple ML methods: Random Forest (RF), XGBoost (XGB), and ensemble approaches
5. Exports gap-filled data to CSV files

**Key Parameters:**
- `data_config` *(dict)*: Complete data configuration dictionary
- `ml_method` *(str, default = 'xgblgbm')*: Machine learning method to use:
  - `'rf'`: Random Forest
  - `'rftc'`: Random Forest with trend constraints
  - `'xgb'`: XGBoost
  - `'xgblgbm'`: Weighted-average ensemble between XGBoost + LightGBM (recommended)
- `n_jobs` *(int, default=-1)*: Number of parallel jobs for processing multiple target logs (-1 uses all available CPU cores)
- `show_plots` *(bool, default=True)*: Whether to generate and display plots during processing
  - Works in both sequential and parallel modes
  - Set to `False` to disable plotting for faster processing

In [None]:
process_and_fill_logs(data_reading_config) 

## Visualize gap-filled logs

**Function: `plot_core_logs()`**

Displaying the ML gap-filled data by setting `file_type='filled'`. See other instructions above.

In [None]:
plot_core_logs(
    data_reading_config,
    file_type='filled',
    pickeddepth_csv=f'example_data/picked_datum/{core_name}_pickeddepth.csv',
    title=f'{core_name} [Data Gap Filled]',
    save_fig=True,
    output_dir=f'example_data/processed_data/{core_name}/'
)