# Kepler-ECG: Pipeline

## FASE 0: Setup & Domain Understanding

### Objective

Configure environment, download data, understand domain.

### Step 0.1: Create project structure

```bash
kepler-ecg/
‚îú‚îÄ‚îÄ data/
‚îÇ ‚îî‚îÄ‚îÄ raw/              # Original ECG
‚îú‚îÄ‚îÄ src/kepler_ecg/
‚îÇ ‚îú‚îÄ‚îÄ preprocessing/    # Filters, quality, beat segmentation
‚îÇ ‚îú‚îÄ‚îÄ features/         # Features Extractors
‚îÇ ‚îî‚îÄ‚îÄ discovery/        # Symbolic regression
‚îú‚îÄ‚îÄ scripts/            # Executable scripts
‚îú‚îÄ‚îÄ results/            # Output analysis
‚îî‚îÄ‚îÄ test/               # Unit tests
```

---

### Step 0.2: Install dependencies

```bash
# Core Data Science stack: math, signal tools, dataframes, and visualization
pip install numpy scipy pandas matplotlib seaborn

# Specialized Biomedical Signal Processing (essential for PhysioNet datasets)
# wfdb: read/write PhysioNet files
# neurokit2/biosppy: advanced ECG/EEG feature extraction and cleaning
pip install wfdb neurokit2 biosppy

# Symbolic Regression: Discovering mathematical expressions from data
pip install pysr

# Utilities and Performance:
# tqdm: progress bars for loops
# pyarrow: high-performance data storage and retrieval
# pytest: framework for testing your preprocessing pipeline
pip install tqdm pyarrow pytest
```

---

### Step 0.3: Download dataset

**Script**: [scripts/download_dataset.py](../scripts/download_dataset.py)

Python utility designed to automate the download of clinical datasets from the **PhysioNet** platform. The script acts as a wrapper around the 'wget' tool, handling automatic directory creation and file organization mirroring the source, eliminating redundant server folder hierarchies (via the 'cut-dirs' option). It is optimized to be integrated into cross-platform search pipelines.

#### Input (Commandline Arguments)

The file accepts the following parameters via 'argparse':

| Parameter | Type | Mandatory | Description |
|-----------|------|-----------|-------------|
| 'url' | String | **Required** | The direct URL of the dataset on PhysioNet (e.g. 'https://physionet.org/files/ptb-xl/1.0.3/`). |
| '--output' | Path (Path) | Optional | The local destination folder. Default: 'data/raw'. |

#### Output

* **Local Files:** A mirror copy of the downloaded dataset in the specified folder (keeping the original structure of the '.dat', '.hea', '.csv', etc. files).
* **Console:** Real-time log of download status and final confirmation of save path.
* **Exit Status:** Returns '0' on success, '1' on error (e.g. 'wget' not installed or URL unreachable).


#### Running the script for different datasets

```bash
# Download PTB-XL
python scripts\download_dataset.py https://physionet.org/files/ptb-xl/1.0.3/ --output data/raw/ptb-xl

# Download Chapman-Shaoxing
python scripts\download_dataset.py https://physionet.org/content/ecg-arrhythmia/1.0.0/ --output data/raw/chapman

# Download CPSC 2018
python scripts\download_dataset.py https://physionet.org/files/challenge-2020/1.0.2/training/cpsc_2018/ --output data/raw/cpsc-2018

# Download Georgia
python scripts\download_dataset.py https://physionet.org/files/challenge-2020/1.0.2/training/georgia/ --output data/raw/georgia

# Download MIT-BIN
python scripts\download_dataset.py https://physionet.org/files/mitdb/1.0.0/ --output data/raw/mit-bih

# Download ltaf
python scripts\download_dataset.py https://physionet.org/files/ltafdb/1.0.0/ --output data/raw/ltaf

# Download QTDB
python scripts\download_dataset.py https://physionet.org/files/qtdb/1.0.0/ --output data/raw/qtdb
```

**Output**: 
- Folders containing data and metadata

---
---

## FASE 1: Preprocessing Pipeline

### Objective

Create robust pipeline: filters, quality assessment, R peak detection.

### Step 1.1: BaselineRemover: High-pass Butterworth 0.5Hz & NoiseFilter: low-pass 40Hz + notch 50Hz

**Script**: [src/preprocessing/filters.py](../src/preprocessing/filters.py)

A specialized Python module for ECG signal preprocessing that implements high-performance filtering techniques. It provides tools to eliminate common cardiac signal artifacts, such as low-frequency baseline wander (caused by breathing or movement) and high-frequency noise (EMG interference or powerline hum). The module utilizes zero-phase filtering (`filtfilt`) to ensure that the morphological timing of the ECG waves (P, QRS, and T) remains undistorted, which is critical for accurate deep learning analysis.

#### Input

The classes (`BaselineRemover`, `NoiseFilter`, `ECGFilter`) and convenience functions accept the following primary inputs:

* **`signal` (np.ndarray):** The raw ECG data. It can be 1D (a single lead) or 2D (multiple leads, shaped as `n_samples, n_leads`).
* **`fs` (int):** The sampling frequency of the signal in Hz (e.g., 500 Hz for PTB-XL).
* **Configuration Parameters:**
* **Baseline Removal:** `baseline_cutoff` (default 0.5 Hz) and `baseline_order`.
* **Noise Filtering:** `lowpass_cutoff` (default 40.0 Hz) and `lowpass_order`.
* **Powerline Interference:** `notch_freq` (50 Hz for Europe/Italy or 60 Hz) and `notch_q` (quality factor).

#### Output

* **Filtered Signal (np.ndarray):** A NumPy array of the same shape as the input, containing the cleaned ECG signal with artifacts removed.
* **Frequency Response (Optional):** Magnitude response data in dB for visualizing filter performance.
* **Metadata:** String representations (`__repr__`) of the filter objects for logging experiment configurations.

---

### Step 1.2: QualityAssessor: SNR, flatlines, saturation

**Script**: [src/preprocessing/quality.py](../src/preprocessing/quality.py)

This module provides a comprehensive framework for evaluating the clinical and technical quality of ECG signals. It implements a multi-metric approach to detect common signal acquisition issues such as **electrode disconnection (flatlines)**, **amplifier saturation (clipping)**, **baseline instability**, and **excessive noise**. By aggregating these metrics into a weighted **Quality Score**, the module allows automated pipelines to decide whether a signal is reliable enough for deep learning inference or if it should be rejected to avoid false positives.


#### Input

The module's main components (`QualityAssessor`, `assess_quality`) take the following inputs:

* **`signal` (np.ndarray):** The ECG data to be evaluated. It supports **1D arrays** (single lead) or **2D arrays** (multi-lead, e.g., 12-lead ECGs).
* **`fs` (int):** The sampling frequency in Hz, used to define temporal windows for flatline detection and frequency bands for SNR calculation.
* **`config` (QualityConfig, optional):** A custom configuration object to override default thresholds for:
* **SNR limits** (default min: 5.0 dB).
* **Flatline duration** (default min: 0.1s).
* **Amplitude ranges** (default: 0.1mV to 5.0mV).

#### Output

The module returns a **`QualityMetrics`** object containing:

* **Numerical Metrics:** Specific values for `snr_db`, `flatline_ratio`, `saturation_ratio`, `baseline_drift`, and `high_freq_noise`.
* **`quality_score` (float):** A normalized value from **0.0 to 1.0** representing overall signal integrity.
* **`quality_level` (Enum):** A categorical label: `EXCELLENT`, `GOOD`, `ACCEPTABLE`, `POOR`, or `UNUSABLE`.
* **`is_usable` (bool):** A boolean flag indicating if the signal meets the minimum quality threshold for analysis.
* **`issues` (List[str]):** A detailed list of detected problems (e.g., "Lead 0: Low SNR", "Signal saturation").

---

### Step 1.3: BeatSegmenter: Pan-Tompkins R-detection

**Script**: [src/preprocessing/segmentation.py](../src/preprocessing/segmentation.py)

This module is the core engine for signal parsing within the **Kepler-ECG** project. It implements the classic **Pan-Tompkins algorithm** to detect R-peaks (the most prominent spikes in an ECG) with high precision. Beyond simple detection, it automatically segments the continuous cardiac signal into individual heartbeats, aligns them, and computes vital clinical metrics such as Heart Rate (BPM) and RR intervals. This segmentation is a fundamental prerequisite for training deep learning models on heartbeat-level morphologies.

#### Input

The primary classes (`PanTompkinsDetector`, `BeatSegmenter`) and the `segment_beats` function accept the following:

* **`signal` (np.ndarray):** The raw or filtered ECG data. It supports **1D arrays** (single lead) or **2D arrays** (multi-lead).
* **`fs` (int):** The sampling frequency in Hz (essential for time-to-sample conversions).
* **`config` (SegmentationConfig, optional):** A configuration object to tune the algorithm, including:
* **Bandpass limits:** Default 5.0 Hz to 15.0 Hz to isolate QRS energy.
* **Search windows:** Time before (0.2s) and after (0.4s) the R-peak for beat extraction.
* **Physiological limits:** Min/Max RR intervals to filter out noise that mimics heartbeats.

#### Output

The module returns a **`SegmentationResult`** object containing:

* **`r_peaks` (np.ndarray):** The precise indices of all detected heartbeats in the signal.
* **`beats` (np.ndarray):** A 2D matrix of shape `(n_beats, beat_length)` containing the extracted and aligned heartbeat waveforms.
* **`heart_rate_bpm` (float):** The calculated mean heart rate.
* **`detection_confidence` (float):** A score from **0.0 to 1.0** indicating the reliability of the detection based on rhythm regularity and signal consistency.
* **`beat_template` (np.ndarray):** An "average" heartbeat calculated from all valid segments, useful for noise reduction.

---

### Step 1.4: HRVPreprocessor: rimozione ectopici, interpolazione

**Script**: [src/preprocessing/hrv_preprocessing.py](../src/preprocessing/hrv_preprocessing.py)

This module is dedicated to the specific preprocessing required for **Heart Rate Variability (HRV)** analysis. Its primary purpose is to transform a sequence of raw RR intervals into a clean, uniform time series ready for time and frequency domain analysis. The module implements advanced algorithms for the identification of **ectopic beats** (irregular rhythms) and artifacts, utilizing clinical standards such as the Malik or Kamath criteria. Additionally, it includes interpolation functionality to handle missing data and ensure a constant sampling rate, which is essential for accurate spectral analysis of cardiac variability.

#### Input

The primary components of the module, such as the `HRVPreprocessor` class and the `preprocess_hrv` function, require the following inputs:

* **`rr_intervals` (np.ndarray)**: An array containing RR intervals expressed in milliseconds.
* **`config` (HRVConfig, optional)**: A configuration object used to customize the processing pipeline. This includes:
* **Ectopic detection method**: Options include `MALIK`, `KAMATH`, `KARLSSON`, or `ACAR`.
* **Detection threshold**: A numerical parameter to adjust the sensitivity of the artifact identification algorithm.
* **Resampling frequency**: The target frequency (e.g., 4 Hz) for the interpolation of the time series.



#### üì§ Output

The module returns an **`HRVData`** object containing the following elements:

* **`rr_clean` (np.ndarray)**: The series of RR intervals after the removal or correction of identified ectopic beats.
* **`rr_interpolated` (np.ndarray)**: An equidistantly sampled time series, suitable for frequency-based analysis such as Fast Fourier Transform (FFT).
* **`ectopic_mask` (np.ndarray)**: A boolean array indicating which original beats were flagged as anomalies or artifacts.
* **`metrics` (Dict)**: Basic descriptive statistics, such as the percentage of detected ectopic beats within the signal.

---

### Step 1.5: PreprocessingPipeline

**Script**: [src/preprocessing/pipeline.py](../src/preprocessing/pipeline.py)

This module serves as the central orchestration engine for the **Kepler-ECG** project, integrating all individual preprocessing components into a unified and automated workflow. It sequentially executes signal filtering (baseline and noise removal), quality assessment, R-peak detection, beat segmentation, and HRV analysis. Designed for high-performance research, the pipeline features **parallel processing** to handle large datasets efficiently and a **caching system** (using SHA-256 hashing) to avoid redundant computations. It provides a robust interface for transforming raw physiological data into structured, analysis-ready objects.

#### Input

The `PreprocessingPipeline` class and its convenience functions (like `process_ecg_batch`) accept the following:

* **`signal` (np.ndarray)**: The raw ECG waveform, supporting both single-lead (1D) and multi-lead (2D) configurations.
* **`fs` (int)**: The sampling frequency in Hz, required by all internal stages for accurate time-domain processing.
* **`PipelineConfig` (optional)**: A comprehensive configuration object that allows fine-tuning of:
* **Filter Settings**: Cutoff frequencies for high-pass, low-pass, and notch filters.
* **Execution Settings**: `n_jobs` for parallel execution and `enable_cache` to toggle disk-based storage of results.
* **Quality & Segmentation**: Thresholds for signal rejection and beat extraction parameters.

#### Output

The pipeline returns a **`ProcessedECG`** object (or a list of objects for batches), which contains:

* **`filtered_signal`**: The cleaned ECG data.
* **`quality_metrics`**: The results from the quality assessment stage (score, level, and usability).
* **`segmentation`**: R-peak indices, extracted heartbeat segments, and heart rate statistics.
* **`hrv_data`**: Cleaned and interpolated RR intervals for variability analysis.
* **`metadata`**: A dictionary containing processing time, timestamps, and the unique hash identifier for the operation.

---

### Step 1.6: Processare intero dataset

**Script**: [scripts/process_dataset.py](../scripts/process_dataset.py)

This module is a high-level CLI (Command Line Interface) script designed to automate the processing of large-scale ECG datasets. It acts as a bridge between raw data storage and the **Kepler-ECG Preprocessing Pipeline**, featuring an auto-detection system that recognizes various clinical data formats such as **WFDB** (.dat/.hea), **MAT** (.mat), and **NumPy** (.npy). The script is optimized for research workflows, allowing for targeted sampling rate filtering and batch processing to handle thousands of records while providing detailed logging and progress reporting.

#### Input (Command Line Arguments)

The script requires several parameters to define the data workflow:

* **`--data_dir` (Required)**: The path to the directory containing the raw ECG dataset.
* **`--output_dir` (Required)**: The path where the processed results, logs, and metadata will be saved.
* **`--sampling_rate` (Optional)**: A filter to process only files with a specific sampling frequency (e.g., 500 Hz for PTB-XL).
* **`--n_samples` (Optional)**: Limits the processing to a specific number of records (useful for testing).
* **`--batch_size` (Default: 100)**: Determines how many records are processed before reporting progress.

#### Output

The execution produces a structured output in the destination folder:

* **Processed Data**: Individual preprocessed objects (typically saved as `.pkl` or `.npy`) containing filtered signals and extracted features.
* **`processing_summary.json`**: A global metadata file summarizing the execution time, number of files processed, and any errors encountered.
* **Quality Statistics**: A detailed breakdown in the logs showing the distribution of signal quality levels (e.g., % of Excellent vs. Unusable signals).
* **Execution Logs**: A professional log file tracking every step of the batch operation.

#### Running the script for different datasets

```bash
# Process [dataset]
python scripts/process_dataset.py `
    --data_dir "./data/raw/[dataset]" `
    --output_dir "./results/[dataset]"
```

---
---

## üìä FASE 2: Features Extraction

### Objective

Extract 100+ features for ECG: morphological, spectral, wavelet, compressibility.

### Step 2.1: MorphologicalExtractor

**Script**: [src/features/morphological.py](../src/features/morphological.py)

This module is designed to extract detailed clinical and morphological features from ECG beat templates. It identifies key characteristic points of the cardiac cycle, such as the onset, peak, and offset of P, Q, R, S, and T waves. By calculating specific wave durations, amplitudes, and intervals (including corrected QT intervals using Bazett, Fridericia, and Framingham formulas), the module provides a structured digital representation of the heartbeat's shape. These features are essential for both traditional clinical diagnostics and as structured inputs for deep learning classifiers.

#### Input

The core components, specifically the `MorphologicalExtractor` and the `WavePoints` data structure, process the following inputs:

* **`beat_template` (np.ndarray)**: A 1D array representing a single, representative heartbeat (typically the average or median beat) centered on the R-peak.
* **`fs` (int)**: The sampling frequency in Hz, used to convert sample indices into clinical time measurements in milliseconds.
* **`rr_mean_ms` (optional float)**: The mean RR interval in milliseconds, required for calculating corrected QT (QTc) intervals.

#### Output

The module returns a comprehensive dictionary or data structure containing:

* **Wave Amplitudes**: Peak values for P, Q, R, S, and T waves, along with ST-segment levels.
* **Temporal Intervals**: Precise durations in milliseconds for the P-wave, QRS complex, T-wave, PR interval, and QT interval.
* **Corrected Metrics**: QTc values calculated via various medical formulas (Bazett, Fridericia, Framingham).
* **Morphological Ratios**: Calculated values such as the R/S ratio and T/R ratio, as well as areas under the QRS and T curves.

---

### Step 2.2: SpectralAnalyzer (HRV)

**Script**: [src/features/spectral.py](../src/features/spectral.py)

This module focuses on the frequency-domain analysis of Heart Rate Variability (HRV). It transforms clean RR interval sequences into power spectral density (PSD) estimates to quantify the autonomic nervous system's influence on the heart. The module implements both the **Welch periodogram** for uniformly resampled series and the **Lomb-Scargle periodogram** for non-uniform data. It is designed to extract standard clinical frequency bands (VLF, LF, HF) and calculate the LF/HF ratio, which serves as an indicator of the sympathetic-vagal balance.

#### Input

The primary class `SpectralAnalyzer` and its associated functions take the following inputs:

* **`rr_intervals` (np.ndarray)**: An array of RR intervals in milliseconds.
* **`fs_interpolation` (float, default=4.0)**: The frequency in Hz used to resample the RR series for Welch analysis.
* **`method` (Literal['welch', 'lomb-scargle'])**: The spectral estimation technique to be applied.
* **`bands` (FrequencyBands, optional)**: A configuration object defining the frequency limits for:
* **VLF (Very Low Frequency)**: 0.003 - 0.04 Hz.
* **LF (Low Frequency)**: 0.04 - 0.15 Hz.
* **HF (High Frequency)**: 0.15 - 0.4 Hz.

#### Output

The module returns a **`SpectralMetrics`** object (or a dictionary) containing:

* **Absolute Power**: Total power and power for each specific band (VLF, LF, HF) measured in .
* **Normalized Power**: LF and HF components expressed in normalized units (n.u.).
* **LF/HF Ratio**: A numerical value representing the balance between the sympathetic and parasympathetic systems.
* **Peak Frequencies**: The specific frequency values (in Hz) where the maximum power occurs within each band.
* **PSD Data**: The frequency vector and power spectral density values for visualization (e.g., plotting the periodogram).

---

### Step 2.3: WaveletExtractor

**Script**: [src/features/wavelet.py](../src/features/wavelet.py)

This module implements feature extraction based on the **Discrete Wavelet Transform (DWT)**, a fundamental technique for the time-frequency analysis of ECG signals. By decomposing the signal into different scales (levels), the module isolates fine details like spikes and noise at lower scales, the QRS complex at middle scales, and P/T waves or baseline wander at higher scales. It is particularly effective at capturing non-stationary morphological variations that traditional methods might miss, providing a set of mathematical descriptors (energy, entropy, and statistical parameters) ideal for Deep Learning models.

#### Input

The `WaveletExtractor` class and its primary functions accept the following:

* **`signal` (np.ndarray)**: The ECG signal (either a segmented beat or a continuous strip) to be decomposed.
* **`WaveletConfig` (optional)**: A configuration object used to define:
* **`wavelet`**: The type of mother wavelet (defaults to `'db4'` ‚Äî Daubechies 4, which is highly effective for ECG morphology).
* **`max_level`**: The number of decomposition levels (if `None`, it is automatically calculated based on signal length).
* **`mode`**: The signal extension mode (e.g., `'symmetric'`).

#### Output

The module returns a dictionary or a **`WaveletFeatures`** object containing:

* **`energy_per_level`**: The distribution of signal energy across the various approximation and detail levels.
* **`wavelet_entropy`**: A measure of signal complexity or disorder based on the wavelet energy distribution.
* **`coefficient_statistics`**: A set of statistical parameters (mean, standard deviation, skewness, and kurtosis) calculated for each decomposition level.
* **`coefficients`**: The raw wavelet coefficients (optional), useful for signal reconstruction or custom filtering.

---

### Step 2.4: CompressibilityCalculator ‚≠ê CORE

**Script**: [src/features/compressor.py](../src/features/compressor.py)

This module implements advanced techniques to measure the **algorithmic complexity** and regularity of ECG signals and RR time series. It utilizes compressibility as a proxy for Kolmogorov complexity, based on the premise that more regular and predictable signals are more easily compressed. The module includes classic entropy algorithms (Sample, Approximate, Permutation) and Lempel-Ziv complexity metrics, providing a non-linear characterization of the signal that complements standard morphological and spectral analyses.

#### Input

The primary components, such as the `CompressibilityCalculator` class, accept the following parameters:

* **`signal` (np.ndarray)**: The filtered ECG signal or the RR interval series.
* **`CompressibilityConfig` (optional)**: A configuration object to tune the algorithm parameters:
* **`entropy_m`**: Embedding dimension for entropy calculation (default: 2).
* **`entropy_r_factor`**: Tolerance factor (default: 0.2 * standard deviation of the signal).
* **`perm_order` & `perm_delay**`: Specific parameters for Permutation Entropy.

#### Output

The module returns a dictionary of numerical features (**`Dict[str, float]`**) including:

* **Compression Ratios**: Ratios calculated using `gzip`, `bzip2`, and `lzma` algorithms.
* **Entropy Metrics**: Values for *Sample Entropy*, *Approximate Entropy*, *Permutation Entropy*, and *Shannon Entropy*.
* **Complexity Measures**: Lempel-Ziv complexity index and Kolmogorov complexity estimates.
* **Hjorth Parameters**: Statistical time-domain metrics for activity, mobility, and complexity.

---

### Step 2.5: DiagnosisMapper

**Script**: [src/features/diagnosis_mapper.py](../src/features/diagnosis_mapper.py)

This module is a specialized utility designed to handle clinical diagnostic labels, specifically for the **PTB-XL dataset**. It implements a mapping system that translates raw **SCP-ECG (Standard Communication Protocol for Computer-Assisted Electrocardiography)** codes into high-level diagnostic superclasses. This standardization is crucial for supervised Deep Learning tasks, as it groups hundreds of specific clinical findings into five primary categories (Normal, Myocardial Infarction, ST/T Changes, Conduction Disturbances, and Hypertrophy), reducing label noise and balancing the dataset for model training.

#### Input

The main class `DiagnosisMapper` and the utility function `create_diagnosis_features` process the following:

* **`df` (pd.DataFrame)**: A pandas DataFrame containing dataset metadata (typically the `ptbxl_database.csv`).
* **`scp_column` (str)**: The name of the column containing the diagnostic codes (usually `'scp_codes'`), which are often stored as dictionaries within strings in raw datasets.
* **`threshold` (float, optional)**: A confidence threshold for labels; only codes with a probability/certainty above this value are considered during the mapping process.

#### Output

The module generates an enriched dataset or a mapping summary:

* **Categorized DataFrame**: Returns a new DataFrame with additional columns for the five major superclasses (`NORM`, `MI`, `STTC`, `CD`, `HYP`). Each column contains a boolean or binary value indicating the presence of that diagnostic category.
* **`primary_label`**: An added column identifying the most significant or highest-confidence diagnosis for each record.
* **Mapping Summary (Dict)**: A dictionary providing statistics on the mapping, including the number of recognized codes, total records per category, and the amount of "Unknown" or unmapped labels.
* **Descriptions**: Human-readable strings corresponding to technical SCP codes (e.g., mapping "NDT" to "Non-specific ST-T changes").

---

### Step 2.6: Feature Pipeline

**Script**: [src/features/feature_pipeline.py](../src/features/feature_pipeline.py)

The **Feature Pipeline** is the high-level orchestrator designed to consolidate all extraction methodologies within the **Kepler-ECG** ecosystem. It acts as a unified interface that coordinates the simultaneous extraction of morphological, interval-based, spectral (HRV), and wavelet-based features. By transforming preprocessed ECG signals and detected R-peaks into a comprehensive numerical feature vector, this module prepares the data for statistical analysis or as input for machine learning models. It supports batch processing with multi-threading and includes robust error handling to ensure pipeline stability during large-scale dataset operations.

#### Input

The `FeaturePipeline` class and its main execution methods (e.g., `extract_all_features`) require:

* **`signal` (np.ndarray)**: The preprocessed/filtered ECG waveform.
* **`r_peaks` (np.ndarray)**: The indices of detected R-peaks (obtained from the `Segmentation` module).
* **`sampling_rate` (int)**: The frequency (Hz) used for time-to-sample conversions.
* **`FeatureConfig` (optional)**: A structured configuration object to toggle specific extractors:
* `extract_morphological`: Boolean to enable/disable wave shape analysis.
* `extract_spectral`: Boolean for HRV frequency domain analysis.
* `extract_wavelet`: Boolean for multi-scale decomposition.
* `n_jobs`: Number of threads for parallel batch extraction.

#### Output

The pipeline returns a **`FeatureVector`** (or a list of vectors for batch mode) containing:

* **`morphological_features`**: A set of amplitudes and areas (P, Q, R, S, T waves).
* **`interval_features`**: Clinical durations (PR, QRS, QT) and corrected QTc values.
* **`spectral_features`**: Frequency band powers (VLF, LF, HF) and the LF/HF ratio.
* **`wavelet_features`**: Energy and entropy metrics across multiple decomposition levels.
* **`processing_metadata`**: Information regarding execution time and potential warnings (e.g., if a specific feature couldn't be calculated due to signal length).

---

### Step 2.7: FeaturePipeline completa

**Script**: [scripts/extract_features_dataset.py](../scripts/extract_features_dataset.py)

This script serves as the final operational utility in the **Kepler-ECG** preprocessing suite. It is a high-level command-line tool designed to perform "Phase 2" of data preparation: extracting advanced multidimensional features from a previously processed dataset. It synchronizes raw signal data with metadata, applying morphological, spectral, wavelet, and compressibility extractors simultaneously. Additionally, it integrates the **Diagnosis Mapper** to align clinical labels, resulting in a comprehensive, "ML-ready" CSV file that can be fed directly into deep learning models or statistical analysis tools.

#### Input (Command Line Arguments)

The script requires specific paths and flags to coordinate the extraction process:

* **`--phase1` (Required)**: Path to the CSV file generated during Phase 1 (e.g., `ptb-xl_features.csv`), which contains file pointers and basic metadata.
* **`--data-dir` (Required)**: The root directory where the original raw ECG files are stored.
* **`--output` (Optional)**: Specific path for the output CSV. If omitted, it appends `_features_extracted` to the input filename.
* **`--lead` (Default: 0)**: Specifies which ECG lead to extract features from (essential for multi-lead datasets).
* **Feature Toggles**: Optional flags to disable specific extractions (e.g., `--no-wavelet`, `--no-compressibility`, `--no-diagnosis`).
* **Processing Limits**: `--n_samples` and `--start` to allow for partial dataset processing or resume capabilities.

#### Output

The script generates a final structured dataset and execution logs:

* **`*ptb-xl_features_extracted.csv`**: A dense tabular file where each row represents an ECG record and columns contain the full suite of extracted features (Amplitudes, Intervals, QTc, Spectral powers, Wavelet coefficients, Entropy, and diagnostic classes).
* **Progress Telemetry**: A real-time progress bar (via `tqdm`) showing the processing rate (samples per second) and estimated time of completion.
* **Statistical Summary**: A final console output detailing the number of successful extractions versus failures (due to signal quality or missing files).

#### Running the script for different datasets

```bash
# Process [dataset]
python scripts/extract_features_dataset.py `
    --phase1 "./results/[dataset]/[dataset]_features.csv" `
    --data-dir "./data/raw/[dataset]"
```

---
---

## üîç FASE 3: Discovery & Analysis

### Objective
Statistical validation, visualizations, dataset preparation for SR.

### Step 3.1: Statistical Analysis

**Script**: [src/analysis/statistical.py](../src/analysis/statistical.py)

The **Statistical Analysis Module** is the core component of "Phase 3" in the **Kepler-ECG** project. It provides a robust framework for validating the clinical relevance of extracted features across different diagnostic groups. The module implements a suite of parametric and non-parametric tests (ANOVA, Kruskal-Wallis, T-tests) to determine if features like HRV spectral power or wavelet entropy vary significantly between healthy subjects and patients with cardiac pathologies. It also calculates effect sizes (Cohen's ) and performs correlation analysis to identify the most discriminative biomarkers for subsequent Deep Learning training.

#### Input

The main analytical classes (`StatisticalAnalyzer`) and their methods accept:

* **`df` (pd.DataFrame)**: The structured feature matrix (typically the `*_features_extracted.csv` file) containing both numerical features and categorical diagnostic labels.
* **`feature` (str)**: The name of the specific column to analyze (e.g., `'hf_power'`, `'qrs_duration'`).
* **`category_col` (str)**: The column used for grouping data (e.g., `'diagnostic_superclass'`).
* **`config` (optional)**: Parameters to define the significance level (default ) and minimum group size requirements.

#### Output

The module returns structured result objects (**`ANOVAResult`**, **`PairwiseComparison`**) containing:

* **Global Statistics**: F-statistic or H-statistic and their corresponding p-values to check for overall group differences.
* **Significance Flags**: Boolean indicators (`is_significant`) and visual markers (e.g., `***` for ).
* **Pairwise Comparisons**: Detailed results from post-hoc tests (e.g., Mann-Whitney U) showing exactly which diagnostic classes differ from each other.
* **Effect Sizes**: Cohen's  or similar metrics to quantify the magnitude of the difference between groups.
* **Descriptive Stats**: Group-specific means, standard deviations, and sample counts for reporting.

---

### Step 3.2: Research Visualization

**Script**: [src/analysis/visualization.py](../src/analysis/visualization.py)

The **Visualization Module** is the graphical engine for the final phase of the **Kepler-ECG** project. It is designed to produce publication-quality charts and exploratory data analysis (EDA) plots that reveal the underlying patterns in cardiac data. The module focuses on the relationship between mathematical features (Morphological, Spectral, Wavelet, and Compressibility) and clinical diagnoses. It provides advanced tools for dimensionality reduction, correlation mapping, and group-wise distribution analysis, ensuring that the results of the Deep Learning and Statistical phases are interpretable and visually compelling.


#### Input

The `ECGVisualizer` class and its plotting methods take the following inputs:

* **`df` (pd.DataFrame)**: The processed feature matrix containing numerical extractions and categorical labels (NORM, MI, STTC, CD, HYP).
* **`features` (List[str])**: A list of specific feature names to be visualized (e.g., `['samp_en', 'hf_power', 'qrs_duration']`).
* **`target_col` (str)**: The categorical column used for grouping and coloring the data (default: `diagnostic_superclass`).
* **`VisualizationConfig` (optional)**: Parameters to customize plot aesthetics, such as color palettes (e.g., the predefined `DIAGNOSIS_COLORS`), figure sizes, and DPI settings for high-resolution export.

#### Output

The module generates high-fidelity visual outputs and saved files:

* **Statistical Plots**: Box plots and violin plots showing the distribution and variance of features across different pathologies.
* **Correlation Heatmaps**: Matrices illustrating the redundancy or independence between different ECG descriptors.
* **Manifold Projections**: 2D or 3D scatter plots using **t-SNE** or **PCA** to visualize how well the different diagnostic classes cluster in the feature space.
* **Exported Files**: High-resolution images saved in multiple formats (e.g., `.png`, `.pdf`, `.svg`) in a structured `reports/` directory.

---


### Step 3.3: Feature Selection

**Script**: [src/analysis/feature_selection.py](../src/analysis/feature_selection.py)

The **Feature Selection Module** is a critical analytical component of Phase 3, designed to identify the most discriminative physiological biomarkers from the high-dimensional feature set generated in Phase 2. It implements a hybrid selection strategy combining univariate statistical tests (ANOVA F-score, Mutual Information) with multivariate machine learning models (Random Forest, Gradient Boosting). By performing redundancy analysis and calculating importance rankings, the module filters out noisy or highly correlated features, ensuring that subsequent Symbolic Regression and Deep Learning models are efficient, interpretable, and less prone to overfitting.

#### Input

The primary class `FeatureSelector` and its selection methods (e.g., `calculate_importance`, `select_best_features`) take the following inputs:

* **`df` (pd.DataFrame)**: The input feature matrix containing all extracted ECG metrics and clinical labels.
* **`target_col` (str)**: The objective variable for selection (e.g., `'diagnostic_superclass'` for classification or `'age'` for regression).
* **`n_features` (int, default=10)**: The target number of top-performing features to retain.
* **`max_correlation` (float, default=0.85)**: The threshold for redundancy filtering; if two high-ranking features are more correlated than this value, the lower-ranked one is discarded.
* **`method` (str)**: The algorithm to use for importance calculation (e.g., `'random_forest'`, `'mutual_info'`, or `'gradient_boosting'`).

#### Output

The module returns a **`SelectionSummary`** (or a detailed dictionary) containing:

* **`suggested_features` (List[str])**: The final list of selected feature names, optimized for predictive power and low redundancy.
* **`importance_scores` (pd.DataFrame)**: A ranked table showing the raw scores (F-score, Gini importance, or MI) and the final rank for every feature analyzed.
* **`group_distribution` (Dict)**: A breakdown showing how many features were selected from each category (e.g., 3 Morphological, 4 Spectral, 3 Wavelet).
* **`redundancy_report`**: Information on which features were removed due to high multicollinearity.
* **`baseline_comparison`**: Cross-validation scores showing the performance of a model using only the selected features versus the full feature set.

---

### Step 3.4: Symbolic Regression Preparation

**Script**: [src/analysis/sr_preparation.py](../src/analysis/sr_preparation.py)

This module is a specialized data engineering component designed to bridge the gap between extracted ECG features and **Symbolic Regression (SR)** tools like PySR. Its primary purpose is to transform complex feature matrices into highly optimized, clean, and normalized datasets. It supports different machine learning tasks, including binary classification (e.g., Normal vs. Pathology), multi-class classification, and regression (e.g., predicting cardiac age). By automating data cleaning, feature scaling, and train-test splitting, it ensures that the mathematical formulas discovered by SR are both statistically sound and interpretable.


#### Input

The `SRDataPreparer` class and its associated methods take the following inputs:

* **`df` (pd.DataFrame)**: The comprehensive feature matrix generated in earlier phases (containing morphological, spectral, and wavelet features).
* **`target_col` (str)**: The dependent variable to be predicted (e.g., a specific diagnostic class or a numerical value).
* **`feature_list` (List[str], optional)**: A specific subset of features to include in the analysis to reduce dimensionality.
* **`PreparationConfig` (optional)**: Configuration for data processing, including:
* **Normalization Method**: Choice between `StandardScaler` (Z-score) or `MinMaxScaler`.
* **Task Type**: Defines the objective as `binary`, `multiclass`, or `regression`.
* **Test Split Ratio**: The percentage of data reserved for validation (e.g., 0.2).

#### Output

The module returns an **`SRDataset`** object, which contains:

* **`X` (np.ndarray)**: The cleaned and scaled feature matrix ready for the regressor.
* **`y` (np.ndarray)**: The encoded target vector (numeric labels for classification or continuous values for regression).
* **`feature_names`**: A list of strings corresponding to the columns in `X`, essential for the symbolic regressor to name variables in the discovered formulas (e.g.,  becomes `hf_power`).
* **`baseline_metrics`**: A dictionary containing performance scores from a baseline model (Logistic or Linear Regression) to provide a benchmark for the Symbolic Regression results.
* **Metadata**: Information regarding the scaling parameters used, enabling the reversal of normalization for interpretation.

---

### Step 3.5: Phase 3 pipeline

**Script**: [src/analysis/pipeline.py](../src/analysis/pipeline.py)

The **Phase 3 Pipeline Orchestrator** is the final management layer of the **Kepler-ECG** project. Its role is to automate the entire analytical workflow following feature extraction. It sequentially triggers statistical validation, generates a full suite of research visualizations, performs intelligent feature selection, and exports data specifically formatted for **Symbolic Regression**. This module ensures that the transition from raw physiological features to interpretable mathematical models is consistent, reproducible, and ready for clinical reporting.

#### Input

The `Phase3Pipeline` class and its execution command accept:

* **`data_path` (str)**: The path to the comprehensive feature CSV file (e.g., `ptb-xl_features_extracted.csv`) generated in Phase 2.
* **`output_dir` (str)**: The destination folder where all reports, plots, and datasets will be stored.
* **`PipelineConfig` (optional)**: A configuration object to customize the execution:
* **`sr_features_count`**: The number of top features to select for Symbolic Regression.
* **`skip_viz`**: A boolean flag to bypass image generation for faster processing.
* **`significance_level`**: The alpha threshold (e.g., 0.05) for statistical tests.

#### Output

The pipeline produces a structured directory of results, typically including:

* **Statistical Reports**: JSON and text files summarizing ANOVA results and p-values for all biomarkers.
* **Visualization Gallery**: A subfolder containing high-resolution PNG/PDF plots (Heatmaps, Box plots, t-SNE clusters).
* **Selected Features List**: A filtered dataset containing only the most discriminative, non-redundant features.
* **SR-Ready Datasets**: Multiple CSV files (e.g., `sr_binary_MI.csv`) formatted specifically for discovery of mathematical formulas.
* **`pipeline_results.json`**: A master metadata file containing baseline performance metrics (e.g., Logistic Regression accuracy) and processing timestamps.

#### Running the pipeline for different datasets

```bash
# Process [dataset]
python .m src.analysis.pipeline `
    --data "./results/[dataset]/[dataset]_features_extracted.csv" `
    --output "./results/[dataset]" `
    --reports "./results/[dataset]/reports" `
    --sr-ready "./results/[dataset]/sr_ready"
```

---
---

## üß¨ FASE 4: Symbolic Regression Discovery

### Objective
Discover formulas that can be interpreted with PySR.

### Step 4.1: Setup PySR/Julia

```bash
pip install pysr
python -c "import pysr; pysr.install()"  # Installa Julia backend
```

---

### Step 4.2: Stream A - Classification (NORM vs HYP)

**Script**: [scripts/run_sr_stream_a.py](../scripts/run_sr_stream_a.py)

This module represents the core of **Phase 4 (Discovery)** in the Kepler-ECG project. It implements a specialized **Symbolic Regression (SR)** workflow using the `PySR` library to discover interpretable mathematical formulas for classifying healthy signals (**NORM**) versus Hypertrophy (**HYP**). Unlike "black-box" neural networks, this script searches for the simplest mathematical expression (e.g., using addition, multiplication, and non-linear functions) that can distinguish these conditions. It features an improved logistic loss function, automatic threshold calibration using Youden's J statistic, and 5-fold cross-validation to ensure the discovered formulas are both accurate and robust.

#### Input

The script processes a curated dataset and specific configuration parameters:

* **`sr_binary_HYP.csv`**: A specialized dataset prepared in Phase 3 containing selected features and binary labels for Normal vs. Hypertrophy.
* **`PySRRegressor` Configuration**:
* **Operators**: A set of mathematical primitives (e.g., `+`, `-`, `*`, `/`, `exp`, `log`, `sin`, `cos`).
* **Complexity constraints**: Parameters that penalize overly long or complex formulas to favor parsimony.
* **Loss Function**: Custom log-loss (logistic loss) designed for binary classification.


* **`StandardScaler`**: Normalization parameters to ensure features are on a comparable scale for the evolutionary algorithm.

#### Output

The execution produces high-level scientific results and performance benchmarks:

* **Discovered Formulas**: A list of mathematical equations ranked by their "score" (a balance between accuracy and simplicity).
* **`best_formula`**: The specific equation selected as the optimal trade-off for clinical interpretation.
* **Performance Metrics**:
* **AUC (Area Under the Curve)**: For both the SR formula and a baseline Logistic Regression.
* **Accuracy, Sensitivity, and Specificity**: Calculated using the calibrated optimal threshold.


* **Visualizations**: ROC curves and Precision-Recall curves comparing the symbolic model against the baseline.
* **`sr_results_stream_a.csv`**: A CSV file storing all candidate formulas and their associated complexity and error metrics.

---

### Step 4.3: Stream B - Cardiac Age

**Script**: [scripts/run_sr_stream_b.py](../scripts/run_sr_stream_b.py)

This module focuses on the regression task of **Phase 4 (Discovery)** within the Kepler-ECG project. Its goal is to discover interpretable mathematical formulas that predict biological **cardiac age** from extracted ECG features. The core concept is that a "predicted age" significantly higher than a person's "chronological age" (the Cardiac Age Delta) can serve as a biomarker for cardiovascular risk. Using `PySR`, the module evolves equations to map physiological features to age, providing a transparent alternative to traditional "black-box" aging models.

#### Input

The script requires a regression-ready dataset and evolutionary parameters:

* **`sr_regression_age.csv`**: A dataset containing pre-selected ECG features (Morphological, Spectral, etc.) and the target chronological age of the subjects.
* **`PySRRegressor` Configuration**:
* **Objective**: Minimization of Mean Squared Error (MSE) or Mean Absolute Error (MAE).
* **Complexity constraints**: Penalties to ensure the resulting age-prediction formulas remain simple and clinically interpretable.
* **Primitive Operators**: Standard arithmetic operators plus functions like `log`, `exp`, and `sqrt` to capture non-linear aging processes.


* **`StandardScaler`**: Feature scaling to normalize inputs before the symbolic search.

#### Output

The module produces the discovered models and an analysis of "cardiac aging":

* **Cardiac Age Formulas**: A collection of mathematical expressions ranked by their accuracy in predicting age.
* **`cardiac_age_best.json`**: The most effective formula selected for clinical use, including its coefficients and complexity.
* **Regression Metrics**:
* ** Score**: The proportion of variance in age explained by the formula.
* **MAE / RMSE**: The average prediction error in years (comparing the symbolic model to a Ridge/Linear Regression baseline).


* **Cardiac Age Delta Analysis**: Statistics on the difference between predicted and actual age, useful for identifying "fast agers."
* **Visualizations**:
* **Scatter Plots**: Predicted vs. Actual Age.
* **Pareto Front**: Visualization showing the trade-off between equation complexity and accuracy.

---

## üìà FASE 4.5: Stream C - Wave Delineation & QTc Validation

### Objective
Extract QT/RR from all ECGs, validate Kepler formulas.

### Step 4.5.1: Wave Delineation

**Script**: [scripts/task1_wave_delineation.py](../scripts/task1_wave_delineation.py)

The **Wave Delineation Pipeline** is a specialized signal processing utility within Phase 4.5 of the **Kepler-ECG** project. Its primary function is the precise identification of PQRST fiducial points‚Äîthe onset, peak, and offset of each wave in an ECG cycle. Utilizing the `NeuroKit2` library for robust detection, this module automates the extraction of clinical landmarks from the entire PTB-XL dataset. This high-resolution "mapping" of the ECG waveform is essential for calculating complex intervals and providing the anatomical context required for advanced diagnostic modeling.

#### Input

The script operates as a batch processor requiring the following inputs:

* **`--data_path` (Path)**: The directory containing the raw PTB-XL records (including `ptbxl_database.csv` and the `.dat`/`.hea` waveform files).
* **`--sampling_rate` (int, default: 500)**: The frequency at which the signals were recorded, ensuring accurate time-to-sample conversion.
* **`--lead_idx` (int, default: 0)**: The specific lead (e.g., Lead I) to be used for delineation.
* **`--max_records` (Optional)**: A limit on the number of records to process, useful for testing or partial updates.

#### Output

The pipeline generates a structured results package in the specified output directory:

* **`delineation_results.csv`**: A dense table containing the sample indices for every P, Q, R, S, and T component detected for each patient.
* **`interval_statistics.json`**: A summary file providing population-level statistics for clinical metrics such as:
* **QT and QTc (Bazett, Fridericia)** intervals.
* **RR intervals** and heart rate (BPM).
* **P and T wave durations**.


* **`wave_delineation.log`**: A detailed execution log tracking processing successes, failures, and any signal quality warnings encountered.
* **Processing Summary**: A console output reporting the success rate and mean ¬± standard deviation for the primary cardiac intervals.

---

### Step 4.5.2: QTc Dataset Preparation & Symbolic Regression QTc

**Script**: [scripts/task2_qtc_dataset_prep.py](../scripts/task2_qtc_dataset_prep.py)

This module is a critical data engineering utility within **Phase 4.5** of the Kepler-ECG project. It focuses on the robust preparation of datasets specifically for **QT Interval Correction (QTc)** analysis. The script processes raw wave features to calculate various clinical QTc formulas (Bazett, Fridericia, and Framingham) and establishes a "ground truth" reference based on a corrected heart rate of 60 bpm. It addresses previous data inconsistencies by implementing a "Fixed" logic for reference calculation, ensuring that the resulting dataset is perfectly calibrated for Symbolic Regression tasks aimed at discovering new, more accurate heart-rate correction formulas.

#### Input

The script requires pre-extracted wave delineation data and metadata:

* **`--wave_features` (Path)**: The CSV or Parquet file containing the fiducial points and raw intervals (QT, RR) generated by the Delineation Pipeline.
* **`--output_path` (Path)**: The directory where the cleaned datasets and statistical reports will be saved.
* **`scp_codes` (Internal)**: Diagnostic metadata used to filter and categorize records into "Normal" (used for reference calibration) and "Pathological" groups.

#### Output

The module produces highly specialized datasets and a validation report:

* **`df_sr_qtc.csv`**: The primary dataset for Symbolic Regression, containing cleaned and normalized RR intervals, QT intervals, Heart Rate, and standardized labels.
* **`task2_report_v2.json`**: A comprehensive statistical summary including:
* **HR/QT Correlations**: Evaluation of how well standard formulas (Bazett, etc.) decouple the QT interval from the heart rate.
* **Interval Summaries**: Mean and standard deviation for QT, RR, and Heart Rate across the processed population.
* **Success Metrics**: Data regarding the number of records successfully parsed and cleaned.


* **Filtered Subsets**: Separate data structures for "Normal" patients to allow the Symbolic Regressor to learn "Healthy" cardiac dynamics.

---

### Step 4.5.3: QTc Formula Discovery via Symbolic Regression

**Script**: [scripts/task4_sr_qtc_discovery.py](../scripts/task4_sr_qtc_discovery.py)

This module represents the scientific core of **Phase 4.5**, dedicated to the mathematical discovery of new **QT Interval Correction (QTc)** formulas. Using `PySR` (Symbolic Regression), the script searches for optimal mathematical relationships that can decouple the QT interval from the heart rate more effectively than traditional clinical standards like Bazett or Fridericia. It explores three distinct mathematical architectures: direct prediction, correction factors, and additive adjustments. The goal is to evolve an interpretable equation that minimizes the correlation between the corrected QT and the heart rate, providing a more reliable diagnostic tool for Arrhythmia and Long QT Syndrome.

#### Input

The script requires a specialized dataset and evolutionary configuration:

* **`--dataset` (Path)**: The prepared CSV file (e.g., `qtc_sr_dataset_all_v2.csv`) containing cleaned RR intervals, QT intervals, and additional physiological features.
* **`--approach` (all | direct | factor | additive)**: Specifies the mathematical structure the regressor should focus on.
* **`--iterations` (int, default: 150)**: The number of evolutionary generations for the symbolic search.
* **`--maxsize` (int, default: 20)**: The maximum complexity (number of nodes) allowed for the discovered equations to ensure clinical interpretability.
* **Feature Set**: A selection of heart rate-related variables (RR, , , ) used as the basis for formula evolution.

#### Output

The module generates a comprehensive discovery report and validated formulas:

* **`task4_sr_report.json`**: A master report containing the best-discovered equations for each approach, their mathematical complexity, and loss values.
* **Comparative Metrics**:
* **r_vs_HR**: The Pearson correlation between the new QTc and Heart Rate (the closer to 0, the better the correction).
* **MAE (Mean Absolute Error)**: Accuracy compared to the reference 60bpm ground truth.


* **Baseline Benchmarks**: A performance comparison against standard clinical formulas (Bazett, Fridericia, Framingham).
* **HR-Bin Analysis**: An evaluation of the formula's stability across different heart rate ranges (e.g., Bradycardia vs. Tachycardia).
* **Serialized Models**: Saved `.csv` and `.pkl` files containing the full Pareto front of candidate equations for further research.

---

### Step 4.5.4: QTc Validation & Clinical Comparison

**Script**: [scripts/task5_validation.py](../scripts/task5_validation.py)

The **Validation & Comparison Module** is the final analytical step of Phase 4.5. Its objective is to rigorously test the newly discovered Symbolic Regression formulas against established clinical standards (Bazett, Fridericia, Framingham, and Hodges). The module focuses on "Heart Rate Independence"‚Äîthe gold standard for a QTc formula‚Äîensuring that the corrected QT interval remains stable across various heart rate ranges (bradycardia, normal, and tachycardia). It provides a high-level scientific validation of the formulas' stability and clinical utility, determining if the new mathematical models offer a superior diagnostic accuracy for identifying prolonged QT intervals.

#### Input

The script requires the consolidated dataset from Task 2 and the mathematical definitions of the new formulas:

* **`--dataset` (Path)**: The refined CSV file containing RR intervals, QT intervals, and reference values for the entire population.
* **`--output_path` (Path)**: The directory for storing validation reports and comparison charts.
* **Symbolic Formulas**: The specific mathematical functions discovered in Task 4 (e.g., Linear, Cubic, or Factor-based models) are hardcoded or passed to the validation suite.

#### Output

The module generates a comprehensive validation package:

* **`task5_validation_report.json`**: A detailed report containing:
* **HR Independence Ranking**: Formulas ranked by their Pearson correlation with heart rate (lower is better).
* **Bin Analysis**: Accuracy metrics segmented by heart rate bins (e.g., <60 BPM, 60-100 BPM, >100 BPM).
* **Clinical Threshold Agreement**: Data on how often the new formulas agree with clinical standards in identifying "Prolonged" (>450ms) or "High Risk" (>500ms) QT intervals.
* **Cross-Validation Stability**: Statistics on the variance of the formula's performance across different data folds.


* **Comparative Visualizations**:
* **Correlation Plots**: Visualizing QTc vs. HR for all competing formulas.
* **Residual Analysis**: Charts showing the error distribution relative to the 60bpm ground truth.


* **Console Summary**: A final leaderboard ranking all formulas (traditional and discovered) by their scientific robustness.

---

### Step 4.5.5: Phase 4.5 Report & Integration Module

**Script**: [scripts/task6_report_integration.py](../scripts/task6_report_integration.py)

The **Report & Integration Module** acts as the final analytical synthesizer for the QTc discovery stream (Phase 4.5). It is designed to consolidate results from wave delineation, formula discovery, and clinical validation into a single cohesive narrative. Beyond simple data aggregation, this script generates a **Clinical Interpretation** document that explains the medical significance of the newly discovered "Kepler-Cubic" and "Kepler-Linear" formulas. Crucially, it also prepares the structured prompt for **Phase 5**, facilitating the transition from automated discovery to final scientific reporting and peer-review preparation.

#### Input

The script functions as a post-processor for the Phase 4.5 results:

* **`--validation_report` (Path)**: The JSON output from Task 5 containing the comparison between Kepler formulas and clinical standards.
* **`--sr_report` (Path)**: The JSON output from Task 4 detailing the symbolic regression search results and discovered equations.
* **`--wave_summary` (Path)**: The summary JSON from Task 1 detailing the population statistics of the PQRST landmarks.
* **`--output_path` (Path)**: The directory where the final integrated documentation will be saved.

#### Output

The module generates the final deliverables of the discovery phase:

* **`clinical_interpretation_stream_c.md`**: A markdown document providing a qualitative and quantitative analysis of the new formulas, formatted for clinical stakeholders.
* **`PROMPT_FASE_5.md`**: A specialized file containing a summary of the entire project's success criteria and achieved metrics, intended to guide the next phase of the project (Documentation and Publication).
* **`final_integration_report.json`**: A master JSON file merging all relevant metrics (Success Criteria, HR Independence, Complexity) for archival purposes.
* **Success Criteria Summary**: A console-based checklist verifying if the project met its predefined targets (e.g.,  or ).

---