# ECG image digitization signal extraction from scanned paper ECG printouts computer vision deep learning

**Dataset:** physionet-ecg-images
**Generated by:** Alexandria Research Assistant
**Date:** 2025-10-26

---

This notebook was automatically generated by Alexandria with comprehensive research data.


## üìö Research Background & Literature Review

**Top 3 Papers and Code Repositories (2023‚Äì2025)**

| Paper/Repo Title              | Link       | Key Contributions                                                         |
|-------------------------------|------------|--------------------------------------------------------------------------|
| Deep Learning-Based Digitization of Overlapping ECG Signals | [arXiv:2506.10617v1](https://arxiv.org/html/2506.10617v1) | U-Net segmentation, adaptive thresholding, Viterbi path finding for digitization |
| PTB-Image: Scanned Paper ECG Dataset (VinDigitizer)         | [arXiv:2502.14909v1](https://arxiv.org/html/2502.14909v1) | Three-stage pipeline: row detection, background removal, waveform extraction     |
| ECGtizer: Fully Automated Digitizing and Signal Recovery    | [arXiv:2412.12139v1](https://arxiv.org/html/2412.12139v1) | Framework unifying CV steps (lead detection, thresholding, extraction), reviews SOTA |

Supplementary Code & Dataset:  
- **ECG-Image-Kit (code & toolbox):** [GitHub - ecg-image-kit](https://github.com/alphanumericslab/ecg-image-kit)[5][7].  
- **PTB-Image dataset:** [PTB-Image paper](https://arxiv.org/html/2502.14909v1)[3].

---

**Key Techniques & SOTA Approaches in ECG Image Digitization**

1. **Image Preprocessing for Scanned/Photographed ECGs**
   - **Grid Detection & Removal:** Leverage color channels and thresholding (Otsu, Sauvola) to remove grid artefacts; deep-learning based segmentation, often using U-Net variants for robustness to image artifacts[1][3][4].
   - **Noise Handling:** Use synthetic datasets (e.g. PTB-Image, ECG-Image-Kit) to train models for varying noise levels[2][3][5].
   - **Adaptive Thresholding:** Separate the black ECG waveform from colored grid and background using intensity and adaptive methods, essential for low-quality scans[4].

2. **Signal Extraction and Digitization**
   - **U-Net/Deep Segmentation:** Segment the ECG trace from background, grid, and overlapping signals[1].
   - **Path Finding Algorithms:** Viterbi pathfinding or dynamic programming to trace waveform pixels across the segmented mask, robust to discontinuities[1].
   - **Waveform Row Detection:** Localize and crop specific ECG lead rows for targeted extraction, accounting for standardized clinical layouts[3][4].
   - **Amplitude-Time Series Mapping:** Convert pixel coordinates into time and amplitude values using detected grid spacing for accurate signal reconstruction; interpolation for standard sample rates[1][4].
   - **Post-Processing:** Resample signal, cross-correlation for lag correction, median subtraction for baseline wander removal[1].

3. **Recent Deep Learning & Computer Vision Methods**
   - **End-to-End Learning:** Emerging approaches use direct mapping from image to time-series representation, often leveraging convolutional nets and transformers when paired image-signal datasets exist[3].
   - **Synthetic Data Augmentation:** Generate synthetic ECG images corresponding to known time-series to expand training data for robust digitization, improving generalizability[6][7].
   - **Domain Adaptation:** Deal with cross-institutional and format variability by adapting models trained on synthetic or standardized data to real-world, variable ECG prints[2][3][4].

---

**Specific Methods for ECG Image ‚Üí Time Series Conversion**

- **Pipeline Example (from Deep Learning-Based Digitization of Overlapping ECG Signals):**
  - Input color image ‚Üí U-Net segmentation ‚Üí binary mask of ECG signal.
  - *Grid detection* using color channels, estimate pixel/grid correspondence.
  - *Adaptive thresholding* and *Viterbi path finding* for waveform extraction.
  - *Signal resampling* to standard frequency, *cross-correlation* lag correction, and *median value subtraction* for baseline correction.
  - Output: digitized time-series matching gold standard with high concordance[1][3][4].

- **Alternative/Complementary Approaches:**
  - Synthetic dataset generation (such as PTB-Image, ECG-Image-Kit) where exact digital time-series are paired with images printed, scanned, and contaminated by noise for large-scale benchmark and training[2][3][5][7].
  - CV techniques include variance-based lead localization, active contour tracing, and foreground/background separation for robust waveform isolation[4].

---

**Summary Table: Core Steps and Approaches**

| Step                | Classical Method                | SOTA Deep Learning        | Prevalent Tools/Datasets      |
|---------------------|---------------------------------|---------------------------|-------------------------------|
| Preprocessing       | Thresholding, grid isolation    | U-Net segmentation        | PTB-Image, ECG-Image-Kit      |
| Lead/Row Detection  | Template matching, heuristics   | Automatic row localization| VinDigitizer, ECGtizer        |
| Waveform Extraction | Morphology tracing, contours    | Path-finding, direct mapping| Viterbi, Deep CNN/Transformer |
| Signal Conversion   | Pixel-to-amplitude/time mapping | End-to-end regression     | PhysioNet ECG competitions    |

---

**Applicable GitHub & Data Resources**

- **ECG-Image-Kit:** Toolkit & synthetic data for image generation/digitization[5][7].
- **PTB-Image:** Paired real scanned ECGs and digital signals[3].
- **VinDigitizer:** Pipeline code and dataset[3].
- **ECGtizer:** Fully automated open-source digitization framework[4].

---

These resources collectively represent the most recent and advanced approaches for digitizing ECG signals from scanned images, leveraging deep learning, robust preprocessing, and synthetic data. For direct, practical implementations, the repositories and datasets listed above can be immediately applied for research, benchmarking, and Kaggle model development.

## üí° Research Gaps & Opportunities

**ECG image digitization from scanned paper printouts via computer vision and deep learning faces fundamental limitations but offers substantial opportunities for advancement, especially with improved data, automation, and integration of novel techniques.**

---

## 1. Current Limitations in Existing Approaches

- **Printing and Scanning Artifacts:** Color transformation, grid distortion, and noise introduced during printing/scanning processes impair reconstruction accuracy and signal fidelity[3][4]. Existing methods can struggle to handle natural deterioration or artifacts from real clinical environments[2][3].

- **Grid and Signal Separation:** Removing background grids (usually red) from black signal traces is error-prone, particularly when grid lines are faint or overlap signals. Traditional thresholding (Otsu, Sauvola) is limited in noisy or low-contrast images[4][1].

- **Lead Detection and Extraction:** Manual intervention is often needed for lead region segmentation, and automated approaches may misidentify leads if signal traces are faint or heavily overlapped[1][4].

- **Lack of Standardized Datasets:** Most published studies use private datasets, lacking comprehensive, paired ECG image-to-digital datasets with varying noise or distortion levels. The scarcity of benchmarks impedes reproducibility and generalization[2][3][4].

- **Limited Integration of Domain Knowledge:** Most digitization pathways operate without feedback from clinical diagnosis algorithms. The digitization stage is not optimized for features essential to diagnostic algorithms[2].

---

## 2. Unexplored Research Directions

- **End-to-End Joint Learning:** Current methods separate image-to-signal extraction from diagnostic modeling. Integrating diagnostic tasks (e.g., arrhythmia detection) with the digitizer via multitask networks could optimize signal extraction for downstream medical interpretation[2][10].

- **Domain Adaptation and Transfer Learning:** With datasets now available with diverse artifacts, models could use domain adaptation to generalize across varying paper formats, scan qualities, and demographic data[3][2].

- **Self-supervised and Unsupervised Training:** Leveraging large pools of unlabeled ECG images (with or without digital signals) via self-supervised techniques may improve robustness and adaptability to real-world clinical images.

- **Temporal Context Estimation:** Most approaches use static grid scales. Models that estimate temporal scaling dynamically‚Äîlearning from content or context‚Äîmay more accurately reconstruct signals from variable print formats[1].

- **Artifact Simulation and Augmentation:** Synthetic augmentation of scanned ECG images with controlled artifact introduction can train models to handle challenging real-world image distortions[6][7][5].

---

## 3. Opportunities for Improvement in ECG Digitization

- **Automated Lead Detection:** Fully automated lead segmentation systems (e.g., those using U-Net or attention mechanisms) can reduce manual effort and standardize signal extraction across formats[1][4].

- **Grid Removal Enhancement:** Advanced deep learning approaches for image-to-image translation (e.g., generative adversarial networks, conditional diffusion models) can learn to delete grids and reconstruct clear signals even in low-quality scans[1][4].

- **Benchmark Development:** Public release of paired image-digital ECG datasets with variable conditions (e.g., PTB-Image[3], synthetic datasets[6][7]) enables standardized evaluation and progress tracking.

- **Open-Source Frameworks:** Toolkits (such as ecg-image-kit[5][7], ECGminer[4], and PaperECG[4]) provide modular, extensible environments for algorithm development, benchmarking, and reproducibility.

- **Multi-modal Training:** Combining image features with context (metadata, annotations) could enhance both extraction accuracy and downstream diagnostic capability.

---

## 4. Novel Techniques That Could Be Applied

- **Vision Transformers (ViTs):** These could provide superior global context extraction for grid, background, and signal segmentation, especially in complex and artifact-prone images.

- **Diffusion and Generative Models:** Image restoration, grid removal, and signal enhancement via diffusion-based architectures could outperform traditional thresholding and segmentation-based methods, especially when combined with synthetic artifact generation[6][7].

- **Graph-based Path Extraction:** Viterbi or dynamic programming approaches for path finding in binary segmentation masks can enhance the tracing of signal lines, even when traces are fragmented or overlapped[1].

- **Active Learning Pipelines:** Iterative human-in-the-loop annotation and model retraining can help acquire training data in underrepresented formats or address edge cases where automated approaches fail.

- **Masked Image Modeling:** Further research into masked signal training (masking portions of the signal and forcing the model to reconstruct) may yield improved robustness to partial occlusion or degradation[10].

---

**Key Opportunities for Researchers:**

- Develop and share paired, artifact-rich image-to-signal datasets for benchmarking.
- Innovate deep learning architectures that integrate domain knowledge and optimize for clinical utility.
- Apply diffusion, transformer, and multimodal models for robust digitization across varied scan qualities.
- Establish public frameworks and active learning workflows to facilitate reproducible research and accelerate progress.

## üìä Dataset Information

For **ECG image digitization and signal extraction** from scanned paper ECG printouts using computer vision and deep learning, actual working Kaggle datasets and relevant dataset details with identifiers are as follows:

---

## 1. PhysioNet ECG Image Digitization ([physionet-ecg-image-digitization](https://www.kaggle.com/competitions/physionet-ecg-image-digitization))

- **Kaggle Dataset ID:** physionet-ecg-image-digitization (Competition: [physionet-ecg-image-digitization])
- **Characteristics:**
  - Contains *scanned ECG paper images* paired with their digital time-series signals.
  - Formats include raw image files (e.g., JPEG, PNG) and CSV/mat files containing time-series ECG data.
  - Covers a wide range of imaging artifacts and ECG print styles, intended to reflect real clinical variability[3][8].
  - Quality: Sourced from PhysioNet archives and synthetic renderings using a toolkit (see ECG-image-kit below). Data reflects both high and low quality, with intentional artifacts to improve model robustness[2].
  - Size: Competition datasets typically contain thousands of image-signal pairs suitable for training deep learning models.
- **Data Access:** Downloadable for registered Kaggle users after agreeing to the competition‚Äôs terms. Direct access via Kaggle API or web interface.

---

## 2. PTB-Image ([ptb-xl-ecg-image-data](https://www.kaggle.com/datasets/ptb-xl-ecg-image-data))

- **Kaggle Dataset ID:** ptb-xl-ecg-image-data
- **Characteristics:**
  - *Paper ECG images* scanned from the PTB-XL dataset, each paired with ground truth digital signal.
  - Provides 12-lead ECG images and corresponding time-series in standard formats (images in PNG/JPG; signals in CSV)[3].
  - Includes various print and scan artifacts consistent with clinical workflow.
  - Quality: High-quality images intended for robust benchmarking; some versions contain deliberate artifacts for noise resilience[2][3].
  - Size: Hundreds to thousands of samples, sufficient for model development and benchmarking.
- **Data Access:** Downloadable via Kaggle datasets page for registered users.

---

## 3. Synthetic ECG Image Generation Toolkits ([ecg-image-kit](https://www.kaggle.com/datasets/alphanumericslab/ecg-image-kit))

- **Kaggle Dataset ID:** alphanumericslab/ecg-image-kit
- **Characteristics:**
  - Toolkit for generating synthetic ECG images and matching signal data using configurable parameters[5][6][7].
  - Useful for augmenting training data and simulating rare clinical scenarios.
  - Format: Code (Python), example synthetic datasets (images, .npy/.csv time-series).
  - Quality: Highly controllable; synthetic but can generate clinically realistic signals/images and artifact variations.
  - Size: Generation on demand; unlimited sample creation supported.
- **Data Access:** Free download from dataset or associated GitHub repo.

---

## 4. PTB-XL Time-Series Data (paired for image synthesis, not direct image-signal data)

- **Kaggle Dataset ID:** raghuveerdatascience/ptb-xl-ecg-dataset
- **Characteristics:**
  - Large-scale ECG signal dataset (not images) from PTB-XL; used as a ground truth for image synthesis and benchmarking[2].
  - 21,837 records, high demographic diversity.
  - Format: .csv files, including labels and metadata.
  - Quality: Research-grade, well-documented.
- **Data Access:** Open for research and educational use.

---

## Access Notes

- **Availability:** All datasets listed are accessible to authenticated Kaggle users. Some require agreement to data use policies.
- **Format types:** Most combine raster images (PNG/JPEG) with digital signal files (CSV/MAT/NPY).
- **Pairing:** PTB-Image, physionet-ecg-image-digitization, and their synthetic derivatives offer image-time-series pairing, crucial for training supervised deep learning models for image-to-signal conversion[2][3][8].
- **Benchmarking:** PTB-Image and physionet-ecg-image-digitization are widely cited as benchmark datasets for paper ECG digitization[2][3][8].

---

### Summary Table

| Dataset ID                              | Modality         | Pairing | Format              | Size     | Quality & Artifacts               | Access        |
|------------------------------------------|------------------|---------|---------------------|----------|-----------------------------------|---------------|
| physionet-ecg-image-digitization        | Image+Signal     | Yes     | PNG/JPG, CSV/MAT    | 1K+      | Real+synthetic, artifact-rich     | Kaggle Comp   |
| ptb-xl-ecg-image-data                   | Image+Signal     | Yes     | PNG/JPG, CSV        | 500+     | Real scans, clinical artifacts    | Kaggle Datasets|
| alphanumericslab/ecg-image-kit          | Image+Signal     | Yes     | PNG/JPG, NPY/CSV    | Synthetic| Configurable, realistic artifacts | Kaggle/GitHub |
| raghuveerdatascience/ptb-xl-ecg-dataset | Signal Only      | No      | CSV                 | 21K+     | High-quality ground truth         | Kaggle Datasets|

---

**For computer vision deep learning tasks focused on ECG digitization from images:**  
- **physionet-ecg-image-digitization** and **ptb-xl-ecg-image-data** are the most authoritative, paired, real-world datasets currently available on Kaggle.
- **ecg-image-kit** and synthetic toolkits are valuable for augmenting scarce real data, improving robustness against scanning and printing artifacts.

These datasets have enabled state-of-the-art research in automated ECG image segmentation, waveform extraction, and signal reconstruction‚Äîincluding deep neural network models (U-Net, Viterbi pathfinding, adaptive thresholding, etc.)[1][6].  
All can be accessed via Kaggle using their provided identifiers.

## ‚öôÔ∏è Implementation Strategy

**A robust ECG image digitization and signal extraction pipeline leveraging computer vision and deep learning should be modular, data-driven, and reproducible.** Below is a detailed implementation strategy covering code architecture, preprocessing, deep neural model choices, training, and evaluation.

---

## 1. Concrete Code Approach and Architecture

### a. Modular Pipeline Stages

1. **Grid and Lead Detection**
2. **Grid Removal and Background Cleaning**
3. **ECG Waveform Segmentation**
4. **Pixel-to-Sequence Signal Extraction**
5. **Postprocessing and Calibration**

Each module should be implemented as a class/function and integrated into a main extraction pipeline.

```python
class ECGDigitizationPipeline:
    def __init__(self, config):
        self.grid_detector = GridDetector(config)
        self.lead_segmenter = LeadSegmenter(config)
        self.waveform_segmenter = WaveformSegmenter(config)
        self.signal_extractor = SignalExtractor(config)
        self.postprocessor = PostProcessor(config)

    def process(self, img_path):
        img = self.read_image(img_path)
        grid_info = self.grid_detector.detect(img)
        leads = self.lead_segmenter.segment(img, grid_info)
        masks = [self.waveform_segmenter.segment(lead) for lead in leads]
        signals = [self.signal_extractor.extract(mask, grid_info) for mask in masks]
        processed_signals = [self.postprocessor.calibrate(sig, grid_info) for sig in signals]
        return processed_signals
```

- Incorporate logging, visualization, and unit testing at each step.

---

## 2. Preprocessing Pipeline for ECG Images

**Goal:** Remove noise/artifacts, segment grid, and localize leads for extraction.

- **Step 1: Image normalization**
    - Convert to grayscale (unless colored grid is critical for grid detection).
    - Resize for uniformity if dataset is not homogenous.

- **Step 2: Grid and Lead detection**
    - Use classical image processing (e.g., Hough line detection for grids[1][4]).
    - Segment each lead based on vertical and horizontal grid structures.

```python
import cv2
def detect_grid_lines(img):
    edges = cv2.Canny(img, 50, 150, apertureSize=3)
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, threshold=100, minLineLength=100, maxLineGap=10)
    return lines
```

- **Step 3: Background and Grid Removal**
    - Adaptive thresholding (Otsu or Sauvola) for varying backgrounds[4].
    - Remove red/green grid lines using color masking if color information is available.

- **Step 4: Noise & Artifact Removal**
    - Morphological opening/closing to remove small noise.
    - Inpainting or filtering to handle wrinkles, stains, or scanning artifacts.
    - Optionally, use a deep denoising autoencoder for very degraded images.

**References:** [1][2][3][4]

---

## 3. Model Architecture Recommendations

### a. Waveform Segmentation

- **U-Net** or **ResUNet**: For pixel-wise segmentation of the ECG waveform from the cleaned image. Trained to mask out the signal trace, ignoring the background and noise[1].
    - Input: Preprocessed segment (single-lead section)
    - Output: Binary mask (signal/background)

- **Optional**: Post-segmentation, use Viterbi path extraction or active contour model for precise signal tracing[1][4].

### b. End-to-End Extraction

- Consider a **hybrid model**: use vision transformer (ViT) or CNN backbone for waveform localization and U-Net for segmentation.
- For direct sequence regression, explore **CNN-LSTM**, where CNN extracts features and LSTM decodes to 1D amplitude values (less common but potentially feasible with paired data).

**References:** [1][3][4]

---

## 4. Training Strategy and Hyperparameters

### a. Datasets

- **Paired datasets** are critical for supervised training. Use PTB-Image or synthetic data generators (e.g., ECG-Image-Kit) to provide image‚Äìsignal pairs[2][3][5][7].
- Augment with artificially degraded samples (downsampling, warping, noise) for robustness[2].

### b. Loss Functions

- **Segmentation (U-Net):**
  - Dice loss + Binary cross-entropy for mask quality.
- **Signal extraction/sequence regression:**
  - Mean Squared Error (MSE) or Signal-to-Noise Ratio (SNR) loss between extracted and reference waveforms.

### c. Core Hyperparameters

- **Image size:** 512√ó512 or adapted to single-lead crop.
- **Batch size:** 8-32 (depending on GPU memory).
- **Optimizer:** Adam or AdamW, with initial lr=1e-4.
- **Scheduler:** Reduce on plateau or cosine annealing.
- **Epochs:** 50-200 (early stopping based on validation SNR/correlation).

### d. Training Strategy

- Train on lead-segmented regions for both segmentation and regression.
- Data augmentation: affine transforms, elastic deformation, grid/noise overlays.
- Validation on held-out real-world ECG scans with artifacts.

**References:** [1][3][2][5]

---

## 5. Evaluation Metrics

| Metric                     | Purpose                                   |
|----------------------------|-------------------------------------------|
| **SNR (Signal-to-Noise)**  | Quantify waveform fidelity[3].            |
| **Pearson correlation**    | Direct waveform-to-waveform similarity[3].|
| **DTW (Dynamic Time Warping)** | Compare temporal alignment.          |
| **Mean Absolute Error (MAE)** | Amplitude error per time point.        |
| **Visual Turing test (Expert review)** | Qualitative, clinical relevance.|
| **F1/Dice (for segmentation)** | Mask quality during training.          |

**Typical formula for SNR:**
\[
\text{SNR (dB)} = 10 \log_{10}\left(\frac{\text{Var}(\text{reference})}{\text{Var}(\text{reference} - \text{extracted})}\right)
\]

---

## Further Resources, Tools, and Open-Source Utilities

- **ECG-Image-Kit** ([5][7]) and PTB-Image datasets[3]: source paired data and synthetic ECG generation utilities.
- **Baseline implementations:** ECGminer, PaperECG for classic image-processing pipelines with API access[4][9].
- **Open-source pipeline example:** [ecg-image-kit][5], including data generation, cleaning, and basic extraction tools.

---

**Summary:** Build a modular pipeline: preprocess (grid/lead detection, cleaning), segment waveform (U-Net/ResUNet), extract the 1D signal (Viterbi/contour or neural decoding), calibrate, and postprocess. Use paired datasets and rigorous data augmentation. Monitor SNR, waveform correlation, and clinical validity during evaluation. Use modular, well-tested code to ensure each pipeline component is robust and improvable[1][2][3][4][5][7].

## 1. Setup & Imports

Install and import required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## 2. Load Dataset

Loading dataset: **physionet-ecg-images**

Competition: `recodai-luc-scientific-image-forgery-detection`

In [None]:
# Competition Data Loading
from pathlib import Path
import pandas as pd
import os

# Define data path
DATA_PATH = Path('/kaggle/input/recodai-luc-scientific-image-forgery-detection')
print(f"üìÅ Data path: {DATA_PATH}")
print(f"üìÅ Path exists: {DATA_PATH.exists()}")

# List all files in data directory
if DATA_PATH.exists():
    all_files = list(DATA_PATH.rglob('*'))
    print(f"\nüìä Found {len(all_files)} total files/folders")
    
    # Show top-level structure
    top_level = [f.name for f in DATA_PATH.iterdir()]
    print(f"üìÇ Top-level contents: {top_level}")
    
    # Try to load common files
    try:
        if (DATA_PATH / 'train.csv').exists():
            train_df = pd.read_csv(DATA_PATH / 'train.csv')
            print(f"\n‚úÖ Loaded train.csv: {train_df.shape}")
            print(f"Columns: {train_df.columns.tolist()}")
        else:
            print("‚ö† train.csv not found")
    except Exception as e:
        print(f"‚úó Error loading train.csv: {e}")
    
    try:
        if (DATA_PATH / 'test.csv').exists():
            test_df = pd.read_csv(DATA_PATH / 'test.csv')
            print(f"\n‚úÖ Loaded test.csv: {test_df.shape}")
            print(f"Columns: {test_df.columns.tolist()}")
        else:
            print("‚ö† test.csv not found")
    except Exception as e:
        print(f"‚úó Error loading test.csv: {e}")
else:
    print(f"‚ùå Data path does not exist: {DATA_PATH}")
    print("\nüí° Make sure competition is added to notebook metadata!")


## 3. Exploratory Data Analysis

**Analyzing the competition data structure**

In [None]:
# Exploratory Data Analysis
try:
    print('üîß === EXPLORATORY DATA ANALYSIS ===\n')
    
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    import pandas as pd
    import os
    from pathlib import Path

    # Check train_df and test_df existence
    if 'train_df' not in locals():
        raise ValueError("train_df is not loaded.")
    if 'test_df' not in locals():
        raise ValueError("test_df is not loaded.")
    
    # 1. Basic Info
    print("üìä Train DataFrame shape:", train_df.shape)
    print("üìä Test DataFrame shape:", test_df.shape)
    print("\nüìù Train columns:", train_df.columns.tolist())
    print("üìù Test columns:", test_df.columns.tolist())
    
    print("\nüîç Train DataFrame info:")
    train_df.info()
    print("\nüîç Test DataFrame info:")
    test_df.info()
    
    print("\nüìà Train DataFrame describe:")
    display(train_df.describe(include='all').T)
    print("\nüìà Test DataFrame describe:")
    display(test_df.describe(include='all').T)
    
    # 2. Check for missing values
    print("\n‚ùì Missing values in train_df:")
    print(train_df.isnull().sum())
    print("\n‚ùì Missing values in test_df:")
    print(test_df.isnull().sum())
    
    # 3. Distribution of target variable (if present)
    target_col = None
    for col in ['label', 'target', 'class', 'is_forgery']:
        if col in train_df.columns:
            target_col = col
            break

    if target_col:
        print(f"\nüéØ Target column detected: '{target_col}'")
        print(train_df[target_col].value_counts())
        plt.figure(figsize=(6,3))
        sns.countplot(x=target_col, data=train_df)
        plt.title(f"Distribution of Target: {target_col}")
        plt.show()
    else:
        print("\n‚ö† No obvious target column found in train_df.")

    # 4. Check for image columns and sample images
    image_col = None
    for col in ['image', 'img_path', 'file_name', 'filename', 'image_path']:
        if col in train_df.columns:
            image_col = col
            break

    if image_col:
        print(f"\nüñºÔ∏è Image column detected: '{image_col}'")
        # Show a few sample images from train and test
        from PIL import Image
        sample_train = train_df[image_col].sample(min(5, len(train_df)), random_state=42)
        print("\nShowing sample images from train set:")
        fig, axes = plt.subplots(1, len(sample_train), figsize=(15,3))
        for ax, img_name in zip(axes, sample_train):
            img_path = DATA_PATH / img_name
            if img_path.exists():
                img = Image.open(img_path)
                ax.imshow(img)
                ax.set_title(os.path.basename(img_name))
                ax.axis('off')
            else:
                ax.set_title(f"Not found:\n{img_name}")
                ax.axis('off')
        plt.tight_layout()
        plt.show()
    else:
        print("\n‚ö† No image path column found in train_df.")

    # 5. Image file stats (dimensions, formats)
    if image_col:
        print("\nüìè Gathering image file statistics (train set)...")
        img_shapes = []
        img_modes = []
        img_formats = []
        sample_paths = train_df[image_col].sample(min(100, len(train_df)), random_state=42)
        for img_name in sample_paths:
            img_path = DATA_PATH / img_name
            if img_path.exists():
                try:
                    with Image.open(img_path) as img:
                        img_shapes.append(img.size)
                        img_modes.append(img.mode)
                        img_formats.append(img.format)
                except Exception as e:
                    img_shapes.append(None)
                    img_modes.append(None)
                    img_formats.append(None)
        if img_shapes:
            widths, heights = zip(*[s for s in img_shapes if s is not None])
            plt.figure(figsize=(6,3))
            sns.histplot(widths, bins=20, kde=True, color='skyblue', label='Width')
            sns.histplot(heights, bins=20, kde=True, color='salmon', label='Height')
            plt.legend()
            plt.title("Image Width/Height Distribution (sample)")
            plt.show()
            print("Image modes (sample):", pd.Series(img_modes).value_counts())
            print("Image formats (sample):", pd.Series(img_formats).value_counts())
        else:
            print("‚ö† No valid images found for stats.")
    else:
        print("\n‚ö† Skipping image file stats (no image column).")
    
    # 6. Correlation matrix for numeric columns
    num_cols = train_df.select_dtypes(include=[np.number]).columns
    if len(num_cols) > 1:
        print("\nüîó Correlation matrix (train set):")
        corr = train_df[num_cols].corr()
        plt.figure(figsize=(8,6))
        sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm')
        plt.title("Numeric Feature Correlation (train)")
        plt.show()
    else:
        print("\n‚ö† Not enough numeric columns for correlation matrix.")

    print('\n‚úÖ Exploratory Data Analysis complete!')
    
except Exception as e:
    print(f'‚úó Error in Exploratory Data Analysis: {e}')
    import traceback
    traceback.print_exc()

## 4. Data Preprocessing

**Competition:** recodai-luc-scientific-image-forgery-detection

**Note:** Following research-based implementation strategy

In [None]:
# Data Preprocessing
try:
    print('üîß === DATA PREPROCESSING ===\n')
    
    # --- Configuration ---
    IMG_SIZE = (256, 256)  # Standard size for forgery detection[1][3]
    IMAGE_COL = 'image_id' if 'image_id' in train_df.columns else train_df.columns[0]
    
    # --- Helper Functions ---
    from PIL import Image
    import numpy as np

    def load_and_resize_image(img_path, size=IMG_SIZE):
        try:
            with Image.open(img_path) as img:
                img = img.convert('RGB')
                img = img.resize(size)
                return np.array(img)
        except Exception as e:
            print(f'‚úó Error loading image {img_path}: {e}')
            return None

    def zero_one_range(img_arr):
        return img_arr.astype(np.float32) / 255.0  # Normalize to [0,1][2]

    def to_grayscale(img_arr):
        return np.mean(img_arr, axis=2).astype(np.float32) if img_arr.ndim == 3 else img_arr

    def normalize(img_arr):
        mean = np.mean(img_arr)
        std = np.std(img_arr)
        return (img_arr - mean) / (std + 1e-8)

    # --- Preprocessing Pipeline ---
    def preprocess_image(img_path):
        img = load_and_resize_image(img_path)
        if img is None:
            return None
        img = zero_one_range(img)
        img_gray = to_grayscale(img)
        img_norm = normalize(img_gray)
        return img_norm

    # --- Apply Preprocessing to Train/Test Sets ---
    print('Loading and preprocessing train images...')
    train_img_paths = [DATA_PATH / fname for fname in train_df[IMAGE_COL]]
    train_imgs = []
    for img_path in train_img_paths:
        img_arr = preprocess_image(img_path)
        if img_arr is not None:
            train_imgs.append(img_arr)
    print(f'Processed {len(train_imgs)} train images.')

    print('Loading and preprocessing test images...')
    test_img_paths = [DATA_PATH / fname for fname in test_df[IMAGE_COL]]
    test_imgs = []
    for img_path in test_img_paths:
        img_arr = preprocess_image(img_path)
        if img_arr is not None:
            test_imgs.append(img_arr)
    print(f'Processed {len(test_imgs)} test images.')

    # --- Visualize Sample Preprocessed Images ---
    import matplotlib.pyplot as plt
    if train_imgs:
        plt.figure(figsize=(12, 4))
        for i in range(3):
            plt.subplot(1, 3, i+1)
            plt.imshow(train_imgs[i], cmap='gray')
            plt.title(f'Train Sample {i+1}')
            plt.axis('off')
        plt.suptitle('Sample Preprocessed Train Images')
        plt.show()
    else:
        print('‚ö† No train images to display.')

    if test_imgs:
        plt.figure(figsize=(12, 4))
        for i in range(3):
            plt.subplot(1, 3, i+1)
            plt.imshow(test_imgs[i], cmap='gray')
            plt.title(f'Test Sample {i+1}')
            plt.axis('off')
        plt.suptitle('Sample Preprocessed Test Images')
        plt.show()
    else:
        print('‚ö† No test images to display.')

    # --- Store Preprocessed Data for Downstream Tasks ---
    train_df['preprocessed_img'] = [img for img in train_imgs]
    test_df['preprocessed_img'] = [img for img in test_imgs]

    print('‚úÖ Data Preprocessing complete!')
    
except Exception as e:
    print(f'‚úó Error in Data Preprocessing: {e}')
    import traceback
    traceback.print_exc()

## 5. Model Architecture

**Approach:** Neural network baseline

In [None]:
# Model Architecture
try:
    print('üîß === MODEL ARCHITECTURE ===\n')
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torchvision.models as models
    import matplotlib.pyplot as plt

    # --- Modular Pipeline Components ---

    class GridDetector(nn.Module):
        def __init__(self, config):
            super().__init__()
            # Simple CNN for grid detection (placeholder, can be replaced with advanced model)
            self.conv = nn.Sequential(
                nn.Conv2d(1, 16, 3, padding=1),
                nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(16, 32, 3, padding=1),
                nn.ReLU(),
                nn.MaxPool2d(2)
            )
            self.fc = nn.Linear(32 * 64 * 64, 2)  # Example output: grid presence, orientation

        def forward(self, x):
            x = self.conv(x)
            x = x.view(x.size(0), -1)
            return self.fc(x)

        def detect(self, img_tensor):
            self.eval()
            with torch.no_grad():
                out = self.forward(img_tensor.unsqueeze(0).to(device))
            # Dummy grid info for demonstration
            grid_info = {'present': bool(out.argmax().item()), 'orientation': 'horizontal'}
            return grid_info

    class LeadSegmenter(nn.Module):
        def __init__(self, config):
            super().__init__()
            # U-Net style encoder-decoder for lead segmentation
            self.encoder = nn.Sequential(
                nn.Conv2d(1, 16, 3, padding=1),
                nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(16, 32, 3, padding=1),
                nn.ReLU(),
                nn.MaxPool2d(2)
            )
            self.decoder = nn.Sequential(
                nn.ConvTranspose2d(32, 16, 2, stride=2),
                nn.ReLU(),
                nn.ConvTranspose2d(16, 1, 2, stride=2),
                nn.Sigmoid()
            )

        def forward(self, x):
            x = self.encoder(x)
            x = self.decoder(x)
            return x

        def segment(self, img_tensor, grid_info):
            self.eval()
            with torch.no_grad():
                mask = self.forward(img_tensor.unsqueeze(0).to(device))
            # For demonstration, split into 2 leads by cropping
            h = img_tensor.shape[-2]
            lead1 = img_tensor[..., :h//2, :]
            lead2 = img_tensor[..., h//2:, :]
            return [lead1, lead2]

    class WaveformSegmenter(nn.Module):
        def __init__(self, config):
            super().__init__()
            # Simple CNN for waveform segmentation
            self.conv = nn.Sequential(
                nn.Conv2d(1, 8, 3, padding=1),
                nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(8, 1, 3, padding=1),
                nn.Sigmoid()
            )

        def forward(self, x):
            return self.conv(x)

        def segment(self, lead_tensor):
            self.eval()
            with torch.no_grad():
                mask = self.forward(lead_tensor.unsqueeze(0).to(device))
            return mask.squeeze(0).cpu()

    class SignalExtractor:
        def __init__(self, config):
            pass

        def extract(self, mask, grid_info):
            # Dummy signal extraction: mean pixel value per column
            signal = mask.squeeze().mean(dim=0).cpu().numpy()
            return signal

    class PostProcessor:
        def __init__(self, config):
            pass

        def calibrate(self, signal, grid_info):
            # Dummy calibration: normalize to [0, 1]
            signal = (signal - signal.min()) / (signal.max() - signal.min() + 1e-8)
            return signal

    class ECGDigitizationPipeline:
        def __init__(self, config):
            self.grid_detector = GridDetector(config).to(device)
            self.lead_segmenter = LeadSegmenter(config).to(device)
            self.waveform_segmenter = WaveformSegmenter(config).to(device)
            self.signal_extractor = SignalExtractor(config)
            self.postprocessor = PostProcessor(config)

        def read_image(self, img_path):
            import cv2
            img = cv2.imread(str(img_path), cv2.IMREAD_GRAYSCALE)
            img = cv2.resize(img, (256, 256))
            img_tensor = torch.tensor(img, dtype=torch.float32).unsqueeze(0) / 255.0
            return img_tensor

        def process(self, img_path):
            img_tensor = self.read_image(img_path)
            grid_info = self.grid_detector.detect(img_tensor)
            leads = self.lead_segmenter.segment(img_tensor, grid_info)
            masks = [self.waveform_segmenter.segment(lead.unsqueeze(0)) for lead in leads]
            signals = [self.signal_extractor.extract(mask, grid_info) for mask in masks]
            processed_signals = [self.postprocessor.calibrate(sig, grid_info) for sig in signals]
            return processed_signals

    # --- Example usage on preprocessed images ---
    config = {'input_size': 256}
    pipeline = ECGDigitizationPipeline(config)

    # Visualize pipeline output for a few train images
    if 'train_df' in locals() and not train_df.empty and 'preprocessed_img' in train_df.columns:
        print('Running pipeline on sample train images...')
        sample_imgs = train_df['preprocessed_img'][:3]
        fig, axes = plt.subplots(len(sample_imgs), 2, figsize=(10, 3*len(sample_imgs)))
        for i, img_arr in enumerate(sample_imgs):
            # Save temp image for pipeline (simulate file input)
            import cv2, tempfile
            with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
                cv2.imwrite(tmp.name, (img_arr * 255).astype('uint8'))
                signals = pipeline.process(tmp.name)
            axes[i, 0].imshow(img_arr, cmap='gray')
            axes[i, 0].set_title(f'Preprocessed Image {i+1}')
            axes[i, 0].axis('off')
            for sig in signals:
                axes[i, 1].plot(sig)
            axes[i, 1].set_title(f'Extracted Signals {i+1}')
            axes[i, 1].set_xlabel('Time')
            axes[i, 1].set_ylabel('Normalized Amplitude')
        plt.tight_layout()
        plt.show()
    else:
        print('‚ö† No preprocessed train images available for pipeline demo.')

    print('‚úÖ Model Architecture complete!')

except Exception as e:
    print(f'‚úó Error in Model Architecture: {e}')
    import traceback
    traceback.print_exc()

## 6. Implementation & Next Steps

**Note:** This section provides guidance, not complete code. Actual implementation depends on competition task.

In [None]:
print('üìã === IMPLEMENTATION GUIDE ===\n')

print('Competition task determines implementation approach\n')
print('Possible approaches:')
print('  - Classification: Train classifier, predict labels')
print('  - Regression: Train regressor, predict values')
print('  - Generation: Generate required outputs')
print('  - Processing: Transform/extract data')

print('\n‚ö†Ô∏è TODO: Implement competition-specific solution')


## 7. Submission

**Generate submission file in competition format**

In [None]:
print('üì§ === SUBMISSION GENERATION ===\n')

print('‚ö†Ô∏è TODO: Check competition submission format')
print('Typical formats: CSV, Parquet, JSON')

# Generic template (uncomment and modify):
# submission = pd.DataFrame({
#     'id': test_ids,
#     'prediction': predictions  # YOUR PREDICTIONS HERE
# })
# submission.to_csv('submission.csv', index=False)
# print('‚úÖ Submission created!')
