Published on October 21, 2025. By Prata, Marília (mpwolke)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from matplotlib.gridspec import GridSpec

#Two lines Required to Plot Plotly
import plotly.io as pio
pio.renderers.default = 'iframe'

import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![](https://www.saem.org/images/default-source/academy-images/cdem/picture1b6cb1a73-b04a-411a-aaa2-241e9a8ba757.png?sfvrsn=b20cf159_1)SAEM

## About Competition: Extract the ECG time-series data

Extract the ECG time-series data from scans and photographs of paper printouts of the ECGs.

"You will build models to extract time series data from Electrocardiogram (ECG) images. ECGs are used to diagnose and guide treatment for heart disease. They exist as physical printouts, scanned images, photographs, or time series. Medical-grade ECG software currently works best on time series. Tools for extracting time series from ECG images can help convert billions of ECG images collected globally over decades into data for training better diagnostic models to improve clinical outcomes."

https://www.kaggle.com/competitions/physionet-ecg-image-digitization/overview

## Competition Citation

@misc{physionet-ecg-image-digitization,

    author = {Matthew A. Reyna and Deepanshi and James Weigle and Zuzana Koscova and Kiersten Campbell and Salman Seyedi and Andoni Elola and Ali Bahrami Rad and Amit J Shah and Neal K. Bhatia and Yao Yan and Sohier Dane and Addison Howard and Gari D. Clifford and Reza Sameni},
    
    title = {PhysioNet - Digitization of ECG Images},
    year = {2025},
    
    howpublished = {\url{https://kaggle.com/competitions/physionet-ecg-image-digitization}},
    note = {Kaggle}
}

## ECG leads

lead: One of the 12 standard ECG leads: I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5,V6.

"ECG leads are different "views" or representations of the heart's electrical activity, calculated from data gathered by physical electrodes placed on the skin. A standard 12-lead ECG uses 10 electrodes to generate 12 different views, categorized as 6 limb leads (I, II, III, aVR, aVL, aVF) and 6 precordial (chest) leads (V1–V6). This provides a comprehensive picture of the heart from multiple angles to help diagnose issues."

In [None]:
##One lead (train/[id]/[id].csv)

lead = pd.read_csv('/kaggle/input/physionet-ecg-image-digitization/train/735384893/735384893.csv')
lead.tail()

In [None]:
lead.info()

## 12-lead ECG

* 10 electrodes required to produce 12-lead ECG
   4 Electrodes on all 4 limbs (RA, LL, LA, RL)6 Electrodes on precordium (V1–6) 
* Monitors 12 leads (V1–6), (I, II, III) and (aVR, aVF, aVL)
* Allows interpretation of specific areas of the heart
Inferior (II, III, aVF)Lateral (I, aVL, V5, V6)Anterior (V1–4)

**12-lead Precordial lead placement**

* V1: 4th intercostal space (ICS), RIGHT margin of the sternum
* V2:  4th ICS along the LEFT margin of the sternum
* V4: 5th ICS, mid-clavicular line
* V3: midway between V2 and V4
* V5:  5th ICS, anterior axillary line (same level as V4)
* V6:  5th ICS, mid-axillary line (same level as V4)

## ECG leads Histograms

In [None]:
lead.hist(figsize=(15,10), bins=30, color='green', edgecolor='black')
plt.suptitle("Histogram of ECG Leads")
plt.show()

In [None]:
test = pd.read_csv('/kaggle/input/physionet-ecg-image-digitization/test.csv')
test.head()

In [None]:
#By Pedro Andrade https://www.kaggle.com/code/pbizil/datahackers-managers-radiografia-dos-gestores

lead_counts = test["lead"].value_counts().head(12)#Try different values of head
sns.set(style="white")
plt.figure(figsize=(10, 10))
#x=type_counts.index, y=loc_counts.values
ax = sns.barplot(x=lead_counts.index, y=lead_counts.values, color=sns.color_palette("Greens", n_colors=5)[3])
plt.title("Distribution of ECG Leads", fontsize=16)
plt.xlabel("Tags", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.xticks(rotation=45, fontsize=11)
plt.yticks(fontsize=11)
sns.despine()

#+2 is good if chart is vertical. +20 worked for horizontal
for i, v in enumerate(lead_counts.values):
    ax.text(i, v + 2, str(v), ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

## Load the sample_submission parquet 

In [None]:
#Read One parquet file. 
sub = pd.read_parquet("../input/physionet-ecg-image-digitization/sample_submission.parquet")
sub.tail()

## ECG-Image-Database: Digitization and Analysis

ECG-Image-Database: A Dataset of ECG Images with Real-World Imaging and Scanning Artifacts; A Foundation for Computerized ECG Image Digitization and Analysis

Authors: Matthew A. Reyna, Deepanshi, James Weigle, Zuzana Koscova, Kiersten Campbell, Kshama Kodthalu Shivashankara, Soheil Saghafi, Sepideh Nikookar, Mohsen Motie-Shirazi, Yashar Kiarashi, Salman Seyedi, Gari D. Clifford, Reza Sameni

"The authors introduced the ECG-Image-Database, a large and diverse collection of electrocardiogram (ECG) images generated from ECG time-series data, with real-world scanning, imaging, and physical artifacts. They used **ECG-Image-Kit**, an open-source Python toolkit, to generate realistic images of 12-lead ECG printouts from raw ECG time-series."

"The resulting dataset includes 35,595 software-labeled ECG images with a wide range of imaging artifacts and distortions. The dataset provides ground truth time-series data alongside the images, offering a reference for developing machine and **deep learning models for ECG digitization and classification**."

"The dataset aims to serve as a reference for ECG digitization and computerized annotation efforts. ECG-Image-Database was used in the PhysioNet Challenge 2024 on ECG image digitization and classification."

https://arxiv.org/abs/2409.16612

In [None]:
train = pd.read_csv('/kaggle/input/physionet-ecg-image-digitization/train.csv')
train.head()

## info() method - train file 

In [None]:
train.info()

## Histograms (train file)

**fs**: The sampling frequency of the corresponding ECG signal.

**sig_len**: The full sequence length of the ECG measurements. For the competition data this is always 10 seconds * fs. Note that clinical ECGs can have different durations. Kindly consider making your code robust to other durations though this is not a requirement of participation.

Almost the same since sig_len has the equal values with one Zero more.

In [None]:
numerical_cols = ['id', 'fs', 'sig_len']

train[numerical_cols].hist(figsize=(15,10), bins=30, color='green', edgecolor='black')
plt.suptitle("Histogram of Numeric Features")
plt.show()

## Imports to check images

In [None]:
import random
from glob import glob
import cv2
import glob
from PIL import Image

In [None]:
DATA_PATH = "../input/physionet-ecg-image-digitization/"

In [None]:
image_paths = sorted(glob.glob(os.path.join(DATA_PATH, '**/**/*.png')))

## len (number of images)

In [None]:
print(f"Number of images: {len(image_paths)}")

## We have 10 images, though I could display only 4. Why?

In [None]:
def plotImages(tools,directory):
    print(tools)
    multipleImages = glob.glob(directory)
    plt.rcParams['figure.figsize'] = (8, 8) #Original is 15,15. Since we have 18 veggies I decreased the size
    plt.subplots_adjust(wspace=0, hspace=0)
    i_ = 0
    for l in multipleImages[:4]: #Original is 25
        im = cv2.imread(l)
        im = cv2.resize(im, (256, 256)) 
        plt.subplot(5, 5, i_+1) #.set_title(l)
        plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis('off')
        i_ += 1

plotImages("ECG 10140238 train images","../input/physionet-ecg-image-digitization/train/10140238/**")

## Test images (Only 2)

In [None]:
def plotImages(tools,directory):
    print(tools)
    multipleImages = glob.glob(directory)
    plt.rcParams['figure.figsize'] = (15, 15) #Original is 15,15. Since we have 18 veggies I decreased the size
    plt.subplots_adjust(wspace=0, hspace=0)
    i_ = 0
    for l in multipleImages[:25]:
        im = cv2.imread(l)
        im = cv2.resize(im, (512, 512)) 
        plt.subplot(5, 2, i_+1) #.set_title(l)
        plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis('off')
        i_ += 1

plotImages("ECG test images","../input/physionet-ecg-image-digitization/test/**")

## Open Pil to check ECG details above

One image is male the other png (female). I displayed only the male image.

In [None]:
imgs_dir = '../input/physionet-ecg-image-digitization/test/'
Image.open(imgs_dir + '2352854581.png')

## That's an Experimental code that I'll try to adapt here

Read the original to understand what I intended to do. 
Unfortunately, I didn't it.

In [None]:
#By Gregoire DC https://www.kaggle.com/code/gregoiredc/arrhythmia-on-ecg-classification-using-cnn/notebook

def plot_hist(class_number,size,min_,bins):
    img=lead.loc[lead[:11]==class_number].values
    img=img[:,min_:size]
    img_flatten=img.flatten()

    final1=np.arange(min_,size)
    for i in range (img.shape[0]-1):
        tempo1=np.arange(min_,size)
        final1=np.concatenate((final1, tempo1), axis=None)
    print(len(final1))
    print(len(img_flatten))
    plt.hist2d(final1,img_flatten, bins=(bins,bins),cmap=plt.cm.jet)
    plt.show()

## The original data has 187 columns. I changed 187 to V6. 

In [None]:
#By Gregoire DC https://www.kaggle.com/code/gregoiredc/arrhythmia-on-ecg-classification-using-cnn/notebook

c=lead.groupby('V6',group_keys=False).apply(lambda lead : lead.sample(1))

## That's c  (= our train/[id]/[id].csv)

In [None]:
c

In [None]:
 plt.plot(c.iloc[:11]);

In [None]:
#By Gregoire DC https://www.kaggle.com/code/gregoiredc/arrhythmia-on-ecg-classification-using-cnn/notebook

def add_gaussian_noise(signal):
    noise=np.random.normal(0,0.5,12)
    return (signal+noise)

### Add a noise to the data to generalize train (=lead)

In [None]:
#By Gregoire DC https://www.kaggle.com/code/gregoiredc/arrhythmia-on-ecg-classification-using-cnn/notebook

tempo=c.iloc[0:11]
bruiter=add_gaussian_noise(tempo)

plt.subplot(2,1,1)
plt.plot(c.iloc[0:11])

plt.subplot(2,1,2)
plt.plot(bruiter)

plt.show()

### Allegedly plotting ECG signals (and Not).

In [None]:
#By Shashank Pandey https://www.kaggle.com/code/shashankkkpandeyyy/biomedical-signal-analysis
# Plot a something that it was allegedly ECG signals
plt.figure(figsize=(15, 6))
for i in range(3):
    plt.plot(lead.iloc[i, :-1], label=f'Sample {i} - Label {lead.iloc[i, -1]}')
plt.xlabel('Time Steps')
plt.ylabel('Amplitude')
plt.title('Allegedly sample ECG Signals')
plt.legend()
plt.show()

## It's too much noise on a ECG. I'm Done.

Read the originals below to understand what's supposed to be. 

#Acknowledgements:

Gregoire DC https://www.kaggle.com/code/gregoiredc/arrhythmia-on-ecg-classification-using-cnn/notebook

Shashank Pandey https://www.kaggle.com/code/shashankkkpandeyyy/biomedical-signal-analysis

Nicolas Mine https://www.kaggle.com/code/coni57/model-from-arxiv-1805-00794/notebook

## 