![img](https://i.pinimg.com/736x/37/99/b6/3799b6806aefb5dacb90bc3484d0b6e8.jpg)

In [None]:
"""
Goal: Understand the problem we solve in this competition.

Author: Rudra Prasad Bhuyan
V1: 25-10-2025 00:15 IST
"""
print("")

# <h1><span style="color:#5e17eb; font-weight:bold;">About Data</span></h1>

> - **Competition**: https://www.kaggle.com/competitions/physionet-ecg-image-digitization
> - **Metrics**:
     - https://en.wikipedia.org/wiki/Signal-to-noise_ratio
     - https://www.kaggle.com/code/metric/physionet-ecg-signal-extraction-metric/
> - **Data**: https://www.kaggle.com/competitions/physionet-ecg-image-digitization/data
> - **My Notebook in same Series**: https://www.kaggle.com/rudraprasadbhuyan/code?query=ecg-

# <h1><span style="color:#5e17eb; font-weight:bold;">1. Notebook Setup</span></h1>


In [None]:
# Basic imports
import pandas as pd
import matplotlib.pyplot as plt
import os
from PIL import Image  
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Paths
train_csv_path = '/kaggle/input/physionet-ecg-image-digitization/train.csv'
test_csv_path = '/kaggle/input/physionet-ecg-image-digitization/test.csv'
sample_submission_path = '/kaggle/input/physionet-ecg-image-digitization/sample_submission.parquet'

train_folder = '/kaggle/input/physionet-ecg-image-digitization/train'
test_folder = '/kaggle/input/physionet-ecg-image-digitization/test'

# <h1><span style="color:#5e17eb; font-weight:bold;">2. inspect Metadata</span></h1>


In [None]:
# Load train and test metadata
train_meta = pd.read_csv(train_csv_path)
test_meta = pd.read_csv(test_csv_path)

In [None]:
print("Train Meta Data ....")
display(train_meta)

| Column    | Meaning                                                                                                        |
| --------- | -------------------------------------------------------------------------------------------------------------- |
| `id`      | Unique identifier for each ECG recording / patient sample. Think of it as the **folder name** in `train/[id]`. |
| `fs`      | Sampling frequency of the ECG signal (in Hz). How many data points are recorded per second.                    |
| `sig_len` | Total number of points in the ECG signal (length of the time series).                                          |


**Interpretation:**

- ECG sample 7663343 was recorded at 500 Hz, meaning 500 points per second.

- Total signal length = 5000 points → signal duration = 5000 / 500 = 10 seconds. 

- So sig_len = fs × duration, usually 10 seconds for most leads (except short ones).

In [None]:
print("\nTest Meta Data ....")
display(test_meta)

| Column           | Meaning                                                                     |
| ---------------- | --------------------------------------------------------------------------- |
| `id`             | Unique ECG recording ID (same as train, but in test).                       |
| `lead`           | Which of the 12 standard ECG leads (I, II, III, aVR, … V6) this row is for. |
| `fs`             | Sampling frequency of the signal (points per second).                       |
| `number_of_rows` | How many data points you are expected to predict for this lead. Usually:    |

For 
- Lead II → 10 seconds → fs × 10 points

- All other leads → 2.5 seconds → fs × 2.5 points |

**Interpretation:**

- This is Lead II for sample 1053922973

- Sampling rate = 1000 Hz → 1000 points per second

- Number of points to predict = 10000 → matches 10 seconds × 1000 Hz

- For other leads (like I, III, V1…V6), number_of_rows = fs × 2.5 seconds = 2500 points in this case.

# <h1><span style="color:#5e17eb; font-weight:bold;">3. Look Sample Images</span></h1>


In [None]:
train_meta['id']

In [None]:
# Pick one id
sample_id = train_meta['id'].iloc[0]

# List all PNGs for that sample
sample_images = os.listdir(os.path.join(train_folder, str(sample_id)))
print("Images for sample:", sample_images)

In [None]:
# Open a few images
for img_name in sample_images[:7]:  
    img_path = os.path.join(train_folder, str(sample_id), img_name)
    img = Image.open(img_path)
    plt.figure(figsize=(8,4))
    plt.imshow(img)
    plt.title(img_name)
    plt.axis('off')
    plt.show()

# <h1><span style="color:#5e17eb; font-weight:bold;">4. Look Sample Data</span></h1>


In [None]:
# Read a sample CSV for first id
sample_csv_path = os.path.join(train_folder, str(sample_id), f"{sample_id}.csv")
sample_data = pd.read_csv(sample_csv_path)
display(sample_data)

| Column                           | Meaning                                                                                                                    |
| -------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| I, II, III, aVR, aVL, aVF, V1–V6 | The **12 standard ECG leads**. Each column contains the **voltage measurements (in mV)** for that lead at each time point. |


In [None]:
lead_list = test_meta['lead'].to_list()

In [None]:
lead_list = test_meta['lead'].to_list()

In [None]:
for i in lead_list:
    plt.figure(figsize=(18,7))
    plt.plot(sample_data['I'])
    plt.title(f"ECG Lead {i} for sample {sample_id}")
    plt.xlabel("Time (sample points)")
    plt.ylabel("Voltage (mV)")
    plt.show()
    print("\n\n\n")

# <h1><span style="color:#5e17eb; font-weight:bold;">5. Inspect the sample Submission</span></h1>


In [None]:
# Load sample submission
sample_submission = pd.read_parquet(sample_submission_path)
display(sample_submission.sample(5))

**Why 0s?**

- All 0s = just a template, not real values.

- You must predict ECG voltage values (mV) for each row.

- Each row = one point in time for one lead for one test image.

- Your model’s job = convert ECG image → numerical signal (time series).

- You’ll replace the zeros with your predicted signal.

In [None]:
sample_submission["value"].unique()

# <h1><span style="color:#5e17eb; font-weight:bold;">6. Understanding Evaluation Metric</span></h1>


- Metric: modified signal-to-noise ratio (SNR)

- It compares your predicted ECG time series with the ground truth.

- High SNR → prediction is very close to true signal.

- For now, just know: we’ll need a 1D time series per lead for each test sample.

# <h1><span style="color:#5e17eb; font-weight:bold;">Resources</span></h1>

- My notesbooks: https://www.kaggle.com/rudraprasadbhuyan/code?query=ecg-
- Kernels issues: https://www.kaggle.com/code/dansbecker/finding-your-files-in-kaggle-kernels