## Data Preparation

The purpose of this notebook is to prepare and understand the working dataset for ML purposes. Important part is to split the data into modeling data  
and testing data - this is important as we must completely isolate the testing data in order to avoid any data and statistical leaks.

#### Goals:
- understand data structure
- load and compile data into one dataset
- map metadata to corresponding ECG signals

### Data structure

Shape:
- 45 152 patients -> signals 10-second 12-lead ECG at 500 Hz -> 5000 samples per lead
- WFDB format (WaveForm Database)

ECG = pair of files:
- JS0000X.hea - ASCII WFDB header + metadata -> description of the data
- JS0000X.mat - binary data matrix (val matrix in raw units) - 10-second 12-lead ECG samples in binary form

In [57]:
'''
ecg-arrhythmia/
    WFDBRecords/
        01/
            010/
                JS00001.hea
                JS00001.mat
                JS00002.hea
                JS00002.mat
                ...
        02/
        ...
        46/
'''
pass

### Implementation

In [58]:
import os
import sys

sys.dont_write_bytecode = True
root_dir = os.path.abspath(os.pardir)
if root_dir not in sys.path:
    sys.path.append(root_dir)

In [59]:
import wfdb
import numpy as np
import pandas as pd
from configs.constants import *

configuration

In [60]:
data_dir_folder_path = os.path.abspath(os.path.join(root_dir, DATA_DIR_FOLDER))
data_dir_path = os.path.abspath(os.path.join(data_dir_folder_path, DATA_DIR))

#### Data Load Pipeline

1. Load pipeline
2. Feature Extraction pipeline

common load - metadata

In [61]:
def iter_header_paths(root_dir, dirname="WFDBRecords"):
    wfdb_root = os.path.join(root_dir, dirname)

    for d1 in sorted(os.listdir(wfdb_root)):
        p1 = os.path.join(wfdb_root, d1)
        if not os.path.isdir(p1):
            continue

        for d2 in sorted(os.listdir(p1)):
            p2 = os.path.join(p1, d2)
            if not os.path.isdir(p2):
                continue

            for fname in sorted(os.listdir(p2)):
                if fname.endswith(".hea"):
                    yield os.path.join(p2, fname)

In [62]:
def parse_header(header_path):
    header_path = os.fspath(header_path)
    with open (header_path, encoding='utf-8') as f:
        lines = f.read().splitlines()

    if not lines:
        raise ValueError(f'Empty header file: {header_path}')
    
    first = lines[0].strip().split()

    # bugfix -> ked chyba prvy metadata riadok
    offset = 1
    try:
        record_name = first[0]
        n_signals = first[1]
        freq = int(first[2])
        n_samples = int(first[3])
    except:
        record_name = None
        n_signals = None
        freq = None
        n_samples = None
        offset = 0

    lines = lines[offset:]

    age = None
    sex = None
    y_dx_codes = []

    for line in lines:
        line = line.strip()
        if line.startswith('#Age:'):
            _, v = line.split(':', 1)
            v = v.strip()
            if v and v.lower() not in ['unknown', 'nan']:
                try:
                    age = int(v)
                except ValueError:
                    age = None
        elif line.startswith('#Sex:'):
            _, v = line.split(":", 1)
            sex = v.strip() or None

        elif line.startswith("#Dx:"):
            _, v = line.split(":", 1)
            codes = [c.strip() for c in v.split(",") if c.strip()]
            dx_codes = []
            for c in codes:
                try:
                    dx_codes.append(int(c))
                except ValueError:
                    pass

    base, _ = os.path.splitext(header_path)
    record_path = base

    return {
        "record": record_name,
        "hea_path": header_path,
        "record_path": record_path,
        "n_sig": n_signals,
        "fs": freq,
        "n_samples": n_samples,
        "age": age,
        "sex": sex,
        "dx_codes": dx_codes,
    }


label loading function

In [63]:
def load_labels(dirpath, filepath, as_index=False, index_column='Snomed_CT'):
    filepath = os.path.join(dirpath, filepath)
    data = pd.read_csv(filepath)
    if as_index:
        data = data.set_index(index_column)
    return data

In [74]:
load_labels(data_dir_folder_path, 'ConditionNames_SNOMED-CT.csv', as_index=True).head()

Unnamed: 0_level_0,Acronym Name,Full Name
Snomed_CT,Unnamed: 1_level_1,Unnamed: 2_level_1
270492004,1AVB,1 degree atrioventricular block
195042002,2AVB,2 degree atrioventricular block
54016002,2AVB1,2 degree atrioventricular block(Type one)
28189009,2AVB2,2 degree atrioventricular block(Type two)
27885002,3AVB,3 degree atrioventricular block


In [75]:
run_ = False
if run_:
    rows = []
    for hea in iter_header_paths(data_dir_folder_path):
        rows.append(parse_header(hea))

    meta_df = pd.DataFrame(rows)
    print("Number of records:", len(meta_df))
    print(meta_df.head())

In [76]:
save_ = False
if save_:
    meta_df.to_csv('../data/results/complete_metadata_mapping_2.csv', index=False)

common load - ECG signal

In [71]:
head_paths = list(iter_header_paths(data_dir_folder_path))

In [72]:
parse_header(head_paths[0])

{'record': 'JS00001',
 'hea_path': 'c:\\Users\\samue\\Desktop\\VS\\Mgr\\mAIN\\Strojove Ucenie\\projekt\\data\\ecg-arrhythmia\\WFDBRecords\\01\\010\\JS00001.hea',
 'record_path': 'c:\\Users\\samue\\Desktop\\VS\\Mgr\\mAIN\\Strojove Ucenie\\projekt\\data\\ecg-arrhythmia\\WFDBRecords\\01\\010\\JS00001',
 'n_sig': '12',
 'fs': 500,
 'n_samples': 5000,
 'age': 85,
 'sex': 'Male',
 'dx_codes': [164889003, 59118001, 164934002]}

### Working with loaded compiled metadata

Workflow:
1. Build metadata mapping dataset -> each row = 1 patient and their recording
2. Map ECG signals to the corresponding -> ECG signal data of the patient (row to signal)
3. Utilize streaming -> data on disk has 6gb size

4. Multiple modeling pipelines -> model dependent, e.g. we will pass raw ECG to CNN but we need to do some feature extraction first for models such as SVM


Resources:
- Models:
    - scikit-multiflow (incremental learning): https://scikit-multiflow.readthedocs.io/en/stable/index.html
    - sklearn streaming: https://scikit-learn.org/stable/computing/scaling_strategies.html

- Features:
    - ECG data processing tips: https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2025.1649923/full?utm_source=chatgpt.com


Approaches:
1. ML Pipeline - traditional ML models - e.g. Log regression, SVM, forests, ... - doesnt need streaming we compute features and discard signals
2. DL Pipeline - deep learning algos - e.g. CNN, Transformers, ... - we will need to somehow stream rows/batches for fitting


#### Research Questions (ideas)

1. Build custom rhytm classification models - evaluate performance and compare to ECG-FM
2. Adversarial inputs and explanability - test adversarial inputs and critical intervals (which part of ECG is the most critical for output change), compare to ECG-FM
    - the idea here is to find which "regions" of the ECG signal is the most critical and also compare with explainability (due dilligence) -> does critical region match param importance?
    

implementation:

In [105]:
def load_ecg_signal(record_path, dataframe=False):
    sig, fields = wfdb.rdsamp(record_path)
    sig = sig.T.astype(np.float32)
    fs = fields['fs']
    lead_names = fields.get('sig_name')

    if dataframe:
        return pd.DataFrame(sig.T, columns=lead_names)

    return sig, fs, lead_names


working on metadata df

In [100]:
file = '../data/results/complete_metadata_mapping_2.csv'

In [101]:
data = pd.read_csv(file)

In [102]:
load_ecg_signal(data.loc[0]['record_path'], dataframe=True)

Unnamed: 0,I,II,III,aVR,aVL,aVF,V1,V2,V3,V4,V5,V6
0,-0.254,0.264,0.517,-0.005,-0.386,0.390,-0.098,-0.312,-0.098,0.810,0.810,0.527
1,-0.254,0.264,0.517,-0.005,-0.386,0.390,-0.098,-0.312,-0.098,0.810,0.810,0.527
2,-0.254,0.264,0.517,-0.005,-0.386,0.390,-0.098,-0.312,-0.098,0.810,0.810,0.527
3,-0.254,0.264,0.517,-0.005,-0.386,0.390,-0.098,-0.312,-0.098,0.810,0.810,0.527
4,-0.264,0.244,0.508,0.010,-0.386,0.376,-0.083,-0.259,-0.063,0.756,0.756,0.517
...,...,...,...,...,...,...,...,...,...,...,...,...
4995,-0.044,-0.044,0.000,0.044,-0.024,-0.024,-0.029,0.590,0.151,-0.185,-0.190,0.122
4996,-0.034,-0.063,-0.029,0.049,-0.005,-0.049,0.000,0.620,0.166,-0.181,-0.176,0.122
4997,-0.034,-0.068,-0.034,0.054,0.000,-0.054,-0.024,0.595,0.137,-0.205,-0.200,0.102
4998,0.024,-0.049,-0.073,0.015,0.049,-0.063,-0.015,0.590,0.132,-0.200,-0.195,0.093


### Project Pipeline

Workflow:
1. common load pipeline
2. train/test split - in order to avoid stat leaks we split at the beginning
3. ML/DL pipelines
4. Model fit -> other notebook

**1. common load pipeline**

**2. train/test split**

**3. ML/DL pipeline**

**4. Model fit**

**full pipeline**