# Intro
This notebook downloads the entire MIT-BIH Arrhythmia Database, which is publicly available here: https://physionet.org/physiobank/database/mitdb/

It then converts all the WFDB files into CSV files with the same name (example the files 100.dat, 100.hea, 100.atr into 100.csv).

Each row in the CSV file corresponds to a single heartbeat and is generated as follows:
* Split MIT-BIH record at the R-peaks into individual heartbeat records.
* Each heartbeat record is appended with the first 40 readings of the next heartbeat record so that we include a full QRS Complex.
* Resample each heartbeat record from 360Hz to 125Hz.
* Normalize the mV readings to a 0-1 range.
* Heartbeat records longer than 187 values are discarded.
* Heartbeat records are padded with zeroes at the end until they contain exactly 187 values.
* Heartbeat classifications from the annotations is reduced to just Normal and Abnormal and appended to the end of each heartbeat record (0 is normal, 1 is abnormal). Each row then contains exactly 188 values.
* Heartbeat records without classifications are discarded.

The purpose of these CSV files is so that they can be used in training the ECG model for classifying heartbeats as either Normal or Abnormal.

# Initialize
Import dependencies.

Note that you will need to download and install the [mitdb](https://github.com/Nospoko/qrs-tutorial) library. The project contains convenience functions that make it easier to download and read [WFDB](https://physionet.org/physiotools/wfdb.shtml) compatible files. In addition, you will also need to install the [BioSPPy](https://github.com/PIA-Group/BioSPPy) library, which we use to find the R-peaks in the data.

In [None]:
%pip install tqdm
%pip install wfdb
%pip install datasets

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal

!pip install biosppy
from biosppy.signals import ecg
!pip install wfdb
import wfdb as wf
wf.dl_database('mitdb', './mitdb')

# Data Conversion
Read the WFDB files and convert to CSV files. Data will be split into individual heartbeats, each row consisting of exactly 187 normalized and resampled values, plus the last value with be an integer representing the classification; 0 = Normal, 1 = Abnormal.

In [14]:
import os
import numpy as np
from scipy import signal
from biosppy.signals import ecg
import wfdb as wf

# Download the MIT-BIH dataset
wf.dl_database('mitdb', './mitdb')

# Create output directory for CSV files
output_dir = './data_ecg'
os.makedirs(output_dir, exist_ok=True)

# Define real beat classifications
realbeats = ['N', 'L', 'R', 'B', 'A', 'a', 'J', 'S', 'V', 'r', 
             'F', 'e', 'j', 'n', 'E', '/', 'f', 'Q', '?']

# List available records
records = wf.get_record_list('mitdb')
print(f'Total files: {len(records)}')

# Iterate through each record
for record_name in records:
    print(f'Processing record: {record_name}')
    record_path = f'./mitdb/{record_name}'

    # Load record and annotations
    record = wf.rdsamp(record_path)
    annotation = wf.rdann(record_path, 'atr')

    # Sampling frequency
    fs = record[1]['fs']

    # ECG data and annotations
    data = record[0].T
    annotations = annotation.sample
    symbols = annotation.symbol

    # Initialize classifications for beats
    classifications = np.zeros(len(annotations), dtype=float)
    for i, sym in enumerate(symbols):
        if sym == 'N':
            classifications[i] = 0  # Normal
        elif sym in realbeats:
            classifications[i] = 1  # Abnormal

    # Process each channel in the record
    for channel_index, channel_data in enumerate(data):
        channel_name = record[1]['sig_name'][channel_index]
        print(f'  Processing channel: {channel_name}')

        # Detect R-peaks using biosppy
        ecg_output = ecg.ecg(signal=channel_data, sampling_rate=fs, show=False)
        rpeaks = ecg_output['rpeaks']

        # Prepare to store processed beats
        all_beats = []

        # Split and process individual heartbeats
        for i in range(1, len(rpeaks) - 1):  # Skip first and last R-peak
            start = rpeaks[i - 1]
            end = rpeaks[i + 1]

            # Boundary checks
            if start < 0 or end > len(channel_data):
                continue

            beat = channel_data[start:end]
            if len(beat) == 0:  # Skip empty beats
                continue

            # Normalize the beat
            beat = (beat - beat.min()) / (beat.max() - beat.min() + 1e-8)
            resampled_beat = signal.resample(beat, 187)

            # Get the classification for the beat
            classification = classifications[i] if i < len(classifications) else 0
            labeled_beat = np.append(resampled_beat, classification)
            all_beats.append(labeled_beat)

        # Save processed beats to a CSV file
        if len(all_beats) > 0:  # Ensure there are beats to save
            all_beats = np.array(all_beats, dtype=np.float32)
            output_file = os.path.join(output_dir, f'{record_name}_{channel_name}.csv')
            print(f'  Saving to: {output_file}')
            np.savetxt(output_file, all_beats, delimiter=',', fmt='%f')
        else:
            print(f'  No valid beats found for {record_name}_{channel_name}')


Generating record list for: 100
Generating record list for: 101
Generating record list for: 102
Generating record list for: 103
Generating record list for: 104
Generating record list for: 105
Generating record list for: 106
Generating record list for: 107
Generating record list for: 108
Generating record list for: 109
Generating record list for: 111
Generating record list for: 112
Generating record list for: 113
Generating record list for: 114
Generating record list for: 115
Generating record list for: 116
Generating record list for: 117
Generating record list for: 118
Generating record list for: 119
Generating record list for: 121
Generating record list for: 122
Generating record list for: 123
Generating record list for: 124
Generating record list for: 200
Generating record list for: 201
Generating record list for: 202
Generating record list for: 203
Generating record list for: 205
Generating record list for: 207
Generating record list for: 208
Generating record list for: 209
Generati