# Desease Prediction by ECG

## Data Processing

We're importing the libraries we need for preprocessing the ECG data:

- `os` lets us handle file paths and directories.
- `numpy (np)` is for numerical operations and working with arrays.
- `wfdb` is a library specialized for reading and processing physiological signals like ECGs from PhysioNet-formatted files.

In [3]:
import os
import numpy as np
import wfdb

The first step in our ECG preprocessing pipeline is to configure the file paths and load the necessary data resources. We begin by specifying:
- the directory containing the raw ECG signals,
- the path to the label annotation file (CSV), and
- the output directory, where we will later store the preprocessed and batched data.

We also define a variable `min_length`, set to 2200, which corresponds to the shortest ECG recording in our dataset. This ensures all signals are trimmed or padded to a uniform length, making them suitable for model training. Additionally, we initialize an empty list named `batch` that will temporarily hold a group of ECG signals before writing them to disk.

Once paths and parameters are set, we proceed to load the label annotations from the CSV file. Since the file contains a header, we skip the first row. The labels are initially loaded as strings, after which we remove unnecessary columns — specifically, the first four and the last one — to retain only the useful class information. Finally, we convert the cleaned label values to `float32`, which is the required numerical format for machine learning workflows.

We also prepare for batching by defining a base filename ("`batch`") and initializing a `file_group` counter. Lastly, we ensure the output directory exists by creating it if necessary using `os.makedirs`.

In [None]:
data_path = "./ecg_resources/data"
label_path = "./ecg_resources/annotations.csv"
output_path = "./data"

min_length = 2200 # this is length of the shortest timestamp
batch = []

labels = np.loadtxt(label_path, delimiter=',', skiprows=1, dtype=str)
trimed_labels = np.delete(labels, [0, 1, 2, 3, -1], axis=1)
casted_labels = trimed_labels.astype(np.float32)

filename = "batch"
file_group = 1

os.makedirs(output_path, exist_ok=True)

With our environment prepared, we now move on to the core data processing loop. This part of the pipeline iterates through the available ECG records, trims them to a uniform length, batches them, and saves them in `.npz` format along with their corresponding labels.

We begin by looping through the data using the `file_group` counter, starting from 1 up to 39,999. For each iteration:
- We construct the file name of the ECG record using the expected naming convention: `TNMG{file_group}_N1`.
- We then attempt to read the signal using the `wfdb` library, which parses the MIT-BIH compatible record format.
If the signal is successfully read, we check its length. To ensure consistency across the dataset:
- If the signal is longer than `min_length`, we trim it down to the target length.
- If it’s already shorter or equal in length, we keep it as is.

Each processed signal is added to the `batch` list. If an exception occurs while loading a file (e.g., the file does not exist or is corrupted), the error is logged, and the loop simply moves to the next file.

Once the `batch` reaches 100 signals, we perform the following steps:
- Convert the list of signals into a NumPy array with `float32` precision.
- Take the first 100 labels from the `casted_labels` array to match the batch.
- Save both the signals and labels into an `.npz` file named based on the batch index (e.g., `batch-1.npz`, `batch-2.npz`, etc.).
- Clear the `batch` and trim the used labels from `casted_labels` to prepare for the next group.

This continues until all records up to 39,999 have been processed.

In [None]:
while file_group <= 39_999:
    record_name = f"TNMG{file_group}_N1"
    print(f"Processing record: {record_name}")
    try:
        record_signal = wfdb.rdrecord(os.path.join(data_path, record_name)).p_signal

        if record_signal.shape[0] > min_length:
            trimmed_signal = record_signal[:min_length, :]
        else:
            trimmed_signal = record_signal

        batch.append(trimmed_signal)

    except Exception as e:
        print(f"Can't load file: {record_name}, error: {e}\n")
        file_group += 1
        continue

    if len(batch) == 100:
        numpy_array = np.array(batch, dtype=np.float32)
        trimmed_labels = casted_labels[:100]

        np.savez(os.path.join(output_path, f"{filename}-{int(file_group / 100)}.npz"),
                 signals=numpy_array, labels=trimmed_labels)

        print(f"Batch saved as {filename}-{int(file_group / 100)}.npz\n")

        batch = []
        casted_labels = casted_labels[100:]

    file_group += 1

## Creating a PyTorch Dataset

Now that our ECG data has been processed and saved in `.npz` batches, we move on to the next step: preparing a PyTorch-compatible dataset. This will allow us to easily load, shuffle, and feed the data into our neural network model during training.

Before we begin, we import PyTorch and verify that it's installed correctly.

In [4]:
import torch

torch.__version__

'2.5.1+cu124'

We begin by specifying the directory where the `.npz` files are stored, using `input_path`. This path points to the folder containing the processed signal batches.

Next, we initialize two empty lists: `data` and `labels`. These will store all the ECG signal data and corresponding labels, respectively.

We then loop through each file in the `input_path` directory. For each file:
- We construct the full file path using `os.path.join`.
- We load the `.npz` file using `np.load()`, which contains both the signals and labels.
- The `signals` and `labels` arrays are extracted from the loaded file and added to the `data` and `labels` lists.

Once all the files have been processed, we convert the `data` and `labels` lists into NumPy arrays. This is necessary for efficient handling of the data within PyTorch.

Finally, we save the entire dataset (signals and labels) to a single `.pt` file using `torch.save()`. This allows us to quickly reload the dataset during training, ensuring it’s ready for use with PyTorch’s DataLoader.

In [None]:
input_path = "./data"

data = []
labels = []

for i, file in enumerate(os.listdir(input_path)):
    print(f"{i}. processing file: {file}")
    file_path = os.path.join(input_path, file)

    record = np.load(file_path)
    signals = record["signals"]
    label = record["labels"]

    data.append(signals)
    labels.append(label)

data = np.array(data)
labels = np.array(labels)

torch.save({'data': data, 'labels': labels}, 'dataset.pt')

## Dataset Class

With the dataset saved in a `.pt` file and ready for use, the next step is to wrap it into a format that PyTorch can work with during training and evaluation. This is done by creating a custom Dataset class that inherits from `torch.utils.data.Dataset`.

We begin by importing the required base class from PyTorch’s data utilities.

We then define a class called `ECGDataset`, which takes in two arguments during initialization: `data` and `labels`. These are expected to be NumPy arrays (or already-converted tensors) that represent the ECG signal data and their corresponding annotations.

Inside the `__init__` method:
- The input `data` and `labels` are converted into PyTorch tensors with a float32 data type, ensuring compatibility with typical model architectures.
- The signal data is reshaped from its original multi-dimensional structure into a 2D tensor using `.view(-1, 2200 * 8)`. This flattens each ECG signal sample (2200 timestamps with 8 channels) into a single vector, making it suitable as input to fully connected neural networks.
- The labels are reshaped into a 2D tensor with shape `(-1, 6)`, where each sample is associated with 6 output values (assumed to be regression targets or class scores).
- We also store the total number of samples using `self.data.shape[0]`, which will be returned when PyTorch queries the length of the dataset.

The `__len__` method simply returns this number of samples, allowing PyTorch to iterate over the dataset correctly.

The `__getitem__` method retrieves a single data-label pair at a given index. This method is essential for PyTorch’s `DataLoader` to be able to batch and shuffle the data during training.

In [44]:
from torch.utils.data import Dataset

class ECGDataset(Dataset):
    def __init__(self, data, labels):
        self.data = torch.tensor(data, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.float32)

        self.data = self.data.view(-1, 2200 * 8)
        self.labels = self.labels.view(-1, 6)

        self.num_samples = self.data.shape[0]

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        x = self.data[idx]
        y = self.labels[idx]
        return x, y

## Model Class

With the data and dataset structure in place, the next step is to define a neural network model suitable for multi-label classification of ECG signals. In this case, we implement a simple feedforward neural network using PyTorch’s `nn.Module` base class.

We begin by importing the necessary neural network components from PyTorch.

Next, we define a class called `MultiLabelClassifier`, which inherits from `nn.Module`. This model is designed to handle multi-label classification, where each input sample may belong to multiple categories simultaneously.

Inside the `__init__` method, we define the architecture of the network:
- **Input Layer to Hidden Layer:** The first fully connected layer (`fc1`) transforms the input features (flattened ECG signal vectors) to a specified number of hidden units (`hidden_dim`).
- **Activation Function:** A ReLU activation (`self.relu`) is applied to introduce non-linearity, helping the model learn complex patterns in the data.
- **Hidden Layer to Output Layer:** The second fully connected layer (`fc2`) maps the hidden representation to the desired output dimension (`output_dim`), which corresponds to the number of labels per sample.
- **Output Activation:** A Sigmoid function is used at the end, ensuring each output value lies between 0 and 1. This is suitable for multi-label classification, as each output can be interpreted as the independent probability of a label being present.

The `forward` method defines the forward pass of the model—how data flows through the layers during inference and training.

In [45]:
import torch.nn as nn

class MultiLabelClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(MultiLabelClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

## Model Training

Before initializing the model or beginning training, we need to determine where the computations will take place—on a GPU (if available) or on the CPU.

PyTorch provides a convenient way to check for GPU availability through `torch.cuda.is_available()`. Based on this check:
- If a CUDA-compatible GPU is available, we use `cuda` as the device, which enables faster training.
- Otherwise, we fall back to using the CPU.

In [46]:
device = "cuda" if torch.cuda.is_available() else "cpu"

With the device selected, we now set up the model architecture, the loss function, and the optimizer that will be used during training.

We begin by defining the dimensions:
- `input_dim = 2200 * 8`: Each ECG sample has 8 channels (leads), and each is trimmed to 2200 time steps, so we flatten each sample into a vector of size 17,600.
- `hidden_dim = 128`: This is the size of the hidden layer in the model. You can tune this value depending on model complexity.
- `output_dim = 6`: There are six output classes, each representing a different ECG abnormality label. Since this is a multi-label classification problem, the model predicts six independent probabilities per sample.

Next, we initialize the model using our custom `MultiLabelClassifier` class and move it to the appropriate device (CPU or GPU) using `.to(device)`.

We define the loss function using **binary cross-entropy** (`BCELoss`), which is suitable for multi-label classification tasks where each label is independent and binary.

Finally, we set up the **Adam optimizer**, which adapts the learning rate during training for efficient convergence. We pass in the model’s parameters and set a learning rate of 0.001

In [47]:
import torch.optim as optim

input_dim = 2200 * 8
hidden_dim = 128
output_dim = 6

model = MultiLabelClassifier(input_dim, hidden_dim, output_dim).to(device)

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

With our model ready, we now load the saved dataset and prepare it for iteration using PyTorch’s `DataLoader`, which handles batching and shuffling of the data.

First, we load the entire dataset from the previously saved `dataset.pt` file using `torch.load`. This loads both the signal data and their associated labels.

We then pass the loaded arrays to our previously defined `ECGDataset` class, which wraps the data in a format that PyTorch can use.

Finally, we create a `DataLoader`, which enables efficient batch processing during training:
- `batch_size=32`: Each training step will operate on 32 ECG samples.
- `shuffle=True`: Randomly shuffles the dataset at the beginning of each epoch, helping the model generalize better.

In [48]:
from torch.utils.data import DataLoader

dataset_load = torch.load("dataset.pt", weights_only=False)
dataset = ECGDataset(dataset_load["data"], dataset_load["labels"])

data_loader = DataLoader(dataset, batch_size=32, shuffle=True)


With our model, dataset, and training utilities ready, we now move into the training and evaluation phase. This is the most critical part of the notebook, where the model learns from the data and improves its predictions over time.

We begin by setting the number of training epochs to **100**. Each epoch consists of two phases: training and evaluation.

**Training Phase**

At the start of each epoch, we switch the model to training mode using `model.train()`. This ensures that all layers behave appropriately during training (e.g., dropout remains active).

We then initialize counters to track the total training loss and the number of correct predictions for calculating accuracy.

Next, we iterate through each batch from the `data_loader`. For each batch:
- We move the inputs and labels to the appropriate device (CPU or GPU).
- We convert the labels to `float` type to match the expected input type for the binary cross-entropy loss.
- We clear the optimizer gradients from the previous batch.
- We pass the inputs through the model to obtain predictions.
- We compute the loss using `BCELoss`, which is suitable for multi-label classification problems.
- We backpropagate the loss using `.backward()` and update the model parameters with `.step()`.

After updating the model, we calculate the training accuracy for the batch:
- We threshold the outputs at 0.5 to produce binary predictions.
- We count the number of correct predictions compared to the ground truth labels.
- We accumulate the loss and correct predictions for calculating average metrics later.

Once all batches are processed, we calculate:
- **Average training loss** by dividing the total loss by the number of batches.
- **Training accuracy** by dividing the total number of correct predictions by the total number of elements.

**Evaluation Phase**

After training, we switch the model to evaluation mode using `model.eval()`. This disables any training-specific behavior like dropout layers.

We also wrap the evaluation logic inside a `torch.inference_mode()` context, which improves performance and reduces memory usage by disabling gradient computation.

The evaluation loop is similar to the training loop, but without any gradient updates:
- We iterate over the same `data_loader`.
- For each batch, we perform a forward pass and compute the loss.
- Predictions are thresholded at 0.5, and accuracy is calculated in the same way as during training.

After all batches have been evaluated, we calculate:
- **Average test loss**
- **Test accuracy**

In [49]:
epochs = 100
for epoch in range(epochs):
    model.train()
    total_train_loss = 0
    train_correct = 0
    train_total = 0

    for inputs, labels in data_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        labels = labels.float()

        optimizer.zero_grad()
        outputs = model(inputs)

        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_train_loss += loss.item()

        predicted = (outputs > 0.5).float()
        train_correct += (predicted == labels).sum().item()
        train_total += labels.numel()

    avg_train_loss = total_train_loss / len(data_loader)
    train_accuracy = train_correct / train_total


    model.eval()
    total_test_loss = 0
    test_correct = 0
    test_total = 0

    with torch.inference_mode():
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            labels = labels.float()

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            total_test_loss += loss.item()

            predicted = (outputs > 0.5).float()
            test_correct += (predicted == labels).sum().item()
            test_total += labels.numel()

    avg_test_loss = total_test_loss / len(data_loader)
    test_accuracy = test_correct / test_total

    print(f"Epoch {epoch+1}/{epochs} | Train Loss: {avg_train_loss:.4f}, Train Acc: {train_accuracy:.4f} | Test Loss: {avg_test_loss:.4f}, Test Acc: {test_accuracy:.4f}")

Epoch 1/100 | Train Loss: 0.3363, Train Acc: 0.9710 | Test Loss: 0.2585, Test Acc: 0.9767
Epoch 2/100 | Train Loss: 0.2820, Train Acc: 0.9765 | Test Loss: 1.2255, Test Acc: 0.9612
Epoch 3/100 | Train Loss: 0.2638, Train Acc: 0.9763 | Test Loss: 0.2426, Test Acc: 0.9778
Epoch 4/100 | Train Loss: 0.2510, Train Acc: 0.9768 | Test Loss: 0.2279, Test Acc: 0.9780
Epoch 5/100 | Train Loss: 0.2510, Train Acc: 0.9770 | Test Loss: 0.2321, Test Acc: 0.9788
Epoch 6/100 | Train Loss: 0.2636, Train Acc: 0.9774 | Test Loss: 0.2591, Test Acc: 0.9791
Epoch 7/100 | Train Loss: 0.2885, Train Acc: 0.9778 | Test Loss: 0.2713, Test Acc: 0.9797
Epoch 8/100 | Train Loss: 0.2775, Train Acc: 0.9785 | Test Loss: 0.2980, Test Acc: 0.9809
Epoch 9/100 | Train Loss: 0.2859, Train Acc: 0.9797 | Test Loss: 0.2692, Test Acc: 0.9821
Epoch 10/100 | Train Loss: 0.2854, Train Acc: 0.9807 | Test Loss: 0.2712, Test Acc: 0.9833
Epoch 11/100 | Train Loss: 0.2974, Train Acc: 0.9814 | Test Loss: 0.2908, Test Acc: 0.9838
Epoch 12

## Save Model

In [50]:
torch.save(model.state_dict(), "ecg_model.pth")

## Test Model Prediction Accuracy

Once the model has been trained and evaluated across several epochs, we may want to inspect how it performs on individual samples. This can help us understand the model’s confidence and how it interprets specific ECG signal inputs.

We define a helper function `print_pred()` to test the model’s predictions for a single sample. This function takes two arguments:
- `vals`: The input ECG signal vector (one sample).
- `ans`: The ground truth label(s) for that sample, used for comparison.

In [60]:
def print_pred(vals, ans):
    model.eval()
    with torch.inference_mode():
        vals = vals.clone().detach().float().to(device)

        vals = vals.view(1, -1)

        pred = model(vals)
        probabilities = pred.cpu().numpy()
        predicted_labels = (pred > 0.5).int().cpu().numpy()

        print(f"\nProbabilities: {probabilities}, \nPrediction: {predicted_labels}, \nAnswer: {ans} \n")

In [62]:
test_model = MultiLabelClassifier(input_dim, hidden_dim, output_dim).to(device)
test_model.load_state_dict(torch.load("ecg_model.pth", weights_only=False, map_location=device))

sample_0 = dataset.data[0]
sample_1 = dataset.data[1]
sample_2 = dataset.data[2]
sample_3 = dataset.data[3]
sample_4 = dataset.data[4]

label_0 = dataset.labels[0].cpu().numpy()
label_1 = dataset.labels[1].cpu().numpy()
label_2 = dataset.labels[2].cpu().numpy()
label_3 = dataset.labels[3].cpu().numpy()
label_4 = dataset.labels[4].cpu().numpy()


print_pred(sample_0, label_0)
print_pred(sample_1, label_1)
print_pred(sample_2, label_2)
print_pred(sample_3, label_3)
print_pred(sample_4, label_4)


Probabilities: [[6.8153909e-08 2.4420189e-07 5.6282011e-17 1.3305446e-13 8.2911463e-08
  1.9636718e-07]], 
Prediction: [[0 0 0 0 0 0]], 
Answer: [0. 0. 0. 0. 0. 0.] 


Probabilities: [[3.2594273e-25 5.2787104e-21 6.1008863e-14 1.0344354e-22 1.0000000e+00
  9.3258457e-12]], 
Prediction: [[0 0 0 0 1 0]], 
Answer: [0. 0. 0. 0. 1. 0.] 


Probabilities: [[0. 0. 0. 0. 0. 0.]], 
Prediction: [[0 0 0 0 0 0]], 
Answer: [1. 0. 0. 0. 0. 0.] 


Probabilities: [[0.0000000e+00 3.2699278e-24 1.0000000e+00 1.2412947e-32 0.0000000e+00
  1.1415087e-38]], 
Prediction: [[0 0 1 0 0 0]], 
Answer: [0. 0. 1. 0. 0. 0.] 


Probabilities: [[0. 0. 0. 0. 0. 0.]], 
Prediction: [[0 0 0 0 0 0]], 
Answer: [0. 0. 0. 0. 0. 0.] 



## Conclusion

The following table presents the model's predictions on five test samples. For each sample, we compare the predicted labels to the true labels and assess the model's accuracy:

| Sample | Predicted Labels         | True Labels             | Notes        |
|--------|--------------------------|-------------------------|--------------|
| #0     | [0, 0, 0, 0, 0, 0]       | [0, 0, 0, 0, 0, 0]      | ✅ Correct   |
| #1     | [0, 0, 0, 0, 1, 0]       | [0, 0, 0, 0, 1, 0]      | ✅ Correct   |
| #2     | [0, 0, 0, 0, 0, 0]       | [1, 0, 0, 0, 0, 0]      | ❌ Missed    |
| #3     | [0, 0, 1, 0, 0, 0]       | [0, 0, 1, 0, 0, 0]      | ✅ Correct   |
| #4     | [0, 0, 0, 0, 0, 0]       | [0, 0, 0, 0, 0, 0]      | ✅ Correct   |

**Analysis of Results**

The model demonstrated **accurate predictions** for most of the test samples, particularly for cases where all labels are zero or where there is a clear positive label.

**Sample #2** exhibited a **missed label** (false negative), where the model incorrectly predicted all labels as zero despite the true label indicating a positive class for the first position. This could indicate that the model struggles with subtle or rare signal features.

The **overall performance** suggests that the model is capable of correctly identifying the most prominent features in the ECG signals but may require further refinement or additional training data to improve its ability to handle more complex or nuanced cases.

**Conclusion and Future Work**

The multi-label classifier has shown promising results in classifying ECG signals, demonstrating good performance in many scenarios. However, there is room for improvement, especially in handling edge cases and minimizing false negatives.

Next steps include:
- **Hyperparameter tuning** to optimize the model for better generalization.
- **Dataset expansion** or augmentation to expose the model to a more diverse set of ECG patterns.
- **Model refinement** using more complex architectures, such as convolutional neural networks (CNNs), which are particularly effective for time-series data like ECG signals.