# PrivECG: Anonymizing ECG Data for Privacy-Preserving Cardiovascular Diagnosis

## Github Link: [https://github.com/DHUIUC/PrivECG](https://github.com/DHUIUC/PrivECG)

## Authors
- Original Authors: Alexis Nolin-Lapalme, Robert Avram, Hussin Julie
- Contributor: Dillon Harding (Improvements, Ablations)

# Introduction:
## Background
In the original paper, PrivECG aims to address the potential privacy risks associated with using electrocardiogram (ECG) data for diagnosing cardiovascular disease. ECG data is invaluable in its ability be utilized in training machine learning models, though there is the potential of the data to reveal private attributes such as age and sex. With state of the art models, even patient re-identification is possible. A potential solution to this issue is PrivECG, which aids in anonymizing the ECG data while retaining the diagnostic utility of the data. 

The approach used in PrivECG involves a Generative Adversarial Network (GAN) to anonymize patient data. The model aims to reduce the accuracy of predicting a patient's sex while preserving other features of the data relevant to disease diagnosis. The GAN uses a generator and discriminator to achieve this result. The discriminator attempts to determine the sex of a patient based on the ECG data, whereas the generator learns to remove features of the data that would otherwise assist in the determining of a patient's sex. This ideally makes the accuracy effectively random (~50%).

The original paper cites a variant of PrivECG, "PrivECG-lambda", limited the sex prediction accuracy to 0.529 +- 0.014, while retaining its ability to determine diagnosis on the transformed data. The diagnosis classification for "PrivECG-lambda" had very high performance with an AUPR of 0.96 +- 0.006 despite limiting the ability of predicting the patient's sex. This paper is incredibly meaningful for data privacy in machine learning models, as models can utilize privECG to eliminate predictibility of identifiable traits while retaining viability of their target prediction variable. y.


# Scope of Reproducibility:

In recreating this project, I will be attempting to test two hypotheses. The paper claims that it can “anonymize a large ECG database in minutes”, which leads me to believe that this paper can be recreated with any ECG dataset. 

- Hypothesis 1: I will be testing if it is possible to glean similar results from the paper using alternate ECG datasets than the original used in the paper.
- Hypothesis 2: This hypothesis comes from the authors’ GitHub, in which they state a future improvement could be to “try to induce privacy by simply varying rhythm, thus requiring less transformation and allowing better readability”.

I would like to see if I can make these improvements upon their codebase, and aim to do so by modifying the GAN to discriminate and generate heartbeat rhythm in conjunction with or in place of sex-defining features. I find that Transfer Learning could be beneficial in this attempt. 

Given the low complexity of the model as well as the manageable size of data, I believe replication of PrivECG should be possible on my personal hardware. Additionally, an aim of the project is to be efficient on large databases, which leads me to believe it could likely be done on lesser hardware. 

# Methodology

Below is the associated methodology in which I attempt to recreate the privECG model. One main struggle discovered in the attempt to reproduce PrivECG is installing all necessary dependencies. The packages are very dependency specific, so it is difficult to install all packages and import them with their Python version dependencies. Beyond that, a notable issue I ran into was with the CUDA version of the torch package specified in the requirements.txt file:

torch==2.0.1+cu117

I am currently looking at a resolution to this group of dependencies and provide an updated requirement.txt and some documentation on any specific versioning in order to replicate PrivECG's functionality. For now, I am simply using the most recent version of each package.

Another dependency that is not covered in the requirements.txt is ecgdetectors, a python package in utils.py

In [13]:
# RUN PRIOR TO IMPORT IF MISSING DEPENDENCIES IN PYTHON ENV

# I think that ipykernels might actually have been outpaced by the requirements here...
# I managed to get the libraries to work by running the below command in conda shell:

#!pip install -r requirements.txt

#!pip install py-ecg-detectors

In [14]:
## IMPORTS ##

import os
import numpy as np
import random 
import pandas as pd
from statistics import mean
import torch
from torch.utils.data import Dataset, DataLoader 
from tqdm import tqdm
import gc
from torch.autograd import Variable
import torch.optim as optim
import torch.nn.functional as F
import torch.utils.data
import math
from sklearn.model_selection import train_test_split
from utils import *
from models import *

# Data

One of the hypotheses lies in the ability to replicate PrivECG on separate data. I managed to find 12-lead ECG data that was made public in a separate paper (see reference 3 in references section). There are quite a few data pre-processing steps that were needed to be performed before using the data in the GAN model below. The data originally came in shape (N, 5000, 12) which is typical for an ECG, but the data existed as separate excel files, where the model expects downsampled numpy arrays of size (N, 500, 12). Thus, it was imperative to read the data (which was incredibly memory intensive) from the Excel files to numpy arrays. Beyond this, I needed to generate R-Waves and non-R-Waves, as this is key to anonymizing the data. This process relies on finding the peaks in the ECG per lead. 

I may need to revise the method in which I generated the R-wave and non-R-wave data. Though, I was successfully able to generate some data that should ideally be close to what is implemented in the paper. See the file: createMasks.py in my repository for my method on doing so. After generating the R-wave and non-R-wave data, I then downsampled the ECG data, R-wave, and non-R-wave data to be of size (N, 500, 12) to conform to the expected dimensions of privECG. 

Lastly, I sorted the file Diagnostics.xlsx by filename and retrieved the Gender column to use as my sex vector for predictors in the GAN. All of this python code exists with the repository as: createMasks.py, downsample.py, readFromExcel.py, readDiagnostics.py. The Excel data can be found in the folder /data/ in the repository. 

In [20]:
## FILE AND DATA READING ##

#FILEPATHS
data_path = 'to_model/ecg_data.npy'
r_mask_path = 'to_model/r_wave_masks.npy'
nr_mask_path = 'to_model/non_r_wave_masks.npy'
sex_labels_path = 'to_model/gender_data.npy'
out_path = 'output_folder'

#FORMERLY ARGUMENTS, NOW CONSTANTS
#cuda_device = 'cuda:0'   CANT SEEM TO GET CUDA WORKING
learning_rate = 1e-7
batch_size_ = 64
beta_1 = 0.5
beta_2 = 0.990
loss_type = 'MSE_R'
multiplicator = 100
mode = 'PrivECG'
save_every = 3
number_epochs = 500
utility_budget = 0.0001
experiment_id = 'exp_1'

#READ FROM FILES, ESTABLISH VARS
# device = cuda_device  --CUDA NOT WORKING
train_X = np.load(data_path)
R_mask = np.load(r_mask_path)
non_R_mask = np.load(nr_mask_path)
y_ = np.load(sex_labels_path, allow_pickle=True)
y_ = np.array([0 if i == 'MALE' else 1 for i in y_])

# Using PrivECG
generator = GeneratorUNet()#.to(device)
disciminator = Discriminator()#.to(device)

print("Count of Male:", np.count_nonzero(y_ == 0))
print("Count of Female", np.count_nonzero(y_ == 1))

Count of Male: 5956
Count of Female 4690


# Model

The model below is a GAN that stays in line with the original code from the PrivECG repository. And significant modifications will contain comments, but largely the code is the same for now. The GAN utilizes methods from the 2 helper python packages called upon in the import statements under Methodology. Ideally the model is meant to run via CUDA on a GPU, but with configuration issues I have disabled this, running the model on CPU instead for now. 

In [25]:
## LINE 73 ONWARDS FROM run_models.py FROM ORIGINAL PrivECG GitHub ##
d_train = DatasetTrain(train_X,y_,R_mask,non_R_mask)

distortion_loss = nn.MSELoss()
adversarial_loss = nn.BCEWithLogitsLoss()
adversarial_loss_rf = nn.BCEWithLogitsLoss()

train_loader = DataLoader(dataset=d_train, batch_size=int(batch_size_), shuffle=True)

optimiser_gen = torch.optim.Adam(generator.parameters(), lr=learning_rate ,betas = (beta_1, beta_2)) 
optimiser_discr = torch.optim.Adam(disciminator.parameters(), lr=learning_rate, betas = (beta_1, beta_2)) 
num_epochs = number_epochs

list_fake_acc = list()
list_true_acc = list()
list_generator_loss = list()
list_distortion_loss = list()
list_gender_loss = list()
list_genderless_loss = list()
list_true_data_loss = list()
list_dicriminator_loss = list()

os.makedirs(out_path, exist_ok=True)

for epochs in range(0,num_epochs):
    print('Epoch: {}'.format(epochs))
    G_distortion_loss_accum = 0
    G_adversary_loss_accum = 0
    D_real_loss_accum = 0
    D_fake_loss_accum = 0

    generator.train()
    disciminator.train()
    
    acc_true_acc = list()
    acc_fake_acc = list()
    acc_generator_loss = list()
    acc_distortion_loss = list()
    acc_gender_loss = list()
    acc_genderless_loss = list()
    acc_true_data_loss = list()
    acc_dicriminator_loss = list()

    #LongTensor = torch.cuda.LongTensor
    #FloatTensor = torch.cuda.FloatTensor

    for i, (x, gender, r_masked, non_r_masked) in enumerate(tqdm(train_loader)):
        gender = gender.type(torch.float32)
        #gender =  gender.to(device=device)
        #x = x.to(device=device, dtype=torch.float32)

        #non_r_masked =  non_r_masked.to(device,dtype=torch.float32)
        #r_masked =  r_masked.to(device,dtype=torch.float32)

        optimiser_gen.zero_grad()

        #generate the random gender vector
        gen_secret = Variable(torch.LongTensor(np.random.choice([1.0], x.shape[0])))#.to(device)
        gen_secret = gen_secret * np.random.normal(0.5, math.sqrt(0.01))

        gen_secret = gen_secret.type(torch.float32)
        #gen_secret =  gen_secret.to(device=device)

        # ----------- Generator --------------

        #get outputs of the networks
        gen_results = generator(x)

        pred_secret = disciminator(gen_results)

        # 'MSE','MSE_R'
        if loss_type == 'MSE':
            generator_distortion_loss = distortion_loss(gen_results, x)#.to(device)

        elif loss_type == 'MSE_R':
            generator_distortion_loss = distortion_loss(gen_results * r_masked, x * r_masked) + distortion_loss(gen_results * non_r_masked, x * non_r_masked)* multiplicator

        
        G_distortion_loss_accum += generator_distortion_loss.item()

        generator_adversary_loss = adversarial_loss(pred_secret, gen_secret)#.to(device)

        G_adversary_loss_accum += generator_adversary_loss.item()

        generator_loss = generator_distortion_loss + generator_adversary_loss * utility_budget

        acc_distortion_loss.append(generator_distortion_loss.item())
        acc_gender_loss.append(generator_adversary_loss.item())
        acc_generator_loss.append(generator_loss.item())

        generator_loss.backward()
        optimiser_gen.step()

        # ----------- Discriminator --------------

        optimiser_discr.zero_grad()

        real_pred_secret = disciminator(x)
        fake_pred_secret = pred_secret.detach()

        acc_true_acc.append(np.mean(torch.sigmoid(real_pred_secret).reshape(-1).clone().detach().cpu().numpy().round() == gender.detach().cpu().numpy()))
        acc_fake_acc.append(np.mean(torch.sigmoid(pred_secret).reshape(-1).clone().detach().cpu().numpy().round() == gender.detach().cpu().numpy()))

        D_real_loss = adversarial_loss_rf(real_pred_secret, gender)#.to(device)
        D_genderless_loss = adversarial_loss_rf(fake_pred_secret, gen_secret)#.to(device)

        D_real_loss_accum += D_real_loss.item()
        D_fake_loss_accum += D_genderless_loss.item()

        discriminator_loss = D_real_loss + D_genderless_loss 

        acc_genderless_loss.append(D_genderless_loss.item())
        acc_true_data_loss.append(D_real_loss.item())
        acc_dicriminator_loss.append(discriminator_loss.item())

        discriminator_loss.backward()
        optimiser_discr.step()

    list_fake_acc.append(mean(acc_fake_acc))
    list_true_acc.append(mean(acc_true_acc))
    list_generator_loss.append(mean(acc_generator_loss))
    list_distortion_loss.append(mean(acc_distortion_loss))
    list_gender_loss.append(mean(acc_gender_loss))
    list_genderless_loss.append(mean(acc_genderless_loss))
    list_true_data_loss.append(mean(acc_true_data_loss))
    list_dicriminator_loss.append(mean(acc_dicriminator_loss))

    print("==============================")
    print("epoch {}".format(epochs))

    print("list_fake_acc {}".format(mean(acc_fake_acc)))
    print("list_true_acc {}".format(mean(acc_true_acc)))
    print("diff acc {}".format(mean(acc_true_acc) - mean(acc_fake_acc)))

    print("list_generator_loss {}".format(mean(acc_generator_loss)))
    print("list_distortion_loss {}".format(mean(acc_distortion_loss)))
    print("list_gender_loss {}".format(mean(acc_gender_loss)))
    print("list_genderless_loss {}".format(mean(acc_genderless_loss)))
    print("list_true_data_loss {}".format(mean(acc_true_data_loss)))
    print("list_dicriminator_loss {}".format(mean(acc_dicriminator_loss)))
    print("==============================")

    if epochs%save_every == 0:
        torch.save(disciminator.state_dict(), os.path.join(out_path,'disciminator_{}_final.pth'.format(epochs)))
        torch.save(generator.state_dict(), os.path.join(out_path,'generator_{}_final.pth'.format(epochs)))


Epoch: 0


  0%|          | 0/167 [00:00<?, ?it/s]


RuntimeError: Given groups=1, weight of size [128, 12, 4], expected input[64, 500, 12] to have 12 channels, but got 500 channels instead

# Results

So far, attempts to use the transformed data on the model has not resulted in success. Presently, I am debugging to determine whether the issue lies within the model itself or data needing further transformation before being sent the Generator. The error relating to my data shape mismatch is logged above in the error text. 

In [27]:
## ACCURACY METRICS ??? ANALYSIS FOLDER ##

# Implement once working


# Analysis

The repository for PrivECG contains a folder named "analysis" which provides various python packages which can be utilized for measuring and analyzing the model. Once the model is working properly and I am able to successfully run on all epochs, I plan to import and utilize the analysis packages to preserve the metrics used in the original paper. 

In [26]:
## PUT GRAPHS HERE? ##

# from analyse_output import *
# from analyse_output_utils import *
# from evaluate_privacy import *
# from predict_diseases import *
# from resnet1d import *
# from siamese_train import *


# Plans

While the progress made so far may not seem significant, the paper itself is not too far away from achieving definitive result. With some further debugging of the model and determining changes without hurting performance, it will be determined whether or not the paper is replicatable. So far, the greatest achievement has been formatting the data such that it fits the requested paramterization of the PrivECG model despite coming from a completely separate source. Going forward, the aim is to achieve:

- Debugging of the GAN model to determine usability with dataset. With some further exploration, it should be able to be mae to work.
- Implement analysis methods used in the original PrivECG paper to measure the results with the new dataset.
- Display data in a graphical format to compare to the original figures from the PrivECG paper.
- Perform further ablations, determining if it is possible to induce privacy by varying the rhythm pattern. We do have rhythm data provided from the dataset utilized.

# References

1.  Nolin-Lapalme, A., Avram, R. & Julie, H. (2023). PrivECG: generating private ECG for end-to-end anonymization. Proceedings of the 8th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 219:509-528 Available from https://proceedings.mlr.press/v219/nolin-lapalme23a.html

2.  Nolin-Lapalme, A., et al. (2023). privECG. GitHub. [https://github.com/anolinlapalme/privECG](https://github.com/anolinlapalme/privECG)

3.	Zheng, Jianwei; Rakovski, Cyril; Danioko, Sidy; Zhang, Jianming; Yao, Hai; Hangyuan, Guo (2019). A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. figshare. Collection. https://doi.org/10.6084/m9.figshare.c.4560497

4.	Zheng, J., Zhang, J., Danioko, S. et al. A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Sci Data 7, 48 (2020). https://doi.org/10.1038/s41597-020-0386-x

5.	Weimann, K., Conrad, T.O.F. Transfer learning for ECG classification. Sci Rep 11, 5251 (2021). https://doi.org/10.1038/s41598-021-84374-8

