<a href="https://colab.research.google.com/github/Praaathaamesh/DL-API-Management/blob/main/Config/CDSS_DataPipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CDSS Data Pipeline**
****

_Author: Prathamesh Pradeep Jadhav_

## Implicative Purpose

- A part of a broader project involving creation of clinical decision support system for diagnosis of ALVF caused by CM, leading to untimely/unforseen cardiac arrest.
- To streamline the data integration and pre-processing step for model training using PTB-XL dataset.
- Further inclusions are yet to be made.
- This file will undergo series of multiple modification throughout the course of this project.
- Decisions will be made based upon the final data product architecture.

## Usage Terms

- Notebook will be used for temporal basis to analyse the efficiency. Later can be sampled into a sufficable package/module or even a software
- Suggestive use must be of the academic cases only.

## Considerations

- Under the management of subsequent repository and MIT license.
- Implementation is yet to be done.

## External Links

- [CDSS-DL API Management Repository](https://github.com/Praaathaamesh/DL-API-Management)


## Import Important Packages and Mount the Drive

In [1]:
import pickle
import os
import sys
import ast
import math
import numpy as np
import pandas as pd
import tensorflow as tf
from google.colab import drive

In [2]:
# ATTENTION: RUN THIS BLOCK TO MOUNT THE GOOGLE DRIVE STORAGE WITHOUT DIRECT AUTHORISATION (EVADES PESKY AUTHORISATION MAIL SPAM)

# Mount the drive (Auto authorisation)
# from google.colab import auth
# auth.authenticate_user()
# from google.auth import default
# creds, _ = default()
# drive.mount('/content/drive')

In [3]:
#!pip install h5py
#!pip install typing-extensions
#!pip install wheel
#!pip install wfdb


# Footnote: install if not already installed in the runtime. Might need to re-run for newer runtimes.

In [4]:
# !unzip "/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/ptb-xl.zip" -d "/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/"

In [5]:
if os.path.exists("/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/ptb-xl"):
  print("Following directory exists without any exceptions!")

Following directory exists without any exceptions!


## Check Case: Check if runtime has GPU connected

In [6]:
# Check Case 1 : List availible GPUs - returns a Null if no GPU detected!
GPUDevice = tf.test.gpu_device_name()
print(GPUDevice)

# Footnote: This runtime doesn't use the GPU. Switch to T4 GPU runtime for more flexibility.




In [7]:
# Check Case 2: See if the session supports the GPU inclusion
try:
  print(tf.test.is_built_with_gpu_support())
except:
  raise SystemError("No GPU support for a given runtime!")

# Configuring physical devices list
print(f"Total list of devices:\n{tf.config.list_physical_devices()}")
print(f"\nLisitng the first device type:{tf.config.list_physical_devices()[0][1]} ")

True
Total list of devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

Lisitng the first device type:CPU 


## Load the Data

### _Loading Signal Data using WFDB (Uncomment it to continue without pickling)_

In [8]:
# Define database path
# Path_to_dataset = "/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/ptb-xl/1.0.3/"

In [9]:
# Loading metadata of database
# Y = pd.read_csv(Path_to_dataset+"ptbxl_database.csv", index_col="ecg_id")

In [10]:
# Check Case 3: visualise metadata and data integrity
# Y.columns  # To see the columns
# Y.iloc[1]  # To see the individual entries


In [11]:
# Eval and convert scp codes into a string
# Y.scp_codes = Y.scp_codes.apply(lambda x: ast.literal_eval(x))

In [12]:
# Check Case 4: see individual signal data records (data file and header file)
#try:
#  Dat, Head = wfdb.rdsamp(Path_to_dataset+Y.iloc[0].filename_lr) # data of entry first
#  Dat.shape
#except:
#  print("heavy error! Data mangled! Please analyse further!")
#
#del Dat; del Head # No need futher

In [13]:
# Define fucntion to load raw data
if False:
  '''
  def loadRaw(df, SampRate, Path):
    # Load meta first use the filename column to import the signal data
    if SampRate == 100:
      Data = [wfdb.rdsamp(Path+file) for file in df.filename_lr]
    else:
      Data = [wfdb.rdsamp(Path+file) for file in df.filename_hr]
    # Eliminate the signal data
    Data = np.array([sig for sig, meta in Data ])
    return Data

  # Define the sampling rate
  Sampling_Rate = 100
  '''

In [14]:
# Loading the actual data
if False:
  '''
  try:
    X = loadRaw(Y, Sampling_Rate, Path_to_dataset)
  except:
    ImportError("No leads found. Please asses the previous steps")
  '''

In [15]:
# Saving the data object as pkl file for greater usability

if False:
  '''
  # Here we have created data object X (took 5 hours to load)
  with open("/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/X.pkl", "wb") as f:
    pickle.dump(X, f)
  '''

In [16]:
# to get the objects back

with open("/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/X.pkl", "rb") as f:
  X = pickle.load(f)

In [17]:
X.shape

(21799, 1000, 12)

## _Load Metadata (Uncomment to load without pickling)_

In [18]:
# Adding diagnostic superclass to each entry
# AggData = pd.read_csv(Path_to_dataset+"scp_statements.csv", index_col=0)
# AggData.columns

In [19]:
# Keeping entries only with the diagnostic report
# AggData = AggData[AggData.diagnostic == 1]

In [20]:
# def AggDataFunc(ScpDict):
#   TempList = []
#   for key in ScpDict.keys():
#     if key in AggData.index:
#       TempList.append(AggData.loc[key].diagnostic_class)
#   return list(set(TempList))

In [21]:
# Y["diagnostic_superclass"] = Y.scp_codes.apply(AggDataFunc)

In [22]:
# Here we have created data object Y
# with open("/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/Y.pkl", "wb") as f:
#  pickle.dump(Y, f)

In [23]:
# Here we have created data object Y
with open("/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/Y.pkl", "rb") as f:
  Y = pickle.load(f)

In [24]:
Y.head()

Unnamed: 0_level_0,patient_id,age,sex,height,weight,nurse,site,device,recording_date,report,...,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr,diagnostic_superclass
ecg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,15709.0,56.0,1,,63.0,2.0,0.0,CS-12 E,1984-11-09 09:17:34,sinusrhythmus periphere niederspannung,...,,", I-V1,",,,,,3,records100/00000/00001_lr,records500/00000/00001_hr,[NORM]
2,13243.0,19.0,0,,70.0,2.0,0.0,CS-12 E,1984-11-14 12:55:37,sinusbradykardie sonst normales ekg,...,,,,,,,2,records100/00000/00002_lr,records500/00000/00002_hr,[NORM]
3,20372.0,37.0,1,,69.0,2.0,0.0,CS-12 E,1984-11-15 12:49:10,sinusrhythmus normales ekg,...,,,,,,,5,records100/00000/00003_lr,records500/00000/00003_hr,[NORM]
4,17014.0,24.0,0,,82.0,2.0,0.0,CS-12 E,1984-11-15 13:44:57,sinusrhythmus normales ekg,...,", II,III,AVF",,,,,,3,records100/00000/00004_lr,records500/00000/00004_hr,[NORM]
5,17448.0,19.0,1,,70.0,2.0,0.0,CS-12 E,1984-11-17 10:43:15,sinusrhythmus normales ekg,...,", III,AVR,AVF",,,,,,4,records100/00000/00005_lr,records500/00000/00005_hr,[NORM]


## Data Preprocessing

- **Apparent Goals:**
  - To streamline the data integration and pre-processing step for model training using PTB-XL dataset.
  - Further inclusions are yet to be made.
  - First of all, normalise the metadata of dataset by analysing each and every column.
  - For the signals, find the suitable normalisation and filtering techniques.
  - Define proper label for HYP patients with higher risk.

- **Considerations:**
  - Target label will be of proxy nature.
  - Here, we are determining the patients with higher susceptibility to ALVF.
  - We are not diagnosing or prognosing anything; it is a screening model and modifications must be done accordingly.
  - ALVF diagnosis requires the LVEF metric, which is absent in this one, hence, it is merely difficult to surely say if they are infact ALVF patients or not?
  - **Hence, not having specific metric to call a patient ALVF ridden, all we can do is call them susceptible.**
  - Proxy labels are acceptable only if literature is backing the claim of change.
  - for now the major objectives are:
    - Perform EDA. Find Notebook here: [CDSS DL EDA](https://colab.research.google.com/drive/1walTNvC8k1NvT7d3plsRZirv4-HlA2rO#scrollTo=5Cax56lIqY9o)
    - Persent your claim with specified literature.
    - Specify what you want to change and how it can be done
    - Change the normal to proxy label.
    - Then consider splitting the data.

### _Claim, Literature Backing & Action_

Note: Refer [CDSS DL EDA](https://colab.research.google.com/drive/1walTNvC8k1NvT7d3plsRZirv4-HlA2rO#scrollTo=5Cax56lIqY9o) for more comprehensive view.

In a gist:
- Literature regarding ALVF (Bacharova 2023, Wagner 2020, Bergamasco 2022, McDonagh 1997, Raymond 2003,Sara 2020) has suggested:
  - "as age increases the susceptibility to ALVF also increases (overall prevalence in male is still arounf 1.5%, yet this holds true), yet many men at later stages of their life are prone to this than women of the same age range"
  - "Men with ALV systolic dysfunction have more chances of mortality than women with AL Diastolic dysfunction"
  - "prevalence for men at 50 years of age is approx. 4.5% and it goes up abpve 10% if age is above 70 years."
  - "Men have more cases of AL systolic dysfunction whereas women have more cases of diastolic dysfunction."
- Visualisations from [CDSS DL EDA](https://colab.research.google.com/drive/1walTNvC8k1NvT7d3plsRZirv4-HlA2rO#scrollTo=5Cax56lIqY9o) have supported such claims. Hence, the core idea suffices the action of adding the proxy label "HYP_HR" (HR stands for High Risk).
- So, which entries we must consider as high risk:
  - After performing EDA, 769 HYP LVH entries had any one of these following form statements:
    - STD
    - STE
    - HVOLT
    - LVOLT
    - VCLVH
  - Any one of these form statements in scp_codes hash table can be deemed as an **high risk individual**!
- Take a reference from EDA pipeline and get those entries and according to some unique identifier (patient_id or other), change their diagnostic label as HYP_HR.
- Following code-blocks do these things as expected!

### _Loading Susceptible Dataframe_

In [25]:
# Load the ALVF dataset made during EDA
with open("/content/drive/MyDrive/Projects/CDSS_DLComponentRepo/ALVF_df.pkl", "rb") as f:
  ALVF_df = pickle.load(f)

In [26]:
ALVF_df.head()

Unnamed: 0,patient_id,age,sex,height,weight,nurse,site,device,recording_date,report,...,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr,diagnostic_superclass
0,340.0,85.0,male,160.0,40.0,3.0,1.0,AT-6 C 5.5,1986-09-07 14:05:32,sinus rhythm. left atrial enlargement. voltage...,...,,,,,,,2,records100/00000/00273_lr,records500/00000/00273_hr,"[STTC, HYP]"
1,5662.0,67.0,male,152.0,60.0,11.0,1.0,AT-6 C 5.5,1986-09-07 14:11:43,sinus rhythm. possible left atrial enlargement...,...,,,,,,,10,records100/00000/00274_lr,records500/00000/00274_hr,"[HYP, STTC, MI]"
2,3192.0,76.0,male,168.0,54.0,10.0,1.0,AT-6 C 5.5,1986-09-10 13:14:44,sinus bradycardia with sinus arrhythmia. the c...,...,,,,,,,3,records100/00000/00284_lr,records500/00000/00284_hr,"[STTC, HYP]"
3,340.0,85.0,male,160.0,40.0,7.0,1.0,AT-6 C 5.5,1986-09-12 12:53:42,sinus rhythm. left atrial enlargement. voltage...,...,,,,,,,2,records100/00000/00298_lr,records500/00000/00298_hr,"[HYP, STTC, MI]"
4,1110.0,56.0,male,165.0,73.0,1.0,1.0,AT-6 C 5.5,1986-09-13 15:10:13,sinus rhythm. voltages are high in chest leads...,...,,,,,,,5,records100/00000/00313_lr,records500/00000/00313_hr,[HYP]


Considering these `patient_ids`, we can change the diagnostic labels in the main metadata variable. Dataset mentioned above incldues entries which are susceptible entries.

### _Changing Labels as Risky Dataframe_

In [41]:
# proxy variable for ALVF DF
ALVF_df_testcase = ALVF_df

In [42]:
# proxy variable for OG DF
Y_test_case = Y

In [43]:
Y_test_case.head()

Unnamed: 0_level_0,patient_id,age,sex,height,weight,nurse,site,device,recording_date,report,...,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr,diagnostic_superclass
ecg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,15709.0,56.0,1,,63.0,2.0,0.0,CS-12 E,1984-11-09 09:17:34,sinusrhythmus periphere niederspannung,...,,", I-V1,",,,,,3,records100/00000/00001_lr,records500/00000/00001_hr,[NORM]
2,13243.0,19.0,0,,70.0,2.0,0.0,CS-12 E,1984-11-14 12:55:37,sinusbradykardie sonst normales ekg,...,,,,,,,2,records100/00000/00002_lr,records500/00000/00002_hr,[NORM]
3,20372.0,37.0,1,,69.0,2.0,0.0,CS-12 E,1984-11-15 12:49:10,sinusrhythmus normales ekg,...,,,,,,,5,records100/00000/00003_lr,records500/00000/00003_hr,[NORM]
4,17014.0,24.0,0,,82.0,2.0,0.0,CS-12 E,1984-11-15 13:44:57,sinusrhythmus normales ekg,...,", II,III,AVF",,,,,,3,records100/00000/00004_lr,records500/00000/00004_hr,[NORM]
5,17448.0,19.0,1,,70.0,2.0,0.0,CS-12 E,1984-11-17 10:43:15,sinusrhythmus normales ekg,...,", III,AVR,AVF",,,,,,4,records100/00000/00005_lr,records500/00000/00005_hr,[NORM]


In [44]:
# get patient ids
ALVF_id_set = set(ALVF_df_testcase['patient_id'])

# Function for replacing HYP to HYP_HR in DS list
def Replace_HYP(string_list):
  if "HYP" in string_list:
    return [s.replace("HYP", "HYP_HR") if "HYP" == s else s for s in string_list]
  return string_list

# Parse OG and ALVF dataframes
Y_test_case.loc[Y_test_case['patient_id'].isin(ALVF_id_set), "diagnostic_superclass"] = \
 Y_test_case.loc[Y_test_case['patient_id'].isin(ALVF_id_set), "diagnostic_superclass"].apply(Replace_HYP)

In [45]:
Y_test_case[260:270]

Unnamed: 0_level_0,patient_id,age,sex,height,weight,nurse,site,device,recording_date,report,...,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr,diagnostic_superclass
ecg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
268,2533.0,57.0,0,165.0,72.0,5.0,1.0,AT-6 C 5.5,1986-09-06 15:54:13,sinus rhythm. no definite pathology.,...,,,,,,,1,records100/00000/00268_lr,records500/00000/00268_hr,[NORM]
269,2282.0,87.0,1,166.0,70.0,8.0,1.0,AT-6 C 5.8,1986-09-06 16:11:01,premature atrial contraction(s). sinus rhythm....,...,,,,,,,8,records100/00000/00269_lr,records500/00000/00269_hr,"[CD, MI]"
270,1480.0,50.0,0,188.0,93.0,7.0,1.0,AT-6 C 5.5,1986-09-06 19:26:43,sinus tachycardia. qs complexes in v2 and tiny...,...,,,,,,,4,records100/00000/00270_lr,records500/00000/00270_hr,[MI]
271,449.0,75.0,0,168.0,75.0,8.0,1.0,AT-6 C 5.5,1986-09-06 23:56:41,premature atrial contraction(s). sinus rhythm....,...,,,,,,,10,records100/00000/00271_lr,records500/00000/00271_hr,"[HYP, STTC, MI]"
272,1027.0,54.0,0,180.0,83.0,11.0,1.0,AT-6 C 5.5,1986-09-07 11:36:22,sinus rhythm. normal ecg.,...,,,,,,,4,records100/00000/00272_lr,records500/00000/00272_hr,[NORM]
273,340.0,85.0,1,160.0,40.0,3.0,1.0,AT-6 C 5.5,1986-09-07 14:05:32,sinus rhythm. left atrial enlargement. voltage...,...,,,,,,,2,records100/00000/00273_lr,records500/00000/00273_hr,"[STTC, HYP_HR]"
274,5662.0,67.0,1,152.0,60.0,11.0,1.0,AT-6 C 5.5,1986-09-07 14:11:43,sinus rhythm. possible left atrial enlargement...,...,,,,,,,10,records100/00000/00274_lr,records500/00000/00274_hr,"[HYP_HR, STTC, MI]"
275,1389.0,67.0,1,157.0,44.0,8.0,1.0,AT-6 C 5.5,1986-09-07 14:11:59,sinus rhythm. normal ecg.,...,,,,,,,1,records100/00000/00275_lr,records500/00000/00275_hr,[NORM]
276,5087.0,76.0,1,160.0,67.0,7.0,1.0,AT-6 C 5.5,1986-09-07 15:48:20,sinus rhythm. minor non-specific t wave flatte...,...,,,,,,,6,records100/00000/00276_lr,records500/00000/00276_hr,[STTC]
277,6211.0,46.0,1,162.0,62.0,11.0,1.0,AT-6 C 5.5,1986-09-08 09:08:40,sinus rhythm. chest lead voltages suggest poss...,...,,,,,,,2,records100/00000/00277_lr,records500/00000/00277_hr,[NORM]


**The susceptible entries have been updated as per the EDA claims we made! Now we can successfully proceeed to dataset splitting!**

## Data Splitting & Fold Preparation

Considering the given literature for PTB-XL dataset;
  - The paper recommends **10-fold-stratified split**
  - The paper recommends **10-fold-cross-validation** for validating the performance of the model.
  - Folds are described in `start_fold` columns in the metadata file.
  - Folds 1 to 8 have same patients within same fold.
  - Fold 9 and fold 10 is considered as **high quality folds**, since they went atleast one human evaluation.

For this practice we will perform following splits:

  - Fold 1 to Fold 8 as **training sets**
  - Fold 10 as **the testing set**
  - Fold 9 as **the validation set**
  

****

For more reference, please use this original reference paper:

[Wagner, P., Strodthoff, N., Bousseljot, RD. et al. PTB-XL, a large publicly available electrocardiography dataset. Sci Data 7, 154 (2020)](https://doi.org/10.1038/s41597-020-0495-6)

### _Creating Test Set_

In [73]:
# test set split
test_fold = 10

X_test = X[np.where(Y_test_case.strat_fold == test_fold)]
Y_test = Y_test_case[(Y_test_case.strat_fold == test_fold)].diagnostic_superclass

In [74]:
Y_test.shape

(2198,)

Here, we have a test set with only entries with designated `strat_fold` value of **fold ten**!

### _Creating Validation Set_

In [75]:
# validation fold
validation_fold = 9

X_val = X[np.where(Y_test_case.strat_fold == validation_fold)]
Y_val = Y_test_case[(Y_test_case.strat_fold == validation_fold)].diagnostic_superclass

In [76]:
Y_val.shape

(2183,)

Here, we have a validation set with only entries with designated `strat_fold` value of **Fold nine**!

### _Creating Training Set_

In [77]:
X_train = X[np.where((Y_test_case.strat_fold != test_fold) & (Y_test_case.strat_fold != validation_fold))]
Y_train = Y_test_case[((Y_test_case.strat_fold != test_fold) & (Y_test_case.strat_fold != validation_fold))].diagnostic_superclass

In [78]:
Y_train.shape

(17418,)

Here, we have a training set with only entries with designated `strat_fold` value of **fold one till fold eight**!

From the split done above, we have found out that:
  - Test set has 2198 entries.
  - Validation set has 2183 entries.
  - Training set has 17418 entries.

**Altogether they contribute to total of 21799 entries!**