## Exploratory Data Analysis

We explore our working features and labels here.

#### Research Questions (final?)

1. Build multi-label classifier: test importance of parameter and preprocessing steps on model performance. Build baseline
and analyse why preprocessing is important for model performance (e.g. proper label encoding -> ordering/relations)

2. Model explainability: in general DL models outperform ML models with the tradeoff of explainability, inspired by our
masters thesis we will perfrom model distillation from DL high performing model to ML high performing model in order to
analyse feature importance.

In [33]:
import os
import sys

sys.dont_write_bytecode = True
root_dir = os.path.abspath(os.pardir)
if root_dir not in sys.path:
    sys.path.append(root_dir)

import pandas as pd
import numpy as np
from configs.constants import *
from scripts.data import *

In [17]:
metafile = '../data/results/complete_metadata_mapping_2.csv'
data_dir_folder_path = os.path.abspath(os.path.join(root_dir, DATA_DIR_FOLDER))
data_dir_path = os.path.abspath(os.path.join(data_dir_folder_path, DATA_DIR))

In [4]:
meta_df = pd.read_csv(metafile)

Descriptives

In [8]:
meta_df[['n_sig', 'fs', 'n_samples', 'age', 'sex']].describe()

Unnamed: 0,n_sig,fs,n_samples,age
count,45151.0,45151.0,45151.0,45097.0
mean,12.0,500.0,5000.0,58.208462
std,0.0,0.0,0.0,19.688251
min,12.0,500.0,5000.0,0.0
25%,12.0,500.0,5000.0,48.0
50%,12.0,500.0,5000.0,61.0
75%,12.0,500.0,5000.0,72.0
max,12.0,500.0,5000.0,89.0


Null analysis

In [10]:
meta_df.isnull().sum()

record          1
hea_path        0
record_path     0
n_sig           1
fs              1
n_samples       1
age            55
sex             0
dx_codes        0
dtype: int64

#### Target Label analysis

In [19]:
labels = load_labels(data_dir_folder_path, 'ConditionNames_SNOMED-CT.csv')

Cardinality of se $L, |L| = 55$ we are working with (possibly coocurring) heart conditions.

In [32]:
labels['Snomed_CT'].nunique()

55

**important**: there might be linear relationships across the sub-groups of heart conditions

e.g. 1st degree venticular block is less serious than 2nd degree block.

In [29]:
labels

Unnamed: 0,Acronym Name,Full Name,Snomed_CT
0,1AVB,1 degree atrioventricular block,270492004
1,2AVB,2 degree atrioventricular block,195042002
2,2AVB1,2 degree atrioventricular block(Type one),54016002
3,2AVB2,2 degree atrioventricular block(Type two),28189009
4,3AVB,3 degree atrioventricular block,27885002
...,...,...,...
58,SVT,Supraventricular Tachycardia,426761007
59,AT,Atrial Tachycardia,713422000
60,AVNRT,Atrioventricular Node Reentrant Tachycardia,233896004
61,AVRT,Atrioventricular Reentrant Tachycardia,233897008


In [26]:
meta_df['dx_codes']

0         [164889003, 59118001, 164934002]
1                   [426177001, 164934002]
2                              [426177001]
3        [164890007, 429622005, 428750005]
4                              [426177001]
                       ...                
45147                          [425856008]
45148                          [425856008]
45149                          [425856008]
45150               [233897008, 425856008]
45151                          [106068003]
Name: dx_codes, Length: 45152, dtype: object

### X analyses