## Modeling Pipeline

This notebook focuses on the development of project ML/DL models - focus on modelling pipeline setup & **baseline** models.

### Research Questions:

1. Build baseline ML and DL models
2. Build optimized ML and DL models
3. Are optimized models better -> how much?
4. Is baseline DL model better than optimized ML model?
5. Knowledge distillation:
    - is distilled ML model better than optmized ML model?
    - is distilled ML model better than baseline DL model?

In [32]:
import os
import sys

sys.dont_write_bytecode = True
root_dir = os.path.abspath(os.pardir)
if root_dir not in sys.path:
    sys.path.append(root_dir)

In [33]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from configs.constants import *

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.decomposition import PCA
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [34]:
data_dir = '../data/preprocessing/ML/'
meta_df_file = '../data/results/complete_metadata_mapping_2.csv'

train_0_file = 'meta_df_preprocess_X_train_0.csv'
test_0_file = 'meta_df_preprocess_X_test_0.csv'
train_1_file = 'meta_df_preprocess_X_train_1.csv'
test_1_file = 'meta_df_preprocess_X_test_1.csv'

### Models

train test split indices

In [35]:
meta_df = pd.read_csv(meta_df_file)
meta_df['dx_codes'] = meta_df['dx_codes'].map(json.loads)

In [36]:
X = meta_df.drop('dx_codes', axis=1)
y = meta_df['dx_codes']

_X_train, _X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=TTS_SEED)

_X_train = _X_train.reset_index(drop=True)
_X_test = _X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

loading train and test data

In [37]:
X_train_0 = pd.read_csv(os.path.join(data_dir, train_0_file))
X_test_0 = pd.read_csv(os.path.join(data_dir, test_0_file))
X_train_1 = pd.read_csv(os.path.join(data_dir, train_1_file))
X_test_1 = pd.read_csv(os.path.join(data_dir, test_1_file))

y_train_saved = pd.read_csv(os.path.join(data_dir, 'y_train.csv'))['dx_codes'].map(json.loads)
y_test_saved = pd.read_csv(os.path.join(data_dir, 'y_test.csv'))['dx_codes'].map(json.loads)

  X_train_0 = pd.read_csv(os.path.join(data_dir, train_0_file))
  X_train_1 = pd.read_csv(os.path.join(data_dir, train_1_file))


equivalence test

In [38]:
assert all(y_train_saved.map(tuple) == y_train.map(tuple))
assert all(y_test_saved.map(tuple) == y_test.map(tuple))

dropping error rows from X, y

In [39]:
bad_index_train = X_train_0.dropna(subset='error').index

In [40]:
X_train_0 = X_train_0.drop(bad_index_train, errors='ignore')
X_train_1 = X_train_1.drop(bad_index_train, errors='ignore')

y_train = y_train.drop(bad_index_train, errors='ignore')

In [41]:
X_train_0 = X_train_0.reset_index(drop=True)
X_train_1 = X_train_1.reset_index(drop=True)
X_test_0 = X_test_0.reset_index(drop=True)
X_test_1 = X_test_1.reset_index(drop=True)

**model preprocessing**

label encoding

In [42]:
mlb = MultiLabelBinarizer()
y_train_transformed = mlb.fit_transform(y_train)
y_test_transformed = mlb.transform(y_test)

remove metadata features

In [43]:
meta_features = ['record_id', 'filter_mode', 'beat_lead', 'error']

In [44]:
X_train_0 = X_train_0.drop(meta_features, axis=1, errors='ignore')
X_train_1 = X_train_1.drop(meta_features, axis=1, errors='ignore')
X_test_0 = X_test_0.drop(meta_features, axis=1, errors='ignore')
X_test_1 = X_test_1.drop(meta_features, axis=1, errors='ignore')

final preprocessing (using pipeline fit)

In [45]:
nominal_columns = ['sex']
numerical_columns = X_train_0.columns[X_train_0.columns != object].drop(nominal_columns)

In [46]:
categorical_pipe = Pipeline([
    ('nulls', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numerical_pipe = Pipeline([
    ('nulls', SimpleImputer()),
    ('scale', StandardScaler()),
    #('poly', PolynomialFeatures(degree=2)),
    #('pca', PCA(n_components=0.99))
])

#### **ML**

In [47]:
ct = ColumnTransformer(
    transformers=[
        ('numerical', numerical_pipe, numerical_columns),
        ('categorical', categorical_pipe, nominal_columns)
    ],
    remainder='passthrough'
)

baseline model:
1. Logistic regression
2. SVM
3. XGBoost

In [48]:
n_jobs = 10

In [49]:
# LR
clf = OneVsRestClassifier(LogisticRegression(max_iter=2000), n_jobs=n_jobs)

model = Pipeline([
    ('preprocess', ct),
    ('clf', clf)
])

# raw
model.fit(X_train_0, y_train_transformed)
y_pred_raw = model.predict(X_test_0)

# bandpass filter
model.fit(X_train_1, y_train_transformed)
y_pred_filter = model.predict(X_test_1)

model


baseline scoring

In [50]:
from sklearn.metrics import accuracy_score, hamming_loss

raw signal

In [51]:
print("Accuracy score:", accuracy_score(y_test_transformed, y_pred_raw))
print("Hamming Loss:", hamming_loss(y_test_transformed, y_pred_raw))

Accuracy score: 0.37134632418069086
Hamming Loss: 0.014909164577954506


filtered signal

In [52]:
print("Accuracy score:", accuracy_score(y_test_transformed, y_pred_filter))
print("Hamming Loss:", hamming_loss(y_test_transformed, y_pred_filter))

Accuracy score: 0.3704605845881311
Hamming Loss: 0.014986902361344062


single label classification subset

- we extract the single label (rhytm) part and evaluate
- we evaluate the multilabel of the conditions as well

In [53]:
from scripts.scoring import multilabel_eval, rhythm_conditions_eval

In [55]:
import numpy as np
from scripts.scoring import multilabel_eval, rhythm_conditions_eval
from configs.constants import RHYTHM_INFO_BY_SNOMED

# Build rhythm label list from constants (keys are SNOMED ints)
rhythm_snomed = list(RHYTHM_INFO_BY_SNOMED.keys())
label_to_idx = {c: i for i, c in enumerate(mlb.classes_)}

# Rhythm indices present in your binarizer classes
rhythm_idx = [label_to_idx[c] for c in rhythm_snomed if c in label_to_idx]
cond_idx = [i for i in range(len(mlb.classes_)) if i not in rhythm_idx]

# Get logits from the fitted pipeline (OvR LogisticRegression exposes decision_function)
logits_raw = model.decision_function(X_test_0)      # raw features
logits_filt = model.decision_function(X_test_1)     # bandpass features

# Overall multilabel metrics
overall_raw, _, _ = multilabel_eval(y_test_transformed, logits_raw, threshold=0.5)
overall_filt, _, _ = multilabel_eval(y_test_transformed, logits_filt, threshold=0.5)
print("Overall (raw):", overall_raw)
print("Overall (filter):", overall_filt)

# Decomposed metrics (conditions multilabel + rhythm multiclass/multilabel)
rc_raw = rhythm_conditions_eval(
    y_test_transformed, logits_raw,
    rhythm_idx=rhythm_idx, cond_idx=cond_idx,
    threshold=0.5,
    rhythm_label_names=[mlb.classes_[i] for i in rhythm_idx],
    cond_label_names=[mlb.classes_[i] for i in cond_idx],
)
rc_filt = rhythm_conditions_eval(
    y_test_transformed, logits_filt,
    rhythm_idx=rhythm_idx, cond_idx=cond_idx,
    threshold=0.5,
    rhythm_label_names=[mlb.classes_[i] for i in rhythm_idx],
    cond_label_names=[mlb.classes_[i] for i in cond_idx],
)

print("Conditions (raw):", rc_raw["conditions_multilabel"])
print("Rhythm (raw):", rc_raw["rhythm"])
print("Conditions (filter):", rc_filt["conditions_multilabel"])
print("Rhythm (filter):", rc_filt["rhythm"])


Overall (raw): {'f1_micro': np.float64(0.43149639454708144), 'f1_macro': np.float64(0.09609104514999574), 'f1_samples': np.float64(0.5213580464194683), 'hamming_loss': 0.02544381207244219, 'auroc_micro': np.float64(0.9364005744456358), 'auroc_macro': np.float64(0.7918022923709046), 'ap_micro': np.float64(0.31223338680879714), 'ap_macro': np.float64(0.11084300928795478)}
Overall (filter): {'f1_micro': np.float64(0.5449864110999857), 'f1_macro': np.float64(0.10986387332110513), 'f1_samples': np.float64(0.5942885969458157), 'hamming_loss': 0.014986902361344062, 'auroc_micro': np.float64(0.9623886070898474), 'auroc_macro': np.float64(0.8590352432201334), 'ap_micro': np.float64(0.602738081410262), 'ap_macro': np.float64(0.18188433754895178)}
Conditions (raw): {'f1_micro': np.float64(0.21424982413827756), 'f1_macro': np.float64(0.06536956828956202), 'f1_samples': np.float64(0.11014429037345143), 'hamming_loss': 0.020611898435193385, 'auroc_micro': np.float64(0.9163441290739667), 'auroc_macro

SVM

- logistic regression might be dependent on polynomial features
- we use RBF SVM in order to better capture complex relationships
- NOTE: high space complexity -> we will downsample

In [99]:
# SVM
frac = 0.15
clf = SVC(kernel='rbf', C=10)
clf = OneVsRestClassifier(clf, n_jobs=n_jobs)

model = Pipeline([
    ('preprocess', ct),
    ('clf', clf)
])

X_train_0_sample = X_train_0.sample(frac=frac)
y_train_sample = y_train_transformed[X_train_0_sample.index]

# raw
model.fit(X_train_0_sample, y_train_sample)
y_pred_raw = model.predict(X_test_0)

X_train_1_sample = X_train_1.sample(frac=frac)
y_train_sample = y_train_transformed[X_train_1_sample.index]

# bandpass filter
model.fit(X_train_1_sample, y_train_sample)
y_pred_filter = model.predict(X_test_1)

In [100]:
print("Accuracy score (raw):", accuracy_score(y_test_transformed, y_pred_raw))
print("Hamming Loss (raw):", hamming_loss(y_test_transformed, y_pred_raw))
print("Accuracy score (filter):", accuracy_score(y_test_transformed, y_pred_filter))
print("Hamming Loss (filter):", hamming_loss(y_test_transformed, y_pred_filter))

Accuracy score (raw): 0.39769707705934454
Hamming Loss (raw): 0.014515764280195239
Accuracy score (filter): 0.4038972542072631
Hamming Loss (filter): 0.014869117841056857


single label classification subset

- we extract the single label (rhytm) part and evaluate
- we evaluate the multilabel of the conditions as well

XGBoost Optimized (ensemble model)

- ensemble model
- utilizing CV strategies
- transformation optimization
- hyperparam optimization

Test data run - only once

___
#### **DL (Deep Learning)**

pytorch installation test + GPU support check

In [1]:
import torch

In [7]:
x = torch.rand(5, 3)
print(x)

cuda = torch.cuda.is_available()
print(cuda)
if cuda:
    print("cuda device count:", torch.cuda.device_count())
    print("cuda device name:", torch.cuda.get_device_name())

tensor([[0.6856, 0.1604, 0.8489],
        [0.8753, 0.4726, 0.5957],
        [0.9645, 0.2421, 0.3777],
        [0.2628, 0.0031, 0.7061],
        [0.7295, 0.6117, 0.3450]])
True
cuda device count: 1
cuda device name: NVIDIA GeForce RTX 3060 Laptop GPU


DL preprocessing
- we will pass raw ECG signals mapped per row (instead of calculating features as in ML pipe)

___
#### xAI: Model Explainability
ideas:
- integrated gradients
- knowledge distillation: DL -> ML