## Modeling Pipeline

This notebook focuses on the development of project ML/DL models - focus on modelling pipeline setup & **baseline** models.

### Research Questions:

1. Build baseline ML and DL models
2. Build optimized ML and DL models
3. Are optimized models better -> how much?
4. Is baseline DL model better than optimized ML model?
5. Knowledge distillation:
    - is distilled ML model better than optmized ML model?
    - is distilled ML model better than baseline DL model?

In [1]:
import os
import sys

sys.dont_write_bytecode = True
root_dir = os.path.abspath(os.pardir)
if root_dir not in sys.path:
    sys.path.append(root_dir)

In [2]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from configs.constants import *

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.decomposition import PCA
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.base import clone

In [3]:
from sklearn.metrics import accuracy_score, hamming_loss
from scripts.scoring import *

In [4]:
data_dir = '../data/preprocessing/ML/'
meta_df_file = '../data/results/complete_metadata_mapping_2.csv'

train_0_file = 'meta_df_preprocess_X_train_0.csv'
test_0_file = 'meta_df_preprocess_X_test_0.csv'
train_1_file = 'meta_df_preprocess_X_train_1.csv'
test_1_file = 'meta_df_preprocess_X_test_1.csv'

### Models

train test split indices

In [5]:
meta_df = pd.read_csv(meta_df_file)
meta_df['dx_codes'] = meta_df['dx_codes'].map(json.loads)

In [6]:
X = meta_df.drop('dx_codes', axis=1)
y = meta_df['dx_codes']

_X_train, _X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=TTS_SEED)

_X_train = _X_train.reset_index(drop=True)
_X_test = _X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

loading train and test data

In [7]:
X_train_0 = pd.read_csv(os.path.join(data_dir, train_0_file))
X_test_0 = pd.read_csv(os.path.join(data_dir, test_0_file))
X_train_1 = pd.read_csv(os.path.join(data_dir, train_1_file))
X_test_1 = pd.read_csv(os.path.join(data_dir, test_1_file))

y_train_saved = pd.read_csv(os.path.join(data_dir, 'y_train.csv'))['dx_codes'].map(json.loads)
y_test_saved = pd.read_csv(os.path.join(data_dir, 'y_test.csv'))['dx_codes'].map(json.loads)

  X_train_0 = pd.read_csv(os.path.join(data_dir, train_0_file))
  X_train_1 = pd.read_csv(os.path.join(data_dir, train_1_file))


equivalence test

In [8]:
assert all(y_train_saved.map(tuple) == y_train.map(tuple))
assert all(y_test_saved.map(tuple) == y_test.map(tuple))

dropping error rows from X, y

In [9]:
bad_index_train = X_train_0.dropna(subset='error').index

In [10]:
X_train_0 = X_train_0.drop(bad_index_train, errors='ignore')
X_train_1 = X_train_1.drop(bad_index_train, errors='ignore')

y_train = y_train.drop(bad_index_train, errors='ignore')

In [11]:
X_train_0 = X_train_0.reset_index(drop=True)
X_train_1 = X_train_1.reset_index(drop=True)
X_test_0 = X_test_0.reset_index(drop=True)
X_test_1 = X_test_1.reset_index(drop=True)

**model preprocessing**

remove metadata features

In [12]:
meta_features = ['record_id', 'filter_mode', 'beat_lead', 'error', 'fs', 'n_leads', 'n_samples']


In [13]:
X_train_0 = X_train_0.drop(meta_features, axis=1, errors='ignore')
X_train_1 = X_train_1.drop(meta_features, axis=1, errors='ignore')
X_test_0 = X_test_0.drop(meta_features, axis=1, errors='ignore')
X_test_1 = X_test_1.drop(meta_features, axis=1, errors='ignore')

train/validation split:
- we use test split only at the final mode eval

In [14]:
train_idx, val_idx = train_test_split(
    np.arange(len(y_train)),
    test_size=VAL_SIZE,
    random_state=TTS_SEED,
)

# Raw features
X_tr_0 = X_train_0.iloc[train_idx].reset_index(drop=True)
X_val_0 = X_train_0.iloc[val_idx].reset_index(drop=True)

# Bandpass features
X_tr_1 = X_train_1.iloc[train_idx].reset_index(drop=True)
X_val_1 = X_train_1.iloc[val_idx].reset_index(drop=True)

# Targets
y_tr = y_train.iloc[train_idx].reset_index(drop=True)
y_val = y_train.iloc[val_idx].reset_index(drop=True)

label encoding

In [15]:
mlb = MultiLabelBinarizer()
y_tr_bin = mlb.fit_transform(y_tr)
y_val_bin = mlb.transform(y_val)
y_test_bin = mlb.transform(y_test)

final preprocessing (using pipeline fit)

In [16]:
nominal_columns = ['sex']
numerical_columns = X_train_0.select_dtypes(exclude='object').columns

In [17]:
categorical_pipe = Pipeline([
    ('nulls', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numerical_pipe = Pipeline([
    ('nulls', SimpleImputer()),
    ('scale', StandardScaler()),
    #('poly', PolynomialFeatures(degree=2)),
    #('pca', PCA(n_components=0.99))
])

#### **ML**

In [18]:
ct = ColumnTransformer(
    transformers=[
        ('numerical', numerical_pipe, numerical_columns),
        ('categorical', categorical_pipe, nominal_columns)
    ],
    remainder='passthrough'
)

baseline model:
1. Logistic regression
2. SVM

In [19]:
n_jobs = 10

In [21]:
clf = OneVsRestClassifier(LogisticRegression(max_iter=2000), n_jobs=n_jobs)

base_model = Pipeline([
    ("preprocess", ct),
    ("clf", clf)
])

# Raw
model_raw = clone(base_model)
model_raw.fit(X_tr_0, y_tr_bin)
val_logits_raw = model_raw.decision_function(X_val_0)
lr_res_raw = score_overall_and_rc(y_val_bin, val_logits_raw, mlb, split_name="val-raw")

# Bandpass
model_filt = clone(base_model)
model_filt.fit(X_tr_1, y_tr_bin)
val_logits_filt = model_filt.decision_function(X_val_1)
lr_res_bandpass = score_overall_and_rc(y_val_bin, val_logits_filt, mlb, split_name="val-filt")

pass


[val-raw] overall: {'f1_micro': np.float64(0.5475216751185997), 'f1_macro': np.float64(0.0975296181685033), 'f1_samples': np.float64(0.5919852839636304), 'hamming_loss': 0.014481068855754732, 'auroc_micro': np.float64(0.9593152132080986), 'auroc_macro': np.float64(0.8315732082349065), 'ap_micro': np.float64(0.5913927084082796), 'ap_macro': np.float64(0.18167702515884385)}
[val-raw] conditions: {'f1_micro': np.float64(0.19607843137254902), 'f1_macro': np.float64(0.06218994171108674), 'f1_samples': np.float64(0.06636599402347433), 'hamming_loss': 0.010809195725534309, 'auroc_micro': np.float64(0.9452502687444373), 'auroc_macro': np.float64(0.834190032181299), 'ap_micro': np.float64(0.3190265845714747), 'ap_macro': np.float64(0.1460152977607099)}
[val-raw] rhythm: {'f1_micro': np.float64(0.7587742273441592), 'f1_macro': np.float64(0.3943829004108023), 'f1_samples': np.float64(0.6942421259842518), 'hamming_loss': 0.045324803149606296, 'auroc_micro': np.float64(0.9616851610586743), 'auroc_m

single label classification subset

- we extract the single label (rhytm) part and evaluate
- we evaluate the multilabel of the conditions as well

SVM

- logistic regression might be dependent on polynomial features
- we use RBF SVM in order to better capture complex relationships
- NOTE: high space complexity -> we will downsample

In [23]:
# SVM
frac = 0.15

svm_clf = OneVsRestClassifier(SVC(kernel="rbf", C=10), n_jobs=n_jobs)
svm_base = Pipeline([
    ("preprocess", ct),
    ("clf", svm_clf),
])

# ----- Raw -----
idx_raw = X_tr_0.sample(frac=frac, random_state=TTS_SEED).index
X_tr_raw_s = X_tr_0.loc[idx_raw]
y_tr_raw_s = y_tr_bin[idx_raw]

svm_raw = clone(svm_base)
svm_raw.fit(X_tr_raw_s, y_tr_raw_s)
val_logits_raw = svm_raw.decision_function(X_val_0)
score_overall_and_rc(y_val_bin, val_logits_raw, mlb, split_name="svm-val-raw")

# ----- Bandpass -----
idx_filt = X_tr_1.sample(frac=frac, random_state=TTS_SEED).index
X_tr_filt_s = X_tr_1.loc[idx_filt]
y_tr_filt_s = y_tr_bin[idx_filt]

svm_filt = clone(svm_base)
svm_filt.fit(X_tr_filt_s, y_tr_filt_s)
val_logits_filt = svm_filt.decision_function(X_val_1)
score_overall_and_rc(y_val_bin, val_logits_filt, mlb, split_name="svm-val-filt")

pass

[svm-val-raw] overall: {'f1_micro': np.float64(0.12489289149252426), 'f1_macro': np.float64(0.09960835718340941), 'f1_samples': np.float64(0.12279249926212595), 'hamming_loss': 0.13099713101021948, 'auroc_micro': np.float64(0.8780046539642714), 'auroc_macro': np.float64(0.7512904290338757), 'ap_micro': np.float64(0.46313835830951433), 'ap_macro': np.float64(0.15978435631328622)}
[svm-val-raw] conditions: {'f1_micro': np.float64(0.027043071037530505), 'f1_macro': np.float64(0.06452243194413466), 'f1_samples': np.float64(0.022312995262330663), 'hamming_loss': 0.1296312570303712, 'auroc_micro': np.float64(0.8096161073102175), 'auroc_macro': np.float64(0.7382954993339191), 'ap_micro': np.float64(0.14725768493472047), 'ap_macro': np.float64(0.12053384599086965)}
[svm-val-raw] rhythm: {'f1_micro': np.float64(0.5052127841394634), 'f1_macro': np.float64(0.39433012919331734), 'f1_samples': np.float64(0.478387467191601), 'hamming_loss': 0.14247047244094488, 'auroc_micro': np.float64(0.9328179265

**XGBoost Optimized** (ensemble model)

- ensemble model
- utilizing CV strategies
- transformation optimization
- hyperparam optimization

In [22]:
import xgboost as xgb

In [40]:
n_estimators = 600
learning_rate = 0.05
max_depth = 4
subsample = 0.8
colsample_bytree = 0.8
reg_lambda = 1

n_jobs = 10

In [41]:
xgb_params = {
    "objective": "binary:logistic",
    "tree_method": "hist",
    "device": "cuda",
    "n_estimators": n_estimators,
    "learning_rate": learning_rate,
    "max_depth": max_depth,
    "subsample": subsample,
    "colsample_bytree": colsample_bytree,
    "reg_lambda": reg_lambda,
    "n_jobs": n_jobs,
    "random_state": TTS_SEED,
}

In [42]:
xgb_base = xgb.XGBClassifier(**xgb_params)

In [45]:
model = OneVsRestClassifier(xgb_base, n_jobs=1)

xgb_pipeline = Pipeline([
    ('preprocess', ct),
    ('model', model)
])

In [46]:
# raw
xgb_raw = clone(xgb_pipeline)
xgb_raw.fit(X_tr_0, y_tr_bin)

P_val_raw = xgb_raw.predict_proba(X_val_0)
xgb_res_raw = score_overall_and_rc(
    y_val_bin, P_val_raw, mlb,
    split_name="xgb-ovr-val-raw",
    threshold=0.5,
    inputs_are_probs=True,
)

# bandpass
xgb_filt = clone(xgb_pipeline)
xgb_filt.fit(X_tr_1, y_tr_bin)

P_val_filt = xgb_filt.predict_proba(X_val_1)
xgb_res_filt = score_overall_and_rc(
    y_val_bin, P_val_filt, mlb,
    split_name="xgb-ovr-val-filt",
    threshold=0.5,
    inputs_are_probs=True,
)

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


  return func(**kwargs)


[xgb-ovr-val-raw] overall: {'f1_micro': np.float64(0.6618153963998468), 'f1_macro': np.float64(0.14465389761258513), 'f1_samples': np.float64(0.7247915415730377), 'hamming_loss': 0.011557107555704473, 'auroc_micro': np.float64(0.97701716761247), 'auroc_macro': np.float64(0.8740862485649855), 'ap_micro': np.float64(0.7539490506980515), 'ap_macro': np.float64(0.2507852457167594)}
[xgb-ovr-val-raw] conditions: {'f1_micro': np.float64(0.3472330475448168), 'f1_macro': np.float64(0.10349082278569197), 'f1_samples': np.float64(0.14486205914601583), 'hamming_loss': 0.009813226471691038, 'auroc_micro': np.float64(0.9613893588079176), 'auroc_macro': np.float64(0.8710263973175587), 'ap_micro': np.float64(0.4510165641647809), 'ap_macro': np.float64(0.20955080688598782)}
[xgb-ovr-val-raw] rhythm: {'f1_micro': np.float64(0.8655812192351382), 'f1_macro': np.float64(0.4904237261584875), 'f1_samples': np.float64(0.8303395669291339), 'hamming_loss': 0.02620570866141732, 'auroc_micro': np.float64(0.98991

___
#### Testing hold-out set
- running final model + baseline once on the test set

#### Scoring system

- Multilabel decomposition:
    - rhytm snomed CT codes evaluation (single label)
    - conditions snomed CT codes evaluation (multi label)

- Evaluation analysis:
    - best/worst performing rhytms
    - best/worst performing conditions
    - rare labels performance

- treshold tunning

___
#### **DL (Deep Learning)**

pytorch installation test + GPU support check

In [1]:
import torch

In [7]:
x = torch.rand(5, 3)
print(x)

cuda = torch.cuda.is_available()
print(cuda)
if cuda:
    print("cuda device count:", torch.cuda.device_count())
    print("cuda device name:", torch.cuda.get_device_name())

tensor([[0.6856, 0.1604, 0.8489],
        [0.8753, 0.4726, 0.5957],
        [0.9645, 0.2421, 0.3777],
        [0.2628, 0.0031, 0.7061],
        [0.7295, 0.6117, 0.3450]])
True
cuda device count: 1
cuda device name: NVIDIA GeForce RTX 3060 Laptop GPU


DL preprocessing
- we will pass raw ECG signals mapped per row (instead of calculating features as in ML pipe)

___
#### xAI: Model Explainability
ideas:
- integrated gradients
- knowledge distillation: DL -> ML