## Modeling Pipeline - Deep Learning

This notebook focuses on the development of project ML/DL models - focus on modelling pipeline setup & **baseline** models.

### Research Questions:

1. Build baseline ML and DL models
2. Build optimized ML and DL models
3. Are optimized models better -> how much?
4. Is baseline DL model better than optimized ML model?
5. Knowledge distillation:
    - is distilled ML model better than optmized ML model?
    - is distilled ML model better than baseline DL model?

In [3]:
import os
import sys

sys.dont_write_bytecode = True
root_dir = os.path.abspath(os.pardir)
if root_dir not in sys.path:
    sys.path.append(root_dir)

In [4]:
import json
import wfdb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from configs.constants import *

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

In [5]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.dataloader import default_collate

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [7]:
data_dir = '../data/preprocessing/ML/'
meta_df_file = '../data/results/complete_metadata_mapping_2.csv'

### Models

train test split indices

In [8]:
meta_df = pd.read_csv(meta_df_file)
meta_df['dx_codes'] = meta_df['dx_codes'].map(json.loads)

In [9]:
X = meta_df.drop('dx_codes', axis=1)
y = meta_df['dx_codes']

X_work, X_test, y_work, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=TTS_SEED)

train/validation split

- in order to avoid optimistic metrics we evaluate on test data once and calibrate on eval

In [10]:
X_train, X_val, y_train, y_val = train_test_split(X_work, y_work, test_size=0.1, random_state=TTS_SEED)

**model preprocessing**

label encoding

In [11]:
mlb = MultiLabelBinarizer()
y_train_transformed = mlb.fit_transform(y_train)
y_val_transformed = mlb.transform(y_val)
y_test_transformed = mlb.transform(y_test)

___
#### **DL (Deep Learning)**

pytorch installation test + GPU support check

In [12]:
import torch

In [13]:
x = torch.rand(5, 3)
print(x)

cuda = torch.cuda.is_available()
print(cuda)
if cuda:
    print("cuda device count:", torch.cuda.device_count())
    print("cuda device name:", torch.cuda.get_device_name())

tensor([[0.5955, 0.0385, 0.0434],
        [0.2217, 0.4618, 0.0509],
        [0.4574, 0.9149, 0.1969],
        [0.3824, 0.3900, 0.4294],
        [0.8161, 0.1438, 0.4660]])
True
cuda device count: 1
cuda device name: NVIDIA GeForce RTX 3060 Laptop GPU


DL preprocessing
- we will pass raw ECG signals mapped per row (instead of calculating features as in ML pipe)

In [14]:
class ECGDataset(Dataset):
    """
    df: pandas DataFrame with a column containing WFDB record base paths
    Y:  numpy array float32 of shape [N, C] (multi-hot targets)
    Returns:
      x: torch.FloatTensor [leads, T]
      y: torch.FloatTensor [C]
    """
    
    def __init__(self, df, y, record_col='record_path', dtype=np.float32):
        self.df = df.reset_index(drop=True)
        self.y = np.asarray(y, dtype=np.float32)
        self.record_col = record_col
        self.dtype = dtype

        if len(self.df) != self.y.shape[0]:
            raise ValueError(f"df has {len(self.df)} rows but Y has {self.Y.shape[0]} rows")


    def __len__(self):
        return len(self.df)


    def __getitem__(self, idx):
        rec = self.df.loc[idx, self.record_col]

        try:
            signals, _fields = wfdb.rdsamp(rec)
            signals = np.asarray(signals, dtype=self.dtype)
            
            x = torch.from_numpy(signals.T)
            y = torch.from_numpy(self.y[idx])
            return x, y
        except Exception as e:
            return None

In [22]:
def collate_skip_none(batch):
    batch = [b for b in batch if b is not None]
    if len(batch) == 0:
        return None
    return default_collate(batch)


def make_loader(dataset, batch_size, shuffle, num_workers=4):
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        pin_memory=True,
        persistent_workers=(num_workers > 0),
        collate_fn=collate_skip_none
    )

load test

In [23]:
def test_first_n(dataset, n=10, expected_leads=12):
    ok = {}
    bad = []

    for i in range(min(n, len(dataset))):
        item = dataset[i]
        if item is None:
            bad.append(i)
        
        if len(item) == 2:
            x, y = item
            rec = dataset.df.iloc[i][dataset.record_col]
        
        ok[rec] = {
            'X' : x,
            'y' : y
        }

    return ok, bad

In [24]:
ds10 = ECGDataset(X_train.iloc[:10], y_train_transformed[:10], record_col='record_path')
ok, bad = test_first_n(ds10, n=10)

In [25]:
#ok

In [26]:
print(len(ok))
print(len(bad))

10
0


baseline model:
1. simple CNN - ECG only
2. simple CNN - ECG + meta attributes (age, sex)

1. ECG only

In [27]:
class SmallECGCNN(nn.Module):
    def __init__(self, n_labels: int):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Conv1d(12, 32, kernel_size=7, stride=2, padding=3),
            nn.ReLU(),
            nn.Conv1d(32, 64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=7, stride=2, padding=3),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
            nn.Flatten(),
        )
        self.head = nn.Linear(128, n_labels)

    def forward(self, x):
        z = self.backbone(x)
        return self.head(z)

In [28]:
def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0.0
    n_seen = 0

    for batch in loader:
        if batch is None:
            continue
        x, y = batch
        x = x.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)

        logits = model(x)
        loss = criterion(logits, y)

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

        bs = x.size(0)
        total_loss += loss.item() * bs
        n_seen += bs

    return total_loss / max(n_seen, 1)

@torch.no_grad()
def eval_loss(model, loader, criterion, device):
    model.eval()
    total_loss = 0.0
    n_seen = 0

    for batch in loader:
        if batch is None:
            continue
        x, y = batch
        x = x.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)

        logits = model(x)
        loss = criterion(logits, y)

        bs = x.size(0)
        total_loss += loss.item() * bs
        n_seen += bs

    return total_loss / max(n_seen, 1)


In [1]:
num_workers = 0

In [32]:
X_train_ds = ECGDataset(X_train, y_train_transformed, record_col='record_path')
X_val_ds = ECGDataset(X_val, y_val_transformed, record_col='record_path')

train_loader = make_loader(X_train_ds, batch_size=64, shuffle=True, num_workers=num_workers)
val_loader   = make_loader(X_val_ds, batch_size=64, shuffle=False, num_workers=num_workers)

channels = y_train_transformed.shape[1]

load test

In [None]:
batch = next(iter(train_loader))
while batch is None:
    batch = next(iter(train_loader))
x0, y0 = batch

# result [B, 12, T] and [B, C]
print("x batch:", x0.shape, "y batch:", y0.shape)

In [63]:
epochs = 10
learning_rate = 1e-3

MODEL_DIR = '../models/'
model_name = 'baseline_cnn_ecg_modality.pth'

In [61]:
model = SmallECGCNN(n_labels=channels).to(device)
criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

baseline model training

In [64]:
best_val = float('inf')

for epoch in range(1, epochs+1):
    tr = train_one_epoch(model, train_loader, optimizer, criterion, device)
    va = eval_loss(model, val_loader, criterion, device)
    print(f"epoch={epoch} train_loss={tr:.4f} val_loss={va:.4f}")

    if va < best_val:
        best_val = va
        model_path = os.path.abspath(os.path.join(MODEL_DIR, model_name))
        torch.save(model.state_dict(), model_path)

TypeError: object of type 'NoneType' has no len()

Optimized architectures:
- Multimodal: ECG + meta (sex, age)

#### Evaluation system - TODO

1. Multi-label classification eval
2. Problem decomposition eval:
    - single-classification part (rhytm prediction)
    - multi-classification part (conditions prediction)

In [41]:
from sklearn.metrics import (
    f1_score,
    roc_auc_score,
    average_precision_score,
    hamming_loss,
    confusion_matrix,
    multilabel_confusion_matrix,
    classification_report,
)

___
#### xAI: Model Explainability
ideas:
- integrated gradients
- knowledge distillation: DL -> ML