## Using EpiNT for Feature Extractor

### 1. Short Introduction 

Traditional machine learning framework needs manual feature engineering, which involves constructing and extracting meaningful features for classification. This process requires expert knowledge, and deep understanding of the data.

In this notebook, we illustrate how EpiNT can act as feature extractor, that extracts features (embeddings) for machine learning classification.

### 2. Load Configurations and EpiNT

In [None]:
import torch
import h5py

from yaml import CLoader as Loader
from yaml import load
from argparse import Namespace
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from epint.model.EpiNTModel import EpiNT


In [2]:
def read_yaml(file):
    with open(file, 'r') as f:
        return load(f, Loader=Loader)

def merge_config(base_config, new_config):
    for key, value in new_config.items():
        base_config[key] = value
    return base_config

In [3]:
finetune_config_path = '../configs/finetune_configs.yaml'
default_config_path = '../configs/default.yaml'

finetune_config = read_yaml(finetune_config_path)
default_config = read_yaml(default_config_path)
merged_config = merge_config(default_config, finetune_config)

# convert config to arguments
args = Namespace(**merged_config)
print(merged_config)

{'patch_len': 256, 'stride_len': 256, 'sequence_len': 3072, 'dropout': 0.1, 'cls_token': True, 'downstream_dataset': '/home/ZRK/ZRK_ssd2/ZRK/Engineering_Server/Epileptogenic/dataset/downstream_wo_sfreq_equal_cpres_chn.hdf5', 'RESULTS_DIR': '/home/ZRK/sda_data/zrkdata/Epileptogenic_Results/', 'experiment_name': 'linear_probing', 'seizure_task': 'hfo_ied_detec2', 'model_name': 'EpiNT', 'patch_mask': False, 'mask_ratio': 0, 'head_dropout': 0.1, 'task': 'classification', 'd_model': 512, 'dim_feedforward': 2048, 'num_heads': 8, 'num_layers': 6, 'codebook_dim': 64, 'codebook_size': 256, 'num_quantizer': 1, 'optimizer_name': 'AdamW', 'lr_scheduler_type': 'onecyclelr', 'pct_start': 0.3, 'init_lr': 0.0001, 'three_phase': False, 'weight_decay': 0.05, 'num_workers': 4, 'max_norm': 5.0, 'max_epoch': 10, 'train_ratio': 0.7, 'train_batch_size': 1024, 'debug': True, 'seed': 666, 'run_name': '20241226_015111'}


In [4]:
model = EpiNT(args)

# load finetune weights
finetune_weights_path = '../weights/representations.bin'
finetune_weights = torch.load(finetune_weights_path)
model.load_state_dict(finetune_weights, strict=False)

# Since we use an additional classifier, the original head is omitted.

_IncompatibleKeys(missing_keys=['head.weight', 'head.bias'], unexpected_keys=['embed.mask_encoding'])

### 3. Load Dataset

We consruct a sample dataset using MAYO dataset. We select 1000 balanced samples (500 negatives v.s. 500 positives) for training, and 100 balanced samples (50 negatives v.s. 50 positives) for evaluating the logistic regression classifier.

We select logistic regression classifier due to its simplicity.

In [5]:
sample_data_path = '../dataset/sample_data.hdf5'
h5file = h5py.File(sample_data_path, 'r')

train_data = h5file['train_data'][:]
train_labels = h5file['train_labels'][:]
test_data = h5file['test_data'][:]
test_labels = h5file['test_labels'][:]

train_data = torch.tensor(train_data, dtype=torch.float32).squeeze(1)
train_labels = torch.tensor(train_labels, dtype=torch.long)
test_data = torch.tensor(test_data, dtype=torch.float32).squeeze(1)
test_labels = torch.tensor(test_labels, dtype=torch.long)

print(f'Train data shape: {train_data.shape}, Train labels shape: {train_labels.shape}')
print(f'Test data shape: {test_data.shape}, Test labels shape: {test_labels.shape}')

Train data shape: torch.Size([1000, 3072]), Train labels shape: torch.Size([1000])
Test data shape: torch.Size([100, 3072]), Test labels shape: torch.Size([100])


### 4. Evaluation

In [6]:
"""
We only need EpiNT to work in evaluation mode for feature extraction.
The [cls] token is used to represent the entire sequence.
"""

model.eval()
with torch.no_grad():
    _, train_cls = model(train_data)
    _, test_cls = model(test_data)


In [7]:
# Train logistic regression
train_cls_flat = train_cls.squeeze(1).numpy()
test_cls_flat = test_cls.squeeze(1).numpy()

# Train logistic regression
clf = LogisticRegression(max_iter=1000)
clf.fit(train_cls_flat, train_labels.numpy())

# Evaluate the classifier
train_accuracy = clf.score(train_cls_flat, train_labels.numpy())
test_accuracy = clf.score(test_cls_flat, test_labels.numpy())

print(f"Train Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

# Generate confusion matrices
train_cm = confusion_matrix(train_labels.numpy(), clf.predict(train_cls_flat))
test_cm = confusion_matrix(test_labels.numpy(), clf.predict(test_cls_flat))

# Print confusion matrices
print("Train Confusion Matrix:")
print(train_cm)
print("Test Confusion Matrix:")
print(test_cm)

# Print classification report for test data
print("Test Classification Report:")
print(classification_report(test_labels.numpy(), clf.predict(test_cls_flat)))

Train Accuracy: 0.915
Test Accuracy: 0.85
Train Confusion Matrix:
[[456  44]
 [ 41 459]]
Test Confusion Matrix:
[[41  9]
 [ 6 44]]
Test Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.82      0.85        50
           1       0.83      0.88      0.85        50

    accuracy                           0.85       100
   macro avg       0.85      0.85      0.85       100
weighted avg       0.85      0.85      0.85       100



The results (confusion matrix and accuracy score) illustrate that the extracted features (embeddings) have good discrimination.