# EEGCanon Benchmark: Cross-Dataset Canonicalization

This notebook demonstrates that EEGCanon produces a consistent, 
auditable, and ML-ready canonical representation across heterogeneous 
EEG datasets (EDF, CSV) without dataset-specific preprocessing.

Goal:
- Show canonical shape consistency
- Show channel & sampling alignment
- Show meaningful warnings & provenance


In [1]:
import numpy as np
import pandas as pd

from eegcanon.pipeline import load_eeg

In [3]:
csv_path = "D:\CognitiveCore_projects\EEG_framework\EEGcanon\eegcanon\examples\dummy_eeg.csv"

eeg_csv = load_eeg(csv_path)

print("=== Dummy CSV ===")
print("Shape:", eeg_csv.data.shape)
print("Channels:", eeg_csv.channels)
print("Sampling rate:", eeg_csv.sampling_rate)

print("\nWarnings:")
for w in eeg_csv.warnings:
    print("-", w)

print("\nProvenance:")
for k, v in eeg_csv.provenance.items():
    print(f"{k}: {v}")

=== Dummy CSV ===
Shape: (4, 512)
Channels: ['FP1', 'FP2', 'F3', 'F4']
Sampling rate: 256.0


Provenance:
source_format: CSV
channel_policy: 10-20-minimal
original_fs: 256.0
target_fs: 256.0
time_normalized: False
load_time: 2025-12-16T12:45:46.660784


### Observation (Dummy CSV)

The synthetic CSV EEG is successfully converted into a CanonicalEEG object with
a fixed channel order and sampling rate. Minor warnings indicate missing
canonical channels, which is expected given the limited channel set.


## Canonicalization: PhysioNet EEG (EDF / FIF)

This section evaluates EEGCanon on real-world clinical EEG data obtained from
PhysioNet (via MNE sample datasets). The dataset contains mixed sensor types and
non-uniform channel naming, providing a realistic stress test.


In [7]:
physionet_path = r"D:\CognitiveCore_projects\mne_dataset\test_edf\S001R01.edf"

eeg_phys = load_eeg(physionet_path)

print("=== PhysioNet EEG ===")
print("Shape:", eeg_phys.data.shape)
print("Channels:", eeg_phys.channels)
print("Sampling rate:", eeg_phys.sampling_rate)

print("\nWarnings:")
for w in eeg_phys.warnings:
    print("-", w)

print("\nProvenance:")
for k, v in eeg_phys.provenance.items():
    print(f"{k}: {v}")

=== PhysioNet EEG ===
Shape: (13, 15616)
Channels: ['FP1', 'FP2', 'F3', 'F4', 'C3', 'C4', 'P3', 'P4', 'O1', 'O2', 'FZ', 'CZ', 'PZ']
Sampling rate: 256.0


Provenance:
source_format: EDF
channel_policy: 10-20-minimal
original_fs: 160.0
target_fs: 256.0
time_normalized: True
load_time: 2025-12-16T12:49:08.824044


### Observation (PhysioNet EEG)

Despite originating from a mixed-modality recording with non-uniform sampling,
EEGCanon successfully extracts EEG channels, normalizes the sampling rate, and
produces a CanonicalEEG object consistent with the CSV baseline. Additional
warnings reflect necessary preprocessing decisions and are transparently
reported.

In [14]:
eeg_physio = load_eeg("D:\CognitiveCore_projects\EEG_framework\EEGcanon\eegcanon\S001R01.edf")

print("=== PhysioNet EEG ===")
print("Shape:", eeg_physio.data.shape)
print("Channels:", eeg_physio.channels)
print("Sampling rate:", eeg_physio.sampling_rate)

print("\nWarnings:")
for w in eeg_physio.warnings:
    print("-", w)

print("\nProvenance:")
for k, v in eeg_physio.provenance.items():
    print(f"{k}: {v}")

=== PhysioNet EEG ===
Shape: (13, 15616)
Channels: ['FP1', 'FP2', 'F3', 'F4', 'C3', 'C4', 'P3', 'P4', 'O1', 'O2', 'FZ', 'CZ', 'PZ']
Sampling rate: 256.0


Provenance:
source_format: EDF
channel_policy: 10-20-minimal
original_fs: 160.0
target_fs: 256.0
time_normalized: True
load_time: 2025-12-16T12:55:01.303037


In [15]:
for name, eeg in {
    "CSV": eeg_csv,
    "PhysioNet": eeg_physio
}.items():
    epochs = eeg.to_epochs(window=1.0, overlap=0.5)
    feats = eeg.extract_features(["psd", "bandpower", "hjorth"])

    print(f"\n{name}")
    print("Epochs shape:", epochs.shape)
    print("Features shape:", feats.shape)


CSV
Epochs shape: (3, 4, 256)
Features shape: (1, 32)

PhysioNet
Epochs shape: (121, 13, 256)
Features shape: (40, 104)


In [16]:
features_csv = eeg_csv.extract_features(
    ["psd", "bandpower", "hjorth"],
    window=1.0,
    overlap=0.5,
)

features_physio = eeg_physio.extract_features(
    ["psd", "bandpower", "hjorth"],
    window=1.0,
    overlap=0.5,
)

print("CSV features:", features_csv.shape)
print("PhysioNet features:", features_physio.shape)

  freqs, _, Pxy = _spectral_helper(x, y, fs, window, nperseg, noverlap,


CSV features: (3, 32)
PhysioNet features: (121, 104)


In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Dummy labels for demonstration
y_csv = [0, 1] * (len(features_csv) // 2 + 1)
y_csv = y_csv[:len(features_csv)]
y_physio = [0, 1] * (len(features_physio) // 2 + 1)
y_physio = y_physio[:len(features_physio)]

# Train on CSV
X_train, X_test, y_train, y_test = train_test_split(
    features_csv, y_csv, test_size=0.3, random_state=42
)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

print("CSV model trained successfully")

# Train on PhysioNet
X_train, X_test, y_train, y_test = train_test_split(
    features_physio, y_physio, test_size=0.3, random_state=42
)

clf.fit(X_train, y_train)
print("PhysioNet model trained successfully")

CSV model trained successfully
PhysioNet model trained successfully


In [21]:
# Train on CSV
X_train, X_test, y_train, y_test = train_test_split(
    features_csv, y_csv, test_size=0.3, random_state=42
)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

print("CSV model trained successfully")

# Train on PhysioNet
X_train, X_test, y_train, y_test = train_test_split(
    features_physio, y_physio, test_size=0.3, random_state=42
)

clf.fit(X_train, y_train)
print("PhysioNet model trained successfully")

CSV model trained successfully
PhysioNet model trained successfully
