# Brain Solver Python Processing Notebook

This notebook utilizes the custom `brain_solver` package for analyzing brain activity data. Our data sources include official datasets from Kaggle competitions and additional datasets for enhanced model training and evaluation.

This is the Training notebook.

## Data Sources

### Official:

- **HMS - Harmful Brain Activity Classification**
  - **Source:** [Kaggle Competition](https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification)
  - **Description:** This competition focuses on classifying harmful brain activity. It includes a comprehensive dataset for training and testing models.

- **Brain-Spectrograms**
  - **Source:** [Kaggle Dataset](https://www.kaggle.com/datasets/cdeotte/brain-spectrograms)
  - **Description:** The `specs.npy` file contains all the spectrograms from the HMS competition, offering a detailed view of brain activity through visual representations.

### Additional:

- **Brain-EEG-Spectrograms**
  - **Source:** [Kaggle Dataset](https://www.kaggle.com/datasets/cdeotte/brain-eeg-spectrograms)
  - **Description:** The `EEG_Spectrograms` folder includes one NumPy file per EEG ID, with each array shaped as (128x256x4), representing (frequency, time, montage chain). This dataset provides a more nuanced understanding of brain activity through EEG spectrograms.

- **hms_efficientnetb0_pt_ckpts**
  - **Source:** [Kaggle Dataset](https://www.kaggle.com/datasets/crackle/hms-efficientnetb0-pt-ckpts)
  - **Description:** This dataset offers pre-trained checkpoints for EfficientNetB0 models, tailored for the HMS competition. It's intended for use in fine-tuning models on the specific task of harmful brain activity classification.


In [23]:
import os, sys
import gc
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from brain_solver import Helpers as hp, Trainer as tr, BrainModel as br, EEGDataset
from brain_solver import Wav2Vec2 as w2v
from brain_solver import Filters, FilterType
from transformers.utils import logging

# Suppress warnings if desired
import warnings

warnings.filterwarnings("ignore")
logging.set_verbosity(logging.CRITICAL)

# Setup for CUDA device selection
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [24]:
from brain_solver import Config
full_path = "/home/osloup/NoodleNappers/brain/data/" # Luppo
# full_path = "C:/Users/tygof/Documents/Semester 8/MLiP/NoodleNappers/brain/data/" # Tygo
# full_path = "C:/Users/dahbl/Documents/TrueDocs/Uni/Year 4/Semester 2/Machine Learning in Practice/brain/brain/data/" # Dick
config = Config(full_path,  full_path + "out/", USE_EEG_SPECTROGRAMS=True, USE_KAGGLE_SPECTROGRAMS=True, should_read_brain_spectograms=False, should_read_eeg_spectrogram_files=False, USE_PRETRAINED_MODEL=False)

# Kaggle Pull
# full_path = "/kaggle/input/"
# config = Config(full_path, "/kaggle/working/", USE_EEG_SPECTROGRAMS=True, USE_KAGGLE_SPECTROGRAMS=True, should_read_brain_spectograms=False, should_read_eeg_spectrogram_files=False, USE_PRETRAINED_MODEL=False)

import sys
sys.path.append(full_path + 'kaggle-kl-div')
# from kaggle_kl_div import score

In [25]:
# Create Output folder if does not exist
if not os.path.exists(config.output_path):
    os.makedirs(config.output_path)

# Initialize random environment
pl.seed_everything(config.seed, workers=True)

print(config.data_train_csv)

Seed set to 2024


/home/osloup/NoodleNappers/brain/data/hms-harmful-brain-activity-classification/train.csv


In [26]:
train_df: pd.DataFrame = hp.load_csv(config.data_train_csv)

if train_df is None:
    print("Failed to load the CSV file.")
    exit()
else:
    EEG_IDS = train_df.eeg_id.unique()
    TARGETS = train_df.columns[-6:]
    TARS = {"Seizure": 0, "LPD": 1, "GPD": 2, "LRDA": 3, "GRDA": 4, "Other": 5}
    TARS_INV = {x: y for y, x in TARS.items()}
    print("Train shape:", train_df.shape)

Train shape: (106800, 15)


In [27]:
train_data_preprocessed = hp.preprocess_eeg_data(train_df, TARGETS)

Train non-overlap eeg_id shape: (17089, 12)


In [28]:
train_data_preprocessed.head()

Unnamed: 0,eeg_id,spec_id,min_offset,max_offset,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,target
0,568657,789577333,0.0,16.0,20654,0.0,0.0,0.25,0.0,0.166667,0.583333,Other
1,582999,1552638400,0.0,38.0,20230,0.0,0.857143,0.0,0.071429,0.0,0.071429,LPD
2,642382,14960202,1008.0,1032.0,5955,0.0,0.0,0.0,0.0,0.0,1.0,Other
3,751790,618728447,908.0,908.0,38549,0.0,0.0,1.0,0.0,0.0,0.0,GPD
4,778705,52296320,0.0,0.0,40955,0.0,0.0,0.0,0.0,0.0,1.0,Other


In [29]:
# Initialize the Filters class
# ft = Filters(order=5)

# train_eegs/ EEG data from one or more overlapping samples. Use the metadata in train.csv to select specific annotated subsets. The column names are the names of the individual electrode locations for EEG leads, with one exception. The EKG column is for an electrocardiogram lead that records data from the heart. All of the EEG data (for both train and test) was collected at a frequency of 200 samples per second.

# Define filter parameters
# cutoff_low = 0.1  # Low cutoff frequency (Hz)
# cutoff_high = 50.0  # High cutoff frequency (Hz)
# fs = 200  # Sampling rate (Hz)

# filtered_brain_spectrograms = {key: ft.apply_filter_to_spectrogram(spectrogram, [cutoff_low, cutoff_high], fs, FilterType.BANDPASS) for key, spectrogram in spectrograms.items()}
# filtered_eeg_spectrograms = {key: ft.apply_filter_to_spectrogram(spectrogram, [cutoff_low, cutoff_high], fs, FilterType.BANDPASS) for key, spectrogram in data_eeg_spectrograms.items()}

# combined_brain_spectrograms = {key: {'raw': spectrograms[key], 'filtered': filtered_brain_spectrograms[key]} for key in spectrograms}
# combined_eeg_spectrograms = {key: {'raw': data_eeg_spectrograms[key], 'filtered': filtered_eeg_spectrograms[key]} for key in data_eeg_spectrograms}

In [30]:
read_path = config.data_spectograms

files = os.listdir(read_path)
print(f"There are {len(files)} spectrogram parquets")

There are 11138 spectrogram parquets


In [31]:
# Create Output folder for wav2vec if does not exist
if not os.path.exists(config.data_w2v_specs):
    os.makedirs(config.data_w2v_specs)

In [32]:
force_regenerate = False

In [33]:
for i, f in enumerate(files):
    name = f[:-8]

    if i % 100 == 0:
        print(i, ", ", end="")

    if not os.path.exists(config.data_w2v_specs + f"{name}.npy"):
        try:

            parquet_file = pd.read_parquet(os.path.join(read_path, f))

            # Here will luppo eventually try to use filters on the spectograms and later on the RAW data in the loop
            parquet_file = parquet_file.iloc[:, 1:].values # FILTERS
            
            parquet_file = w2v.wav2vec2(parquet_file)
            np.save(config.data_w2v_specs + f"{name}", parquet_file)

        except Exception as e:
            print(f"ERROR: An unexpected error occurred for {name}: {e}")


0 , 

/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000086677
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000189855
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000317312
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000381196
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000493950
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000646093
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000655456
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1000757705
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1001335039
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1001616430
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1001782302
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/100193677
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1001944237
/home/osloup/NoodleNappers/brain/data/out/w2v_specs_filter/1002038108
/home/osloup/NoodleNa