# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>1 |</span></b> <b>INTRODUCTION</b></div>

👋 Welcome to "🧠Exploring EEG: A Beginner's Guide"! 

If you're fascinated by the wonders of the human brain and the intricate patterns of brainwaves, but find the world of Electroencephalography (EEG) analysis daunting, you're in the right place. 

This notebook is designed for beginners like me & you, aiming to demystify the complexities of EEG data and make your learning journey both enjoyable and informative.

### <b><span style='color:#FFCE30'> 1.1 |</span> Intention of the notebook</b>
In this notebook, we will embark on an exploratory journey into the realm of EEG data analysis. Our goal is to provide a clear, step-by-step guide to understanding and analyzing EEG signals, which are crucial in detecting and classifying brain activities, such as seizures. We aim to:

* Break down complex concepts into easily digestible sections.
* Illustrate each step with practical code examples.
* Reference public notebooks and discussions to enhance your learning experience.


### <b><span style='color:#FFCE30'> 1.2 |</span> Learning Objective</b>
By the end of this notebook, you will have a foundational understanding of:

* The basics of EEG signals and their significance in medical research and neurology.
* How to preprocess and analyze EEG data.
* Run through the basic code to build a machine learning model for EEG data classification.

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>2 |</span></b> <b>REFERENCE & ACKNOWLEDGEMENT</b></div>

This notebook wouldn't be possible without the valuable insights and contributions from the Kaggle community. I've leveraged several resources to compile the most effective learning path for us:

* https://www.kaggle.com/code/cdeotte/catboost-starter-lb-0-8
* https://www.kaggle.com/code/mvvppp/hms-eda-and-domain-journey
* https://www.kaggle.com/code/ksooklall/hms-banana-montage
* https://www.kaggle.com/code/mpwolke/seizures-classification-parquet


Feel free to explore these resources alongside this notebook to deepen your understanding.

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>3 |</span></b> <b>LOAD LIBARIES</b></div>

In [1]:
import os
import pandas as pd, numpy as np
from glob import glob
import matplotlib.pyplot as plt
VER = 1

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>4 |</span></b> <b>INTRODUCTION TO EEG AND SEIZURE DETECTION</b></div>

<b><span style='color:#FFCE30'> 4.1 |</span> Electroencephalography (EEG) - The Window into Brain Activity</b>

* Electroencephalography, commonly known as EEG, is a non-invasive method used by medical professionals to record electrical activity in the brain. 
* This is done using electrodes placed along the scalp. 
* EEG is a crucial tool in diagnosing neurological disorders, especially epilepsy, which is characterized by recurrent seizures.

<img src="https://www.researchgate.net/profile/Sebastian-Nagel-4/publication/338423585/figure/fig1/AS:844668573073409@1578396089381/Sketch-of-how-to-record-an-Electroencephalogram-An-EEG-allows-measuring-the-electrical.png" alt="EEG" width="600" height="400">



In [2]:
# check the reading of one parquet for understanding

BASE_PATH = '/kaggle/input/hms-harmful-brain-activity-classification/'

df = pd.DataFrame({'path': glob(BASE_PATH + '**/*.parquet')})
df['test_type'] = df['path'].str.split('/').str.get(-2).str.split('_').str.get(-1)
df['id'] = df['path'].str.split('/').str.get(-1).str.split('.').str.get(0)

df_eeg = pd.read_parquet(BASE_PATH + 'train_eegs/1000913311.parquet')
df_eeg.head()

Unnamed: 0,Fp1,F3,C3,P3,F7,T3,T5,O1,Fz,Cz,Pz,Fp2,F4,C4,P4,F8,T4,T6,O2,EKG
0,-105.849998,-89.230003,-79.459999,-49.23,-99.730003,-87.769997,-53.330002,-50.740002,-32.25,-42.099998,-43.27,-88.730003,-74.410004,-92.459999,-58.93,-75.739998,-59.470001,8.21,66.489998,1404.930054
1,-85.470001,-75.07,-60.259998,-38.919998,-73.080002,-87.510002,-39.68,-35.630001,-76.839996,-62.740002,-43.040001,-68.629997,-61.689999,-69.32,-35.790001,-58.900002,-41.66,196.190002,230.669998,3402.669922
2,8.84,34.849998,56.43,67.970001,48.099998,25.35,80.25,48.060001,6.72,37.880001,61.0,16.58,55.060001,45.02,70.529999,47.82,72.029999,-67.18,-171.309998,-3565.800049
3,-56.32,-37.279999,-28.1,-2.82,-43.43,-35.049999,3.91,-12.66,8.65,3.83,4.18,-51.900002,-21.889999,-41.330002,-11.58,-27.040001,-11.73,-91.0,-81.190002,-1280.930054
4,-110.139999,-104.519997,-96.879997,-70.25,-111.660004,-114.43,-71.830002,-61.919998,-76.150002,-79.779999,-67.480003,-99.029999,-93.610001,-104.410004,-70.07,-89.25,-77.260002,155.729996,264.850006,4325.370117


In [3]:
# Determine the number of channels
# Assuming each row is a time point and each column is a channel
n_channels = df_eeg.shape[1]
n_channels

20

* The headers in the dataset (Fp1, F3, C3, P3, F7, T3, T5, O1, Fz, Cz, Pz, Fp2, F4, C4, P4, F8, T4, T6, O2, EKG) are standard electrode placement labels used in electroencephalography (EEG). 
* These labels correspond to specific positions on the scalp where EEG electrodes are placed to record brain activity. 
* Here's a brief overview of what they represent:

1. **Fp1, Fp2:** Frontopolar electrodes, located on the forehead, left and right side.
2. **F3, F4:** Frontal electrodes, on the left and right side of the forehead.
3. **C3, C4:** Central electrodes, placed above the left and right hemispheres of the brain.
4. **P3, P4:** Parietal electrodes, located on the upper back portion of the head, left and right sides.
5. **O1, O2:** Occipital electrodes, positioned at the back of the head near the visual cortex.
6. **T3, T4, T5, T6:** Temporal electrodes, situated on the left and right sides of the head near the ears. They are often involved in monitoring auditory functions.
7. **F7, F8:** Frontal-temporal electrodes, located at the front of the temporal lobes.
8. **Fz, Cz, Pz:** Midline electrodes, located at the frontal (Fz), central (Cz), and parietal (Pz) positions on the midline of the head.
9. **EKG:** Electrocardiogram electrode, which records the heart’s electrical activity. It's not directly related to brain activity but can be important in some EEG analyses.


<img src="https://www.researchgate.net/profile/Danny-Plass-Oude-Bos/publication/237777779/figure/fig3/AS:669556259434497@1536646060035/10-20-system-of-electrode-placement.png" alt="10-20-system-of-electrode-placement" width="300" height="150">

<b><span style='color:#FFCE30'> 4.2 |</span> Seizures and Their Impact</b>
* Seizures are sudden, uncontrolled electrical disturbances in the brain that can cause changes in behavior, feelings, movements, and levels of consciousness. 
* Detecting and classifying seizures accurately is vital for appropriate treatment and care, especially in critically ill patients.

<b><span style='color:#FFCE30'> 4.3 |</span> The Challenge of Manual EEG Analysis</b>

* Traditionally, EEG data analysis relies on visual inspection by trained neurologists. 
* This process is not only time-consuming and labor-intensive but also prone to errors due to fatigue and subjective interpretation.

<img src="https://slideplayer.com/slide/12925171/78/images/2/Manual+Interpretation+of+EEGs.jpg" alt="Manual Interpretation of EEG" width="700" height="300">
Source: Automated Identification of Abnormal Adult EEG, S. López, G. Suarez, D. Jungreis, I. Obeid and J. Picone, Neural Engineering Data Consortium, Temple University


<b><span style='color:#FFCE30'> 4.4 |</span> The Role of Data Science in EEG Analysis</b>

* Automating EEG Interpretation
The advent of machine learning and data science offers an opportunity to automate the interpretation of EEG data. By developing algorithms that can detect and classify different patterns in EEG signals, we can aid neurologists in making faster, more accurate diagnoses.

* The Data Science Approach
Data scientists approach this challenge by first preprocessing the EEG data, which involves filtering out noise and extracting relevant features. Machine learning models are then trained on these features to distinguish between different types of brain activity.

<img src="https://www.researchgate.net/profile/Huiguang-He/publication/336336651/figure/fig1/AS:834361356197888@1575938657076/The-flow-chart-of-EEG-emotion-classification-with-similarity-learning-network.png" alt="flowchart for EEG classification" width="700" height="300">


<b><span style='color:#FFCE30'> 4.5 |</span> Understanding EEG Patterns</b>

In the realm of EEG analysis for seizure detection, certain patterns are of particular interest:

1. **Seizure (SZ):** Characterized by abnormal rhythmic activity, indicative of a seizure.
2. **Generalized Periodic Discharges (GPD):** Patterns that may be seen in various encephalopathies.
3. **Lateralized Periodic Discharges (LPD):** Often associated with focal brain lesions.
4. **Lateralized Rhythmic Delta Activity (LRDA):** Can be observed in focal brain dysfunction.
5. **Generalized Rhythmic Delta Activity (GRDA):** Typically related to diffuse brain dysfunction.
6. **"Other" Patterns:** Any other type of activity not falling into the above categories.

<b><span style='color:#FFCE30'> 4.6 |</span> Interpreting Complex EEG Data</b>

EEG data interpretation can be complex, especially in edge cases where expert neurologists may not agree on a classification. This is where machine learning models can particularly shine by providing an additional layer of analysis.

<img src="https://www.neurology.org/cms/10.1212/WNL.0000000000207127/asset/bd84c182-712c-41ab-8742-cecf9d49a322/assets/images/large/5ff2.jpg" alt="flowchart for EEG classification" width="700" height="300">

Source: Development of Expert-Level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation https://www.neurology.org/doi/10.1212/WNL.0000000000207127


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>5 |</span></b> <b>LOAD TRAIN DATA</b></div>

In [4]:
df = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
TARGETS = df.columns[-6:]
print('Train shape:', df.shape )
print('Targets', list(TARGETS))
df.head()

Train shape: (106800, 15)
Targets ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']


Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>6 |</span></b> <b>CREATE NON-OVERLAPPING EEG ID TRAIN DATA</b></div>

Following the notebook from Chris Deotte: https://www.kaggle.com/code/cdeotte/catboost-starter-lb-0-8,
Initial discussion found here https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/467021

We perform the following because:

* **Match Training Data with Test Data Format:** The competition states that the test data does not have multiple segments from the same eeg_id. To make the training data similar to the test data, we also use only one segment per eeg_id in the training data.

* **Remove Redundancies:** This approach ensures that the training data does not have overlapping or redundant information, which can lead to a more accurate and generalizable machine learning model.

* **Consistency in Data:** By standardizing how we handle the EEG segments in training, we ensure that our model learns from data that is consistent in format with the data it will be tested on.

* **Data Preparation for Machine Learning:** The normalization of target variables and inclusion of relevant features like patient_id and expert_consensus prepare the dataset for effective machine learning modeling.

In [5]:
# Creating a Unique EEG Segment per eeg_id:
# The code groups (groupby) the EEG data (df) by eeg_id. Each eeg_id represents a different EEG recording.
# It then picks the first spectrogram_id and the earliest (min) spectrogram_label_offset_seconds for each eeg_id. This helps in identifying the starting point of each EEG segment.
# The resulting DataFrame train has columns spec_id (first spectrogram_id) and min (earliest spectrogram_label_offset_seconds).
train = df.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_id':'first','spectrogram_label_offset_seconds':'min'})
train.columns = ['spec_id','min']


# Finding the Latest Point in Each EEG Segment:
# The code again groups the data by eeg_id and finds the latest (max) spectrogram_label_offset_seconds for each segment.
# This max value is added to the train DataFrame, representing the end point of each EEG segment.
tmp = df.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_label_offset_seconds':'max'})
train['max'] = tmp


tmp = df.groupby('eeg_id')[['patient_id']].agg('first') # The code adds the patient_id for each eeg_id to the train DataFrame. This links each EEG segment to a specific patient.
train['patient_id'] = tmp


tmp = df.groupby('eeg_id')[TARGETS].agg('sum') # The code sums up the target variable counts (like votes for seizure, LPD, etc.) for each eeg_id.
for t in TARGETS:
    train[t] = tmp[t].values
    
y_data = train[TARGETS].values # It then normalizes these counts so that they sum up to 1. This step converts the counts into probabilities, which is a common practice in classification tasks.
y_data = y_data / y_data.sum(axis=1,keepdims=True)
train[TARGETS] = y_data

tmp = df.groupby('eeg_id')[['expert_consensus']].agg('first') # For each eeg_id, the code includes the expert_consensus on the EEG segment's classification.
train['target'] = tmp

train = train.reset_index() # This makes eeg_id a regular column, making the DataFrame easier to work with.
print('Train non-overlapp eeg_id shape:', train.shape )
train.head()

Train non-overlapp eeg_id shape: (17089, 12)


Unnamed: 0,eeg_id,spec_id,min,max,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,target
0,568657,789577333,0.0,16.0,20654,0.0,0.0,0.25,0.0,0.166667,0.583333,Other
1,582999,1552638400,0.0,38.0,20230,0.0,0.857143,0.0,0.071429,0.0,0.071429,LPD
2,642382,14960202,1008.0,1032.0,5955,0.0,0.0,0.0,0.0,0.0,1.0,Other
3,751790,618728447,908.0,908.0,38549,0.0,0.0,1.0,0.0,0.0,0.0,GPD
4,778705,52296320,0.0,0.0,40955,0.0,0.0,0.0,0.0,0.0,1.0,Other


In [6]:
train[(train.spec_id == 1908433744) & (train['min'] // 1000 == 2)]

Unnamed: 0,eeg_id,spec_id,min,max,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,target
564,147350182,1908433744,2615.0,2615.0,17408,0.0,0.0,1.0,0.0,0.0,0.0,GPD
16281,4084934272,1908433744,2063.0,2063.0,17408,0.0,0.0,1.0,0.0,0.0,0.0,GPD
16954,4255016832,1908433744,2783.0,2783.0,17408,0.0,0.0,1.0,0.0,0.0,0.0,GPD
17060,4285210475,1908433744,2845.0,2845.0,17408,0.0,0.0,1.0,0.0,0.0,0.0,GPD


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>7 |</span></b> <b>FEATURE ENGINEERING</b></div>



<b><span style='color:#FFCE30'> 7.1 |</span> 10 min and 20 sec windows</b>

* The code belows efficiently reads spectrogram data, from a single combined file, based on the set variable. We relied on the dataset by Chris Deotte to save time. https://www.kaggle.com/datasets/cdeotte/brain-spectrograms
* It then performs feature engineering by calculating mean and minimum values over two different time windows for each frequency in the spectrogram.
It produce produces in 1600 features (400 features × 4 calculations) for each EEG ID.
* The new features are intended to help the model better understand and classify the EEG data.
* This approach is designed to enhance the model's performance by providing it with more detailed information derived from the spectrogram data.

In [7]:
READ_SPEC_FILES = False # If READ_SPEC_FILES is False, the code reads the combined file instead of individual files.
FEATURE_ENGINEER = True
READ_EEG_SPEC_FILES = False

In [8]:
%%time
# READ ALL SPECTROGRAMS
PATH = '/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/'
files = os.listdir(PATH)
print(f'There are {len(files)} spectrogram parquets')

if READ_SPEC_FILES:    
    spectrograms = {}
    for i,f in enumerate(files):
        if i%100==0: print(i,', ',end='')
        tmp = pd.read_parquet(f'{PATH}{f}')
        name = int(f.split('.')[0])
        spectrograms[name] = tmp.iloc[:,1:].values
else:
    spectrograms = np.load('/kaggle/input/brain-spectrograms/specs.npy',allow_pickle=True).item()

There are 11138 spectrogram parquets
CPU times: user 4.38 s, sys: 11 s, total: 15.4 s
Wall time: 1min 5s


# Autoencoder Definition

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torch.nn.init as init
import torch.nn.init as init
import gc
gc.collect()
torch.cuda.empty_cache()

"""
Ideas To Prevent Loss Nans
1. Normalize Data Better
2. Less Deep / Wide Architecture
3. CNN instead of FCNN
"""
class AE(torch.nn.Module):
    def __init__(self, numFrequencies, numRows, numFeatures=1000):
        super().__init__()

        # Building a linear encoder with Batch Normalization
        self.encoder = torch.nn.Sequential(
            torch.nn.Linear(numFrequencies * numRows, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, numFeatures),
            torch.nn.ReLU(),
        )

        # Building a linear decoder with Batch Normalization
        self.decoder = torch.nn.Sequential(
            torch.nn.Linear(numFeatures, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, numFrequencies * numRows),
            torch.nn.Sigmoid()
        )

        # Apply Xavier initialization to the weights
        for m in self.modules():
            if isinstance(m, nn.Linear):
                init.xavier_uniform_(m.weight)

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded


# Autoencoder Feature Engineering - Spectrogram Level

In [10]:
%time
# ENGINEER FEATURES
import warnings
warnings.filterwarnings('ignore')
SPEC_FREQS = len(pd.read_parquet(f'{PATH}1000086677.parquet').columns[1:])
print(f"Num Frequencies: {SPEC_FREQS}")
numFeatures = 800
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('cpu') # delete when issue resolved
print("Using: ", device)
"""
Define 10min feature autoencoder
"""
model_10min = AE(SPEC_FREQS, 300, numFeatures=numFeatures)
model_10min = model_10min.to(device)
loss_function_10min = torch.nn.MSELoss()
optimizer_10min = torch.optim.Adam(model_10min.parameters(),
                            lr = 3e-4,
                            )

"""
Define 20sec feature autoencoder
"""
model_20sec = AE(SPEC_FREQS, 10, numFeatures=numFeatures)
model_20sec = model_20sec.to(device)
loss_function_20sec = torch.nn.MSELoss()
optimizer_20sec = torch.optim.Adam(model_20sec.parameters(),
                            lr = 3e-4,
                            )

CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 8.82 µs
Num Frequencies: 400
Using:  cpu


In [11]:
from torch.utils.data import DataLoader, TensorDataset
from sklearn.impute import SimpleImputer

def extract_raw_values(spectrogram_id, r, offset, end):
    return spectrograms[spectrogram_id][r+offset:r+end,:]

# Create a SimpleImputer instance
nan_imputer = SimpleImputer(strategy='mean')

batch_size = 100
log_noise = 1
print(f"Training Autoencoder on {len(train)} datapoints with batch size {batch_size}")
num_epochs = 8 # fine with 2-3 epochs but should do more with GPU if possible
for epoch in range(num_epochs): 
    num_batches = len(train) // batch_size + 1
    epoch_loss_10min = 0.0
    epoch_loss_20sec = 0.0
    print(f"Batches {num_batches}:", end=' ')
    
    for i in range(num_batches):        
        
        row_start = i * batch_size
        row_end = min((i + 1) * batch_size, len(train))
        
        input_10min_list = []
        input_20sec_list = []
        
        for k in range(row_start, row_end):
            row = train.iloc[k]
            r = int((row['min'] + row['max']) // 4)

            # get raw spectrogram values
            raw_values_10min = extract_raw_values(row.spec_id, r, 0, 300)
            raw_values_20sec = extract_raw_values(row.spec_id, r, 145, 155)

            # Replace infinite values
            raw_values_10min = np.where(np.isfinite(raw_values_10min), raw_values_10min, np.nan)
            raw_values_20sec = np.where(np.isfinite(raw_values_20sec), raw_values_20sec, np.nan)            
            
            # Use SimpleImputer to handle NaN values
            raw_values_10min = nan_imputer.fit_transform(raw_values_10min)
            raw_values_20sec = nan_imputer.fit_transform(raw_values_20sec)
            
            # Convert to torch tensors and append to the lists
            if len(raw_values_10min.flatten()) == 120000:
                # normalize
                raw_values_10min = np.log(raw_values_10min.flatten() + log_noise)
                normalized_values_10min = (raw_values_10min - raw_values_10min.min()) / (raw_values_10min.max() - raw_values_10min.min())
                input_10min_list.append(normalized_values_10min)
            if len(raw_values_20sec.flatten()) == 4000:
                # normalize
                raw_values_20sec = np.log(raw_values_20sec.flatten()  + log_noise)
                normalized_values_20sec = (raw_values_20sec - raw_values_20sec.min()) / (raw_values_20sec.max() - raw_values_20sec.min())
                input_20sec_list.append(normalized_values_20sec)
        
        # Forward pass through the autoencoders
        input_10min_batch = torch.tensor(input_10min_list, dtype=torch.float32).to(device)
        input_20sec_batch = torch.tensor(input_20sec_list, dtype=torch.float32).to(device)
        
        output_10min_batch = model_10min(input_10min_batch)
        output_20sec_batch = model_20sec(input_20sec_batch)

        # Calculate loss and perform optimization for 10min autoencoder
        loss_10min = loss_function_10min(output_10min_batch, input_10min_batch)
        optimizer_10min.zero_grad()
        loss_10min.backward()
        optimizer_10min.step()

        # Calculate loss and perform optimization for 20sec autoencoder
        loss_20sec = loss_function_20sec(output_20sec_batch, input_20sec_batch)
        optimizer_20sec.zero_grad()
        loss_20sec.backward()
        optimizer_20sec.step()

        # Accumulate epoch loss
        epoch_loss_10min += loss_10min.item()
        epoch_loss_20sec += loss_20sec.item()

        # Clean up to avoid memory issues
        del output_10min_batch, output_20sec_batch, input_10min_batch, input_10min_list, input_20sec_batch, input_20sec_list
        
        if i % 20 == 0:
            print(f"Done batch {i}, {epoch_loss_10min}, {epoch_loss_20sec}", end = '... ')

    # Calculate average loss for the epoch
    avg_loss_10min = epoch_loss_10min / num_batches
    avg_loss_20sec = epoch_loss_20sec / num_batches

    print(f"Epoch {epoch} Summary: Avg Loss_10min: {avg_loss_10min}, Avg Loss_20sec: {avg_loss_20sec}")


Training Autoencoder on 17089 datapoints with batch size 100
Batches 171: Done batch 0, 0.16649693250656128, 0.14608146250247955... Done batch 20, 1.4075658842921257, 0.9517939817160368... Done batch 40, 2.5865249410271645, 1.3062684210017323... Done batch 60, 3.714051477611065, 1.6441492019221187... Done batch 80, 4.300490640103817, 1.9638665663078427... Done batch 100, 4.79236651584506, 2.2806110745295882... Done batch 120, 5.232303109019995, 2.582220665179193... Done batch 140, 5.676802691072226, 2.849000684916973... Done batch 160, 6.072929002344608, 3.1064385743811727... Epoch 0 Summary: Avg Loss_10min: 0.03665933480257528, Avg Loss_20sec: 0.01888520389316026
Batches 171: Done batch 0, 0.016862379387021065, 0.01036853902041912... Done batch 20, 0.4063644874840975, 0.24692776892334223... Done batch 40, 0.7872101478278637, 0.4714507516473532... Done batch 60, 1.1557267028838396, 0.6983602810651064... Done batch 80, 1.477969621308148, 0.919167285785079... Done batch 100, 1.7724137473

In [12]:
"""
Get Feature Data
"""
print(f"Generating {2 * numFeatures} features on {len(train)} datapoints")
FEATURES = ["feature_{}_10min".format(i) for i in range(numFeatures)]
FEATURES += ["feature_{}_20sec".format(i) for i in range(numFeatures)]
data = np.zeros((len(train), len(FEATURES)))

for k in range(len(train)):
    if k%100==0: print(k,', ',end='')
    row = train.iloc[k]
    r = int( (row['min'] + row['max'])//4 ) 

    # 10 MINUTE WINDOW FEATURES
    # this will likey need to be unsqueezed or smth
    raw_values_10min = np.log(spectrograms[row.spec_id][r:r+300, :].flatten() + log_noise)
    normalized_values = (raw_values_10min - raw_values_10min.min()) / (raw_values_10min.max() - raw_values_10min.min())
    x = np.array(model_10min.encoder(torch.tensor([normalized_values]).to(device)).tolist())    
    data[k,:numFeatures] = x

    # 20 SECOND WINDOW FEATURES 
    # this will likey need to be unsqueezed or smth
    raw_values_20sec = np.log(spectrograms[row.spec_id][r+145:r+155, :].flatten() + log_noise)
    normalized_values = (raw_values_20sec - raw_values_20sec.min()) / (raw_values_20sec.max() - raw_values_20sec.min())
    x = np.array(model_20sec.encoder(torch.tensor([normalized_values]).to(device)).tolist())
    data[k,numFeatures:2*numFeatures] = x
train[FEATURES] = data

print('New train shape:',train.shape)

Generating 1600 features on 17089 datapoints
0 , 100 , 200 , 300 , 400 , 500 , 600 , 700 , 800 , 900 , 1000 , 1100 , 1200 , 1300 , 1400 , 1500 , 1600 , 1700 , 1800 , 1900 , 2000 , 2100 , 2200 , 2300 , 2400 , 2500 , 2600 , 2700 , 2800 , 2900 , 3000 , 3100 , 3200 , 3300 , 3400 , 3500 , 3600 , 3700 , 3800 , 3900 , 4000 , 4100 , 4200 , 4300 , 4400 , 4500 , 4600 , 4700 , 4800 , 4900 , 5000 , 5100 , 5200 , 5300 , 5400 , 5500 , 5600 , 5700 , 5800 , 5900 , 6000 , 6100 , 6200 , 6300 , 6400 , 6500 , 6600 , 6700 , 6800 , 6900 , 7000 , 7100 , 7200 , 7300 , 7400 , 7500 , 7600 , 7700 , 7800 , 7900 , 8000 , 8100 , 8200 , 8300 , 8400 , 8500 , 8600 , 8700 , 8800 , 8900 , 9000 , 9100 , 9200 , 9300 , 9400 , 9500 , 9600 , 9700 , 9800 , 9900 , 10000 , 10100 , 10200 , 10300 , 10400 , 10500 , 10600 , 10700 , 10800 , 10900 , 11000 , 11100 , 11200 , 11300 , 11400 , 11500 , 11600 , 11700 , 11800 , 11900 , 12000 , 12100 , 12200 , 12300 , 12400 , 12500 , 12600 , 12700 , 12800 , 12900 , 13000 , 13100 , 13200 , 133

In [13]:
from sklearn.preprocessing import StandardScaler

# Columns to be excluded from scaling
excluded_columns = ['eeg_id', 'spec_id', 'min', 'max', 'patient_id', 'seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote','target']

# Save the columns to be excluded
excluded_data = train[excluded_columns]

# DataFrame with only the columns to be scaled
features = train.drop(columns=excluded_columns)

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the features and transform them
features_scaled = scaler.fit_transform(features)

# Create a DataFrame from the scaled features
features_scaled_df = pd.DataFrame(features_scaled, columns=features.columns)

# Concatenate the scaled features with the excluded columns
train_scaled_df = pd.concat([excluded_data.reset_index(drop=True),features_scaled_df,], axis=1)
# train_scaled_df.to_csv("/kaggle/working/")
train_scaled_df 


Unnamed: 0,eeg_id,spec_id,min,max,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,...,feature_790_20sec,feature_791_20sec,feature_792_20sec,feature_793_20sec,feature_794_20sec,feature_795_20sec,feature_796_20sec,feature_797_20sec,feature_798_20sec,feature_799_20sec
0,568657,789577333,0.0,16.0,20654,0.0,0.000000,0.25,0.000000,0.166667,...,-0.018423,-1.218027,-0.042937,0.0,0.0,0.0,-1.241587,-0.111487,-0.143369,0.337190
1,582999,1552638400,0.0,38.0,20230,0.0,0.857143,0.00,0.071429,0.000000,...,-0.018423,-0.108367,-0.042937,0.0,0.0,0.0,0.230908,-0.111487,-0.143369,-0.312104
2,642382,14960202,1008.0,1032.0,5955,0.0,0.000000,0.00,0.000000,0.000000,...,-0.018423,-0.592410,-0.042937,0.0,0.0,0.0,0.409068,-0.111487,-0.143369,-0.312104
3,751790,618728447,908.0,908.0,38549,0.0,0.000000,1.00,0.000000,0.000000,...,-0.018423,-1.218027,-0.042937,0.0,0.0,0.0,-1.241587,-0.111487,-0.143369,-0.312104
4,778705,52296320,0.0,0.0,40955,0.0,0.000000,0.00,0.000000,0.000000,...,-0.018423,-1.218027,-0.042937,0.0,0.0,0.0,-1.201671,1.750040,-0.143369,-0.312104
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17084,4293354003,1188113564,0.0,0.0,16610,0.0,0.000000,0.00,0.000000,0.500000,...,-0.018423,0.830302,-0.042937,0.0,0.0,0.0,0.892533,-0.111487,-0.143369,-0.312104
17085,4293843368,1549502620,0.0,0.0,15065,0.0,0.000000,0.00,0.000000,0.500000,...,-0.018423,-0.003880,-0.042937,0.0,0.0,0.0,-0.630064,-0.111487,-0.143369,0.788677
17086,4294455489,2105480289,0.0,0.0,56,0.0,0.000000,0.00,0.000000,0.000000,...,,,,,,,,,,
17087,4294858825,657299228,0.0,12.0,4312,0.0,0.000000,0.00,0.000000,0.066667,...,-0.018423,-0.045101,-0.042937,0.0,0.0,0.0,0.452943,-0.111487,-0.143369,-0.312104


In [14]:
train_scaled_df.describe()

Unnamed: 0,eeg_id,spec_id,min,max,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,...,feature_790_20sec,feature_791_20sec,feature_792_20sec,feature_793_20sec,feature_794_20sec,feature_795_20sec,feature_796_20sec,feature_797_20sec,feature_798_20sec,feature_799_20sec
count,17089.0,17089.0,17089.0,17089.0,17089.0,17089.0,17089.0,17089.0,17089.0,17089.0,...,17003.0,17003.0,17003.0,17003.0,17003.0,17003.0,17003.0,17003.0,17003.0,17003.0
mean,2135226000.0,1080640000.0,401.650711,431.761191,32839.981977,0.15281,0.142456,0.104062,0.065407,0.114851,...,8.566798e-18,-1.065626e-16,-8.253378e-18,0.0,0.0,0.0,-5.5057350000000004e-17,-3.8446120000000003e-17,-2.6536180000000003e-17,0.0
std,1235712000.0,625173900.0,1226.839779,1232.863269,18351.751174,0.331563,0.295541,0.258825,0.187005,0.271425,...,1.000029,1.000029,1.000029,0.0,0.0,0.0,1.000029,1.000029,1.000029,1.000029
min,568657.0,353733.0,0.0,0.0,56.0,0.0,0.0,0.0,0.0,0.0,...,-0.01842285,-1.218027,-0.04293726,0.0,0.0,0.0,-1.241587,-0.1114867,-0.1433692,-0.312104
25%,1062096000.0,539664800.0,0.0,4.0,17408.0,0.0,0.0,0.0,0.0,0.0,...,-0.01842285,-0.7881088,-0.04293726,0.0,0.0,0.0,-0.8647463,-0.1114867,-0.1433692,-0.312104
50%,2123560000.0,1073264000.0,0.0,40.0,32068.0,0.0,0.0,0.0,0.0,0.0,...,-0.01842285,-0.1432888,-0.04293726,0.0,0.0,0.0,-0.1022258,-0.1114867,-0.1433692,-0.312104
75%,3208261000.0,1641428000.0,308.0,346.0,48272.0,0.0,0.068966,0.0,0.0,0.0,...,-0.01842285,0.59677,-0.04293726,0.0,0.0,0.0,0.6692814,-0.1114867,-0.1433692,-0.312104
max,4294958000.0,2147388000.0,17556.0,17632.0,65494.0,1.0,1.0,1.0,1.0,1.0,...,81.6286,6.062694,46.60862,0.0,0.0,0.0,5.112061,28.26509,20.82453,9.926018


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>8 |</span></b> <b>TRAIN MODEL</b></div>

* Original work uses catboost, let's try with XGBoost in this version to see the difference in model performance.

In [15]:
import xgboost as xgb
from sklearn.svm import SVC
import gc
from sklearn.model_selection import KFold, GroupKFold
import pickle
from sklearn.multioutput import MultiOutputRegressor
from scipy.special import rel_entr

In [16]:
len(FEATURES)

1600

In [17]:
all_oof = []
all_true = []
TARS = {'Seizure':0, 'LPD':1, 'GPD':2, 'LRDA':3, 'GRDA':4, 'Other':5}
n_splits = 5
gkf = GroupKFold(n_splits=5)
for i, (train_index, valid_index) in enumerate(gkf.split(train_scaled_df, train_scaled_df.target, train_scaled_df.patient_id)):   
    if i >= n_splits:
        continue
    print('#'*25)
    print(f'### Fold {i+1}')
    print(f'### train size {len(train_index)}, valid size {len(valid_index)}')
    print('#'*25)
    
    # Instantiate the XGBRegressor model
    xgb_model = xgb.XGBRegressor(objective='reg:squarederror', learning_rate = 0.1) # uses MSE to predict probabilities

    model = MultiOutputRegressor(xgb_model) # since we have multiple outputs
    
#     model = SVC(probability=True)    
    LABEL_NAMES = ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']
    # Prepare training and validation data
    X_train = train_scaled_df.loc[train_index, FEATURES]
    y_train = train_scaled_df.loc[train_index, LABEL_NAMES]
    X_valid = train_scaled_df.loc[valid_index, FEATURES]
    y_valid = train_scaled_df.loc[valid_index, LABEL_NAMES]
    model.fit(X_train, y_train, verbose=True,) 

    with open(f'XGBoost_f{i}.pkl', 'wb') as f:
        pickle.dump(model, f)

    y_pred = model.predict(X_valid)
    y_pred[y_pred < 0] = 0
    oof = y_pred / np.sum(y_pred, axis=1).reshape(-1,1) # ensure they sum to 1
    true = y_valid.values
    kl_divergence = np.mean(np.sum(true * (np.log(true + 1e-10) - np.log(oof + 1e-10)), axis=1))
    print(f"Kale Divergence: {kl_divergence}")
    
    all_oof.append(oof)
    all_true.append(true)
    
    del X_train, y_train, X_valid, y_valid, oof
    gc.collect()
    
all_oof = np.concatenate(all_oof)
all_true = np.concatenate(all_true)

#########################
### Fold 1
### train size 13671, valid size 3418
#########################
Kale Divergence: 1.0205473546464672
#########################
### Fold 2
### train size 13671, valid size 3418
#########################
Kale Divergence: 1.2669746666311315
#########################
### Fold 3
### train size 13671, valid size 3418
#########################
Kale Divergence: 1.1330065057377365
#########################
### Fold 4
### train size 13671, valid size 3418
#########################
Kale Divergence: 1.0422925316761045
#########################
### Fold 5
### train size 13672, valid size 3417
#########################
Kale Divergence: 1.111646223019359


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>10 |</span></b> <b>FEATURE IMPORTANCE</b></div>

In [18]:
test = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/test.csv')
print('Test shape',test.shape)
test.head()

Test shape (1, 3)


Unnamed: 0,spectrogram_id,eeg_id,patient_id
0,853520,3911565283,6885


In [19]:
%%time
# READ ALL TEST SPECTROGRAMS
PATH2 = '/kaggle/input/hms-harmful-brain-activity-classification/test_spectrograms/'
files = os.listdir(PATH2)
print(f'There are {len(files)} spectrogram parquets')

spectrograms_test = {}
for i,f in enumerate(files):
    if i%100==0: print(i,', ',end='')
    tmp = pd.read_parquet(f'{PATH2}{f}')
    name = int(f.split('.')[0])
    spectrograms_test[name] = tmp.iloc[:,1:].values


There are 1 spectrogram parquets
0 , CPU times: user 74.2 ms, sys: 3.32 ms, total: 77.5 ms
Wall time: 75.9 ms


In [20]:
%time
# ENGINEER FEATURES
import warnings
warnings.filterwarnings('ignore')

SPEC_COLS = pd.read_parquet(f'{PATH}1000086677.parquet').columns[1:]

TEST_FEATURES = FEATURES

print(f'We are creating {len(TEST_FEATURES)} features for {len(test)} rows... ',end='')


# A data matrix data is initialized to store the new features for each eeg_id in the train DataFrame.
# For each row in train, the code calculates the mean and minimum values within the specified 10-minute and 20-second windows.
# These calculated values are then stored in the data matrix.
# Finally, the matrix is added to the train DataFrame as new columns.

data = np.zeros((len(test),len(TEST_FEATURES)))
for k in range(len(test)):
    if k%100==0: print(k,', ',end='')
    row = test.iloc[k]       
    s = int( row.spectrogram_id )
    spec = pd.read_parquet(f'{PATH2}{s}.parquet')
    raw_values_10min = np.log(spec.iloc[:,1:].values.flatten() + log_noise)
    normalized_values = (raw_values_10min - raw_values_10min.min()) / (raw_values_10min.max() - raw_values_10min.min())
    x = np.array(model_10min.encoder(torch.tensor([normalized_values]).to(device)).tolist())   
    data[k,:numFeatures] = x

    # 20 SECOND WINDOW FEATURES 
    # this will likey need to be unsqueezed or smth
    raw_values_20sec = np.log(spec.iloc[145:155,1:].values.flatten() + log_noise)
    normalized_values = (raw_values_20sec - raw_values_20sec.min()) / (raw_values_20sec.max() - raw_values_20sec.min())
    x = np.array(model_20sec.encoder(torch.tensor([normalized_values]).to(device)).tolist())  
    data[k,numFeatures:2*numFeatures] = x    

print(len(TEST_FEATURES), data.shape)
test[TEST_FEATURES] = data

    
print()
print('New test shape:',test.shape)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.82 µs
We are creating 1600 features for 1 rows... 0 , 1600 (1, 1600)

New test shape: (1, 1603)


In [21]:

# Columns to be excluded from scaling
excluded_columns = ['eeg_id', 'spectrogram_id', 'patient_id']

# Save the columns to be excluded
excluded_data = test[excluded_columns]

# DataFrame with only the columns to be scaled
features = test.drop(columns=excluded_columns)

# Initialize the StandardScaler
# scaler = StandardScaler()

# Fit the scaler to the features and transform them
features_scaled = scaler.transform(features)

# Create a DataFrame from the scaled features
features_scaled_df = pd.DataFrame(features_scaled, columns=features.columns)

# Concatenate the scaled features with the excluded columns
test_scaled_df = pd.concat([excluded_data.reset_index(drop=True),features_scaled_df,], axis=1)
test_scaled_df 


Unnamed: 0,eeg_id,spectrogram_id,patient_id,feature_0_10min,feature_1_10min,feature_2_10min,feature_3_10min,feature_4_10min,feature_5_10min,feature_6_10min,...,feature_790_20sec,feature_791_20sec,feature_792_20sec,feature_793_20sec,feature_794_20sec,feature_795_20sec,feature_796_20sec,feature_797_20sec,feature_798_20sec,feature_799_20sec
0,3911565283,853520,6885,-0.199303,0.0,-0.00793,0.0,0.0,-0.00793,0.0,...,-0.018423,0.92318,-0.042937,0.0,0.0,0.0,-0.012949,-0.111487,-0.143369,-0.312104


In [22]:
# INFER XGBOOST ON TEST
preds = []

for i in range(n_splits):
    print(i, ', ', end='')
    
    # Load the XGBoost model
    with open(f'XGBoost_f{i}.pkl', 'rb') as f:
        model = pickle.load(f)
    
    # Make predictions
    test_data_scaled = test_scaled_df[TEST_FEATURES]
    
    # data_imputed = imputer.fit_transform(test_data_scaled)
    
    pred = model.predict(test_data_scaled)
    pred[pred < 0] = 0
    pred = pred / np.sum(pred, axis=1).reshape(-1,1)
    preds.append(pred) 

# Average the predictions from each fold
pred = np.mean(preds, axis=0)
print('Test preds shape', pred.shape)

0 , 1 , 2 , 3 , 4 , Test preds shape (1, 6)


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>12 |</span></b> <b>SUBMISSION</b></div>

In [23]:
sub = pd.DataFrame({'eeg_id':test.eeg_id.values})
sub[LABEL_NAMES] = pred
sub.to_csv('submission.csv',index=False)
print('Submission shape',sub.shape)
sub.head()

Submission shape (1, 7)


Unnamed: 0,eeg_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,3911565283,0.042841,0.04975,0.003679,0.045709,0.068547,0.789472


In [24]:
# SANITY CHECK TO CONFIRM PREDICTIONS SUM TO ONE
sub.iloc[:,-6:].sum(axis=1)

0    1.0
dtype: float32