# **Comparative Analysis of Tree Ensemble Models for EEG Event Detection**

I am using data from a past Kaggle competition to train a model that can detect certain events from EEG brainwave data. The events would then trigger certain gestures in a prosthetic device for example, using BCI technology. My goal is to get perfect/near perfect predictions on the testing data. You can get more info on the contest/dataset [here](https://www.kaggle.com/c/grasp-and-lift-eeg-detection/)

## **Install The Libraries**
First we install install all necessary Python libraries. Check the [README.md](../README.md) file for more info on how to do this.

## **Kaggle Environment Setup**
You will need to upload your *kaggle.json*, set the permissions so the file can be read.

In [None]:
!chmod 600 ../kaggle.json

Then we set the Kaggle configuration directory to our current working directory, as an environment variable.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '../'

Now we can download the data from the competition page, 

In [None]:
if not os.path.exists('../data/kaggle-eeg'):
    os.makedirs('../data/kaggle-eeg')
    !kaggle competitions download grasp-and-lift-eeg-detection -p ../data/kaggle-eeg/ -f train.zip
    !unzip ../data/kaggle-eeg/train.zip -d ../data/kaggle-eeg

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mne.decoding import CSP
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, hamming_loss, jaccard_score, multilabel_confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
import wandb
pd.set_option('display.max_columns', None)


## **Data Analysis**
First we load some of the training data and check the first few rows.

In [None]:
data_path = '../data/kaggle-eeg/train'
features = pd.read_csv(f'{data_path}/subj1_series1_data.csv')
labels = pd.read_csv(f'{data_path}/subj1_series1_events.csv')
features = features.drop(columns=['id'])
labels = labels.drop(columns=['id'])

display(features.info(), features.describe(), features.head(), labels.info(), labels.describe(), labels.head())

In [None]:
corr = features.corr()
plt.figure(figsize=(30, 20))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()

## **Training**

### **Data Preprocessing**

In [None]:
def load_series_data(subject, series):
    features = pd.read_csv(f'{data_path}/subj{subject}_series{series}_data.csv')
    labels = pd.read_csv(f'{data_path}/subj{subject}_series{series}_events.csv')
    return features, labels

def merge_labels(features, labels):
    data = features.copy()
    data = data.merge(labels, on='id')
    data.drop(columns=['id'], inplace=True)
    return data

def get_training_batch(series):
    features, labels = load_series_data(1, series)
    data = merge_labels(features, labels)
    for i in range(2, 13):
        features, labels = load_series_data(i, series)
        data = pd.concat([data, merge_labels(features, labels)])
    return data

In [None]:
train_df = get_training_batch(1)
train_df.shape, train_df.value_counts()

In [None]:
def get_categories(data):
    label_counts = data[['HandStart', 'FirstDigitTouch', 'BothStartLoadPhase', 'LiftOff', 'Replace', 'BothReleased']].value_counts()
    categories = list(label_counts.index)
    category_dict = {category: i for i, category in enumerate(categories)}
    return category_dict

category_dict = get_categories(train_df)
category_dict

In [None]:
# reshape data for CSP
def reshape_data(data, category_dict=category_dict):
    X = data.drop(columns=['HandStart', 'FirstDigitTouch', 'BothStartLoadPhase', 'LiftOff', 'Replace', 'BothReleased']).values
    y = data[['HandStart', 'FirstDigitTouch', 'BothStartLoadPhase', 'LiftOff', 'Replace', 'BothReleased']].values
    y = y.astype(np.float64)
    X = np.expand_dims(X, axis=2)
    X = X.astype(np.float64)
    return X, y

X_train, y_train = reshape_data(train_df)
X_train = X_train.astype('float64').copy()
X_train.shape, y_train.shape, X_train.dtype, y_train.dtype

### **Model Building**

In [None]:
def build_model(pipeline, param_grid, X_train, y_train, outer_cv=5, inner_cv=5):
    outer_cv = KFold(n_splits=outer_cv, shuffle=True, random_state=42)
    scores = []
    best_estimators = [] # Store a list of best estimators for each label
    best_scores = [] # Store a list of best scores for each label

    for train_index, test_index in outer_cv.split(X_train):
        X_train_fold, X_test_fold = X_train[train_index], X_train[test_index]
        y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
        
        # Initialize list to store best estimators and scores for each label in this outer fold
        fold_best_estimators = []
        fold_best_scores = []

        # Iterate over each label (column) in the multi-label target matrix
        for label_idx in range(y_train_fold.shape[1]):
            y_train_label = y_train_fold[:, label_idx]
            y_test_label = y_test_fold[:, label_idx]

            inner_cv = KFold(n_splits=inner_cv, shuffle=True, random_state=42)
            grid_search = GridSearchCV(pipeline, param_grid, cv=inner_cv, n_jobs=-1, verbose=1, scoring='accuracy') # You may change scoring to other suitable multi-label metrics
            grid_search.fit(X_train_fold, y_train_label)

            y_pred = grid_search.predict(X_test_fold)
            
            # Example: Calculate multiple metrics
            accuracy = accuracy_score(y_test_label, y_pred)
            jaccard = jaccard_score(y_test_label, y_pred, average='samples')  
            hamming = hamming_loss(y_test_label, y_pred)

            scores.append({'accuracy': accuracy, 'jaccard': jaccard, 'hamming': hamming}) 
            fold_best_estimators.append(grid_search.best_estimator_)
            fold_best_scores.append(jaccard)  

        best_estimators.append(fold_best_estimators)
        best_scores.append(fold_best_scores)
        

    best_label_indices = []
    for scores_per_fold in best_scores:
        best_label_indices.append(np.argmax(scores_per_fold))

    best_estimators_final = []
    for i in range(len(best_estimators)):  # Outer fold
        best_estimators_final.append(best_estimators[i][best_label_indices[i]])

    return best_estimators_final, scores  # Return a list of best estimators, one per label


#### Random Forest Model

In [None]:
rf_clf = Pipeline([('CSP', CSP(n_components=4)),
                     ('RF', RandomForestClassifier())])

param_grid = {
    'CSP__n_components': [2, 4, 6],
    'RF__n_estimators': [50, 100, 200],
    'RF__max_depth': [10, 20, 30]
}

best_rf_clf, scores, best_rf_params = build_model(rf_clf, param_grid, X_train, y_train)
print("Grid search scores: ", scores)
rf_cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(best_rf_clf, X_train, y_train, cv=rf_cv, scoring='accuracy')
print("Cross-validation scores: ", scores)

#### Gradient Boosting Model

#### XGBoost Model

### **Wandb Logging**
First we're going to login to Wandb with our api key so that we can log the training. 

In [None]:
!wandb login d754544ba90d0be7ea7009afb39a9225330e6be9

Initialize Wandb and specify a project name to keep track of metrics

In [None]:
wandb.init(
    project="eeg-signal-classification", 
    config={
        "hyper": "parameter",
        "epochs": 17983756,
        "batch_size": 719350,
        "loss_function": "categorical_crossentropy",
        "architecture": "CNN",
        "dataset": "kaggle-eeg"
    }
)

### **Training Loop**

In [None]:
def training_loop(model):
    total_test_x = pd.DataFrame()
    total_test_y = pd.DataFrame()
    for series in range(1, 9):
        train_df = get_training_batch(series)
        category_dict = get_categories(train_df)
        X_train, y_train = reshape_data(train_df, category_dict)
        train_x, test_x, train_y, test_y = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
        model.fit(train_x, train_y)
        total_test_x = pd.concat([total_test_x, test_x])
        total_test_y = pd.concat([total_test_y, test_y])
        score = accuracy_score(total_test_y, model.predict(total_test_x))
        wandb.log({"series": series, "score": score, "model": model.__qualname__()})
        print(f"Series {series} score: {score}")
    y_pred = model.predict(total_test_x)
    accuracy = accuracy_score(total_test_y, y_pred)
    report = classification_report(total_test_y, y_pred)
    return accuracy, report, multilabel_confusion_matrix(total_test_y, y_pred)

## **Submission**
We're gonna download the testing data now from the Kaggle competition and unzip into the data directory.

In [None]:
!kaggle competitions download grasp-and-lift-eeg-detection -f test.zip

In [None]:
!unzip ../data/kaggle-eeg/test.zip -d ../data/kaggle-eeg

Here we load the sample submission from the Kaggle competition. This gives us a pre-made dataframe and we just need to update column values with predictions from our model. 

In [None]:
!kaggle competitions download grasp-and-lift-eeg-detection -f sample_submission.csv.zip

In [None]:
!unzip ../data/kaggle-eeg/sample_submission.csv.zip -d ../data/kaggle-eeg

In [None]:
sub = pd.read_csv('../data/kaggle-eeg/sample_submission.csv')

In [None]:
sub.head()

Here we create a dataframe in the same shape as the example submission on the competition page.

In [None]:
path = '../data/kaggle-eeg/test'

def get_merged_tests():
  tests = None
  for sj in range(1, 13):
    for sr in range(9, 11):
      c_tests = pd.read_csv(f'{path}/subj{sj}_series{sr}_data.csv')
      tests = c_tests if tests is None else tests.append(c_tests, ignore_index=True)
  return tests

In [None]:
tests = get_merged_tests()

In [None]:
tests = tests.drop(columns=['id'])
tests.head()

In [None]:
out = tests.loc[[0], :]  
out.head()

In [None]:
classes = ['HandStart', 'FirstDigitTouch', 'LiftOff', 'Replace', 'BothReleased', 'BothStartLoadPhase']
for id in range(tests.shape[0]):
    pred = model.predict(tests.loc[[id], :])
    tests.loc[[id], classes] = pred

In [None]:
sub.to_csv('../data/kaggl-eeg/submission.csv', index=False)

In [None]:
!kaggle competitions submit grasp-and-lift-eeg-detection -f ../data/kaggle-eeg/submission.csv -m "Message"