# EEG EMOTION CLASSIFICATIONS: Comparing Random Forest VS Logistic Regression

## Introduction

Electroencephalography (EEG) measures the electrical activity of the brain and can provide insights into different emotional states. This project is an opportunity to explore how EEG signals reflect emotions and, more importantly, to understand how different machine learning models handle this type of data.

I aim to compare the performance of two commonly used algorithms, Random Forest and Logistic Regression, not just to achieve high accuracy, but to learn how model choice impacts predictions, interpretability, and feature importance.

Random Forest is a tree-based ensemble method capable of capturing complex, non-linear patterns, while Logistic Regression is a linear model that predicts class probabilities and is simpler to interpret.

The dataset contains EEG recordings labelled as POSITIVE, NEUTRAL, or NEGATIVE. By evaluating these models using accuracy, confusion matrices, F1-scores, and feature importance, I hope to gain a deeper understanding of how different modelling approaches work on the same data and what each can teach me about EEG-based emotion recognition.

## Methods
### Dataset
Source: Kaggle EEG Brainwave Dataset â€“ Feeling Emotions.

Features: 14 EEG channel measurements per recording, plus additional extracted features (FFT and statistical metrics).

Target: label (POSITIVE, NEUTRAL, NEGATIVE).

Total samples: 2,132 (1,705 training, 427 testing).

In [None]:

# 1. IMPORT LIBRARIES

# Explanation:
# pandas & numpy -> data handling,
# sklearn -> modeling and evaluation,
# matplotlib & seaborn -> visualization.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#2. Load Data 

# Explanation:
# Load the EEG dataset
df = pd.read_csv('/kaggle/input/eeg-brainwave-dataset-feeling-emotions/emotions.csv')

# Inspect the first few rows
df.head()

In [None]:
# 3. PREPARE FEATURES AND LABELS

# Explanation:
# Features X -> all columns EXCEPT 'label'.
# Labels y -> the 'label' column.
X = df.drop('label', axis=1)
y = df['label']

# Check label distribution
print("Label Distribution:\n", y.value_counts())

In [None]:
# 4. SPLIT DATA INTO TRAIN AND TEST SETS

# Explanation:
# We split 80% training, 20% testing.
# stratify=y ensures each emotion is represented proportionally.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training samples:", len(X_train))
print("Testing samples:", len(X_test))
print("Training label distribution:\n", y_train.value_counts())
print("Testing label distribution:\n", y_test.value_counts())

In [None]:
# 5. SCALE DATA FOR LOGISTIC REGRESSION
# Explanation:
# Logistic Regression works better when features are scaled.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# 6. LOGISTIC REGRESSION
# Explanation:
# Train logistic regression and evaluate.
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train_scaled, y_train)
y_pred_log = log_reg.predict(X_test_scaled)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log))
print("Classification Report:\n", classification_report(y_test, y_pred_log))

In [None]:
# 7. RANDOM FOREST
# Explanation:
# Random Forest is an ensemble method that can handle unscaled features.
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

In [None]:
# 8. VISUALIZATION
# Explanation:
# Plot confusion matrices and feature importances to understand model behavior.

# Confusion Matrix Heatmap
def plot_confusion(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y), yticklabels=np.unique(y))
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(title)
    plt.show()

plot_confusion(y_test, y_pred_log, "Logistic Regression Confusion Matrix")
plot_confusion(y_test, y_pred_rf, "Random Forest Confusion Matrix")

# Feature Importance (Random Forest)
importances = rf.feature_importances_
feat_names = X.columns
feat_imp_df = pd.DataFrame({'Feature': feat_names, 'Importance': importances}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_df.head(15))
plt.title("Top 15 Feature Importances - Random Forest")
plt.show()

## Discussion

Random Forest handles complex, non-linear patterns in EEG data, which likely explains the higher accuracy.

Logistic Regression is simpler and linear, so it may not capture subtle EEG patterns as effectively.

EEG feature importances can guide future neuroscience studies to focus on the most informative channels.

### Limitations:

* Only one dataset used
* Model interpretability can be improved with SHAP or permutation importance

### Future directions:

* Test more models (SVM, XGBoost)
* Explore deep learning on EEG time-series data