# Objective
The goal of this project was to develop a machine learning model to predict whether a person’s eyes were open or closed based on EEG data recorded from 14 different regions of the brain. The dataset contained 14 EEG features along with the manually labeled eye state ('0' for open, '1' for closed).

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import xgboost as xgb


In [None]:
df = pd.read_csv("/kaggle/input/eeg-neuroheadset/eeg-headset.csv")  
print(df.info())

print(df.head())

# Data Preprocessing

Standardized the EEG signals using StandardScaler to improve model performance.

Split the dataset into training and testing sets (80%-20%) to ensure a fair evaluation of models.

Handled class imbalance by analyzing the distribution of eye states, which was relatively balanced (~55% closed, ~45% open).

In [None]:
sns.countplot(x=df["eye_state"])
plt.title("Eye State Distribution")
plt.show()

print(df["eye_state"].value_counts(normalize=True))

In [None]:
print(df["eye_state"].unique())  
df["eye_state"] = df["eye_state"].replace(2, 0)


In [None]:


scaler = StandardScaler()
df_scaled = df.copy()
df_scaled.iloc[:, :-1] = scaler.fit_transform(df.iloc[:, :-1])  


# Model Comparisons & Performance 
# Logistic Regression 

In [None]:
X = df.drop(columns=['eye_state'])
y = df['eye_state']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
log_reg = LogisticRegression(max_iter=500)
log_reg.fit(X_train, y_train)


y_pred_log = log_reg.predict(X_test)
log_accuracy = accuracy_score(y_test, y_pred_log)
print(f'Logistic Regression Accuracy: {log_accuracy:.4f}')

In [None]:
y_pred = log_reg.predict(X_test)
print(classification_report(y_test, y_pred))

Insights: Struggled to capture complex patterns in EEG data. The linear nature of the model limited its effectiveness.

# Random forest 

In [None]:
rf_model = RandomForestClassifier(random_state=42, n_jobs=-1) 
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)


In [None]:
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", rf_accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))

Insights: The Random Forest classifier demonstrated **strong predictive accuracy** in classifying eye state based on EEG signals. The model achieved:  

- **Accuracy:** 93.96%  
- **Precision & Recall:** Balanced performance for both eye-open (0) and eye-closed (1) states.  
- **F1-score:** High scores (0.93-0.95), indicating strong classification performance.  

These results suggest that EEG signals contain clear patterns that distinguish between eye-open and eye-closed states, making Random Forest a **robust choice** for this task.  


# Feature importance analysis

In [None]:
feature_importances = rf_model.feature_importances_
feature_names = X_train.columns
sorted_indices = np.argsort(feature_importances)[::-1]
sorted_features = feature_names[sorted_indices]
sorted_importances = feature_importances[sorted_indices]


In [None]:
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_importances, color='teal')
plt.xlabel("Feature Importance")
plt.ylabel("EEG Channel")
plt.title("Feature Importance in Predicting Eye State")
plt.gca().invert_yaxis()  # To have the most important feature on top
plt.show()

Insights: It revealed that EEG channels contribute differently to eye-state classification.  

- The **O1 region** (Occipital lobe) exhibited the **highest importance (0.12)**, aligning with its role in visual processing.  
- Other occipital and frontal regions also played a role, while the **P (parietal) region had the lowest importance (0.042)**.  

These findings confirm that **visual cortex activity is the strongest predictor of eye state**, while other brain areas contribute to a lesser extent.  

### Next Steps  
While the Random Forest model performed well, we will now explore a **Multilayer Perceptron (MLP)** to assess whether a deep learning approach can further improve classification accuracy by capturing more complex patterns in EEG signals.  


# Multilayer Perceptron (MLP - Neural Network)

In [None]:
mlp_model = Sequential([
    Input(shape=(X_train.shape[1],)),

    Dense(256, activation='relu', kernel_regularizer=l2(0.01)),
    BatchNormalization(),
    Dropout(0.2),

    Dense(128, activation='relu', kernel_regularizer=l2(0.01)),
    BatchNormalization(),
    Dropout(0.2),

    Dense(64, activation='relu', kernel_regularizer=l2(0.01)),
    BatchNormalization(),
    Dropout(0.2),

    Dense(1, activation='sigmoid')  
])

optimizer = Adam(learning_rate=0.0005) 

mlp_model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

history = mlp_model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

In [None]:
mlp_test_loss, mlp_test_accuracy = mlp_model.evaluate(X_test, y_test)
print(f"MLP Test Accuracy: {mlp_test_accuracy:.4f}")

Insights: The final test accuracy remained **~55%**, which is only slightly better than random guessing.
- **MLP struggled to capture patterns** in the EEG data, possibly due to:
  - Limited dataset size (only 14 features).
  - EEG signals being more suited to tree-based models.
  - Neural networks requiring more data to generalize well.

Given the poor performance, we decided to shift focus to **XGBoost**, a powerful gradient boosting algorithm, to see if it can outperform Random Forest while handling structured data more effectively.

# XGBoost

In [None]:
xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)

xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

xgb_accuracy = accuracy_score(y_test, y_pred_xgb)
print(f"XGBoost Accuracy: {xgb_accuracy:.4f}")


In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb))

Insights:  Achieved strong performance with better interpretability compared to MLP. Although slightly lower than Random Forest, it provided robust predictions with good generalization.

# Conclusions
Random Forest emerged as the best-performing model, achieving 94% accuracy while also providing explainable feature importance.

XGBoost was a close second, performing well while being more computationally efficient.

Logistic Regression and MLP struggled due to the complexity of EEG data and non-linearity in patterns.

Therefore, this project demonstrated the application of multiple machine learning techniques, their strengths and weaknesses, and how feature importance analysis can provide valuable scientific insights beyond just prediction accuracy.

