Task 1: classify the data using linear regression, SVM (feel free to play with kernel options), and kNN algorithms; provide accuracy and precision data for each. Suggest reasoning as to why the ones performing better might do so (you do not have to be correct – for this project, this is just something you should enjoy considering).


**Approach 1:**

In [1]:
import requests

# URL of the published Google Sheets CSV file
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSC87hYugbdg0_MbAhqVHaGSTk_-tEb_X_1YeXo6qzuz-bKm3Vo3gQd6m4IlZ5CAQMUUxfZrtCgbWYv/pub?output=csv"

# Download the CSV file
response = requests.get(url)

# Save the file locally
if response.status_code == 200:
    with open("EEG data - Sheet1.csv", "wb") as file:
        file.write(response.content)
    print("CSV file downloaded successfully as 'EEG data - Sheet1.csv'")
else:
    print("Failed to download CSV file. Status code:", response.status_code)

CSV file downloaded successfully as 'EEG data - Sheet1.csv'


In [2]:
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from imblearn.over_sampling import SMOTE
from scipy.stats import uniform
from sklearn.utils import resample
import matplotlib.pyplot as plt

# Load dataset
np.random.seed(42)
file_path = "EEG data - Sheet1.csv"

In [3]:
df = pd.read_csv(file_path)
df = df.drop(columns=['Unnamed: 0'])

X = df.drop(columns=['target'])
y = df['target']

# Data Augmentation: Generate synthetic samples with noise
num_synthetic_samples = 120
noise_factor = 0.5
synthetic_X = X.sample(n=num_synthetic_samples, replace=True, random_state=42).values
synthetic_y = y.sample(n=num_synthetic_samples, replace=True, random_state=42).values
synthetic_X += noise_factor * np.random.randn(*synthetic_X.shape)

X_augmented = np.vstack((X.values, synthetic_X))
y_augmented = np.hstack((y.values, synthetic_y))
scaler = StandardScaler()
X_augmented = scaler.fit_transform(X_augmented)

# Apply PCA (reduce to 40 principal components)
pca = PCA(n_components=40)
X_pca = pca.fit_transform(X_augmented)

# Select best 10 features using SelectKBest
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X_pca, y_augmented)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y_augmented, test_size=0.2, random_state=42)

# Classifiers
models = {
    "Linear Regression (SGDClassifier)": SGDClassifier(loss='log_loss', max_iter=1000),
    "SVM (RBF Kernel)": SVC(kernel='rbf'),
    "kNN (k=5)": KNeighborsClassifier(n_neighbors=5)
}

# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    results[name] = (accuracy, precision)

# Print model results
print("Model Performance:")
for name, (acc, prec) in results.items():
    print(f"{name}: Accuracy = {acc:.2f}, Precision = {prec:.2f}")

#Applying Gridsearch for svm, knn
# Define hyperparameter grid for SVM
svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

# Define hyperparameter grid for KNN
knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

# Initialize models
svm_model = SVC()
knn_model = KNeighborsClassifier()

# Run GridSearchCV for SVM
svm_grid_search = GridSearchCV(svm_model, svm_param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
svm_grid_search.fit(X_train, y_train)

# Run GridSearchCV for KNN
knn_grid_search = GridSearchCV(knn_model, knn_param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
knn_grid_search.fit(X_train, y_train)

# Get actual predictions using the best estimator from GridSearchCV
svm_preds = svm_grid_search.best_estimator_.predict(X_test)
knn_preds = knn_grid_search.best_estimator_.predict(X_test)

# Evaluate tuned SVM
svm_accuracy = accuracy_score(y_test, svm_preds)
svm_precision = precision_score(y_test, svm_preds, average='weighted')
print(f"\nTuned SVM: Accuracy = {svm_accuracy:.2f}, Precision = {svm_precision:.2f}, Best parameters:", svm_grid_search.best_params_)

# Evaluate tuned KNN
knn_accuracy = accuracy_score(y_test, knn_preds)
knn_precision = precision_score(y_test, knn_preds, average='weighted')
print(f"Tuned KNN: Accuracy = {knn_accuracy:.2f}, Precision = {knn_precision:.2f}, Best parameters:", knn_grid_search.best_params_)

Model Performance:
Linear Regression (SGDClassifier): Accuracy = 0.88, Precision = 0.90
SVM (RBF Kernel): Accuracy = 0.78, Precision = 0.79
kNN (k=5): Accuracy = 0.69, Precision = 0.69

Tuned SVM: Accuracy = 0.91, Precision = 0.92, Best parameters: {'C': 100, 'gamma': 'scale', 'kernel': 'linear'}
Tuned KNN: Accuracy = 0.88, Precision = 0.88, Best parameters: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'}


# **Approach 1:** PCA-Based Dimensionality Reduction and Feature Selection

***Steps:***
* Data Cleaning: Removed irrelevant columns and separated features (X) from the target labels (y).

* Data Augmentation: Generated 120 synthetic samples by adding Gaussian noise to sampled instances to improve generalization.

* Normalization: Applied StandardScaler to standardize the feature values.

* Dimensionality Reduction: Used PCA to reduce feature space to 40 principal components, capturing the majority of the variance in the data.

* Feature Selection: Applied SelectKBest with ANOVA F-test to select the top 10 most relevant components post-PCA.

* Model Training: Evaluated three models – SGDClassifier, SVM (RBF), and kNN (k=5).

**Hyperparameter Tuning**: Applied GridSearchCV on SVM and kNN to optimize performance.

***Results:***

* Best performance: Tuned SVM (Linear kernel) – 91% Accuracy, 92% Precision.

* PCA helped improve linear separability, favoring models like SGD and linear SVM.

* kNN improved after tuning but remained behind the SVM in this approach.



**Approach 2:**

In [4]:
# Load and clean data
df = pd.read_csv(file_path)
df = df.drop(columns=['Unnamed: 0'])

X = df.drop(columns=['target'])
y = df['target']

# --- Topological (Neighbor-based) Pooling ---
channel_neighbors = {
    1: [], 2: [5, 59, 60], 3: [6, 60], 4: [7, 54], 5: [2], 6: [9, 3], 7: [16, 4],
    8: [], 9: [12, 6], 10: [11], 11: [10, 12, 13], 12: [9, 11, 13, 14], 13: [11, 12, 18, 19],
    14: [12], 15: [20], 16: [7, 21], 17: [], 18: [13], 19: [13], 20: [15], 21: [16],
    22: [26], 23: [24], 24: [23, 25], 25: [24, 26, 27], 26: [22, 25, 27, 28], 27: [26, 30],
    28: [26, 31], 29: [32], 30: [32, 27], 31: [28, 33], 32: [32, 29], 33: [31, 36],
    34: [36], 35: [37], 36: [33, 34, 38], 37: [35, 39], 38: [36, 40], 39: [37],
    40: [38, 42], 41: [51], 42: [40, 46], 43: [44, 47], 44: [45, 43], 45: [44, 46, 48],
    46: [42, 45, 48, 49], 47: [43], 48: [45, 46, 52], 49: [46], 50: [53], 51: [54, 41],
    52: [48, 55], 53: [50], 54: [4, 51], 55: [52], 56: [59], 57: [60], 58: [59],
    59: [2, 56, 58, 60], 60: [2, 3, 57, 59], 61: [], 62: [], 63: [], 64: []
}

X_pooled = X.copy()
for channel, neighbors in channel_neighbors.items():
    for band in ["theta", "alpha", "beta", "gamma", "delta"]:
        col = f"{channel}_{band}"
        neighbor_cols = [f"{n}_{band}" for n in neighbors if f"{n}_{band}" in X.columns]
        if neighbor_cols:
            X_pooled[col] = X[neighbor_cols].mean(axis=1)

# Data Augmentation: Generate synthetic samples with noise
num_synthetic_samples = 120
noise_factor = 0.5
synthetic_X = X_pooled.sample(n=num_synthetic_samples, replace=True, random_state=42).values
synthetic_y = y.sample(n=num_synthetic_samples, replace=True, random_state=42).values
synthetic_X += noise_factor * np.random.randn(*synthetic_X.shape)


X_augmented = np.vstack((X_pooled.values, synthetic_X))
y_augmented = np.hstack((y.values, synthetic_y))

# Normalize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_augmented)

# Select top 20 features using ANOVA F-test
selector = SelectKBest(score_func=f_classif, k=20)
X_selected = selector.fit_transform(X_scaled, y_augmented)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y_augmented, test_size=0.2, random_state=42)

# classifiers
models = {
    "Linear Regression (SGDClassifier)": SGDClassifier(loss='log_loss', max_iter=1000),
    "SVM (RBF Kernel)": SVC(kernel='rbf'),
    "k-Nearest Neighbors (k=5)": KNeighborsClassifier(n_neighbors=5)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    results[name] = (accuracy, precision)

# Output base models
print("\nModel Performance:")
for name, (acc, prec) in results.items():
    print(f"{name}: Accuracy = {acc:.2f}, Precision = {prec:.2f}")

# GridSearch Hyperparameter Tuning
svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

svm_model = SVC()
knn_model = KNeighborsClassifier()

svm_grid_search = GridSearchCV(svm_model, svm_param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
svm_grid_search.fit(X_train, y_train)

knn_grid_search = GridSearchCV(knn_model, knn_param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
knn_grid_search.fit(X_train, y_train)

# Predictions from tuned models
svm_best = svm_grid_search.best_estimator_
knn_best = knn_grid_search.best_estimator_

svm_preds = svm_best.predict(X_test)
knn_preds = knn_best.predict(X_test)

# Evaluate tuned models
svm_accuracy = accuracy_score(y_test, svm_preds)
svm_precision = precision_score(y_test, svm_preds, average='weighted')
print(f"\nTuned SVM: Accuracy = {svm_accuracy:.2f}, Precision = {svm_precision:.2f}, Best parameters:", svm_grid_search.best_params_)

knn_accuracy = accuracy_score(y_test, knn_preds)
knn_precision = precision_score(y_test, knn_preds, average='weighted')
print(f"Tuned KNN: Accuracy = {knn_accuracy:.2f}, Precision = {knn_precision:.2f}, Best parameters:", knn_grid_search.best_params_)



Model Performance:
Linear Regression (SGDClassifier): Accuracy = 0.69, Precision = 0.81
SVM (RBF Kernel): Accuracy = 0.78, Precision = 0.85
k-Nearest Neighbors (k=5): Accuracy = 0.72, Precision = 0.74

Tuned SVM: Accuracy = 0.88, Precision = 0.90, Best parameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
Tuned KNN: Accuracy = 0.91, Precision = 0.92, Best parameters: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'distance'}


# Approach 2: Topological Pooling Based on EEG Channel Neighborhoods

* Data Cleaning: Same as Approach 1.

* Topological Pooling: Leveraged a hand-crafted dictionary of EEG channel neighbors. For each feature (like 2_alpha), it was replaced by the average of its neighbors, encoding spatial relationships among electrodes.

* Data Augmentation: Same method as in Approach 1 – added noise to generate synthetic samples.

* Normalization: Standardized the data using StandardScaler.

* Feature Selection: Used SelectKBest to choose the top 20 features based on ANOVA F-test (no PCA used).

* Model Training: Same three models evaluated as in Approach 1.

**Hyperparameter Tuning**: Applied GridSearchCV for SVM and kNN.

***Results:***

* Best performance: Tuned kNN – 91% Accuracy, 92% Precision.

* SVM performance improved but was slightly lower than kNN in this setting.

* Linear models like SGD did not perform well, likely due to the non-linear nature of pooled data.**bold text**

# General Observations

* SVM and kNN models benefit significantly from hyperparameter tuning, confirming their sensitivity to parameters like C, gamma, and n_neighbors.

* PCA effectively compresses the high-dimensional EEG data into a smaller, noise-reduced space that retains maximum variance. This transformation helps in making the data more linearly separable, which explains why linear models (like SGD and linear SVM) perform exceptionally well.

* With topological processing, kNN performs better than it did with data in approach 1. Since kNN relies on distance metrics in feature space, pooling-based smoothing can reduce local noise and enhance neighborhood consistency, making distance-based classification more stable!