## EEG Data Classification: Neural State Prediction

Dataset: 'EEG data - Sheet1.csv'

Task: Classify participant data into two neural states.

**Workflow Overview:**
1. Load data.
2. **Task 1: Model Training & Evaluation**
   - Engineer features (regional averages, asymmetry).
   - Select features based on prior EDA.
   - Preprocess data (scaling, outlier removal, correlation filtering).
   - Setup modeling parameters (CV, pipelines, grids).
   - Run baseline model.
   - Train and evaluate models (Logistic Regression, SVM, kNN) using GridSearchCV.
   - Report Accuracy and Precision.
3. **Task 2: Feature Importance Analysis**
   - Part A: Analyze original 320 features (UFS, RFE, PCA).
   - Part B: Analyze features used in Task 1 models (UFS, RFE, PCA).
4. Summarize results and provide reasoning.

## Imports

In [1]:
import pandas as pd
import numpy as np
import re
from collections import defaultdict

# Scikit-learn imports for modeling and evaluation
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV, cross_validate
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, precision_score, make_scorer, confusion_matrix

# Scikit-learn imports for feature selection
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.decomposition import PCA

## Configuration and Constants

In [2]:
# --- Configuration Flags and Thresholds ---
REMOVE_OUTLIER_SAMPLE = True # Flag to remove Sample #17, identified as an outlier
OUTLIER_SAMPLE_INDEX = 16    # 0-based index for Sample #17
CORRELATION_THRESHOLD = 0.93 # Threshold for removing highly correlated features
N_SPLITS_CV = 5              # Number of splits for K-Fold CV
N_REPEATS_CV = 5             # Number of repeats for K-Fold CV
K_TOP_FEATURES = 5           # Number of top features to identify in Task 2
RANDOM_STATE = 42            # Seed for reproducibility
DATA_FILEPATH = 'EEG data - Sheet1.csv' # Path to the dataset

# --- EEG Channel Mapping (EGI 64 to 10-10 System) ---
EEG_MAPPING_TEXT = """
1 → 1 → F10
2 → 3 → AF4
3 → 4 → F2
4 → 6 → Fcz
5 → 9 → FP2
6 → 11 → Fz
7 → 13 → FC1
8 → 16 → AFz
9 → 19 → F1
10 → 22 → FP1
11 → 23 → AF3
12 → 24 → F3
13 → 27 → F5
14 → 28 → FC5
15 → 29 → FC3
16 → 30 → C1
17 → 32 → F9
18 → 33 → F7
19 → 34 → FT7
20 → 36 → C3
21 → 37 → CP1
22 → 41 → C5
23 → 44 → T9
24 → 45 → T7
25 → 46 → Tp7
26 → 47 → Cp5
27 → 51 → P5
28 → 52 → P3
29 → 57 → TP9
30 → 58 → P7
31 → 60 → P1
32 → 64 → P9
33 → 67 → PO3
34 → 62 → Pz
35 → 70 → O1
36 → 72 → Poz
37 → 75 → Oz
38 → 77 → PO4
39 → 83 → O2
40 → 85 → P2
41 → 87 → Cp2
42 → 92 → P4
43 → 95 → P10
44 → 96 → P8
45 → 97 → P6
46 → 98 → CP6
47 → 100 → TP10
48 → 102 → TP8
49 → 103 → C6
50 → 104 → C4
51 → 105 → C2
52 → 108 → T8
53 → 111 → FC4
54 → 112 → FC2
55 → 114 → T10
56 → 117 → FC6
57 → 118 → N/A
58 → 122 → F8
59 → 123 → F6
60 → 124 → F4
61 → 125 → N/A
62 → 126 → N/A
63 → 127 → N/A
64 → 128 → N/A
"""

# --- Feature Definitions (Based on EDA) ---
TOP_FEATURE_DEFINITIONS = {
    "delta50": 50, "delta42": 42, "delta24": 24, "delta12": 12, "theta50": 50,
    "delta52": 52, "theta42": 42, "theta24": 24, "theta52": 52, "theta12": 12,
}
TOP_REGIONAL_FEATURES_DEF = [
    'central_right_delta_mean', 'parietal_right_delta_mean', 'temporal_left_delta_mean',
    'frontal_left_delta_mean', 'central_right_theta_mean', 'temporal_right_delta_mean',
    'parietal_right_theta_mean', 'temporal_left_theta_mean', 'temporal_right_theta_mean'
]
TOP_ASYMMETRY_FEATURES_DEF = [
    'frontal_left_frontal_right_delta_asymmetry', 'central_left_central_right_delta_asymmetry',
    'temporal_left_temporal_right_delta_asymmetry', 'parietal_left_parietal_right_delta_asymmetry',
    'frontal_left_frontal_right_theta_asymmetry', 'central_left_central_right_theta_asymmetry'
]

## General Helper Functions

These functions are used across different steps of the workflow.

In [3]:
def parse_eeg_mapping(mapping_text):
    """Parses the multiline EGI 64 to 10-10 mapping string into a dictionary."""
    mapping = {}
    lines = mapping_text.strip().split('\n')
    for line in lines:
        # Regex to capture EGI index (group 1) and 10-10 label (group 2)
        match = re.match(r'(\d+)\s*→\s*\d+\s*→\s*(.*)', line)
        if match:
            egi_64_index = int(match.group(1))
            label_10_10 = match.group(2).strip()
            mapping[egi_64_index] = label_10_10
    return mapping

def compute_regional_averages(df, electrode_map):
    """Computes average band power for predefined neuroanatomical regions."""
    # Define regions based on 10-10 system labels
    regions = {
        'frontal_left': ['F1', 'F3', 'F5', 'F7', 'F9', 'AF3'],
        'frontal_right': ['F2', 'F4', 'F6', 'F8', 'F10', 'AF4'],
        'central_left': ['C1', 'C3', 'C5'],
        'central_right': ['C2', 'C4', 'C6'],
        'temporal_left': ['T7', 'T9', 'TP7', 'TP9'],
        'temporal_right': ['T8', 'T10', 'TP8', 'TP10'],
        'parietal_left': ['P1', 'P3', 'P5', 'P7', 'P9', 'PO3'],
        'parietal_right': ['P2', 'P4', 'P6', 'P8', 'P10', 'PO4'],
        'occipital_left': ['O1'],
        'occipital_right': ['O2'],
        'midline': ['Fz', 'FCz', 'Cz', 'CPz', 'Pz', 'POz', 'Oz']
    }

    # Create reverse mapping: 10-10 label -> EGI channel number
    rev_map = {label: ch_num for ch_num, label in electrode_map.items() if label != 'N/A'}

    regional_features = pd.DataFrame(index=df.index)
    bands = ['alpha', 'beta', 'delta', 'theta', 'gamma']

    # Calculate mean for each region and band
    for region_name, electrode_labels in regions.items():
        for band in bands:
            # Get EGI channel numbers for the current region
            channel_indices = [rev_map[label] for label in electrode_labels if label in rev_map]

            if channel_indices:
                # Construct column names (e.g., 'alpha1', 'alpha3', ...)
                band_cols = [f"{band}{idx}" for idx in channel_indices if f"{band}{idx}" in df.columns]

                if band_cols:
                    # Compute mean across the relevant channel columns for each sample
                    feature_name = f"{region_name}_{band}_mean"
                    regional_features[feature_name] = df[band_cols].mean(axis=1)

    print(f"Generated {regional_features.shape[1]} regional average features.")
    return regional_features

def compute_asymmetry_features(regional_features_df):
    """Calculates hemispheric asymmetry features based on regional averages."""
    asymmetry_df = pd.DataFrame(index=regional_features_df.index)
    epsilon = 1e-10 # To avoid division by zero

    # Define symmetric region pairs for asymmetry calculation
    region_pairs = [
        ('frontal_left', 'frontal_right'), ('central_left', 'central_right'),
        ('temporal_left', 'temporal_right'), ('parietal_left', 'parietal_right'),
        ('occipital_left', 'occipital_right')
    ]
    bands = ['alpha', 'beta', 'delta', 'theta', 'gamma']

    for left_region, right_region in region_pairs:
        for band in bands:
            left_col = f"{left_region}_{band}_mean"
            right_col = f"{right_region}_{band}_mean"

            # Check if both regional features exist
            if left_col in regional_features_df.columns and right_col in regional_features_df.columns:
                left_values = regional_features_df[left_col]
                right_values = regional_features_df[right_col]
                feature_name = f"{left_region}_{right_region}_{band}_asymmetry"
                # Calculate normalized difference
                asymmetry_df[feature_name] = (left_values - right_values) / (left_values + right_values + epsilon)

    print(f"Generated {asymmetry_df.shape[1]} hemispheric asymmetry features.")
    return asymmetry_df

def filter_correlated_features(X, feature_names, threshold=0.97):
    """Removes highly correlated features to reduce redundancy."""
    if X.shape[1] <= 1:
        return X, feature_names

    df_corr = pd.DataFrame(X, columns=feature_names)
    corr_matrix = df_corr.corr().abs()

    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Find features with correlation greater than the threshold
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

    if to_drop:
        print(f"Dropping {len(to_drop)} highly correlated features (threshold={threshold}): {to_drop}")
        # Get indices of features to keep
        keep_indices = [i for i, name in enumerate(feature_names) if name not in to_drop]
        # Filter X and feature_names
        X_filtered = X[:, keep_indices]
        filtered_names = [name for name in feature_names if name not in to_drop]
        return X_filtered, filtered_names
    else:
        print("No highly correlated features found above the threshold.")
        return X, feature_names

## Load Data

In [4]:

df_raw = pd.read_csv(DATA_FILEPATH, index_col=0)
# print(f"Dataset loaded: {df.shape[0]} samples, {df_raw.shape[1]} columns (including target)")


In [5]:
# df_raw = load_data(DATA_FILEPATH)
# if df_raw is None:
#     raise SystemExit("Stopping: Data loading error.")

## Task 1 - Feature Engineering

In [6]:
def engineer_features(df, mapping_text):
    """Engineers regional and asymmetry features."""
    print("\n--- Task 1: Feature Engineering --- ")
    channel_map = parse_eeg_mapping(mapping_text)
    regional_features = compute_regional_averages(df, channel_map)
    asymmetry_features = compute_asymmetry_features(regional_features)
    df_engineered = pd.concat([df, regional_features, asymmetry_features], axis=1)
    print(f"Shape after adding engineered features: {df_engineered.shape}")
    return df_engineered

In [7]:
df_engineered = engineer_features(df_raw, EEG_MAPPING_TEXT)


--- Task 1: Feature Engineering --- 
Generated 55 regional average features.
Generated 25 hemispheric asymmetry features.
Shape after adding engineered features: (40, 401)


## Task 1 - Feature Selection (EDA-based)

In [8]:
def select_eda_features(df_engineered, top_defs, regional_defs, asymmetry_defs):
    """Selects features based on predefined lists from EDA."""
    print("\n--- Task 1: Feature Selection (Based on prior EDA) ---")
    original_cols = [col for col in df_engineered.columns if not ('_mean' in col or '_asymmetry' in col or 'target' in col)]
    top_individual_cols = []
    for name, ch_num in top_defs.items():
        band = re.match(r"([a-zA-Z]+)", name).group(1)
        col_name = f"{band}{ch_num}"
        if col_name in original_cols: top_individual_cols.append(col_name)
        else: print(f"Warning: Defined top feature '{col_name}' not found.")
    top_individual_cols = sorted(list(set(top_individual_cols)))
    print(f"Selected {len(top_individual_cols)} individual channel features.")

    top_regional_cols = [col for col in regional_defs if col in df_engineered.columns]
    print(f"Selected {len(top_regional_cols)} regional average features.")

    top_asymmetry_cols = [col for col in asymmetry_defs if col in df_engineered.columns]
    print(f"Selected {len(top_asymmetry_cols)} asymmetry features.")

    selected_feature_names_eda = top_individual_cols + top_regional_cols + top_asymmetry_cols
    print(f"Total features selected based on EDA definitions: {len(selected_feature_names_eda)}")

    cols_to_keep = ['target'] + selected_feature_names_eda
    cols_to_keep = [col for col in cols_to_keep if col in df_engineered.columns]
    df_selected = df_engineered[cols_to_keep].copy()
    print(f"Shape after selecting EDA-based features: {df_selected.shape}")

    final_selected_names = [name for name in selected_feature_names_eda if name in df_selected.columns]
    return df_selected, final_selected_names

In [9]:
df_selected, selected_feature_names_eda = select_eda_features(
    df_engineered,
    TOP_FEATURE_DEFINITIONS,
    TOP_REGIONAL_FEATURES_DEF,
    TOP_ASYMMETRY_FEATURES_DEF
)


--- Task 1: Feature Selection (Based on prior EDA) ---
Selected 10 individual channel features.
Selected 9 regional average features.
Selected 6 asymmetry features.
Total features selected based on EDA definitions: 25
Shape after selecting EDA-based features: (40, 26)


## Task 1 - Preprocessing

In [10]:
def preprocess_data(df_selected, initial_feature_names, remove_outlier, outlier_idx, corr_threshold):
    """Preprocesses the selected data: outlier removal, correlation filtering, NaN handling."""
    print("\n--- Task 1: Preprocessing --- ")
    df_processed = df_selected.copy()
    if remove_outlier:
        if outlier_idx >= 0 and outlier_idx < len(df_processed):
            print(f"Removing outlier Sample # {outlier_idx + 1} (index {outlier_idx}).")
            df_processed = df_processed.drop(index=outlier_idx).reset_index(drop=True)
        else: print(f"Warning: Outlier index {outlier_idx} out of bounds. Skipping.")

    y = df_processed['target'].values
    current_feature_names = [name for name in initial_feature_names if name in df_processed.columns]
    X_raw = df_processed[current_feature_names].values
    print(f"Features before correlation filtering: {len(current_feature_names)}")

    X_filtered, feature_names_filtered = filter_correlated_features(X_raw, current_feature_names, threshold=corr_threshold)

    if np.isnan(X_filtered).any() or np.isinf(X_filtered).any():
        print("Handling NaN/Inf values in final features (replacing with 0).")
        X = np.nan_to_num(X_filtered, nan=0.0, posinf=np.finfo(np.float64).max, neginf=np.finfo(np.float64).min)
    else: X = X_filtered
    feature_names_final = feature_names_filtered

    n_samples, n_features = X.shape
    target_counts = pd.Series(y).value_counts(normalize=True)
    print("\n--- Task 1: Data Summary (After Preprocessing for Modeling) ---")
    print(f"Total samples for modeling: {n_samples}")
    print(f"Features remaining for modeling: {n_features}")
    print(f"Final features: {feature_names_final}")
    print("Target class distribution:")
    for class_val, proportion in target_counts.items(): print(f"  Class {class_val}: {proportion:.1%}")

    if n_features == 0:
        print("Error: No features remaining!")
        return None, None, None
    return X, y, feature_names_final

In [11]:
X, y, feature_names_final = preprocess_data(
    df_selected,
    selected_feature_names_eda,
    REMOVE_OUTLIER_SAMPLE,
    OUTLIER_SAMPLE_INDEX,
    CORRELATION_THRESHOLD
)
if X is None:
     raise SystemExit("Stopping: Preprocessing error (no features left).")


--- Task 1: Preprocessing --- 
Removing outlier Sample # 17 (index 16).
Features before correlation filtering: 25
Dropping 12 highly correlated features (threshold=0.93): ['delta50', 'delta52', 'theta42', 'theta50', 'theta52', 'central_right_delta_mean', 'parietal_right_delta_mean', 'central_right_theta_mean', 'temporal_right_delta_mean', 'parietal_right_theta_mean', 'temporal_left_theta_mean', 'temporal_right_theta_mean']

--- Task 1: Data Summary (After Preprocessing for Modeling) ---
Total samples for modeling: 39
Features remaining for modeling: 13
Final features: ['delta12', 'delta24', 'delta42', 'theta12', 'theta24', 'temporal_left_delta_mean', 'frontal_left_delta_mean', 'frontal_left_frontal_right_delta_asymmetry', 'central_left_central_right_delta_asymmetry', 'temporal_left_temporal_right_delta_asymmetry', 'parietal_left_parietal_right_delta_asymmetry', 'frontal_left_frontal_right_theta_asymmetry', 'central_left_central_right_theta_asymmetry']
Target class distribution:
  Clas

## Modeling Setup

In [12]:
def setup_modeling(n_samples, n_splits_config, n_repeats_config, random_state):
    """Sets up CV strategy, scoring, pipelines, and parameter grids."""
    print("\n--- Task 1: Model Training Setup ---")
    if n_samples < n_splits_config:
         print(f"Warning: Samples ({n_samples}) < CV splits ({n_splits_config}). Adjusting splits.")
         cv_splits = max(2, n_samples)
    else: cv_splits = n_splits_config
    cv_strategy = RepeatedStratifiedKFold(n_splits=cv_splits, n_repeats=n_repeats_config, random_state=random_state)
    print(f"Using Repeated Stratified K-Fold CV ({cv_splits} splits, {n_repeats_config} repeats)")

    precision_scorer = make_scorer(precision_score, pos_label=1, zero_division=0)
    scoring = {'accuracy': 'accuracy', 'precision': precision_scorer}
    refit_metric = 'accuracy'
    print(f"Metrics: Accuracy, Precision (pos_label=1, zero_division=0)")
    print(f"GridSearch refit metric: {refit_metric}")

    scaler = RobustScaler()
    pipelines = {}
    param_grids = {}

    # 1. Logistic Regression
    pipelines['Logistic Regression'] = Pipeline([('scaler', scaler), ('classifier', LogisticRegression(random_state=random_state, max_iter=5000))])
    param_grids['Logistic Regression'] = {
        'classifier__penalty': ['l1', 'l2'], 'classifier__C': [0.001, 0.01, 0.1, 1, 10],
        'classifier__solver': ['liblinear'], 'classifier__class_weight': ['balanced', None]
    }
    # 2. SVM (Linear Kernel)
    pipelines['SVM (Linear Kernel)'] = Pipeline([('scaler', scaler), ('classifier', SVC(kernel='linear', probability=False, random_state=random_state, class_weight='balanced'))])
    param_grids['SVM (Linear Kernel)'] = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]}
    # 3. SVM (RBF Kernel)
    pipelines['SVM (RBF Kernel)'] = Pipeline([('scaler', scaler), ('classifier', SVC(kernel='rbf', probability=False, random_state=random_state, class_weight='balanced'))])
    param_grids['SVM (RBF Kernel)'] = {
        'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
        'classifier__gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]
    }
    # 4. SVM (Polynomial Kernel)
    pipelines['SVM (Polynomial Kernel)'] = Pipeline([('scaler', scaler), ('classifier', SVC(kernel='poly', probability=False, random_state=random_state, class_weight='balanced'))])
    param_grids['SVM (Polynomial Kernel)'] = {
        'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100], 'classifier__degree': [2, 3],
        'classifier__gamma': ['scale', 'auto', 0.01, 0.1], 'classifier__coef0': [0, 1]
    }
    # 5. k-Nearest Neighbors (kNN)
    max_k = min(n_samples // 2, 15)
    knn_neighbors = [k for k in [1, 3, 5, 7, 9, 11, 13, 15] if k < n_samples and k <= max_k]
    if not knn_neighbors: knn_neighbors = [1]
    pipelines['k-Nearest Neighbors'] = Pipeline([('scaler', scaler), ('classifier', KNeighborsClassifier())])
    param_grids['k-Nearest Neighbors'] = {
        'classifier__n_neighbors': knn_neighbors,
        'classifier__weights': ['uniform', 'distance'],
        'classifier__metric': ['euclidean', 'manhattan']
    }

    return cv_strategy, scoring, pipelines, param_grids, refit_metric

In [13]:
cv_strategy, scoring, pipelines, param_grids, refit_metric = setup_modeling(
    X.shape[0],
    N_SPLITS_CV,
    N_REPEATS_CV,
    RANDOM_STATE
)


--- Task 1: Model Training Setup ---
Using Repeated Stratified K-Fold CV (5 splits, 5 repeats)
Metrics: Accuracy, Precision (pos_label=1, zero_division=0)
GridSearch refit metric: accuracy


## Task 1 - Baseline Model

In [14]:
def run_baseline(X, y, cv_strategy, scoring):
    """Runs the DummyClassifier baseline."""
    print("\n--- Task 1: Baseline Performance (Predicting Most Frequent Class) ---")
    dummy_clf = DummyClassifier(strategy='most_frequent')
    baseline_results = {}
    try:
        baseline_scores = cross_validate(dummy_clf, X, y, cv=cv_strategy, scoring=scoring)
        baseline_results = {metric: (np.mean(scores), np.std(scores)) for metric, scores in baseline_scores.items() if metric.startswith('test_')}
        print(f"Baseline Mean Accuracy:  {baseline_results['test_accuracy'][0]:.4f} (+/- {baseline_results['test_accuracy'][1]:.4f})")
        print(f"Baseline Mean Precision: {baseline_results['test_precision'][0]:.4f} (+/- {baseline_results['test_precision'][1]:.4f})")
    except Exception as e:
        print(f"Could not compute baseline scores: {e}")
        baseline_results['error'] = str(e)
    return baseline_results

In [15]:
baseline_results = run_baseline(X, y, cv_strategy, scoring)


--- Task 1: Baseline Performance (Predicting Most Frequent Class) ---
Baseline Mean Accuracy:  0.5143 (+/- 0.0286)
Baseline Mean Precision: 0.0000 (+/- 0.0000)


## Task 1 - Train & Evaluate Models

In [16]:
def train_and_evaluate_models(X, y, pipelines, param_grids, cv_strategy, scoring, refit_metric):
    """Runs GridSearchCV for all defined models."""
    print(f"\n--- Task 1: Training and Evaluating Models via GridSearchCV --- ")
    results = {}
    best_model_name = None
    best_accuracy = -1

    for name in pipelines:
        print(f"\nTuning {name}...")
        pipeline = pipelines[name]
        param_grid = param_grids[name]
        grid_search = GridSearchCV(
            estimator=pipeline, param_grid=param_grid, scoring=scoring,
            refit=refit_metric, cv=cv_strategy, n_jobs=-1,
            verbose=0, error_score='raise'
        )
        try:
            grid_search.fit(X, y)
            results[name] = {
                'best_params': grid_search.best_params_,
                'best_estimator': grid_search.best_estimator_
            }
            best_index = grid_search.best_index_
            for metric in scoring.keys():
                mean_score = grid_search.cv_results_[f'mean_test_{metric}'][best_index]
                std_score = grid_search.cv_results_[f'std_test_{metric}'][best_index]
                results[name][f'mean_test_{metric}'] = mean_score
                results[name][f'std_test_{metric}'] = std_score

            acc_mean = results[name]['mean_test_accuracy']
            acc_std = results[name]['std_test_accuracy']
            prec_mean = results[name]['mean_test_precision']
            prec_std = results[name]['std_test_precision']
            print(f"  Best Mean Accuracy:  {acc_mean:.4f} (+/- {acc_std:.4f})")
            print(f"  Best Mean Precision: {prec_mean:.4f} (+/- {prec_std:.4f})")
            print(f"  Best Parameters: {results[name]['best_params']}")

            if acc_mean > best_accuracy:
                best_accuracy = acc_mean
                best_model_name = name
        except Exception as e:
             print(f"  Error during GridSearchCV for {name}: {e}")
             results[name] = {'error': str(e)}

    print("\nTask 1 GridSearchCV Complete.")
    if best_model_name:
        print(f"Overall best model (based on mean accuracy): {best_model_name} ({best_accuracy:.4f})")
    else: print("No models completed successfully.")
    return results, best_model_name, best_accuracy

In [17]:
model_results, best_model_name, best_accuracy = train_and_evaluate_models(
    X, y, pipelines, param_grids, cv_strategy, scoring, refit_metric
)


--- Task 1: Training and Evaluating Models via GridSearchCV --- 

Tuning Logistic Regression...
  Best Mean Accuracy:  0.5143 (+/- 0.0286)
  Best Mean Precision: 0.0000 (+/- 0.0000)
  Best Parameters: {'classifier__C': 0.001, 'classifier__class_weight': 'balanced', 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}

Tuning SVM (Linear Kernel)...
  Best Mean Accuracy:  0.5214 (+/- 0.0825)
  Best Mean Precision: 0.5067 (+/- 0.0557)
  Best Parameters: {'classifier__C': 0.01}

Tuning SVM (RBF Kernel)...
  Best Mean Accuracy:  0.5329 (+/- 0.0703)
  Best Mean Precision: 0.5103 (+/- 0.0462)
  Best Parameters: {'classifier__C': 1, 'classifier__gamma': 0.001}

Tuning SVM (Polynomial Kernel)...
  Best Mean Accuracy:  0.6129 (+/- 0.1477)
  Best Mean Precision: 0.6600 (+/- 0.2066)
  Best Parameters: {'classifier__C': 100, 'classifier__coef0': 0, 'classifier__degree': 2, 'classifier__gamma': 0.1}

Tuning k-Nearest Neighbors...
  Best Mean Accuracy:  0.5071 (+/- 0.1005)
  Best Mean Pre

## Task 2 (Part A) - Analyze Original Features

In [18]:
def run_task2_original_features(df_original, k_top_features, random_state):
    """
    Performs Task 2 (UFS, RFE, PCA) on the original features of the dataset.
    Returns a dictionary containing the top features identified by each method.
    """
    print("\n--- Task 2 (Part A): Analysis on Original 320 Features ---")
    results_task2a = defaultdict(list)

    # --- Prepare Data ---
    if 'target' not in df_original.columns:
        print("Error: 'target' column not found.")
        results_task2a['error'] = "Target column missing"
        return results_task2a

    y_original = df_original['target'].values
    X_original_df = df_original.drop('target', axis=1)
    X_original = X_original_df.values
    feature_names_original = X_original_df.columns.tolist()
    n_samples, n_features = X_original.shape

    print(f"Analyzing {n_features} original features for {n_samples} samples.")

    # Handle NaNs/Infs
    if np.isnan(X_original).any() or np.isinf(X_original).any():
        print("Handling NaN/Inf values in original features (replacing with 0).")
        X_original = np.nan_to_num(X_original, nan=0.0, posinf=np.finfo(np.float64).max, neginf=np.finfo(np.float64).min)

    # --- 2a. UFS ---
    print("\n--- 2a. UFS (f_classif) on Original Features ---")
    try:
        scaler_ufs_orig = RobustScaler()
        X_scaled_ufs_orig = scaler_ufs_orig.fit_transform(X_original)
        k_ufs = min(k_top_features, n_features)
        ufs_selector_orig = SelectKBest(score_func=f_classif, k=k_ufs)
        ufs_selector_orig.fit(X_scaled_ufs_orig, y_original)
        ufs_scores_all_orig = np.nan_to_num(ufs_selector_orig.scores_, nan=-np.inf)
        ufs_sorted_indices_orig = np.argsort(ufs_scores_all_orig)[::-1]
        ufs_top_indices_orig = ufs_sorted_indices_orig[:k_ufs]
        ufs_features_orig = [feature_names_original[i] for i in ufs_top_indices_orig]
        ufs_scores_orig = ufs_scores_all_orig[ufs_top_indices_orig]
        print(f"Top {k_ufs} original features (and scores) via UFS (f_classif):")
        for feature, score in zip(ufs_features_orig, ufs_scores_orig):
            print(f"  - {feature}: {score:.4f}")
        results_task2a['ufs_features'] = ufs_features_orig
    except Exception as e:
        print(f"Error during UFS analysis on original features: {e}")
        results_task2a['ufs_features'] = ["Error in UFS"]

    # --- 2b. RFE ---
    print("\n--- 2b. RFE on Original Features ---")
    try:
        scaler_rfe_orig = RobustScaler()
        X_scaled_rfe_orig = scaler_rfe_orig.fit_transform(X_original)
        k_rfe = min(k_top_features, n_features)
        estimator_rfe_orig = LogisticRegression(solver='liblinear', C=1.0, class_weight='balanced', random_state=random_state, max_iter=1000)
        rfe_estimator_name = "LogisticRegression(C=1.0)"
        print(f"Using {rfe_estimator_name} for RFE (may take time)...")
        rfe_selector_orig = RFE(estimator=estimator_rfe_orig, n_features_to_select=k_rfe, step=0.1)
        rfe_selector_orig.fit(X_scaled_rfe_orig, y_original)
        rfe_indices_orig = rfe_selector_orig.get_support(indices=True)
        rfe_features_orig = [feature_names_original[i] for i in rfe_indices_orig]
        print(f"Top {k_rfe} original features via RFE (using {rfe_estimator_name}):")
        for feature in rfe_features_orig:
            print(f"  - {feature}")
        results_task2a['rfe_features'] = rfe_features_orig
    except Exception as e:
        print(f"Error during RFE analysis on original features: {e}")
        results_task2a['rfe_features'] = ["Error in RFE"]

    # --- 2c. PCA ---
    print("\n--- 2c. PCA on Original Features ---")
    try:
        scaler_pca_orig = StandardScaler()
        X_scaled_pca_orig = scaler_pca_orig.fit_transform(X_original)
        n_components_pca = min(n_features, n_samples)
        pca_orig = PCA(n_components=n_components_pca, random_state=random_state)
        pca_orig.fit(X_scaled_pca_orig)
        print(f"Explained variance ratio by top principal components:")
        cumulative_variance = 0
        max_pcs_to_print = min(10, n_components_pca)
        for i, ratio in enumerate(pca_orig.explained_variance_ratio_):
            cumulative_variance += ratio
            if i < max_pcs_to_print:
                print(f"  - PC{i+1}: {ratio:.4f} (Cumulative: {cumulative_variance:.4f})")
            elif i == max_pcs_to_print:
                print(f"      ... (remaining {n_components_pca - i} components explain {(1-cumulative_variance):.4f} variance)")
                break # Stop printing after max_pcs_to_print
        k_pca_loadings = min(k_top_features, n_features)
        print(f"\nTop {k_pca_loadings} original features based on absolute loading on PC1:")
        pc1_loadings_orig = pca_orig.components_[0]
        feature_loadings_pc1_orig = sorted(zip(feature_names_original, np.abs(pc1_loadings_orig)), key=lambda x: x[1], reverse=True)
        pca_features_pc1 = []
        for feature, loading in feature_loadings_pc1_orig[:k_pca_loadings]:
            print(f"  - {feature}: {loading:.4f}")
            pca_features_pc1.append(feature)
        results_task2a['pca_features_pc1'] = pca_features_pc1

        if n_components_pca > 1:
            print(f"\nTop {k_pca_loadings} original features based on absolute loading on PC2:")
            pc2_loadings_orig = pca_orig.components_[1]
            feature_loadings_pc2_orig = sorted(zip(feature_names_original, np.abs(pc2_loadings_orig)), key=lambda x: x[1], reverse=True)
            pca_features_pc2 = []
            for feature, loading in feature_loadings_pc2_orig[:k_pca_loadings]:
                print(f"  - {feature}: {loading:.4f}")
                pca_features_pc2.append(feature)
            results_task2a['pca_features_pc2'] = pca_features_pc2
    except Exception as e:
        print(f"Error during PCA analysis on original features: {e}")
        results_task2a['pca_features_pc1'] = ["Error in PCA"]
        results_task2a['pca_features_pc2'] = ["Error in PCA"]

    print("\n--- End of Task 2 (Part A) Analysis on Original Features ---")
    return dict(results_task2a)

In [19]:
# Run Task 2 (Part A) using the initial raw dataframe
analysis_results_task2a = run_task2_original_features(df_raw.copy(), K_TOP_FEATURES, RANDOM_STATE)


--- Task 2 (Part A): Analysis on Original 320 Features ---
Analyzing 320 original features for 40 samples.

--- 2a. UFS (f_classif) on Original Features ---
Top 5 original features (and scores) via UFS (f_classif):
  - delta41: 1.8424
  - delta51: 1.8316
  - beta23: 1.7772
  - theta23: 1.6022
  - delta23: 1.4612

--- 2b. RFE on Original Features ---
Using LogisticRegression(C=1.0) for RFE (may take time)...
Top 5 original features via RFE (using LogisticRegression(C=1.0)):
  - delta40
  - delta58
  - delta60
  - theta55
  - gamma63

--- 2c. PCA on Original Features ---
Explained variance ratio by top principal components:
  - PC1: 0.3477 (Cumulative: 0.3477)
  - PC2: 0.1888 (Cumulative: 0.5365)
  - PC3: 0.1067 (Cumulative: 0.6432)
  - PC4: 0.0939 (Cumulative: 0.7371)
  - PC5: 0.0632 (Cumulative: 0.8003)
  - PC6: 0.0473 (Cumulative: 0.8476)
  - PC7: 0.0299 (Cumulative: 0.8775)
  - PC8: 0.0205 (Cumulative: 0.8980)
  - PC9: 0.0189 (Cumulative: 0.9169)
  - PC10: 0.0176 (Cumulative: 0.9345

## Task 2 (Part B) - Analyze Task 1 Features

In [20]:
def analyze_task2_task1_features(X, y, feature_names, k_top_features, model_results, random_state):
    """Performs UFS, RFE, and PCA on the final features used in Task 1 models."""
    print("\n--- Task 2 (Part B): Feature Selection Analysis on *Features Used in Task 1 Models* ---")
    analysis_results = defaultdict(list)
    n_samples, n_features = X.shape

    if n_features <= 0:
        print("Skipping Task 2 (Part B) as no features were available for Task 1 modeling.")
        analysis_results['error'] = "No features to analyze"
        return dict(analysis_results)

    k_analyze = min(k_top_features, n_features)
    print(f"Analyzing top {k_analyze} features using UFS, RFE, and PCA on the {n_features} features used in Task 1.")
    print(f"Features analyzed: {feature_names}")

    # --- 9a. UFS ---
    print("\n--- 9a. UFS (f_classif) on Task 1 Features ---")
    try:
        scaler_ufs = RobustScaler()
        X_scaled_ufs = scaler_ufs.fit_transform(X)
        ufs_selector = SelectKBest(score_func=f_classif, k=k_analyze)
        ufs_selector.fit(X_scaled_ufs, y)
        ufs_scores_all = np.nan_to_num(ufs_selector.scores_, nan=-np.inf)
        ufs_sorted_indices = np.argsort(ufs_scores_all)[::-1]
        ufs_top_indices = ufs_sorted_indices[:k_analyze]
        ufs_features = [feature_names[i] for i in ufs_top_indices]
        ufs_scores = ufs_scores_all[ufs_top_indices]
        print(f"Top {k_analyze} Task 1 features (and scores) via UFS (f_classif):")
        for feature, score in zip(ufs_features, ufs_scores):
            print(f"  - {feature}: {score:.4f}")
        analysis_results['ufs_features'] = ufs_features
    except Exception as e:
        print(f"Error during UFS analysis: {e}")
        analysis_results['ufs_features'] = ["Error in UFS"]

    # --- 9b. RFE ---
    print("\n--- 9b. RFE on Task 1 Features ---")
    estimator_rfe = None
    rfe_name = "Default SVC(kernel='linear', C=1.0)"
    # Try to use best Linear SVM from Task 1
    if 'SVM (Linear Kernel)' in model_results and 'error' not in model_results['SVM (Linear Kernel)']:
        try:
            params = {k.split('__')[1]: v for k, v in model_results['SVM (Linear Kernel)']['best_params'].items() if k.startswith('classifier__')}
            params.setdefault('class_weight', 'balanced')
            params.setdefault('probability', False)
            params.setdefault('random_state', random_state)
            estimator_rfe = SVC(kernel='linear', **params)
            rfe_name = f"Best Task 1 SVC(kernel='linear', C={params.get('C', '?')})"
            print(f"Using {rfe_name} for RFE.")
        except Exception as e:
            print(f"Could not use best Linear SVM params, using default. Error: {e}")
            estimator_rfe = SVC(kernel='linear', C=1.0, probability=False, random_state=random_state, class_weight='balanced')
    else:
        print(f"Linear SVM results not found/had errors. Using {rfe_name} for RFE.")
        estimator_rfe = SVC(kernel='linear', C=1.0, probability=False, random_state=random_state, class_weight='balanced')

    if estimator_rfe:
        try:
            scaler_rfe = RobustScaler()
            X_scaled_rfe = scaler_rfe.fit_transform(X)
            rfe_selector = RFE(estimator=estimator_rfe, n_features_to_select=k_analyze, step=1)
            rfe_selector.fit(X_scaled_rfe, y)
            rfe_indices = rfe_selector.get_support(indices=True)
            rfe_features = [feature_names[i] for i in rfe_indices]
            print(f"Top {k_analyze} Task 1 features via RFE (using {rfe_name}):")
            for feature in rfe_features:
                print(f"  - {feature}")
            analysis_results['rfe_features'] = rfe_features
        except Exception as e:
            print(f"Error during RFE analysis: {e}")
            analysis_results['rfe_features'] = ["Error in RFE"]
    else:
        analysis_results['rfe_features'] = ["Error configuring RFE estimator"]

    # --- 9c. PCA ---
    print("\n--- 9c. PCA on Task 1 Features ---")
    try:
        scaler_pca = StandardScaler()
        X_scaled_pca = scaler_pca.fit_transform(X)
        n_components_pca = min(n_features, n_samples)
        pca = PCA(n_components=n_components_pca, random_state=random_state)
        pca.fit(X_scaled_pca)
        print(f"Explained variance ratio by principal components (Task 1 features):")
        cumulative_variance_task1 = 0
        for i, ratio in enumerate(pca.explained_variance_ratio_):
            cumulative_variance_task1 += ratio
            print(f"  - PC{i+1}: {ratio:.4f} (Cumulative: {cumulative_variance_task1:.4f})")
            # Stop printing details early if variance is mostly explained
            if i+1 >= k_analyze and cumulative_variance_task1 > 0.95:
                if i + 1 < n_components_pca: print(f"      ... (remaining {n_components_pca - (i+1)} components)")
                break
        print(f"\nTop {k_analyze} Task 1 features based on absolute loading on PC1:")
        pc1_loadings = pca.components_[0]
        feature_loadings_pc1 = sorted(zip(feature_names, np.abs(pc1_loadings)), key=lambda x: x[1], reverse=True)
        pca_features_pc1 = []
        for feature, loading in feature_loadings_pc1[:k_analyze]:
            print(f"  - {feature}: {loading:.4f}")
            pca_features_pc1.append(feature)
        analysis_results['pca_features_pc1'] = pca_features_pc1

        if n_components_pca > 1:
            print(f"\nTop {k_analyze} Task 1 features based on absolute loading on PC2:")
            pc2_loadings = pca.components_[1]
            feature_loadings_pc2 = sorted(zip(feature_names, np.abs(pc2_loadings)), key=lambda x: x[1], reverse=True)
            pca_features_pc2 = []
            for feature, loading in feature_loadings_pc2[:k_analyze]:
                print(f"  - {feature}: {loading:.4f}")
                pca_features_pc2.append(feature)
            analysis_results['pca_features_pc2'] = pca_features_pc2
    except Exception as e:
        print(f"Error during PCA analysis: {e}")
        analysis_results['pca_features_pc1'] = ["Error in PCA"]
        analysis_results['pca_features_pc2'] = ["Error in PCA"]

    print("\n--- End of Task 2 (Part B) Analysis on Task 1 Features ---")
    return dict(analysis_results)

In [21]:
# Run Task 2 (Part B) using the final features from Task 1
analysis_results_task2b = analyze_task2_task1_features(
    X, y, feature_names_final, K_TOP_FEATURES, model_results, RANDOM_STATE
)


--- Task 2 (Part B): Feature Selection Analysis on *Features Used in Task 1 Models* ---
Analyzing top 5 features using UFS, RFE, and PCA on the 13 features used in Task 1.
Features analyzed: ['delta12', 'delta24', 'delta42', 'theta12', 'theta24', 'temporal_left_delta_mean', 'frontal_left_delta_mean', 'frontal_left_frontal_right_delta_asymmetry', 'central_left_central_right_delta_asymmetry', 'temporal_left_temporal_right_delta_asymmetry', 'parietal_left_parietal_right_delta_asymmetry', 'frontal_left_frontal_right_theta_asymmetry', 'central_left_central_right_theta_asymmetry']

--- 9a. UFS (f_classif) on Task 1 Features ---
Top 5 Task 1 features (and scores) via UFS (f_classif):
  - temporal_left_delta_mean: 1.3219
  - central_left_central_right_delta_asymmetry: 1.2246
  - frontal_left_delta_mean: 1.1595
  - delta42: 0.9475
  - delta24: 0.6985

--- 9b. RFE on Task 1 Features ---
Using Best Task 1 SVC(kernel='linear', C=0.01) for RFE.
Top 5 Task 1 features via RFE (using Best Task 1 SVC(

## Summarize Results & Reasoning

In [22]:
def summarize_results(model_results, best_model, best_acc, task2a_results, task2b_results):
    """Prints the final summary of model performance and feature analysis."""
    print("\n--- Final Summary & Reasoning --- ")
    print("Task 1 Model Performance Summary:")
    results_list = []
    for name, data in model_results.items():
        if 'error' not in data:
            row = {
                'Model': name,
                'Mean Accuracy': data.get('mean_test_accuracy', np.nan),
                'Std Accuracy': data.get('std_test_accuracy', np.nan),
                'Mean Precision': data.get('mean_test_precision', np.nan),
                'Std Precision': data.get('std_test_precision', np.nan),
                'Best Params': str(data.get('best_params', {}))
            }
            results_list.append(row)
        else: print(f"Model '{name}' failed: {data['error']}")
    if results_list:
        results_df = pd.DataFrame(results_list).sort_values('Mean Accuracy', ascending=False)
        float_cols = ['Mean Accuracy', 'Std Accuracy', 'Mean Precision', 'Std Precision']
        for col in float_cols: results_df[col] = results_df[col].map('{:.4f}'.format)
        try:
            print(results_df.to_markdown(index=False, numalign="left", stralign="left"))
        except ImportError:
            print(results_df)
    else: print("No successful model results to display.")

    print("\nReasoning for Performance Differences (Example):")
    if best_model: print(f"- Best model: '{best_model}' (Accuracy: {best_acc:.4f}).") # Add more reasoning here based on model type
    else: print("- No model completed successfully.")

    print("\nTask 2 Feature Selection Summary:")
    print("  Part A (Original Features):")
    if 'error' not in task2a_results:
        print(f"  - UFS Top Features: {task2a_results.get('ufs_features', 'N/A')}")
        print(f"  - RFE Top Features: {task2a_results.get('rfe_features', 'N/A')}")
        print(f"  - PCA Top Features (PC1): {task2a_results.get('pca_features_pc1', 'N/A')}")
        if 'pca_features_pc2' in task2a_results:
             print(f"  - PCA Top Features (PC2): {task2a_results.get('pca_features_pc2', 'N/A')}")
    else: print(f"  - Task 2(A) analysis failed: {task2a_results['error']}")

    print("\n  Part B (Task 1 Features):")
    if 'error' not in task2b_results:
        print(f"  - UFS Top Features: {task2b_results.get('ufs_features', 'N/A')}")
        print(f"  - RFE Top Features: {task2b_results.get('rfe_features', 'N/A')}")
        print(f"  - PCA Top Features (PC1): {task2b_results.get('pca_features_pc1', 'N/A')}")
        if 'pca_features_pc2' in task2b_results:
             print(f"  - PCA Top Features (PC2): {task2b_results.get('pca_features_pc2', 'N/A')}")
    else: print(f"  - Task 2(B) analysis failed: {task2b_results['error']}")

    print("\nReasoning for Feature Selection Differences (Example):")
    print("- UFS (individual stats), RFE (model-based importance), PCA (variance).")

    print("\n--- Complete --- ")

In [23]:
# Pass results from Task 1 modeling and both parts of Task 2
summarize_results(
    model_results,        
    best_model_name,    
    best_accuracy,      
    analysis_results_task2a, 
    analysis_results_task2b  
)


--- Final Summary & Reasoning --- 
Task 1 Model Performance Summary:
                     Model Mean Accuracy Std Accuracy Mean Precision  \
3  SVM (Polynomial Kernel)        0.6129       0.1477         0.6600   
2         SVM (RBF Kernel)        0.5329       0.0703         0.5103   
1      SVM (Linear Kernel)        0.5214       0.0825         0.5067   
0      Logistic Regression        0.5143       0.0286         0.0000   
4      k-Nearest Neighbors        0.5071       0.1005         0.5118   

  Std Precision                                        Best Params  
3        0.2066  {'classifier__C': 100, 'classifier__coef0': 0,...  
2        0.0462   {'classifier__C': 1, 'classifier__gamma': 0.001}  
1        0.0557                            {'classifier__C': 0.01}  
0        0.0000  {'classifier__C': 0.001, 'classifier__class_we...  
4        0.1409  {'classifier__metric': 'manhattan', 'classifie...  

Reasoning for Performance Differences (Example):
- Best model: 'SVM (Polynomial Ke