In [1]:
#!pip install mne scipy

In [None]:
!pip install numpy==1.26.0

In [None]:
# !pip install pandas numpy openpyxl

In [3]:
#!pip install tsfresh

In [4]:
#!pip install PyWavelets

In [1]:
import pandas as pd

# Read time_series_df from CSV
time_series_df = pd.read_csv('time_series_df.csv')

# Read labels from CSV
labels = pd.read_csv('labels.csv', squeeze=True)  # Use squeeze=True to load it as a Series if it's a single column

# Optionally, read labels from Pickle (preserves Python object types)
# labels = pd.read_pickle('labels.pkl')

In [2]:
labels.index = time_series_df['id'].unique()

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np
from tsfresh import extract_features
from tsfresh.feature_extraction import MinimalFCParameters, ComprehensiveFCParameters

# Extract features using tsfresh with minimal settings
extracted_features = extract_features(time_series_df, column_id='id', column_sort='time',
                                      default_fc_parameters=MinimalFCParameters())

# Drop any columns with NaN or infinite values
extracted_features_clean = extracted_features.replace([np.inf, -np.inf], np.nan).dropna(axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(extracted_features_clean, labels, test_size=0.3, random_state=42)

# Select the most important features using ANOVA F-test
selector = SelectKBest(f_classif, k=10)  # Adjust 'k' to select the top k important features
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Train a Random Forest Classifier to identify the most important features
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Define parameter grid for GridSearchCV (for Random Forest)
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}

# Perform grid search to find the best parameters for Random Forest
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_selected, y_train)

# Print the best parameters found by GridSearchCV
print(f"Best parameters: {grid_search.best_params_}")

# Use the best estimator from GridSearchCV to predict and evaluate the model
best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test_selected)

# Evaluate the model
print(classification_report(y_test, y_pred))

# Identify and display the top selected features with importance
selected_feature_names = extracted_features_clean.columns[selector.get_support()]
important_features = pd.DataFrame({
    'Feature': selected_feature_names,
    'Importance': best_clf.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(important_features)


In [11]:
def process_in_chunks(time_series_df, N):
    # Get unique trial IDs
    unique_ids = time_series_df['id'].unique()
    
    # Split the unique IDs into chunks of size N
    chunks = [unique_ids[i:i + N] for i in range(0, len(unique_ids), N)]
    
    # Initialize an empty list to store the results
    results = []
    
    # Process each chunk
    for chunk in chunks:
        # Filter the DataFrame to include only the trials in the current chunk
        chunk_df = time_series_df[time_series_df['id'].isin(chunk)]
        
        # Extract features for the current chunk
        extracted_features = extract_features(chunk_df, column_id='id', column_sort='time', default_fc_parameters=ComprehensiveFCParameters())
        
        # Append the extracted features to the results list
        results.append(extracted_features)
        break
    # Concatenate all the results into a single DataFrame
    final_result = pd.concat(results)
    
    return final_result

In [None]:
fr = process_in_chunks(time_series_df, N=1)

1. Classifier Performance:
The Random Forest model's classification performance is summarized in the precision, recall, f1-score, and support columns for both classes:

- Class 0 (Control): The model predicted this class with a precision of 0.12, a recall of 0.17, and an F1-score of 0.14.
- Class 1 (Amblyopia): The model predicted this class with a precision of 0.38, a recall of 0.30, and an F1-score of 0.33.
- Overall Accuracy: The model's overall accuracy is 0.25 (25%), which indicates that the model didn't perform well in distinguishing between Amblyopia and Control participants.

Key Metrics:

Precision: Measures how many of the predicted positive results are true positives.
- For class 0 (Control), only 12% of the instances predicted as Control were correct.
- For class 1 (Amblyopia), 38% of the instances predicted as Amblyopia were correct.

Recall: Measures how many actual positive instances were correctly predicted.
- For class 0, 17% of the actual Control instances were correctly identified.
- For class 1, 30% of the actual Amblyopia instances were correctly identified.

F1-Score: A harmonic mean of precision and recall, giving a better sense of the balance between the two metrics. In both classes, the F1-scores are relatively low, especially for Control.

This low performance (accuracy 25%) suggests that the model struggled to differentiate between the two groups based on the extracted features. This could be due to various reasons, such as insufficient or irrelevant features, a small dataset, or an imbalance between the groups.

2. Feature Importance:
This table ranks the extracted features by their importance, as determined by the Random Forest classifier:

Feature	Importance
- 8    value__absolute_maximum    0.214460
- 7             value__maximum    0.168308
- 9             value__minimum    0.159271
- 4  value__standard_deviation    0.146875
- 6    value__root_mean_square    0.134352
- 0          value__sum_values    0.128472
- 1              value__median    0.048262
- 2                value__mean    0.000000
- 3              value__length    0.000000
- 5            value__variance    0.000000

Most Important Features:

- value__absolute_maximum: This feature, which represents the absolute maximum value in the time series, was the most important in distinguishing between Amblyopia and Control participants (importance = 0.214460).

Observations:

- Feature Importance Distribution: The features' importance values are skewed, with the top 6 features contributing most to the model's predictions, while others (such as mean, length, and variance) contributed very little or nothing.

Interpretation:
- Low Accuracy: The classifier struggled to differentiate between the Amblyopia and Control groups, possibly because the extracted features don't capture the necessary information to distinguish between these groups, or the dataset might be too small or imbalanced.
- Feature Importance: Features like absolute_maximum, maximum, and minimum seem to provide the most information for distinguishing between Amblyopia and Control participants. These might represent extreme fluctuations or peak characteristics in the EEG signals.

Next Steps:
- Feature Engineering: Consider extracting additional or more complex features using tsfresh or other methods to capture more meaningful aspects of the EEG data.
- Balanced Dataset: Ensure that the dataset is balanced between Amblyopia and Control groups to avoid skewed performance metrics.
- Cross-Validation: Use cross-validation to get a more robust estimate of model performance.
- Other Models: Try different classifiers (e.g., Support Vector Machines or Gradient Boosting) to see if they perform better.
- Hyperparameter Tuning: Tuning the Random Forest classifier (e.g., adjusting the number of trees or maximum depth) could potentially improve the model's accuracy.
Let me know if you need further insights or improvements!

In [1]:
import joblib

important_features = joblib.load('important_features.joblib')

# import pandas as pd
# print(pd.__version__)
# !pip install --upgrade pandas



ModuleNotFoundError: No module named 'numpy._core'