# Script content

1. **Retrieve the total matrix of standardized data**: 
   - The data matrix is called `total_list_stand`.

2. **Test with a sample of the matrix to apply XGBoost**: 
   - We will use a small sample from the standardized matrix to perform initial testing with XGBoost.

3. **Plot the feature importance based on XGBoost**: 
   - After training the XGBoost model, we will visualize the importance of each feature according to the model.

4. **Run a loop over the matrices to apply XGBoost with Cross-Validation (CV) to all extracted matrices**:
   - We will iterate over all the data matrices and apply XGBoost with cross-validation to find the best parameters.

5. **Store results in `list_result`**: 
   - For each iteration, store the following tuple of data in `list_result`: 
      - `(prep, troz, chan, eta, gamma, max_depth, acc, std_error)`.

6. **Create a DataFrame to better view and sort the results**: 
   - Convert the `list_result` into a DataFrame for easy viewing and sorting of the results.

7. **Train the final model with the best combination from the CV and test it by plotting the confusion matrix**:
   - Using the best hyperparameters from cross-validation, train a final XGBoost model, test it, and visualize the confusion matrix (~0.8 Accuracy).


In [21]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import pickle
from sklearn.model_selection import train_test_split, StratifiedKFold

In [22]:
# Loading the total_list_stand from the saved file
with open('total_list_stand.pkl', 'rb') as f:
    total_list_stand = pickle.load(f)

Now, `total_list_stand` contains 4 lists, each corresponding to one of the preprocessing methods. 

Each of these 4 lists contains 35 sublists, representing the different segments (chunks) that we have generated. 

In turn, each of these 35 sublists contains 19 matrices of size 121x54, one for each EEG channel.


### XGBoost training

In [23]:
def num_to_channel(num):
    # Mapping of numbers to channel names
    channel_map = {
        0: 'Fp1', 1: 'Fp2', 2: 'F3', 3: 'F4', 4: 'C3', 5: 'C4', 6: 'P3',
        7: 'P4', 8: 'O1', 9: 'O2', 10: 'F7', 11: 'F8', 12: 'T7', 13: 'T8',
        14: 'P7', 15: 'P8', 16: 'Fz', 17: 'Cz', 18: 'Pz'
    }

    # Return the channel name if the number exists, otherwise return an error message
    return channel_map.get(num, 'Channel error')

In [24]:
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold

# Initialize an empty list to store the results
list_results = []

# Iterate through preprocessing types (you may expand this loop if needed)
for prep in range(4):  # Assuming you're only running for the first preprocessing type
    for segment in range(35):  # Loop over the different segments (chunks)
        for channel in range(19):  # Iterate over all 19 EEG channels
            print(f'Processing: prep={prep}, segment={segment}, channel={channel}')

            # Retrieve the standardized data for the given preprocessing, segment, and channel
            df_data = total_list_stand[prep][segment][channel]

            # Convert the data to numpy arrays (features and labels)
            features = df_data.iloc[:, :-1].to_numpy()  # All columns except the last (features)
            labels = df_data['Label'].to_numpy()  # The last column (labels)

            # Convert the data to XGBoost's DMatrix format for efficient processing
            dtrain = xgb.DMatrix(features, label=labels)

            # Define the hyperparameters for XGBoost
            eta_values = [0.01, 0.1, 0.5, 1]  # Learning rates
            gamma_values = [0, 1]  # Minimum loss reduction
            max_depth_values = [2, 3, 6, 12, 24]  # Maximum tree depth

            # Loop over the hyperparameter combinations
            for eta in eta_values:
                for gamma in gamma_values:
                    for max_depth in max_depth_values:
                        # Set the XGBoost parameters
                        params = {
                            "max_depth": max_depth,
                            "eta": eta,
                            "gamma": gamma,
                            "objective": "binary:logistic"  # Binary classification task
                        }
                        num_rounds = 30  # Number of boosting rounds
                        k_folds = StratifiedKFold(n_splits=5)  # 5-fold stratified cross-validation

                        # Perform cross-validation using XGBoost
                        res = xgb.cv(
                            params,
                            dtrain,
                            num_boost_round=num_rounds,
                            folds=k_folds,
                            stratified=True,
                            metrics={"error"},
                            seed=1,
                            verbose_eval=False  # Set to False to suppress output during CV
                        )

                        # Extract the best result (min test error)
                        best_result = res.loc[res['test-error-mean'].idxmin()]
                        min_error = round(best_result['test-error-mean'], 4)  # Minimum error
                        std_error = round(best_result['test-error-std'], 4)  # Standard deviation of the error

                        # Convert the channel number to its corresponding name
                        channel_name = num_to_channel(channel)

                        # Append the results as a tuple
                        list_results.append((prep, segment, channel_name, eta, gamma, max_depth, 1 - min_error, std_error))

                        # Optional: Print the result
                        print(f'Accuracy: {1 - min_error} ± {std_error} for {channel_name}, eta={eta}, gamma={gamma}, max_depth={max_depth}')


### Analysis of the results

In [26]:
# Sort results by accuracy (index 6) in descending order
list_results.sort(key=lambda x: x[6], reverse=True)

# Create a DataFrame from the sorted results
df_results_xgb = pd.DataFrame(
    list_results, 
    columns=['Preprocess', 'Segment', 'Channel', 'Eta', 'Gamma', 'max_depth', 'Accuracy', 'Std_error']
)

# Add classifier and feature set information
df_results_xgb['clf'] = ['xgboost'] * df_results_xgb.shape[0]
df_results_xgb['Features'] = ['all'] * df_results_xgb.shape[0]

# Display the top 20 results
df_results_xgb.head(20)

In [27]:
with open("xgboost_1channel_df_results.pkl", "wb") as fp:   #Pickling
    pickle.dump(df_results_xgb, fp)

In [28]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

# Select the specific data matrix for training and testing
df_best_xgb = total_list_stand[0][28][6]

# Extract features and labels from the data
data = df_best_xgb.iloc[:, :-1].to_numpy()  # Features (all columns except the last)
label = df_best_xgb['Label'].to_numpy()  # Labels (last column)

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(data, label, train_size=0.8, random_state=12, stratify=label)

# Create an instance of the XGBoost classifier with specified parameters
model = XGBClassifier(max_depth=3, eta=0.1, gamma=1, objective='binary:logistic')

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
preds = model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Compute the confusion matrix
cm = confusion_matrix(y_test, preds)

# Plot the confusion matrix
ConfusionMatrixDisplay(confusion_matrix=cm).plot()