# Description: 
##### This script performs a 5-fold cross-validation experiment using PyCaret for a binary classification task. 
* It iterates through different data groups (Total, High, Low) and target variables (arousal, valence), aggregates the performance of various models across the predefined folds, and saves the summary results to CSV files.

# Instructions:
1. Make sure you have PyCaret and its dependencies installed.
If not, uncomment the line below and run it once.
> !pip install pycaret pandas

2. Place this script in the same directory as your 90 CSV data files.

3. Run the script.

In [1]:
import pandas as pd
from pycaret.classification import *
import os

In [4]:
!pwd

/Users/leechangmin/Desktop/Project/ETRI-Emotion/src


In [None]:
def run_pycaret_classification_experiment():
    """
    This function runs the main experiment. It sets up PyCaret for classification,
    iterates through predefined data files for each fold, compares ML models,
    and aggregates the results.
    """
    # --- Configuration ---
    # Set the target column name to 'label' based on the user's provided column list.
    TARGET_COLUMN_NAME = 'label'

    groups = ['Total', 'High', 'Low']
    variables = ['arousal', 'valence'] # These are still used for file naming.
    folds = range(1, 6)

    final_results = {}

    print("Starting PyCaret Classification Modeling...")

    for group in groups:
        for variable in variables:
            experiment_name = f"{group}_{variable}"
            print(f"\n--- Starting Experiment: [{experiment_name}] ---")

            fold_results_list = []

            for fold in folds:
                try:
                    train_file = f'./data/fold{fold}_{group}_{variable}_train.csv'
                    valid_file = f'./data/fold{fold}_{group}_{variable}_valid.csv'
                    test_file = f'./data/fold{fold}_{group}_{variable}_test.csv'

                    if not all(os.path.exists(f) for f in [train_file, valid_file, test_file]):
                        print(f"  - Skipping Fold {fold}: Required CSV file(s) not found.")
                        continue

                    train_df = pd.read_csv(train_file)
                    valid_df = pd.read_csv(valid_file)
                    test_df = pd.read_csv(test_file)

                    train_valid_df = pd.concat([train_df, valid_df], ignore_index=True)

                    print(f"  - Fold {fold}: Training and evaluating models...")

                    # Initialize the PyCaret classification environment.
                    s = setup(data=train_valid_df,
                              test_data=test_df,
                              # Using 'label' as the target column.
                              target=TARGET_COLUMN_NAME,
                              index=False,
                              session_id=123,
                              verbose=False)

                    compare_models(verbose=False)
                    results_grid = pull()
                    fold_results_list.append(results_grid)

                except Exception as e:
                    print(f"  - An error occurred in Fold {fold}: {e}")

            if fold_results_list:
                all_folds_df = pd.concat(fold_results_list)
                mean_results = all_folds_df.groupby('Model').mean()
                mean_results = mean_results.sort_values('Accuracy', ascending=False)
                final_results[experiment_name] = mean_results
                print(f"  > Finished experiment [{experiment_name}]. Aggregated results from 5 folds.")

    if final_results:
        print("\n--- Final Aggregated Results for All Experiments ---")
        for name, result_df in final_results.items():
            print(f"\n[{name}] - Top 5 Models by Average Accuracy")
            print(result_df.head())
            output_filename = f'results_{name}_classification_summary.csv'
            result_df.to_csv(output_filename)
            print(f"> Results saved to '{output_filename}'")
    else:
        print("\nNo results were processed. Please check your file names, paths, or for errors during execution.")

In [13]:
# This ensures the script runs when executed directly.
if __name__ == '__main__':
    run_pycaret_classification_experiment()

Starting PyCaret Classification Modeling...

--- Starting Experiment: [Total_arousal] ---
  - Fold 1: Training and evaluating models...
  - Fold 2: Training and evaluating models...
  - Fold 3: Training and evaluating models...
  - Fold 4: Training and evaluating models...
  - Fold 5: Training and evaluating models...
  > Finished experiment [Total_arousal]. Aggregated results from 5 folds.

--- Starting Experiment: [Total_valence] ---
  - Fold 1: Training and evaluating models...
  - Fold 2: Training and evaluating models...
  - Fold 3: Training and evaluating models...
  - Fold 4: Training and evaluating models...
  - Fold 5: Training and evaluating models...
  > Finished experiment [Total_valence]. Aggregated results from 5 folds.

--- Starting Experiment: [High_arousal] ---
  - Fold 1: Training and evaluating models...
  - Fold 2: Training and evaluating models...
  - Fold 3: Training and evaluating models...
  - Fold 4: Training and evaluating models...
  - Fold 5: Training and ev

### Load Results

In [3]:
def load_all_results():
    """
    Read 'results_{group}_{variable}_classification_summary.csv' formatted files
    and return a dictionary of DataFrames.
    """
    # --- Configuration (can be adjusted as needed) ---
    groups = ['Total', 'High', 'Low']
    variables = ['arousal', 'valence']
    
    # dictionary to hold all results
    all_results = {}
    
    print("Reading all result CSV files...")

    # Iterate through each group and variable to construct filenames
    for group in groups:
        for variable in variables:
            # Construct the filename based on the group and variable
            experiment_name = f"{group}_{variable}"
            filename = f"./res/results_{experiment_name}_classification_summary.csv"
            
            try:
                # CSV file into DataFrame
                # index_col=0 to use the first column as index
                df = pd.read_csv(filename, index_col=0)
                
                # Store the DataFrame in the dictionary with the experiment name as key
                all_results[experiment_name] = df
                print(f"  - Successfully loaded: {filename}")
                
            except FileNotFoundError:
                print(f"  - File not found, skipping: {filename}")
                
    return all_results

In [4]:
if __name__ == '__main__':
    loaded_results = load_all_results()
    print("\n--- Summary of Loaded Results ---")
    
    if not loaded_results:
        print("No result files were found.")
    else:
        # Display the top 5 models for each loaded result
        for name, result_df in loaded_results.items():
            print(f"\n--- Top 5 Models for [{name}] ---")
            print(result_df.head())

Reading all result CSV files...
  - Successfully loaded: ./res/results_Total_arousal_classification_summary.csv
  - Successfully loaded: ./res/results_Total_valence_classification_summary.csv
  - Successfully loaded: ./res/results_High_arousal_classification_summary.csv
  - Successfully loaded: ./res/results_High_valence_classification_summary.csv
  - Successfully loaded: ./res/results_Low_arousal_classification_summary.csv
  - Successfully loaded: ./res/results_Low_valence_classification_summary.csv

--- Summary of Loaded Results ---

--- Top 5 Models for [Total_arousal] ---
                                 Accuracy      AUC   Recall    Prec.       F1  \
Model                                                                           
Gradient Boosting Classifier      0.95500  0.99220  0.95500  0.95646  0.95466   
Random Forest Classifier          0.95374  0.99182  0.95374  0.95572  0.95346   
CatBoost Classifier               0.95166  0.99230  0.95166  0.95406  0.95124   
Ada Boost Cl