# Description: 
##### This script is designed for a pre-split train/test data structure.
##### For each of the 6 conditions, it performs the following steps, ensuring no data leakage:
1. Loads the specific train and test CSV files.
2. Initializes PyCaret, ~~which learns Z-score normalization parameters ONLY from the train data and applies them to both train and test sets.~~
3. Runs 5-fold cross-validation on the normalized train data to find the best model.
4. Evaluates the best model on the unseen, normalized test data.
5. Saves the final test performance results to the './res' directory.

# Instructions:
1. Make sure you have PyCaret and its dependencies installed.
If not, uncomment the line below and run it once.
> !pip install pycaret pandas

2. Place this script in the same directory as your 90 CSV data files.

3. Run the script.

In [1]:
import os
import pandas as pd
from pycaret.classification import *

In [2]:
!pwd

/Users/leechangmin/Desktop/Project/ETRI-Emotion/cardio_exp2_entire_data_TVT


In [15]:
def run_cv_with_pre_split_normalized_data():
    """
    Runs a 5-fold CV workflow on pre-split and normalized data.
    """
    # --- Configuration ---
    TARGET_COLUMN_NAME = 'label'
    DATA_DIR = './data'
    RESULT_DIR = './res'
    # Adjusted group names to match filenames
    GROUPS = ['All', 'High', 'Low']
    VARIABLES = ['arousal', 'valence']
    
    # Create the result directory if it doesn't exist
    os.makedirs(RESULT_DIR, exist_ok=True)

    print("=====================================================================")
    print("===  STARTING EXPERIMENT: CV on Pre-Split, Normalized Data  ===")
    print("=====================================================================")

    for group in GROUPS:
        for variable in VARIABLES:
            experiment_name = f"{group}_{variable}"
            train_file_path = os.path.join(DATA_DIR, 'train', f"{experiment_name}_train.csv")
            test_file_path = os.path.join(DATA_DIR, 'test', f"{experiment_name}_test.csv")
            
            if not (os.path.exists(train_file_path) and os.path.exists(test_file_path)):
                print(f"\n--- Skipping {experiment_name}: Train or test file not found. ---")
                continue
                
            print(f"\n--- Processing: {experiment_name} ---")
            
            # 1. Load the pre-split train and test datasets
            train_df = pd.read_csv(train_file_path)
            test_df = pd.read_csv(test_file_path)
            
            # Rename the actual target column ('arousal' or 'valence') to 'label'
            # so PyCaret can find it. The `variable` holds the correct column name for each loop.
            train_df.rename(columns={variable: TARGET_COLUMN_NAME}, inplace=True)
            test_df.rename(columns={variable: TARGET_COLUMN_NAME}, inplace=True)
            
            print(f"  > Renamed column '{variable}' to '{TARGET_COLUMN_NAME}'.")
            
            # 2. Setup PyCaret environment.
            # Pass train_df and test_df directly.
            # Enable Z-score normalization. PyCaret handles this correctly to prevent data leakage.
            s = setup(data=train_df,
                      test_data=test_df,
                      target=TARGET_COLUMN_NAME,
                      fold=5,
                      
                      # Enable Z-score normalization
                      normalize=True,
                      normalize_method='zscore',
                      
                      index=False,
                      session_id=123,
                      verbose=False)
            
            # 3. Compare models using 5-fold CV on the normalized training data
            print("  > Comparing models using 5-fold CV on the normalized training set...")
            best_model = compare_models(verbose=False)
            
            # 4. Finalize the best model (retrains on the entire normalized training set)
            print(f"  > Finalizing the best model: {pull().iloc[0,0]}")
            final_model = finalize_model(best_model)
            
            # 5. Evaluate the final model on the unseen, normalized test set
            print("  > Evaluating the final model on the hold-out test set...")
            # predict_model() without a data argument uses the test_data provided during setup
            predict_model(final_model, verbose=False)
            
            # 6. Pull the performance metrics from the test set evaluation
            test_metrics = pull()
            
            # 7. Save the test set performance to a CSV file
            output_path = os.path.join(RESULT_DIR, f"test_performance_{experiment_name}.csv")
            test_metrics.to_csv(output_path)
            print(f"  > Test performance for '{pull().iloc[0,0]}' saved to {output_path}")


In [16]:
if __name__ == '__main__':
    run_cv_with_pre_split_normalized_data()
    print("\n\nAll experiments are complete.")

===  STARTING EXPERIMENT: CV on Pre-Split, Normalized Data  ===

--- Processing: All_arousal ---
  > Renamed column 'arousal' to 'label'.
  > Comparing models using 5-fold CV on the normalized training set...
  > Finalizing the best model: Dummy Classifier
  > Evaluating the final model on the hold-out test set...
  > Test performance for 'Dummy Classifier' saved to ./res/test_performance_All_arousal.csv

--- Processing: All_valence ---
  > Renamed column 'valence' to 'label'.
  > Comparing models using 5-fold CV on the normalized training set...
  > Finalizing the best model: Naive Bayes
  > Evaluating the final model on the hold-out test set...
  > Test performance for 'Naive Bayes' saved to ./res/test_performance_All_valence.csv

--- Processing: High_arousal ---
  > Renamed column 'arousal' to 'label'.
  > Comparing models using 5-fold CV on the normalized training set...
  > Finalizing the best model: Dummy Classifier
  > Evaluating the final model on the hold-out test set...
  > T

In [22]:
#
# Description:
# This script identifies the top 5 models based on 5-fold cross-validation on the
# training set. It then finalizes each of these 5 models and evaluates their
# performance on the unseen test set, saving the combined results.
#

import pandas as pd
from pycaret.classification import *
import os

def evaluate_top_5_models_on_test_set():
    """
    Finds the top 5 models via CV and evaluates each on the test set.
    """
    # --- Configuration ---
    TARGET_COLUMN_NAME = 'label'
    DATA_DIR = './data'
    RESULT_DIR = './res2'
    GROUPS = ['All', 'High', 'Low']
    VARIABLES = ['arousal', 'valence']
    
    os.makedirs(RESULT_DIR, exist_ok=True)

    print("=====================================================================")
    print("===  STARTING EXPERIMENT: Evaluating Top 5 Models on Test Set  ===")
    print("=====================================================================")

    for group in GROUPS:
        for variable in VARIABLES:
            experiment_name = f"{group}_{variable}"
            train_file_path = os.path.join(DATA_DIR, 'train', f"{experiment_name}_train.csv")
            test_file_path = os.path.join(DATA_DIR, 'test', f"{experiment_name}_test.csv")
            
            if not (os.path.exists(train_file_path) and os.path.exists(test_file_path)):
                print(f"\n--- Skipping {experiment_name}: Train or test file not found. ---")
                continue
                
            print(f"\n--- Processing: {experiment_name} ---")
            
            try:
                # 1. Load data and rename target column
                train_df = pd.read_csv(train_file_path)
                test_df = pd.read_csv(test_file_path)
                train_df.rename(columns={variable: TARGET_COLUMN_NAME}, inplace=True)
                test_df.rename(columns={variable: TARGET_COLUMN_NAME}, inplace=True)
                
                # 2. Setup PyCaret environment
                s = setup(data=train_df,
                          target=TARGET_COLUMN_NAME,
                          fold=5,
                          index=False,
                          session_id=123,
                          verbose=False)
                
                # 3. Compare models and get the top 5
                print("  > Comparing models to find the top 5...")
                # n_select=5 returns a list of the 5 best model objects
                top_5_models = compare_models(n_select=5, verbose=False)
                
                # This list will hold the performance DataFrame for each of the top 5 models
                top_5_test_results = []

                print("  > Evaluating each of the top 5 models on the test set...")
                for model in top_5_models:
                    # 4. Finalize each model (retrain on full training data)
                    final_model = finalize_model(model)
                    
                    # 5. Evaluate on the test set
                    predict_model(final_model, data=test_df, verbose=False)
                    
                    # 6. Pull the performance metrics and add to our list
                    test_metrics = pull()
                    top_5_test_results.append(test_metrics)
                
                # 7. Combine the results of the 5 models into a single DataFrame
                final_results_df = pd.concat(top_5_test_results).reset_index(drop=True)
                
                # 8. Save the combined results to a CSV file
                output_path = os.path.join(RESULT_DIR, f"top5_test_performance_{experiment_name}.csv")
                final_results_df.to_csv(output_path, index=False)
                print(f"  > Top 5 model test performances saved to {output_path}")

            except Exception as e:
                print(f"  > An error occurred during processing: {e}")

if __name__ == '__main__':
    evaluate_top_5_models_on_test_set()
    print("\n\nAll experiments are complete.")


===  STARTING EXPERIMENT: Evaluating Top 5 Models on Test Set  ===

--- Processing: All_arousal ---
  > Comparing models to find the top 5...
  > Evaluating each of the top 5 models on the test set...
  > Top 5 model test performances saved to ./res2/top5_test_performance_All_arousal.csv

--- Processing: All_valence ---
  > Comparing models to find the top 5...
  > Evaluating each of the top 5 models on the test set...
  > Top 5 model test performances saved to ./res2/top5_test_performance_All_valence.csv

--- Processing: High_arousal ---
  > Comparing models to find the top 5...
  > Evaluating each of the top 5 models on the test set...
  > Top 5 model test performances saved to ./res2/top5_test_performance_High_arousal.csv

--- Processing: High_valence ---
  > Comparing models to find the top 5...
  > Evaluating each of the top 5 models on the test set...
  > Top 5 model test performances saved to ./res2/top5_test_performance_High_valence.csv

--- Processing: Low_arousal ---
  > Comp

### Load Results

In [20]:
# Pandas display options to prevent truncation
# Set options to display all rows and columns without truncation
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000) # Adjust width to prevent line wrapping

def load_all_results():
    """
    Read 'results_{group}_{variable}_manual_5fold_summary.csv' formatted files
    and return a dictionary of DataFrames.
    """
    # --- Configuration (can be adjusted as needed) ---
    groups = ['All', 'High', 'Low']
    variables = ['arousal', 'valence']
    
    # dictionary to hold all results
    all_results = {}
    
    print("Reading all result CSV files...")

    # Iterate through each group and variable to construct filenames
    for group in groups:
        for variable in variables:
            # Construct the filename based on the group and variable
            experiment_name = f"{group}_{variable}"
            filename = f"./res/test_performance_{experiment_name}.csv"
            
            try:
                # CSV file into DataFrame
                # index_col=0 to use the first column as index
                df = pd.read_csv(filename, index_col='Model') # Use 'Model' column as index
                
                # Store the DataFrame in the dictionary with the experiment name as key
                all_results[experiment_name] = df
                print(f"  - Successfully loaded: {filename}")
                
            except FileNotFoundError:
                print(f"  - File not found, skipping: {filename}")
                
    return all_results

In [21]:
if __name__ == '__main__':
    # Run the function to load all result data
    loaded_results = load_all_results()
    
    print("\n" + "="*50)
    print("      <<< Summary of All Loaded Results >>>")
    print("="*50)
    
    if not loaded_results:
        print("No result files were found.")
    else:
        # Loop through and print each result DataFrame
        for name, result_df in loaded_results.items():
            print(f"\n\n--- Performance Results for [{name}] ---")
            # Using print() on a DataFrame with the options set will display it fully
            print(result_df)
            print("-"*(len(name) + 35))

Reading all result CSV files...
  - Successfully loaded: ./res/test_performance_All_arousal.csv
  - Successfully loaded: ./res/test_performance_All_valence.csv
  - Successfully loaded: ./res/test_performance_High_arousal.csv
  - Successfully loaded: ./res/test_performance_High_valence.csv
  - Successfully loaded: ./res/test_performance_Low_arousal.csv
  - Successfully loaded: ./res/test_performance_Low_valence.csv

      <<< Summary of All Loaded Results >>>


--- Performance Results for [All_arousal] ---
                  Unnamed: 0  Accuracy  AUC  Recall   Prec.     F1  Kappa  MCC
Model                                                                         
Dummy Classifier           0    0.5625  0.5  0.5625  0.3164  0.405    0.0  0.0
----------------------------------------------


--- Performance Results for [All_valence] ---
             Unnamed: 0  Accuracy     AUC  Recall   Prec.      F1  Kappa  MCC
Model                                                                        
N

In [24]:
# Pandas display options to prevent truncation
# Set options to display all rows and columns without truncation
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000) # Adjust width to prevent line wrapping

def load_all_results():
    """
    Read 'results_{group}_{variable}_manual_5fold_summary.csv' formatted files
    and return a dictionary of DataFrames.
    """
    # --- Configuration (can be adjusted as needed) ---
    groups = ['All', 'High', 'Low']
    variables = ['arousal', 'valence']
    
    # dictionary to hold all results
    all_results = {}
    
    print("Reading all result CSV files...")

    # Iterate through each group and variable to construct filenames
    for group in groups:
        for variable in variables:
            # Construct the filename based on the group and variable
            experiment_name = f"{group}_{variable}"
            filename = f"./res2/top5_test_performance_{experiment_name}.csv"
            
            try:
                # CSV file into DataFrame
                # index_col=0 to use the first column as index
                df = pd.read_csv(filename, index_col='Model') # Use 'Model' column as index
                
                # Store the DataFrame in the dictionary with the experiment name as key
                all_results[experiment_name] = df
                print(f"  - Successfully loaded: {filename}")
                
            except FileNotFoundError:
                print(f"  - File not found, skipping: {filename}")
                
    return all_results

if __name__ == '__main__':
    # Run the function to load all result data
    loaded_results = load_all_results()
    
    print("\n" + "="*50)
    print("      <<< Summary of All Loaded Results (5 models per experiment) >>>")
    print("="*50)
    
    if not loaded_results:
        print("No result files were found.")
    else:
        # Loop through and print each result DataFrame
        for name, result_df in loaded_results.items():
            print(f"\n\n--- Performance Results for [{name}] ---")
            # Using print() on a DataFrame with the options set will display it fully
            print(result_df)
            print("-"*(len(name) + 35))

Reading all result CSV files...
  - Successfully loaded: ./res2/top5_test_performance_All_arousal.csv
  - Successfully loaded: ./res2/top5_test_performance_All_valence.csv
  - Successfully loaded: ./res2/top5_test_performance_High_arousal.csv
  - Successfully loaded: ./res2/top5_test_performance_High_valence.csv
  - Successfully loaded: ./res2/top5_test_performance_Low_arousal.csv
  - Successfully loaded: ./res2/top5_test_performance_Low_valence.csv

      <<< Summary of All Loaded Results (5 models per experiment) >>>


--- Performance Results for [All_arousal] ---
                                 Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Model                                                                                    
Extra Trees Classifier             0.5521  0.4590  0.5521  0.5263  0.4993  0.0255  0.0309
Random Forest Classifier           0.4479  0.3902  0.4479  0.3992  0.4085 -0.1811 -0.1999
CatBoost Classifier                0.5000  0.4572  0.5000  0.4631  