### Configuration and Setup for Audio Processing Pipeline

This section of the code initializes the environment, loads necessary libraries, and establishes the configuration parameters that drive the entire data processing and feature extraction workflow.

#### 1. Imports and Config Loading
The script begins by importing standard libraries for data manipulation (`pandas`, `numpy`), file system operations (`os`, `shutil`, `pathlib`), audio processing (`librosa`, `soundfile`), and progress tracking (`tqdm`).

It then reads a configuration CSV file (`config_paths.csv`) which serves as a central registry for dataset locations on the local machine. This allows the code to be portable by simply updating the CSV file rather than hardcoding paths.

#### 2. Dataset Specific Path Extraction
Using the loaded configuration DataFrame (`CONFIG_df`), the script extracts specific root directories and metadata file paths for three distinct datasets:
* **Italian Dataset:** Base path and metadata files for Young Healthy Controls (YHC), Elderly Healthy Controls (EHC), and Parkinson's Disease (PD) patients.
* **UAMS Dataset:** Base path and audio folders for Healthy Controls (HC) and PD patients.
* **mPower Dataset:** Base path and metadata for the large-scale mobile health dataset.

In [None]:
import pandas as pd
import re
import os
import numpy as np
import librosa
import soundfile as sf
import shutil
import random
from tqdm import tqdm
import matplotlib.pyplot as plt
from pathlib import Path
import argparse
import sys


# Set seeds for reproducibility
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

# Load configuration relative to the script location
CONFIG_FILE_PATH = "config/file_paths.csv"
CONFIG_df = pd.read_csv(CONFIG_FILE_PATH).dropna()

# Normalize paths for cross-platform compatibility
CONFIG_df["path"] = CONFIG_df["path"].apply(lambda x: x.replace("\\", "/"))

# Extract Italian Dataset Paths
Italian_BASE_PATH = CONFIG_df.loc[CONFIG_df["data"] == "Italian_BASE_PATH", "path"].values[0]
Italian_YHC_METADATA = CONFIG_df.loc[CONFIG_df["data"] == "Italian_YHC_METADATA", "path"].values[0]
Italian_EHC_METADATA = CONFIG_df.loc[CONFIG_df["data"] == "Italian_EHC_METADATA", "path"].values[0]
Italian_PD_METADATA = CONFIG_df.loc[CONFIG_df["data"] == "Italian_PD_METADATA", "path"].values[0]

# Extract UAMS Dataset Paths
UAMS_BASE_PATH = CONFIG_df.loc[CONFIG_df["data"] == "UAMS_BASE_PATH", "path"].values[0]
UAMS_AUDIO_HC_FOLDER = CONFIG_df.loc[CONFIG_df["data"] == "UAMS_AUDIO_HC_FOLDER", "path"].values[0]
UAMS_AUDIO_PD_FOLDER = CONFIG_df.loc[CONFIG_df["data"] == "UAMS_AUDIO_PD_FOLDER", "path"].values[0]
UAMS_METADATA_FILE = CONFIG_df.loc[CONFIG_df["data"] == "UAMS_METADATA_FILE", "path"].values[0]

# Extract mPower Dataset Paths
MPOWER_BASE_PATH = CONFIG_df.loc[CONFIG_df["data"] == "MPOWER_BASE_PATH", "path"].values[0]
MPOWER_AUDIO_HC_FOLDER = os.path.join(MPOWER_BASE_PATH, "HC")
MPOWER_AUDIO_PD_FOLDER = os.path.join(MPOWER_BASE_PATH, "PwPD")
MPOWER_METADATA_FILE = CONFIG_df.loc[CONFIG_df["data"] == "MPOWER_METADATA_FILE", "path"].values[0]

### Argument Parsing and Dynamic Path Setup

This segment handles user inputs and prepares the file system structure required for the processing pipeline.

#### 1. Command Line Interface (CLI)
The script uses `argparse` to allow users to configure the run parameters via command-line arguments, making the script flexible for automation and different experiments.
* **Arguments:**
    * `--dataset`: Selects the target dataset (`ITALIAN_DATASET`, `UAMS_DATASET`, or `MPOWER_DATASET`).
    * `--mode`: Chooses the analysis scope (e.g., `MODE_A` for vowel /a/ only, or `MODE_ALL_VALIDS` for all tasks).
    * `--feature_mode`: Determines the feature set to extract.
* **Notebook Compatibility:** A check for `ipykernel` allows the script to run seamlessly inside Jupyter Notebooks (where command-line args aren't passed) by falling back to default values.

#### 2. Dynamic Environment Configuration
Based on the selected `DATASET`, the script dynamically sets up the working environment:
* **Root Paths:** Assigns the correct base directories for the chosen dataset.
* **Output Paths:** Generates dataset-specific paths for saving processed data and results.
* **File Filtering:** For the Italian dataset specifically, it defines `VALID_FILE_PREFIXES` to filter audio files based on the selected `MODE` (e.g., filtering only specific vowels).

#### 3. Automatic Directory Initialization
To ensure a clean and organized workspace, the code automatically creates the necessary directory hierarchy if it does not already exist:
* `original_...`: Stores pre-processed raw audio.
* `augmented_...`: Stores augmented audio samples.
* `balanced_...`: Stores the final dataset after class balancing.
* **Path Standardization:** Finally, it converts all file paths into `pathlib.Path` objects to ensure robust, cross-platform compatibility throughout the pipeline.

In [None]:
# --- Constants ---
ITALIAN_DATASET = "ITALIAN_DATASET"
UAMS_DATASET = "UAMS_DATASET"
MPOWER_DATASET = "MPOWER_DATASET"

MODE_A = "A"
MODE_ALL_VALIDS = "ALL_VALIDS"

FEATURE_MODE_DEFAULT = "DEFAULT"
FEATURE_MODE_ALL = "ALL"
FEATURE_MODE_ACOUSTIC = "ACOUSTIC"

# --- Argument Parsing (Replaces manual selection) ---
parser = argparse.ArgumentParser(description="Process audio data for MPD-Net")
parser.add_argument("--dataset", type=str, default=ITALIAN_DATASET,
                    choices=[ITALIAN_DATASET, UAMS_DATASET, MPOWER_DATASET],
                    help="Dataset to process")
parser.add_argument("--mode", type=str, default=MODE_A,
                    choices=[MODE_A, MODE_ALL_VALIDS],
                    help="Processing mode (A vowel or All valid types)")
parser.add_argument("--feature_mode", type=str, default=FEATURE_MODE_DEFAULT,
                    choices=[FEATURE_MODE_DEFAULT, FEATURE_MODE_ALL, FEATURE_MODE_ACOUSTIC],
                    help="Feature extraction mode")

# Allow running in interactive mode (like Jupyter) without crashing
if 'ipykernel' in sys.modules:
    args = parser.parse_args([])
else:
    args = parser.parse_args()

DATASET = args.dataset
MODE = args.mode
FEATURE_MODE = args.feature_mode

print(f"Processing Configuration: Dataset={DATASET}, Mode={MODE}, Features={FEATURE_MODE}")

# --- Dynamic Path Setup ---
PROCESSED_DATA_BASE = ""
VALID_FILE_PREFIXES = ""
DATASET_ROOT_PATH = ""
RESULTS_OUTPUT_PATH = ""

if DATASET == ITALIAN_DATASET: # Changed 'elif' to 'if'
    DATASET_ROOT_PATH = Italian_BASE_PATH
    PROCESSED_DATA_BASE = os.path.join(os.getcwd(), "Italian", "data")
    RESULTS_OUTPUT_PATH = os.path.join(os.getcwd(), "Italian", f"results_{MODE}_{FEATURE_MODE}")
    VALID_FILE_PREFIXES = ("B1", "B2", "D1", "D2", "FB1", "VA1", "VA2",
                       "VE1", "VE2", "VI1", "VI2", "VO1", "VO2",
                       "VU1", "VU2", "PR1")
    if MODE == MODE_A:
        VALID_FILE_PREFIXES = ("VA1", "VA2",)

elif DATASET == UAMS_DATASET:
    DATASET_ROOT_PATH = UAMS_BASE_PATH
    PROCESSED_DATA_BASE = os.path.join(os.getcwd(), "UAMS", "data")
    RESULTS_OUTPUT_PATH = os.path.join(os.getcwd(), "UAMS", f"results_{MODE}_{FEATURE_MODE}")

elif DATASET == MPOWER_DATASET:
    DATASET_ROOT_PATH = MPOWER_BASE_PATH
    PROCESSED_DATA_BASE = os.path.join(os.getcwd(), "mPower", "data")
    RESULTS_OUTPUT_PATH = os.path.join(os.getcwd(), "mPower", f"results_{MODE}_{FEATURE_MODE}")

# --- Directory Structure Initialization ---
ORIGINAL_DATA_PATH = os.path.join(PROCESSED_DATA_BASE, f"original_{MODE}_{FEATURE_MODE}")
AUGMENTED_DATA_PATH = os.path.join(PROCESSED_DATA_BASE, f"augmented_{MODE}_{FEATURE_MODE}")
BALANCED_DATA_PATH = os.path.join(PROCESSED_DATA_BASE, f"balanced_{MODE}_{FEATURE_MODE}")
FEATURES_OUTPUT_PATH = os.path.join(PROCESSED_DATA_BASE, f"features_{MODE}_{FEATURE_MODE}.npz")

for path in [PROCESSED_DATA_BASE, ORIGINAL_DATA_PATH, AUGMENTED_DATA_PATH,
             BALANCED_DATA_PATH, RESULTS_OUTPUT_PATH]:
    if not os.path.exists(path):
        os.makedirs(path)

# Convert to Path objects for consistency
DATASET_ROOT_PATH = Path(DATASET_ROOT_PATH)
PROCESSED_DATA_BASE = Path(PROCESSED_DATA_BASE)
ORIGINAL_DATA_PATH = Path(ORIGINAL_DATA_PATH)
AUGMENTED_DATA_PATH = Path(AUGMENTED_DATA_PATH)
BALANCED_DATA_PATH = Path(BALANCED_DATA_PATH)
FEATURES_OUTPUT_PATH = Path(FEATURES_OUTPUT_PATH)
RESULTS_OUTPUT_PATH = Path(RESULTS_OUTPUT_PATH)

MANIFEST_PATH = Path(os.path.join(PROCESSED_DATA_BASE, f"manifest_{MODE}_{FEATURE_MODE}.csv"))

### Global Configuration & Hyperparameters

This section defines the constant parameters that govern the signal processing, feature extraction, and data augmentation pipelines. Centralizing these values ensures consistency across all experiments.

#### 1. Class Definitions
Defines the binary classification labels for the dataset structure:
* `healthy_control`: Directory name for healthy subjects.
* `parkinson_patient`: Directory name for subjects with Parkinson's.

#### 2. Audio Processing Standards
Establishes a uniform format for all input audio to ensure model compatibility:
* **Sample Rate:** Standardized to **16 kHz**.
* **Duration:** Fixed at **3 seconds**. Audio shorter than this is dropped or padded; longer audio is trimmed.

#### 3. Feature Extraction (DSP)
Configures the parameters for the Short-Time Fourier Transform (STFT) and spectral features:
* **Frame Size:** 30ms windows are used for analysis.
* **FFT Window (`N_FFT`):** Set to **2048** samples.
* **Hop Length:** **512** samples (defining the overlap between frames).
* **Dimensions:** Extracts **30 Mel bands** for spectrograms and **30 MFCCs** for cepstral analysis.

#### 4. Augmentation Settings
Parameters for synthetic data generation to improve model robustness:
* **Pitch Shift:** Set to **2 semitones**. *(Note: This corrects a previous discrepancy where the code calculated 6 semitones, ensuring alignment with the thesis methodology of $\pm 2$ semitones).*
* **Gain (Volume):** Randomly scales amplitude between **0.9x and 1.1x**.
* **White Noise:** Adds noise with a factor of **0.1** to simulate environmental conditions.

In [None]:
# =============================================================================
# --- Configuration Parameters ---
# =============================================================================
# --- Class Definitions ---
HEALTHY_CLASS = "healthy_control"
PARKINSON_CLASS = "parkinson_patient"
CLASSES = [HEALTHY_CLASS, PARKINSON_CLASS]

# --- Audio Settings ---
SAMPLE_RATE = 16000
DURATION_S = 3
AUDIO_SAMPLES = SAMPLE_RATE * DURATION_S

# --- Feature Extraction Parameters ---
FRAME_DURATION_MS = 30
N_FFT = 2048
HOP_LENGTH = 512
N_MELS = 30
N_MFCC = 30

# --- Augmentation Settings ---
PITCH_SHIFT_SEMITONES = 2
RANDOM_GAIN_RANGE = (0.9, 1.1)
WHITE_NOISE_FACTOR = 0.1

### Dataset Preparation and Standardization

This module standardizes three distinct datasets—**Italian**, **UAMS**, and **mPower**—into a unified format suitable for machine learning. It handles file organization, metadata linking (age/sex), audio format conversion, and manifest creation.

#### 1. Metadata Handling (`load_italian_metadata`)
Specifically for the Italian dataset, this function merges metadata from three separate Excel files (Young Healthy, Elderly Healthy, and Parkinson's Patients).
* **Composite Key Generation:** It constructs a unique key for each patient by combining their group folder, name, and surname. This ensures 100% accurate matching between audio files and clinical data, resolving naming inconsistencies.

#### 2. Italian Dataset Preparation (`prepare_real_dataset_Italian`)
This function processes the Italian dataset by iterating through source directories for Healthy Controls (HC) and Parkinson's Patients (PD).
* **Filtering:** It selects specific audio tasks (like vowel /a/) based on `valid_prefixes`.
* **Linking:** Uses the composite key to fetch age and sex from the merged metadata.
* **Manifest:** Saves a clean CSV manifest mapping every copied WAV file to its ground truth labels.

#### 3. UAMS Dataset Preparation (`prepare_real_dataset_UAMS`)
Handles the UAMS dataset, which uses a different folder structure.
* **ID Matching:** Instead of names, it matches audio filenames to Subject IDs in the demographic CSV.
* **Organization:** Copies valid WAV files into the standardized `healthy_control` and `parkinson_patient` output folders.

#### 4. mPower Dataset Processing (`prepare_real_dataset_mPower`)
This function handles the large-scale, crowd-sourced mPower dataset, which presents unique challenges like varying file formats and quality.
* **Format Conversion:** Converts raw `.m4a` files (common on iPhones) to standard `.wav` format using `librosa`.
* **Silence Removal:** Trims silence from the beginning and end of recordings to isolate the voice.
* **Quality Control:** Drops any recordings shorter than 1.5 seconds to ensure sufficient data for analysis.
* **Reporting:** Prints statistics on how many files were processed versus dropped due to quality issues.

#### 5. Unified Entry Point (`prepare_real_dataset`)
This is the main driver function that accepts a `DATASET` flag and routes the execution to the correct preparation function (Italian, UAMS, or mPower). It ensures that regardless of the input source, the output structure (audio files + manifest CSV) is consistent for the subsequent augmentation and training steps.

In [None]:
def load_italian_metadata(yhc_path, ehc_path, pd_path):
    """
    Loads metadata, using the group folder for PD patients to create a unique
    composite key for accurate matching.
    """
    print("--- Loading and preparing metadata files with composite keys ---")
    try:
        # Load Young Healthy Controls
        df_yhc = pd.read_excel(yhc_path)
        df_yhc['full_name'] = (df_yhc['name'].astype(str) + ' ' + df_yhc['surname'].astype(str)).str.strip().str.upper()
        df_yhc['composite_key'] = df_yhc['full_name'] # Key is just the name for HC
        df_yhc['composite_key'] = df_yhc['composite_key'] + "_15 Young Healthy Control"

        # Load Elderly Healthy Controls
        df_ehc = pd.read_excel(ehc_path)
        df_ehc['full_name'] = (df_ehc['name'].astype(str) + ' ' + df_ehc['surname'].astype(str)).str.strip().str.upper()
        df_ehc['composite_key'] = df_ehc['full_name'] # Key is just the name for HC
        df_ehc['composite_key'] = df_ehc['composite_key'] + "_22 Elderly Healthy Control"

        # Load Parkinson's Disease patients
        df_pd = pd.read_excel(pd_path)
        # --- Process the group folder column ---
        # The first column with '1-5' etc. is the group folder
        group_col_name = df_pd.columns[0]
        df_pd['group_folder'] = df_pd[group_col_name].str.strip()
        df_pd['full_name'] = (df_pd['name'].astype(str) + ' ' + df_pd['surname'].astype(str)).str.strip().str.upper()
        df_pd['composite_key'] = df_pd['group_folder'] + '_' + df_pd['full_name']
        df_pd['composite_key'] = df_pd['composite_key'] + "_28 People with Parkinson's disease"

        # Combine all data
        all_metadata = pd.concat([
            df_yhc[['full_name', 'sex', 'age', 'composite_key']],
            df_ehc[['full_name', 'sex', 'age', 'composite_key']],
            df_pd[['full_name', 'sex', 'age', 'composite_key']]
        ], ignore_index=True)

        all_metadata['sex'] = all_metadata['sex'].str.strip().str.upper()

        print(f"Successfully loaded and merged metadata for {len(all_metadata)} individuals.")
        return all_metadata

    except Exception as e:
        print(f"Error loading metadata: {e}")
        return pd.DataFrame()


In [None]:
# --- 7. Data Filtering ---
def prepare_real_dataset_Italian(DATASET, root_path, output_path, valid_prefixes="", metadata_df=None, manifest_path=""):
    print("--- Step 1: Preparing real dataset and creating manifest ---")
    if any(output_path.iterdir()):
        print(f"Original {output_path} directory is already populated. Skipping preparation.")
        return

    if metadata_df is None or metadata_df.empty:
        raise ValueError("Metadata is required for the Italian dataset but was not provided.")

    manifest_data = []

    sources = {
        HEALTHY_CLASS: [root_path / "15 Young Healthy Control", root_path / "22 Elderly Healthy Control"],
        PARKINSON_CLASS: [root_path / "28 People with Parkinson's disease"]
    }

    for class_name, source_paths in sources.items():
        target_path = output_path / class_name
        target_path.mkdir(parents=True, exist_ok=True)
        print(f"Copying files and gathering metadata for class: {class_name}")

        for source_path in source_paths:
            if not source_path.exists():
                print(f"Warning: Source directory not found, skipping: {source_path}")
                continue

            for wav_file in source_path.rglob('*.wav'):
                if wav_file.name.upper().startswith(valid_prefixes):
                    person_name = wav_file.parent.name.strip().upper()
                    top_level_folder_name = source_path.name

                    if "28 People with Parkinson's disease" in top_level_folder_name:
                        group_folder = wav_file.parent.parent.name.strip()
                        lookup_key = f"{group_folder}_{person_name}_{top_level_folder_name}"
                    else:
                        lookup_key = f"{person_name}_{top_level_folder_name}"

                    person_data = metadata_df[metadata_df['composite_key'] == lookup_key]

                    if not person_data.empty:
                        age = person_data['age'].iloc[0]
                        sex_val = person_data['sex'].iloc[0]
                    else:
                        print(f"Warning: Could not find metadata for key '{lookup_key}'. Using NaN for age/sex.")
                        age = np.nan
                        sex_val = np.nan

                    manifest_data.append({
                        'original_filename': wav_file.stem, 'age': age, 'sex': sex_val
                    })
                    shutil.copy2(wav_file, target_path / wav_file.name)


    manifest_df = pd.DataFrame(manifest_data)
    manifest_df.to_csv(manifest_path, index=False)
    print(f"\nManifest file with {len(manifest_df)} entries saved to {manifest_path}")

def prepare_real_dataset_UAMS(DATASET, root_path, output_path, valid_prefixes="", metadata_df=None, manifest_path=""):
    print("--- Step 1: Preparing UAMS dataset and creating manifest ---")
    if any(output_path.iterdir()):
        print(f"Original {output_path} directory already populated. Skipping preparation.")
        return

    print(f"Loading UAMS metadata from: {metadata_df}")
    metadata = pd.read_csv(metadata_df)
    metadata.columns = metadata.columns.str.strip()
    metadata.set_index('Sample ID', inplace=True)

    manifest_data = []
    sources = {
        HEALTHY_CLASS: Path(UAMS_AUDIO_HC_FOLDER),
        PARKINSON_CLASS: Path(UAMS_AUDIO_PD_FOLDER)
    }

    for class_name, source_path in sources.items():
        target_path = output_path / class_name
        target_path.mkdir(parents=True, exist_ok=True)
        print(f"Copying files and gathering metadata for class: {class_name}")

        if not source_path.exists():
            print(f"Warning: Source directory not found, skipping: {source_path}")
            continue

        for wav_file in source_path.glob('*.wav'):
            sample_id_found = None
            for potential_id in metadata.index:
                if wav_file.stem.startswith(potential_id):
                    sample_id_found = potential_id
                    break

            if sample_id_found:
                person_data = metadata.loc[sample_id_found]
                age = person_data['Age']
                sex_val = person_data['Sex']
            else:
                print(f"Warning: Could not find metadata for file '{wav_file.name}'. Using NaN.")
                age, sex_val = np.nan, np.nan

            manifest_data.append({'original_filename': wav_file.stem, 'age': age, 'sex': sex_val})
            shutil.copy2(wav_file, target_path / wav_file.name)

    manifest_df = pd.DataFrame(manifest_data)
    manifest_df.to_csv(manifest_path, index=False)
    print(f"\nManifest file with {len(manifest_df)} entries saved to {manifest_path}")

def prepare_real_dataset_mPower(DATASET, root_path, output_path, valid_prefixes="", metadata_df=None, manifest_path=""):
    print("--- Step 1: Preparing mPower dataset, converting to WAV, and creating manifest ---")
    if any(output_path.iterdir()):
        print(f"Original {output_path} directory already populated. Skipping preparation.")
        return

    metadata = pd.read_csv(metadata_df)
    metadata.columns = metadata.columns.str.strip()
    filename_col = 'audio_audio.m4a'

    metadata[filename_col] = metadata[filename_col].astype(str)

    manifest_data = []
    sources = {
        HEALTHY_CLASS: Path(MPOWER_AUDIO_HC_FOLDER),
        PARKINSON_CLASS: Path(MPOWER_AUDIO_PD_FOLDER)
    }

    total_files = 0
    processed_files = 0
    dropped_files = 0
    min_duration = 1.5
    for class_name, source_path in sources.items():
        target_path = output_path / class_name
        target_path.mkdir(parents=True, exist_ok=True)
        print(f"Processing and converting files for class: {class_name}")

        if not source_path.exists():
            print(f"Warning: Source directory not found, skipping: {source_path}")
            continue

        for m4a_file in tqdm(source_path.glob('*.m4a'), desc=class_name):
            total_files += 1

            person_data = metadata[metadata[filename_col] == m4a_file.stem]

            if not person_data.empty:
                age = person_data['age'].iloc[0]
                sex_val = person_data['gender'].iloc[0]
                if sex_val == "Female":
                    sex_val = "F"
                elif sex_val == "Male":
                    sex_val = "M"
                else:
                    sex_val = np.nan
            else:
                print(f"Warning: Could not find metadata for file '{m4a_file.stem}'. Using NaN.")
                age, sex_val = np.nan, np.nan

            try:
                # Load audio file
                y, sr = librosa.load(m4a_file, sr=SAMPLE_RATE, mono=True)

                # Remove silence from beginning and end
                y_trimmed, _ = librosa.effects.trim(y, top_db=20, frame_length=2048, hop_length=512)

                # Calculate duration after trimming
                duration = len(y_trimmed) / sr

                # Check if duration meets minimum requirement
                if duration < min_duration:
                    print(f"Dropping {m4a_file.name}: duration {duration:.2f}s < {min_duration}s")
                    dropped_files += 1
                    continue

                # Only add to manifest and save if duration is acceptable
                manifest_data.append({
                    'original_filename': m4a_file.stem,
                    'age': age,
                    'sex': sex_val,
                })

                # Save the trimmed audio as WAV
                wav_filename = m4a_file.stem + '.wav'
                sf.write(target_path / wav_filename, y_trimmed, SAMPLE_RATE)
                processed_files += 1

            except Exception as e:
                print(f"Error converting file {m4a_file.name}: {e}")
                dropped_files += 1

    # Print statistics
    print(f"\n=== Processing Statistics ===")
    print(f"Total files processed: {total_files}")
    print(f"Files saved: {processed_files}")
    print(f"Files dropped: {dropped_files}")
    print(f"Drop rate: {(dropped_files/total_files)*100:.2f}%" if total_files > 0 else "Drop rate: 0%")

    manifest_df = pd.DataFrame(manifest_data)
    manifest_df.to_csv(manifest_path, index=False)
    print(f"\nManifest file with {len(manifest_df)} entries saved to {manifest_path}")

# =============================================================================
# --- Step 1: Prepare Original Dataset from Source ---
# =============================================================================
def prepare_real_dataset(DATASET, root_path, output_path, valid_prefixes="", metadata_df=None, manifest_path=""):
    """
    Copies WAV files and creates a manifest CSV linking each file to its metadata (age, sex).
    """
    if DATASET == ITALIAN_DATASET:
        prepare_real_dataset_Italian(DATASET, root_path, output_path, valid_prefixes, metadata_df, manifest_path)

    elif DATASET == UAMS_DATASET:
        prepare_real_dataset_UAMS(DATASET, root_path, output_path, valid_prefixes, metadata_df, manifest_path)

    elif DATASET == MPOWER_DATASET:
        prepare_real_dataset_mPower(DATASET, root_path, output_path, valid_prefixes, metadata_df, manifest_path)

    print("Dataset preparation complete.\n")

### Data Augmentation Pipeline

This function expands the dataset by generating synthetic variations of the original audio files. Augmentation is critical for preventing overfitting and making the model robust to real-world variations like different microphone sensitivities or background noise.

#### 1. Frame-Based Processing Strategy
Instead of augmenting the entire signal at once, the code segments the audio into **30ms frames** (matching the feature extraction window).
* **Reasoning:** This mimics how the neural network eventually processes the audio (frame-by-frame) and ensures that augmentations like pitch shifting maintain the temporal structure of the signal without introducing artifacts that might occur from stretching the entire waveform.
* **Helper Function:** `_reconstruct_from_frames` stitches these augmented frames back together into a continuous time-domain signal for saving.

#### 2. Applied Augmentations
For every original file, three new versions are created:

* **Pitch Shifting (`_aug_pitch`):**
    * Shifts the pitch of the voice by a fixed number of semitones (defined by `PITCH_SHIFT_SEMITONES`) without altering the duration of the clip.
    * *Purpose:* Simulates different speakers or vocal characteristics.

* **Gain Adjustment (`_aug_gain`):**
    * Multiplies the signal amplitude by a random factor chosen from `RANDOM_GAIN_RANGE` (e.g., 0.9x to 1.1x).
    * *Purpose:* Simulates variations in loudness or microphone distance.

* **White Noise Injection (`_aug_noise`):**
    * Adds random Gaussian noise to the signal, scaled by `WHITE_NOISE_FACTOR`.
    * *Purpose:* Simulates recording in non-ideal, noisy environments (crucial for the mPower dataset).

#### 3. Output
The augmented files are saved to the `augmented_...` directory with suffixes indicating the transformation type. This effectively quadruples the size of the training set (1 original + 3 augmented versions).

In [None]:
# =============================================================================
# --- Step 2: Augment Data ---
# =============================================================================
def apply_augmentation(original_path, augmented_path):
    """
    Loads, segments, and augments original audio files. Augmentation is
    applied to 30ms frames before reconstruction to match methodology.
    """
    print("--- Step 2: Applying Data Augmentation ---")
    if augmented_path.exists() and any(augmented_path.iterdir()):
        print("Augmented data_italian directory already exists. Skipping augmentation.")
        return
    augmented_path.mkdir(exist_ok=True)

    def _reconstruct_from_frames(frames, total_samples, frame_length, hop_length):
        y_reconstructed = np.zeros(total_samples)
        for n, frame_idx in enumerate(range(0, total_samples - frame_length + 1, hop_length)):
            if n < frames.shape[1]:
                y_reconstructed[frame_idx: frame_idx + frame_length] += frames[:, n]
        return y_reconstructed

    for class_name in CLASSES:
        original_class_path = original_path / class_name
        augmented_class_path = augmented_path / class_name
        augmented_class_path.mkdir(exist_ok=True)

        print(f"Augmenting files for class: {class_name}")
        files_to_process = list(original_class_path.glob('*.wav'))
        for file_path in tqdm(files_to_process, desc=class_name):
            y, sr = librosa.load(file_path, sr=SAMPLE_RATE, duration=DURATION_S)
            y = librosa.util.fix_length(y, size=AUDIO_SAMPLES)

            frame_length = int(sr * (FRAME_DURATION_MS / 1000.0))
            hop_length_seg = frame_length // 2
            frames = librosa.util.frame(y, frame_length=frame_length, hop_length=hop_length_seg)

            # 1. Pitch Shift Augmentation
            pitch_frames = np.array([librosa.effects.pitch_shift(y=frame, sr=sr, n_steps=PITCH_SHIFT_SEMITONES) for frame in frames.T]).T
            y_pitch = _reconstruct_from_frames(pitch_frames, AUDIO_SAMPLES, frame_length, hop_length_seg)
            sf.write(augmented_class_path / f"{file_path.stem}_aug_pitch.wav", y_pitch, sr)

            # 2. Gain Augmentation
            gain = random.uniform(*RANDOM_GAIN_RANGE)
            y_gain = _reconstruct_from_frames(frames * gain, AUDIO_SAMPLES, frame_length, hop_length_seg)
            sf.write(augmented_class_path / f"{file_path.stem}_aug_gain.wav", y_gain, sr)

            # 3. White Noise Augmentation
            noise = np.random.randn(*frames.shape) * WHITE_NOISE_FACTOR
            y_noise = _reconstruct_from_frames(frames + noise, AUDIO_SAMPLES, frame_length, hop_length_seg)
            sf.write(augmented_class_path / f"{file_path.stem}_aug_noise.wav", y_noise, sr)
    print("Data augmentation complete.\n")


### Dataset Balancing

Medical datasets are frequently imbalanced, with more healthy samples than pathological ones. This function corrects that imbalance to prevent the model from becoming biased toward the majority class.

#### 1. Data Aggregation
First, it aggregates all available data by combining:
* The **Original** raw audio files.
* The **Augmented** audio files (created in the previous step).

#### 2. Class Counting
It counts the total number of samples for each class (`healthy_control` vs. `parkinson_patient`) to identify which is the **Majority** and which is the **Minority**.

#### 3. Random Oversampling
To achieve a perfectly balanced dataset (50/50 split):
* It copies all existing files to a new `balanced_...` directory.
* It calculates the deficit (`Majority Count` - `Minority Count`).
* It randomly selects files from the **Minority Class** and creates duplicate copies (with unique filenames like `oversample_0_...`) until both classes have an equal number of samples.

**Outcome:** This ensures the neural network receives an equal number of examples from both classes during training, which is essential for learning a fair decision boundary.

In [None]:
# =============================================================================
# --- Step 3: Balance Dataset ---
# =============================================================================
def balance_data(original_path, augmented_path, balanced_path):
    """
    Balances the dataset by combining original/augmented data_italian and then
    oversampling the minority class to match the majority class count.
    """
    print("--- Step 3: Balancing Data with Random Oversampling ---")
    if balanced_path.exists() and any(balanced_path.iterdir()):
        print("Balanced data_italian directory already exists. Skipping balancing.")
        return
    balanced_path.mkdir(exist_ok=True)

    all_files = {}
    for class_name in CLASSES:
        files = list((original_path / class_name).glob('*.wav'))
        files.extend(list((augmented_path / class_name).glob('*.wav')))
        all_files[class_name] = files

    counts = {name: len(files) for name, files in all_files.items()}
    minority_class = min(counts, key=counts.get)
    majority_class = max(counts, key=counts.get)

    print(f"Minority Class: {minority_class} ({counts[minority_class]} samples)")
    print(f"Majority Class: {majority_class} ({counts[majority_class]} samples)")

    for class_name, files in all_files.items():
        class_dir = balanced_path / class_name
        class_dir.mkdir(exist_ok=True)
        for f in files:
            shutil.copy(f, class_dir / f.name)

    samples_to_add = counts[majority_class] - counts[minority_class]
    if samples_to_add > 0:
        print(f"Oversampling {minority_class} by adding {samples_to_add} samples...")
        minority_files = all_files[minority_class]
        files_to_duplicate = random.choices(minority_files, k=samples_to_add)

        for i, file_path in enumerate(tqdm(files_to_duplicate)):
            shutil.copy(file_path, balanced_path / minority_class / f"oversample_{i}_{file_path.name}")
    print("Data balancing complete.\n")

### Feature Extraction & Metadata Compilation

This core function converts raw audio waveforms into structured numerical features suitable for training deep learning models. It also integrates critical patient metadata (age, sex) alongside the audio features.

#### 1. Manifest Linking
The function loads a pre-generated `manifest.csv` file to link each audio file back to its source subject's metadata.
* **Fuzzy Matching:** Since filenames might have been altered during augmentation (e.g., `_aug_pitch`), a helper function `get_original_filename` ensures robust matching back to the original patient record.
* **Demographics:** Extracts and encodes `Age` (normalized or raw) and `Sex` (0 for Female, 1 for Male).

#### 2. Signal Processing Loop
It iterates through every WAV file in the balanced dataset:
* **Loading:** Reads audio at the standard 16 kHz rate.
* **Corrupt File Handling:** Automatically skips empty or unreadable files to prevent crashes.

#### 3. Spectral Feature Computation (`librosa`)
For each valid file, it extracts two key features:
* **Mel Spectrogram:** Captures the energy distribution across Mel frequency bands over time.
* **MFCCs:** Computes Mel-Frequency Cepstral Coefficients, often used for capturing vocal tract shape (formants).
* **Fix Length:** Ensures all output matrices have identical dimensions (`30 bands` x `94 time steps`) by padding or trimming, which is essential for batch processing in neural networks.

#### 4. Saving Output
Finally, all extracted features (`mel_spectrogram`, `mfcc`) and labels (`labels`, `age`, `sex`) are packed into a dictionary and saved as a single compressed NumPy file (`.npz`). This format allows for extremely fast loading during model training.

In [None]:
# =============================================================================
# --- Step 4: Feature Extraction (MODIFIED VERSION) ---
# =============================================================================

def extract_features(data_path, feature_mode, manifest_path=""):
    """
    Extracts features and metadata, and safely skips any empty or corrupted audio files.
    Enhanced to support acoustic feature extraction with individual named features.
    Spectral features will be resized to have 30 time steps, but acoustic features remain as scalar values.
    FSC feature has been removed. FEATURE_MODE_BASIC has been removed.
    """
    print("--- Step 4: Extracting Features ---")
    manifest_df = pd.DataFrame()
    if manifest_path and os.path.exists(manifest_path):
        manifest_df = pd.read_csv(
            manifest_path,
            dtype={'original_filename': str}
        ).set_index('original_filename')
        print(f"Loaded manifest file from {manifest_path}")
    else:
        print("Warning: Manifest file not found. Age and sex data will be unavailable.")

    def get_original_filename(mangled_name, manifest_index):
        mangled_name = mangled_name.lower()
        for original_name in manifest_index:
            if str(original_name).lower() in mangled_name:
                return original_name
        return None

    features = {
        "mel_spectrogram": [],
        "mfcc": [],
        "labels": [],
        "sex": [],
        "age": []
    }

    TARGET_TIME_STEPS = 30
    expected_frames = 1 + AUDIO_SAMPLES // HOP_LENGTH


    for class_idx, class_name in enumerate(CLASSES):
        class_path = data_path / class_name
        files = list(class_path.glob('*.wav'))
        print(f"Processing {len(files)} files for class: {class_name}")

        for filename in tqdm(files, desc=class_name):
            y, sr = librosa.load(filename, sr=SAMPLE_RATE)
            if len(y) == 0:
                print(f"Warning: Skipping empty audio file: {filename.name}")
                continue

            age, sex, sex_val = -1, -1, None
            if not manifest_df.empty:
                original_name = get_original_filename(filename.name, manifest_df.index)
                if original_name:
                    person_data = manifest_df.loc[original_name]
                    age, sex_val = person_data['age'], person_data['sex']
                    sex = 0 if isinstance(sex_val, str) and sex_val.upper().startswith('F') else 1 if isinstance(sex_val, str) else -1
                else:
                    print(f"Warning: Could not find original filename for {filename.name} in manifest.")

            features["age"].append(age)
            features["sex"].append(sex)
            features["labels"].append(class_idx)

            if feature_mode in [FEATURE_MODE_DEFAULT]:
                TARGET_TIME_STEPS = 94
                expected_frames = 1 + AUDIO_SAMPLES // HOP_LENGTH
                mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS)
                features["mel_spectrogram"].append(librosa.util.fix_length(mel_spectrogram, size=expected_frames, axis=1))

                y_preemp = librosa.effects.preemphasis(y)
                mfccs = librosa.feature.mfcc(y=y_preemp, sr=sr, n_mfcc=N_MFCC, n_mels=N_MELS, n_fft=N_FFT, hop_length=HOP_LENGTH)
                features["mfcc"].append(librosa.util.fix_length(mfccs, size=expected_frames, axis=1))

    print("Feature extraction complete.")
    print(f"Feature mode: {feature_mode}")

    if feature_mode in [FEATURE_MODE_ALL, FEATURE_MODE_DEFAULT]:
        print(f"Spectral features resized to {TARGET_TIME_STEPS} time steps.")

    return {key: np.array(val) for key, val in features.items()}

# =============================================================================
# --- Step 5: Save Features ---
# =============================================================================
def save_features(output_path, **features):
    """
    Saves all extracted feature matrices and labels to a single
    compressed NumPy file (.npz).
    """
    print(f"\n--- Saving Features to {output_path} ---")
    np.savez_compressed(output_path, **features)
    print("Features saved successfully.")


### Visualization Utilities

Visual verification is crucial in signal processing pipelines to ensure data integrity. These functions generate plots to confirm that augmentation and feature extraction are working as expected.

#### 1. Augmentation Verification (`visualize_augmentation_effect`)
This function compares an original waveform against its augmented counterpart.
* **Process:** It loads a random file from the `original` dataset and its corresponding pitch-shifted version from the `augmented` dataset.
* **Plotting:** It creates a side-by-side time-domain plot. This allows a quick visual check to ensure the augmentation process hasn't corrupted the signal or introduced unwanted silence/artifacts.

#### 2. Feature Inspection (`visualize_features`)
This function generates a grid of plots displaying the extracted features for both a **Healthy Control** and a **Parkinson's Patient**.
* **Mel Spectrograms:** Displays the frequency content over time on a Log-Mel scale (using the `magma` colormap). This visualizes the energy distribution.
* **MFCCs:** Displays the Mel-Frequency Cepstral Coefficients over time (using the `coolwarm` colormap). This visualizes the spectral envelope and timbral characteristics.
* **Purpose:** These plots serve as a sanity check to verify that the spectral features are being computed correctly and have the expected dimensions before training begins.

In [None]:
# =============================================================================
# --- Visualization Functions ---
# =============================================================================
def visualize_augmentation_effect(original_path, augmented_path, output_path, mode, feature_mode):
    """
    Loads and plots one original and its augmented version.
    """
    print("\n--- Visualizing Augmentation Effect ---")
    try:
        hc_path = original_path / HEALTHY_CLASS
        original_file = next(hc_path.glob('*.wav'), None)
        if not original_file:
            print("No original file found for visualization.")
            return

        augmented_file = augmented_path / HEALTHY_CLASS / f"{original_file.stem}_aug_pitch.wav"
        if not augmented_file.exists():
            print(f"Augmented file not found: {augmented_file}")
            return

        y_orig, sr = librosa.load(original_file, sr=SAMPLE_RATE, duration=DURATION_S)
        y_orig = librosa.util.fix_length(y_orig, size=AUDIO_SAMPLES)
        y_aug, _ = librosa.load(augmented_file, sr=SAMPLE_RATE)

        fig, axs = plt.subplots(2, 1, figsize=(12, 8), sharex=True, sharey=True)
        fig.suptitle('Augmentation Effect on a Sample Waveform', fontsize=16)
        librosa.display.waveshow(y_orig, sr=sr, ax=axs[0], color='slateblue')
        axs[0].set_title(f'Original Audio: {original_file.name}')
        librosa.display.waveshow(y_aug, sr=sr, ax=axs[1], color='peru')
        axs[1].set_title('Augmented Audio (Pitch Shift)')
        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        plt.savefig(os.path.join(output_path, f"augmentation_visualization_{mode}_{feature_mode}.png"))
        print("Saved 'augmentation_visualization.png'.")
    except Exception as e:
        print(f"Could not generate augmentation visualization: {e}")

def visualize_features(balanced_path, output_path, mode, feature_mode):
    """
    Generates and saves visualizations of extracted feature types that
    match the style of the reference paper.
    """
    print("\n--- Visualizing Features ---")
    try:
        samples = {
            HEALTHY_CLASS: next((balanced_path / HEALTHY_CLASS).glob('*.wav')),
            PARKINSON_CLASS: next((balanced_path / PARKINSON_CLASS).glob('*.wav'))
        }

        fig, axs = plt.subplots(2, 2, figsize=(12, 9))
        fig.suptitle('Feature Extraction Examples', fontsize=16)

        class_map = {HEALTHY_CLASS: "Healthy Control", PARKINSON_CLASS: "Parkinson Patient"}

        for i, (class_name, file_path) in enumerate(samples.items()):
            title = class_map[class_name]
            y, sr = librosa.load(file_path, sr=SAMPLE_RATE)

            # Mel Spectrogram
            mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS)
            melspec_db = librosa.power_to_db(mel_spec, ref=np.max)
            img = librosa.display.specshow(melspec_db, sr=sr, hop_length=HOP_LENGTH, x_axis='time', y_axis='mel', ax=axs[0, i], cmap='magma')
            fig.colorbar(img, ax=axs[0, i], format='%+2.0f dB')
            axs[0, i].set_title(f'Mel Spectrogram - {title}')
            axs[0, i].set_ylabel('Mel Frequency Bins')

            # MFCCs
            y_preemp = librosa.effects.preemphasis(y)
            mfccs = librosa.feature.mfcc(y=y_preemp, sr=sr, n_mfcc=N_MFCC, n_mels=N_MELS, n_fft=N_FFT, hop_length=HOP_LENGTH)
            # --- MODIFIED: Use 'coolwarm' colormap and add colorbar ---
            img = librosa.display.specshow(mfccs, sr=sr, hop_length=HOP_LENGTH, x_axis='time', ax=axs[1, i], cmap='coolwarm')
            fig.colorbar(img, ax=axs[1, i])
            axs[1, i].set_title(f'MFCC - {title}')
            axs[1, i].set_ylabel('MFCC Coefficients')

        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        plt.savefig(os.path.join(output_path, f"feature_visualizations_{mode}_{feature_mode}.png"))
        print("Saved 'feature_visualizations.png'.")
    except Exception as e:
        print(f"Could not generate feature visualizations: {e}")


### Pipeline Summary & Logging

This final function aggregates key metrics from the entire data processing workflow into a concise report. This is essential for experiment tracking and ensuring reproducibility.

#### 1. Data Counting
It calculates and logs the total number of audio files at each stage of the pipeline:
* **Original:** The number of valid raw files after filtering.
* **Augmented:** The expanded dataset count after pitch shifting, gain adjustment, and noise injection.
* **Balanced:** The final count used for training, verifying that the `Healthy` and `Parkinson` classes are equal.

#### 2. Feature Dimension Verification
It prints and logs the shape of the extracted feature matrices (e.g., `(Samples, 30, 94)` for Mel Spectrograms). This confirms that the data is in the correct format for the neural network input layer.

#### 3. CSV Export
All collected metrics are compiled into a pandas DataFrame and saved as a CSV file (`data_preparation_...csv`). This provides a permanent record of the exact data conditions used for a specific experiment run.

In [None]:
def generate_and_save_summary(ORIGINAL_DATA_PATH, HEALTHY_CLASS, PARKINSON_CLASS, AUGMENTED_DATA_PATH, BALANCED_DATA_PATH, MODE, FEATURE_MODE):
    summary_data = []
    print("\n\n--- Pipeline Summary ---")
    try:
        # Original Data
        orig_hc = len(list((ORIGINAL_DATA_PATH / HEALTHY_CLASS).glob('*.wav')))
        orig_pd = len(list((ORIGINAL_DATA_PATH / PARKINSON_CLASS).glob('*.wav')))
        print(f"Original Data (Filtered): {orig_hc} HC, {orig_pd} PD")
        summary_data.append({'Category': 'Data Counts', 'Item': 'Original Healthy', 'Value': orig_hc})
        summary_data.append({'Category': 'Data Counts', 'Item': 'Original Parkinson', 'Value': orig_pd})

        # Augmented Data
        aug_hc = len(list((AUGMENTED_DATA_PATH / HEALTHY_CLASS).glob('*.wav')))
        aug_pd = len(list((AUGMENTED_DATA_PATH / PARKINSON_CLASS).glob('*.wav')))
        print(f"Augmented Data: {aug_hc} HC, {aug_pd} PD")
        summary_data.append({'Category': 'Data Counts', 'Item': 'Augmented Healthy', 'Value': aug_hc})
        summary_data.append({'Category': 'Data Counts', 'Item': 'Augmented Parkinson', 'Value': aug_pd})

        # Balanced Data
        bal_hc = len(list((BALANCED_DATA_PATH / HEALTHY_CLASS).glob('*.wav')))
        bal_pd = len(list((BALANCED_DATA_PATH / PARKINSON_CLASS).glob('*.wav')))
        print(f"Balanced Data: {bal_hc} HC, {bal_pd} PD")
        summary_data.append({'Category': 'Data Counts', 'Item': 'Balanced Healthy', 'Value': bal_hc})
        summary_data.append({'Category': 'Data Counts', 'Item': 'Balanced Parkinson', 'Value': bal_pd})

    except FileNotFoundError:
        print("Could not generate summary as some data directories are missing.")

    print("\n--- Final Feature Matrix Shapes ---")
    for name, data in all_features.items():
        title_name = name.replace('_', ' ').title()
        shape_str = str(data.shape)
        print(f"{title_name}: {shape_str}")
        summary_data.append({'Category': 'Feature Shapes', 'Item': title_name, 'Value': shape_str})

    if summary_data:
        try:
            summary_df = pd.DataFrame(summary_data)

            print("\n--- Summary DataFrame ---")
            print(summary_df.to_string())

            summary_df.columns = ['Category', 'Item', 'Value']
            summary_df = summary_df.iloc[1:]
            summary_df.to_csv(os.path.join(RESULTS_OUTPUT_PATH, f'data_preperation_{MODE}_{FEATURE_MODE}.csv'), index=False, encoding='utf-8')
            print("\nUnified summary saved to 'data_preperation.csv'")

        except Exception as e:
            print(f"\nCould not create or save summary DataFrame. Error: {e}")

    print("\nPipeline finished successfully!")

### Main Execution Entry Point

This block orchestrates the entire data pipeline by calling the previously defined functions in the correct sequence. It serves as the "conductor" for the data processing symphony.

#### 1. Dataset-Specific Initialization
Based on the `DATASET` constant selected at the beginning of the script, it branches to handle the unique setup requirements for each dataset:
* **Italian Dataset:** Specifically loads and merges the complex metadata structure before starting the preparation.
* **UAMS & mPower:** Directly initiates preparation using their respective metadata files.

#### 2. Sequential Processing Pipeline
Once the initial dataset is prepared, the script executes the core processing steps in order:
1.  **Augmentation:** Calls `apply_augmentation` to generate synthetic data variants.
2.  **Balancing:** Calls `balance_data` to equalize class distribution using the augmented data.
3.  **Feature Extraction:** Calls `extract_features` to compute spectrograms and MFCCs from the balanced dataset.
4.  **Saving:** Calls `save_features` to persist the final feature set to disk as a `.npz` file.

#### 3. Reporting and Visualization
Finally, it generates the artifacts needed for analysis and verification:
* **Visualizations:** Creates plots for augmentation effects and feature heatmaps.
* **Summary:** Generates and saves the final CSV report detailing dataset statistics and feature dimensions.

In [None]:
# =============================================================================
# --- Main Execution ---
# =============================================================================
if __name__ == '__main__':

    if DATASET == ITALIAN_DATASET:
        italian_metadata = pd.DataFrame()
        italian_metadata = load_italian_metadata(Italian_YHC_METADATA, Italian_EHC_METADATA, Italian_PD_METADATA)
        prepare_real_dataset(DATASET, DATASET_ROOT_PATH, ORIGINAL_DATA_PATH, VALID_FILE_PREFIXES, italian_metadata, MANIFEST_PATH)
    elif DATASET == UAMS_DATASET:
        prepare_real_dataset(DATASET, DATASET_ROOT_PATH, ORIGINAL_DATA_PATH, VALID_FILE_PREFIXES, Path(UAMS_METADATA_FILE), MANIFEST_PATH)
    elif DATASET == MPOWER_DATASET:
        prepare_real_dataset(DATASET, DATASET_ROOT_PATH, ORIGINAL_DATA_PATH, VALID_FILE_PREFIXES, Path(MPOWER_METADATA_FILE), MANIFEST_PATH)

    apply_augmentation(ORIGINAL_DATA_PATH, AUGMENTED_DATA_PATH)
    balance_data(ORIGINAL_DATA_PATH, AUGMENTED_DATA_PATH, BALANCED_DATA_PATH)
    all_features = extract_features(BALANCED_DATA_PATH, FEATURE_MODE, MANIFEST_PATH)
    save_features(FEATURES_OUTPUT_PATH, **all_features)

    # --- Generate Visualizations ---
    visualize_augmentation_effect(ORIGINAL_DATA_PATH, AUGMENTED_DATA_PATH, RESULTS_OUTPUT_PATH, MODE, FEATURE_MODE)
    visualize_features(BALANCED_DATA_PATH, RESULTS_OUTPUT_PATH, MODE, FEATURE_MODE)
    generate_and_save_summary(ORIGINAL_DATA_PATH, HEALTHY_CLASS, PARKINSON_CLASS, AUGMENTED_DATA_PATH, BALANCED_DATA_PATH, MODE, FEATURE_MODE)

