# Data preprocessing: Segment audio files by class

This is the first step in the data preprocessing pipeline. We break an audio file down into segments (1 second)

This notebook processes long audio recordings (tapes) and their corresponding Audacity label files (TXT format) to extract segments for training an acoustic model. It supports **multi-class classification** where labels can be categorized into different classes based on their text content.

**Class Handling:**
1.  **Noise (Y=0)**: Extracted from time segments that *do not* contain any labeled sounds (gaps between annotations).
2.  **Anemonefish (Target)**: Default class for all labeled segments that don't match other specific classes.
3.  **Custom Classes** (e.g., "biological"): Labels that exactly match specific text in the annotation files are assigned to their respective classes.

A sliding window approach is used to generate multiple audio clips from both labeled and noise regions.

**New Features:**
- **YAML Configuration**: All preprocessing parameters are loaded from a YAML config file for reproducibility.
- **Flexible Directory Structure**: Audio files and annotations are now stored separately in `data/1_raw/{site}/audio/` and `data/1_raw/{site}/annotations/`.
- **Multi-Class Support**: Configure any number of classes by specifying exact label text matches in the config.

**Process:**
1.  **Load Configuration**: Read all parameters from YAML config file (paths, classes, window settings).
2.  **Parse Labels**: Read Audacity label files to identify time segments and their label text.
3.  **Classify Segments**: Categorize labeled segments by class based on exact text matching.
4.  **Identify Noise Regions**: Determine the time segments in the audio tapes that *do not* contain labeled sounds.
5.  **Extract Segments with Sliding Window**:
    *   Apply a sliding window to each class's labeled regions to generate class-specific examples.
    *   Apply the same sliding window to the identified noise regions to generate negative examples.
6.  **Save Segments**: Save the extracted audio clips as individual WAV files into class-specific output directories (e.g., `anemonefish/`, `biological/`, `noise/`).

The output will populate class folders within `data/_cache/1_generate_training_audio/`.

## 1. Setup and Imports

In [None]:
import os
import glob
import pandas as pd
import soundfile as sf
import numpy as np
import logging
import random
import yaml
import shutil

# Setup basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

In [None]:
def clear_cache(cache_dir_path):
    """
    Clears the cache directory by removing all contents.
    """
    if not os.path.exists(cache_dir_path):
        return "Cache directory does not exist"
    
    # Remove all contents of the directory
    shutil.rmtree(cache_dir_path, ignore_errors=True)
    
    # Recreate the empty directory
    os.makedirs(cache_dir_path, exist_ok=True)
    
    return "Cache directory cleared"

## 2. Configuration

**YAML-Based Configuration**: All preprocessing parameters are now loaded from a YAML configuration file. This ensures reproducibility and makes it easy to version control your preprocessing settings.

**To use this notebook:**
1. Copy the template config: `data/2_training_datasets/preprocessing_config_template.yaml`
2. Customize it for your dataset (update paths, classes, parameters)
3. Update the `CONFIG_PATH` variable below to point to your config file
4. Run the notebook!

The YAML config specifies:
- Dataset version and site information
- Class list (e.g., `["noise", "anemonefish"]` for binary, or `["noise", "anemonefish", "biological"]` for multi-class)
- Audio processing parameters (window size, slide duration, etc.)
- Noise padding parameters (to prevent duration-based model bias)

In [None]:
# --- Load Configuration from YAML ---

# !!! UPDATE THIS PATH TO YOUR CONFIG FILE !!!
CONFIG_PATH = '/Volumes/InsightML/NAS/3_Lucia_Yllan/Clown_Fish_Acoustics/data/2_training_datasets/preprocessing_config_template.yaml'

# Load configuration
logging.info(f"Loading configuration from: {CONFIG_PATH}")
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

# Extract configuration values
WORKSPACE_BASE_PATH = config['workspace_base_path']
DATASET_VERSION = config['dataset_version']
RAW_DATA_SITE = config['raw_data_site']
ANNOTATION_VERSION = config['annotation_version']
CLASSES = config['classes']

# Construct paths based on new directory structure
INPUT_AUDIO_DIR = os.path.join(WORKSPACE_BASE_PATH, 'data', '1_raw', RAW_DATA_SITE, 'audio')
INPUT_ANNOTATIONS_DIR = os.path.join(WORKSPACE_BASE_PATH, 'data', '1_raw', RAW_DATA_SITE, ANNOTATION_VERSION)
OUTPUT_AUDIO_FILES_DIR = os.path.join(WORKSPACE_BASE_PATH, 'data', '_cache', '1_generate_training_audio')

# Audio processing parameters
WINDOW_SIZE_SECONDS = config['audio_processing']['window_size_seconds']
SLIDE_SECONDS = config['audio_processing']['slide_seconds']
MIN_SEGMENT_DURATION_SECONDS = config['audio_processing']['min_segment_duration_seconds']

# Noise padding parameters
NOISE_PADDING_RATIO = config['noise_padding']['padding_ratio']
MIN_NOISE_DURATION_FOR_SHORTENING = config['noise_padding']['min_duration_seconds']
MAX_NOISE_DURATION_FOR_SHORTENING = config['noise_padding']['max_duration_seconds']

# Create output directories for each class
OUTPUT_CLASS_DIRS = {}
for class_name in CLASSES:
    class_dir = os.path.join(OUTPUT_AUDIO_FILES_DIR, class_name)
    os.makedirs(class_dir, exist_ok=True)
    OUTPUT_CLASS_DIRS[class_name] = class_dir

# Log configuration
logging.info(f"=== Configuration Loaded ===")
logging.info(f"Dataset Version: {DATASET_VERSION}")
logging.info(f"Raw Data Site: {RAW_DATA_SITE}")
logging.info(f"Annotation Version: {ANNOTATION_VERSION}")
logging.info(f"Classes: {CLASSES}")
logging.info(f"Input Audio Directory: {INPUT_AUDIO_DIR}")
logging.info(f"Input Annotations Directory: {INPUT_ANNOTATIONS_DIR}")
logging.info(f"Output Base Directory: {OUTPUT_AUDIO_FILES_DIR}")
for class_name, class_dir in OUTPUT_CLASS_DIRS.items():
    logging.info(f"  - {class_name}: {class_dir}")
logging.info(f"Audio Window Size: {WINDOW_SIZE_SECONDS}s")
logging.info(f"Sliding Window Hop: {SLIDE_SECONDS}s")
logging.info(f"Minimum Segment Duration: {MIN_SEGMENT_DURATION_SECONDS}s")
logging.info(f"Noise Padding Ratio: {NOISE_PADDING_RATIO} ({int(NOISE_PADDING_RATIO*100)}%)")

# Validate input directories exist
if not os.path.isdir(INPUT_AUDIO_DIR):
    logging.critical(f"Input audio directory not found: {INPUT_AUDIO_DIR}")
    logging.critical("Please check your configuration file.")
if not os.path.isdir(INPUT_ANNOTATIONS_DIR):
    logging.critical(f"Input annotations directory not found: {INPUT_ANNOTATIONS_DIR}")
    logging.critical("Please check your configuration file.")

## 3. Helper Functions

In [None]:
def parse_audacity_labels(label_file_path):
    """
    Parses an Audacity label file (TXT, tab-separated).
    Assumes columns are: start_time (s), end_time (s), label_text.
    Returns a list of tuples representing labeled segments with their text: 
    [(start1, end1, label_text1), (start2, end2, label_text2), ...].
    Labels are sorted by start time.
    """
    labeled_segments = []
    try:
        # Read with tab delimiter, no header, read all three columns
        df = pd.read_csv(label_file_path, sep='\t', header=None, float_precision='round_trip')
        for index, row in df.iterrows():
            start_time = float(row[0])
            end_time = float(row[1])
            # Extract label text (third column), default to empty string if not present
            label_text = str(row[2]).strip() if len(row) > 2 and pd.notna(row[2]) else ""
            
            if start_time < end_time: # Basic validation
                 labeled_segments.append((start_time, end_time, label_text))
            else:
                logging.warning(f"Skipping invalid segment in {label_file_path}: start_time {start_time} >= end_time {end_time}")

        # Sort segments by start time
        labeled_segments.sort(key=lambda x: x[0])
        logging.info(f"Parsed {len(labeled_segments)} segments from {label_file_path}")
    except FileNotFoundError:
        logging.error(f"Label file not found: {label_file_path}")
    except pd.errors.EmptyDataError:
        logging.warning(f"Label file is empty: {label_file_path}")
    except Exception as e:
        logging.error(f"Error parsing label file {label_file_path}: {e}")
    return labeled_segments

In [None]:
def classify_labeled_segments(labeled_segments_with_text, classes):
    """
    Classifies labeled segments into their respective classes based on label text matching.
    
    Args:
        labeled_segments_with_text (list): List of tuples [(start, end, label_text), ...]
        classes (list): List of class names from config (e.g., ["noise", "anemonefish", "biological"])
    
    Returns:
        dict: Dictionary mapping class names to segment lists (without text)
              {class_name: [(start1, end1), (start2, end2), ...]}
    
    Classification Logic:
        - 'noise': Always represents unlabeled regions (handled separately, not from annotations)
        - 'anemonefish': Default class for all labeled segments that don't match other classes
        - Other classes: Exact match (case-sensitive) on label text
    """
    # Initialize result dictionary for all non-noise classes
    classified_segments = {}
    for class_name in classes:
        if class_name != 'noise':  # Noise is handled separately (unlabeled regions)
            classified_segments[class_name] = []
    
    # Get list of specific class names to match (excluding 'noise' and 'anemonefish')
    specific_classes = [c for c in classes if c not in ['noise', 'anemonefish']]
    
    # Classify each labeled segment
    for start_time, end_time, label_text in labeled_segments_with_text:
        matched = False
        
        # Try to match with specific classes (exact match, case-sensitive)
        for specific_class in specific_classes:
            if label_text == specific_class:
                classified_segments[specific_class].append((start_time, end_time))
                matched = True
                break
        
        # If not matched with any specific class, assign to 'anemonefish' (default target class)
        if not matched and 'anemonefish' in classified_segments:
            classified_segments['anemonefish'].append((start_time, end_time))
    
    # Log classification results
    for class_name, segments in classified_segments.items():
        if segments:
            logging.info(f"  Classified {len(segments)} segments as '{class_name}'")
    
    return classified_segments


In [None]:
def get_noise_segments(total_duration_seconds, labeled_segments, min_segment_len_seconds):
    """
    Identifies noise segments in an audio file given its total duration and labeled (non-noise) segments.
    Args:
        total_duration_seconds (float): Total duration of the audio file.
        labeled_segments (list of tuples): Sorted list of (start, end) or (start, end, text) times for labeled regions.
                                           Function handles both formats.
        min_segment_len_seconds (float): Minimum duration for a segment to be considered noise.
                                         Segments shorter than this will be ignored.
    Returns:
        list of tuples: [(noise_start1, noise_end1), ...] for noise regions.
    """
    noise_segments = []
    current_time = 0.0
    
    # Normalize labeled_segments to (start, end) format (handle both 2-tuple and 3-tuple)
    normalized_segments = []
    for seg in labeled_segments:
        if len(seg) >= 2:
            normalized_segments.append((seg[0], seg[1]))
    
    labeled_segments = normalized_segments

    # If no labeled segments, the whole file is noise
    if not labeled_segments:
        if total_duration_seconds >= min_segment_len_seconds:
            noise_segments.append((0.0, total_duration_seconds))
            logging.info(f"No labels found. Entire duration {total_duration_seconds:.2f}s considered noise for segmentation.")
        else:
            logging.info(f"No labels found. Entire duration {total_duration_seconds:.2f}s is less than min_segment_len_seconds {min_segment_len_seconds:.2f}s. No noise segments generated.")
        return noise_segments

    # Process segment from start of tape to the first label
    first_label_start = labeled_segments[0][0]
    if first_label_start > current_time:
        duration = first_label_start - current_time
        if duration >= min_segment_len_seconds:
            noise_segments.append((current_time, first_label_start))
        # else: logging.debug(f"Initial noise segment from {current_time:.2f} to {first_label_start:.2f} (duration {duration:.2f}s) too short.")
    current_time = max(current_time, labeled_segments[0][1]) # Move current time to end of first label

    # Process segments between labels
    for i in range(len(labeled_segments) - 1):
        end_current_label = labeled_segments[i][1]
        start_next_label = labeled_segments[i+1][0]
        
        # Ensure current_time is at least at the end of the current label before looking for a gap
        current_time = max(current_time, end_current_label) 

        if start_next_label > current_time: # If there's a gap
            duration = start_next_label - current_time
            if duration >= min_segment_len_seconds:
                noise_segments.append((current_time, start_next_label))
            # else: logging.debug(f"Noise segment between labels (from {current_time:.2f} to {start_next_label:.2f}, duration {duration:.2f}s) too short.")
        current_time = max(current_time, labeled_segments[i+1][1]) # Move current time to end of next label

    # Process segment from the end of the last label to the end of the file
    last_label_end = labeled_segments[-1][1]
    current_time = max(current_time, last_label_end) # Ensure current_time is at least at the end of the last label
    
    if total_duration_seconds > current_time:
        duration = total_duration_seconds - current_time
        if duration >= min_segment_len_seconds:
            noise_segments.append((current_time, total_duration_seconds))
        # else: logging.debug(f"Final noise segment (from {current_time:.2f} to {total_duration_seconds:.2f}, duration {duration:.2f}s) too short.")
            
    if noise_segments:
        logging.info(f"Identified {len(noise_segments)} noise segments meeting minimum duration of {min_segment_len_seconds:.2f}s.")
    else:
        logging.info(f"No noise segments meeting minimum duration of {min_segment_len_seconds:.2f}s were identified.")
    return noise_segments

In [None]:
def extract_save_segments_sliding_window(tape_audio_path, segments_to_process,
                                         window_duration_s, slide_duration_s, sr,
                                         output_dir_path, tape_basename, segment_type_prefix):
    """
    Extracts audio segments using a sliding window from the given audio tape and saves them.
    For 'anemonefish' type, segments shorter than window_duration_s will be padded with zeros.
    For 'noise' type, sliding windows are extracted normally, then some windows are randomly shortened and padded.
    Args:
        tape_audio_path (str): Path to the full audio tape WAV file.
        segments_to_process (list of tuples): List of (start_sec, end_sec) for regions to process.
        window_duration_s (float): Duration of each window in seconds.
        slide_duration_s (float): Hop duration for the sliding window in seconds.
        sr (int): Sample rate of the audio tape.
        output_dir_path (str): Directory to save the extracted audio windows.
        tape_basename (str): Basename of the tape file, for naming windows.
        segment_type_prefix (str): Prefix for the filename (e.g., 'noise', 'anemonefish').
    Returns:
        dict: Statistics containing counts of total windows, padded windows, and segment lengths.
    """
    saved_windows_count = 0
    padded_windows_count = 0
    segment_lengths = []
    window_len_samples = int(window_duration_s * sr)

    if not segments_to_process:
        logging.info(f"No segments to process for {segment_type_prefix} from {tape_basename}.")
        return {
            'total_windows': 0,
            'padded_windows': 0,
            'segment_lengths': []
        }

    for seg_idx, (seg_start_s, seg_end_s) in enumerate(segments_to_process):
        segment_duration_s = seg_end_s - seg_start_s
        segment_lengths.append(segment_duration_s)
        
        logging.debug(f"Processing segment {seg_idx+1}/{len(segments_to_process)} ({segment_type_prefix}): "
                      f"{seg_start_s:.2f}s - {seg_end_s:.2f}s (duration: {segment_duration_s:.2f}s) from {tape_basename}")

        if segment_duration_s <= 0:
            logging.warning(f"Segment {seg_idx+1} for {segment_type_prefix} from {tape_basename} has zero or negative duration. Skipping.")
            continue

        # Case 1: Segment is shorter than window_duration_s (for all non-noise classes)
        # Noise segments use sliding window regardless of length to maintain variety
        if segment_duration_s < window_duration_s and segment_type_prefix != "noise":
            logging.info(f"Segment {seg_idx+1} ({segment_type_prefix}, duration {segment_duration_s:.2f}s) is shorter than window size {window_duration_s:.2f}s. Padding...")
            start_sample = int(seg_start_s * sr)
            frames_to_read = int(segment_duration_s * sr) # Read only the actual segment

            if frames_to_read <= 0:
                logging.warning(f"Segment {seg_idx+1} ({segment_type_prefix}) resulted in {frames_to_read} frames to read. Skipping padding.")
                continue
            
            try:
                audio_segment_data, read_sr = sf.read(tape_audio_path, start=start_sample,
                                                      frames=frames_to_read, dtype='float32', always_2d=False)
                if sr != read_sr: # Should not happen if sf.info worked correctly
                    logging.warning(f"Sample rate mismatch during read: expected {sr}, got {read_sr}. Using original sr for padding calculation.")

                # Ensure audio_segment_data is 1D array
                if audio_segment_data.ndim > 1:
                    audio_segment_data = np.mean(audio_segment_data, axis=1) # Convert to mono by averaging if stereo

                num_padding_samples = window_len_samples - len(audio_segment_data)
                
                if num_padding_samples < 0: # Should ideally not happen if segment_duration_s < window_duration_s
                    logging.warning(f"Calculated negative padding ({num_padding_samples}) for short segment {seg_idx+1}. "
                                    f"Segment len: {len(audio_segment_data)}, target window len: {window_len_samples}. Clipping padding to 0.")
                    num_padding_samples = 0
                    # Potentially truncate if audio_segment_data is somehow longer than window_len_samples
                    audio_segment_data = audio_segment_data[:window_len_samples]

                padded_audio_data = np.pad(audio_segment_data, (0, num_padding_samples), 'constant', constant_values=(0.0, 0.0))

                if len(padded_audio_data) == window_len_samples:
                    window_filename = f"{tape_basename}_{segment_type_prefix}_window_padded_{saved_windows_count:04d}.wav"
                    output_window_path = os.path.join(output_dir_path, window_filename)
                    sf.write(output_window_path, padded_audio_data, sr)
                    saved_windows_count += 1
                    padded_windows_count += 1  # Track padded window
                else:
                    logging.warning(f"Padded audio for short segment {seg_idx+1} ({segment_type_prefix}) has unexpected length {len(padded_audio_data)} (expected {window_len_samples}). Skipping save.")
            except Exception as e:
                logging.error(f"Error processing/padding short segment {seg_idx+1} ({segment_type_prefix}) at {seg_start_s:.2f}s from {tape_audio_path}: {e}", exc_info=True)
            continue # Move to the next segment

        # Case 2: Segment is >= window_duration_s OR noise segment (noise uses sliding window even for short segments)
        # Apply sliding window approach to extract multiple overlapping windows
        current_window_start_s = seg_start_s
        while current_window_start_s + window_duration_s <= seg_end_s:
            start_sample = int(current_window_start_s * sr)
            
            try:
                audio_window_data, _ = sf.read(tape_audio_path, start=start_sample,
                                               frames=window_len_samples, dtype='float32', always_2d=False)
                
                if audio_window_data.ndim > 1: # Ensure mono
                    audio_window_data = np.mean(audio_window_data, axis=1)

                if len(audio_window_data) == window_len_samples:
                    # For noise segments, randomly decide whether to shorten and pad this window
                    if segment_type_prefix == "noise" and random.random() < NOISE_PADDING_RATIO:
                        # Randomly shorten this window and apply padding
                        random_duration = random.uniform(MIN_NOISE_DURATION_FOR_SHORTENING, MAX_NOISE_DURATION_FOR_SHORTENING)
                        random_samples = int(random_duration * sr)
                        
                        logging.info(f"Randomly shortening noise window at {current_window_start_s:.2f}s (original duration {window_duration_s:.2f}s) to {random_duration:.2f}s for padding.")
                        
                        # Truncate the window to the random duration
                        shortened_audio_data = audio_window_data[:random_samples]
                        
                        # Pad to match the original window length
                        num_padding_samples = window_len_samples - len(shortened_audio_data)
                        
                        if num_padding_samples < 0:
                            logging.warning(f"Calculated negative padding ({num_padding_samples}) for shortened noise window. Truncating.")
                            num_padding_samples = 0
                            shortened_audio_data = shortened_audio_data[:window_len_samples]

                        padded_audio_data = np.pad(shortened_audio_data, (0, num_padding_samples), 'constant', constant_values=(0.0, 0.0))
                        
                        if len(padded_audio_data) == window_len_samples:
                            window_filename = f"{tape_basename}_{segment_type_prefix}_window_padded_{saved_windows_count:04d}.wav"
                            output_window_path = os.path.join(output_dir_path, window_filename)
                            sf.write(output_window_path, padded_audio_data, sr)
                            saved_windows_count += 1
                            padded_windows_count += 1  # Track padded window
                        else:
                            logging.warning(f"Padded audio for shortened noise window has unexpected length {len(padded_audio_data)} (expected {window_len_samples}). Skipping save.")
                    else:
                        # Save the normal window (for anemonefish or non-selected noise windows)
                        window_filename = f"{tape_basename}_{segment_type_prefix}_window_{saved_windows_count:04d}.wav"
                        output_window_path = os.path.join(output_dir_path, window_filename)
                        sf.write(output_window_path, audio_window_data, sr)
                        saved_windows_count += 1
                else:
                    logging.warning(f"Could not read full window of {window_len_samples} samples "
                                    f"at {current_window_start_s:.2f}s from {tape_basename} for sliding window. "
                                    f"Read {len(audio_window_data)} samples. Skipping this window.")

            except Exception as e:
                logging.error(f"Error reading/writing sliding window at {current_window_start_s:.2f}s "
                              f"from {tape_audio_path} for {segment_type_prefix}: {e}", exc_info=True)
            
            current_window_start_s += slide_duration_s
            
    logging.info(f"Extracted and saved {saved_windows_count} '{segment_type_prefix}' windows from {tape_basename} (from {len(segments_to_process)} segments). Padded: {padded_windows_count}")
    
    return {
        'total_windows': saved_windows_count,
        'padded_windows': padded_windows_count,
        'segment_lengths': segment_lengths
    }

## 4. Main Processing Loop

This section iterates through all audio files in `INPUT_AUDIO_DIR`. For each audio tape:
1. Finds matching annotation file(s) in `INPUT_ANNOTATIONS_DIR` based on filename matching.
2. Parses the labels if the file exists; otherwise, treats the entire tape as noise.
3. Classifies labeled segments into their respective classes based on label text matching.
4. Identifies noise segments (unlabeled regions).
5. Extracts and saves audio windows for each class into their respective output directories.

In [None]:
def main_processing():
    if not os.path.isdir(INPUT_AUDIO_DIR):
        logging.error("Input audio directory does not exist. Please check Configuration.")
        return
    if not os.path.isdir(INPUT_ANNOTATIONS_DIR):
        logging.error("Input annotations directory does not exist. Please check Configuration.")
        return

    # Find all audio files
    audio_files = glob.glob(os.path.join(INPUT_AUDIO_DIR, '*.wav')) + \
                  glob.glob(os.path.join(INPUT_AUDIO_DIR, '*.WAV'))
    
    if not audio_files:
        logging.warning(f"No .wav or .WAV files found in {INPUT_AUDIO_DIR}. Processing will not start.")
        return

    logging.info(f"Found {len(audio_files)} audio files for processing.")
    
    # Initialize statistics tracking for all classes
    class_stats = {}
    for class_name in CLASSES:
        class_stats[class_name] = {
            'total_windows': 0,
            'padded_windows': 0,
            'segment_lengths': []
        }

    for audio_path in audio_files:
        audio_basename_with_ext = os.path.basename(audio_path)
        audio_basename = os.path.splitext(audio_basename_with_ext)[0]
        
        logging.info(f"--- Processing audio: {audio_path} ---")
        
        # Find matching annotation file(s) in the annotations directory
        # Look for any .txt file that starts with the audio basename
        annotation_pattern = os.path.join(INPUT_ANNOTATIONS_DIR, f"{audio_basename}*.txt")
        annotation_files = glob.glob(annotation_pattern)
        
        # Filter out hidden files (starting with ._)
        annotation_files = [f for f in annotation_files if not os.path.basename(f).startswith('._')]
        
        labeled_segments_with_text = [] # Initialize to empty list
        if not annotation_files:
            logging.warning(f"No annotation file found for {audio_basename} in {INPUT_ANNOTATIONS_DIR}. "
                            "This audio will only be processed for noise.")
        else:
            # Use the first matching annotation file
            annotation_file_path = annotation_files[0]
            if len(annotation_files) > 1:
                logging.info(f"Multiple annotation files found for {audio_basename}, using: {os.path.basename(annotation_file_path)}")
            else:
                logging.info(f"Using annotation file: {os.path.basename(annotation_file_path)}")
            
            labeled_segments_with_text = parse_audacity_labels(annotation_file_path)

        try:
            audio_info = sf.info(audio_path)
            total_duration = audio_info.duration
            sample_rate = audio_info.samplerate
            logging.info(f"Audio duration: {total_duration:.2f}s, Sample rate: {sample_rate}Hz")
        except Exception as e:
            logging.error(f"Could not read audio info for {audio_path}: {e}")
            continue # Skip to the next audio file

        # Classify labeled segments into their respective classes
        if labeled_segments_with_text:
            logging.info(f"Classifying {len(labeled_segments_with_text)} labeled segments into classes...")
            classified_segments = classify_labeled_segments(labeled_segments_with_text, CLASSES)
            
            # Process each class's segments
            for class_name in CLASSES:
                if class_name == 'noise':
                    continue  # Noise is handled separately below
                
                class_segments = classified_segments.get(class_name, [])
                if class_segments:
                    logging.info(f"Processing {len(class_segments)} '{class_name}' segments from {audio_basename_with_ext}...")
                    
                    stats = extract_save_segments_sliding_window(
                        audio_path,
                        class_segments,
                        WINDOW_SIZE_SECONDS,
                        SLIDE_SECONDS,
                        sample_rate,
                        OUTPUT_CLASS_DIRS[class_name],
                        audio_basename,
                        class_name
                    )
                    class_stats[class_name]['total_windows'] += stats['total_windows']
                    class_stats[class_name]['padded_windows'] += stats['padded_windows']
                    class_stats[class_name]['segment_lengths'].extend(stats['segment_lengths'])
        else:
            logging.info(f"No labeled segments found for {audio_basename_with_ext}.")

        # Process Noise Segments (unlabeled regions)
        noise_segments = get_noise_segments(total_duration, labeled_segments_with_text, MIN_SEGMENT_DURATION_SECONDS)
        
        if noise_segments:
            logging.info(f"Processing {len(noise_segments)} noise segments from {audio_basename_with_ext}...")
            noise_stats = extract_save_segments_sliding_window(
                audio_path,
                noise_segments,
                WINDOW_SIZE_SECONDS,
                SLIDE_SECONDS,
                sample_rate,
                OUTPUT_CLASS_DIRS['noise'],
                audio_basename,
                "noise"
            )
            class_stats['noise']['total_windows'] += noise_stats['total_windows']
            class_stats['noise']['padded_windows'] += noise_stats['padded_windows']
        else:
            logging.info(f"No suitable noise segments found in {audio_basename_with_ext}.")
        
        logging.info(f"--- Finished processing audio: {audio_basename_with_ext} ---\n")

    # Calculate and display statistics
    logging.info(f"\n=== Processing Complete ===")
    logging.info(f"Dataset Version: {DATASET_VERSION}")
    logging.info(f"\nWindows Generated by Class:")
    
    for class_name in CLASSES:
        total_windows = class_stats[class_name]['total_windows']
        padded_windows = class_stats[class_name]['padded_windows']
        
        logging.info(f"\n--- {class_name.upper()} ---")
        logging.info(f"Total windows generated: {total_windows}")
        
        if total_windows > 0:
            padding_percentage = (padded_windows / total_windows) * 100
            logging.info(f"Padded windows: {padded_windows}/{total_windows} ({padding_percentage:.2f}%)")
        else:
            logging.info(f"No windows generated for this class")
        
        # Display segment length statistics for non-noise classes
        segment_lengths = class_stats[class_name]['segment_lengths']
        if segment_lengths and class_name != 'noise':
            import statistics
            logging.info(f"Segment length statistics:")
            logging.info(f"  Total segments: {len(segment_lengths)}")
            logging.info(f"  Mean: {statistics.mean(segment_lengths):.3f}s")
            logging.info(f"  Median: {statistics.median(segment_lengths):.3f}s")
            logging.info(f"  Min: {min(segment_lengths):.3f}s")
            logging.info(f"  Max: {max(segment_lengths):.3f}s")
            if len(segment_lengths) > 1:
                logging.info(f"  Std Dev: {statistics.stdev(segment_lengths):.3f}s")
            
            # Count segments shorter than window size
            short_segments = [length for length in segment_lengths if length < WINDOW_SIZE_SECONDS]
            if short_segments:
                short_percentage = (len(short_segments) / len(segment_lengths)) * 100
                logging.info(f"  Segments < window size ({WINDOW_SIZE_SECONDS}s): {len(short_segments)}/{len(segment_lengths)} ({short_percentage:.2f}%)")
            else:
                logging.info(f"  All segments >= window size ({WINDOW_SIZE_SECONDS}s)")
    
    logging.info(f"\n=== Summary ===")
    total_all_windows = sum(class_stats[c]['total_windows'] for c in CLASSES)
    logging.info(f"Total windows across all classes: {total_all_windows}")
    for class_name in CLASSES:
        windows = class_stats[class_name]['total_windows']
        if total_all_windows > 0:
            percentage = (windows / total_all_windows) * 100
            logging.info(f"  {class_name}: {windows} ({percentage:.1f}%)")
        else:
            logging.info(f"  {class_name}: {windows}")

if __name__ == '__main__':
    # Optional: Clear previous logs for a fresh run in notebook context if desired.
    # for handler in logging.root.handlers[:]:
    # logging.root.removeHandler(handler)
    # logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') # Re-setup if cleared
    
    # Ensure basicConfig is called if not already (e.g. if cell with logging.basicConfig wasn't run prior in a session)
    if not logging.getLogger().hasHandlers():
         logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
            
    main_processing()