# Data Preparation for Training and EDA

Prepares a subset of audio files for training and exploratory data analysis (EDA).

## Tasks
1. Count the number of processed WAV files in the dataset directory.
2. Select a subset of audio files totaling ~5 hours in duration.
   - Each audio file is assumed to be 16 seconds long.
   - Files are randomly shuffled before selection.

## Dependencies
- Python 3.10+
- glob, os, random, shutil

## Usage
- Run all cells sequentially.
- Modify `directory`, `file_duration_sec`, or `target_hours` to adjust subset size or source location.


## Task 1: Count the number of processed WAV files

In [None]:
import glob
import os

# Directory containing the label files
directory = "../output_preprocess_rave/processed/"

# Pattern to match *_labels.json files
wav_files = os.path.join(directory, "*_processed.wav")

# Get all matching files
wav_files = glob.glob(wav_files)

# Count them
print(f"Number of *_processed.wav files: {len(wav_files)}")

Number of *_processed.wav files: 4386


## Task 2: Select a subset of audio files (~5 hours)

In [None]:
import os
import glob
import random
import shutil

# Directory containing your original files
directory = "../output_preprocess_rave/processed/"

# Pattern to match wav files
wav_pattern = os.path.join(directory, "*_processed.wav")

# Get all wav files matching pattern
wav_files = glob.glob(wav_pattern)

print(f"Found {len(wav_files)} wav files.")

# Each file duration in seconds
file_duration_sec = 16
target_hours = 5
target_seconds = target_hours * 3600

# Number of files needed for 5 hours
files_needed = target_seconds // file_duration_sec

# Shuffle files
random.shuffle(wav_files)

# Select subset
selected_files = wav_files[:int(files_needed)]

print(f"Selecting {len(selected_files)} files for approximately {target_hours} hours of audio.")

# Create new directory for 5h subset
output_dir = os.path.join(directory, "wav_files_5h")
os.makedirs(output_dir, exist_ok=True)

# Copy selected files to new directory
for filepath in selected_files:
    filename = os.path.basename(filepath)
    dest_path = os.path.join(output_dir, filename)
    shutil.copy2(filepath, dest_path)

print(f"Copied selected files to {output_dir}")


Found 4386 wav files.
Selecting 1125 files for approximately 5 hours of audio.
Copied selected files to ../../output_preprocess_rave/processed/wav_files_5h
