# boutRight Code Description

The boutRight project processes bird song recordings to detect and classify bouts using a YOLOv5 model. The code is divided into two main parts:

## Part 1: Setup and Function Definitions

1. **Import Libraries**: The script imports necessary libraries including `os`, `glob`, `shutil`, `torch`, `PIL`, `numpy`, `scipy`, `tqdm`, `io`, `IPython.display`, `concurrent.futures`, `uuid`, `pandas`, and `math`.

2. **Load YOLOv5 Model**: The YOLOv5 model is loaded with custom weights for detecting bouts in spectrogram images.

3. **Define Functions**:
    - `filtered_spectrogram`: Filters and generates a spectrogram from a given audio file. It applies a Butterworth filter, normalizes the signal, and calculates the spectrogram.
    - `check_for_bouts`: Generates a spectrogram image, uses the YOLOv5 model to detect bouts, and returns the detection results.
    - `process_wav_file`: Processes each WAV file, moves it to the appropriate directory (`Songs` or `Noise_Calls`), and appends the detection results to CSV files.
    - `append_to_csv`: Appends detection results to a CSV file, ensuring consistent data formatting.
    - `get_all_wav_files`: Retrieves all WAV files in a directory, including subdirectories.
    - `is_already_scanned`: Checks if a file has already been scanned by looking it up in a list of scanned files.

## Part 2: Processing Bird Song Recordings

1. **Set Base Directory**: The base directory containing bird song recordings is specified.

2. **Process Each Bird's Mic Folder**:
    - The script iterates through each bird's mic folder.
    - For each hatch day folder, it checks for the existence of `Songs` and `Noise_Calls` directories and creates them if they don't exist.
    - It loads the list of already scanned files from `wav_scanned.csv` if available.

3. **Process WAV Files**:
    - The script retrieves all WAV files in the hatch day folder, including subdirectories.
    - It uses a thread pool to process each WAV file concurrently.
    - For each file, it generates a spectrogram, detects bouts using the YOLOv5 model, and categorizes the files into `Songs` or `Noise_Calls`.
    - Detection results are appended to `results_bouts.csv` and `results_calls.csv`.

4. **Handle Duplicates**:
    - If a file has already been scanned and is detected as a duplicate, a warning is displayed, and the file is deleted.

5. **Save Scanned Files**:
    - The list of scanned files is saved to `wav_scanned.csv` to keep track of processed files and avoid reprocessing.

## Summary

This script is essential for preprocessing and analyzing bird song recordings, enabling the detection and classification of bouts. It ensures efficient processing, categorization, and management of bird song data, facilitating further analysis and research.


In [1]:
import os
from glob import glob
import shutil
import torch
from PIL import Image
import numpy as np
from scipy.io import wavfile
from scipy import signal
from tqdm import tqdm
import io
from IPython.display import clear_output
import concurrent.futures
import uuid
import pandas as pd
import math
import zipfile
import warnings
import threading
import hashlib

# Initialize a lock object
lock = threading.Lock()

# Load YOLOv5 model
model = torch.hub.load('ultralytics/yolov5', 'custom', path=os.path.join(os.getcwd(),r'yolov5\runs\train\exp13\weights\best.pt'))
#folder to temporarily save images for YOLO detection
#this is needed to run parallel computing without problems
temp_path = r'C:\Temp'

# Function to filter and generate spectrogram
def filtered_spectrogram(filepath):
    # Length of FFT
    lend = 34
    # Overlap of FFT
    overlap = 33
    # Time length for exponential window of FFT
    ts = 3
    # Low cut frequency in Hz
    lc = 500
    # High cut frequency in Hz
    hc = 20000
    # Color of image settings
    # Contribution of each channel to color
    RGBch = [0.8, 1.5, 1.5]

    # Import the audio data
    fs, data = wavfile.read(filepath)
    # Round length of data and overlap
    lend = round((lend / 1E3) * fs)
    overlap = round((overlap / 1E3) * fs)
    # Next power of two definition
    def nextpow2(x):
        return 1 if x == 0 else 2**math.ceil(math.log2(x))
    # Calculate next power of two
    nfft = nextpow2(lend)

    # Butterworth filter
    def butter_bp(data, lc, hc, fs, order=3):
        nyq = 0.5 * fs
        low = lc / nyq
        high = hc / nyq
        b, a = signal.butter(order, [low, high], btype='band')
        data_filtered = signal.lfilter(b, a, data)
        return data_filtered

    data = butter_bp(data, lc, hc, fs, order=5)
    # Normalize signal
    data = data / max(abs(data))
    # Make windows for spectrogram
    t = np.linspace(-lend / 2 + 1, lend / 2, num=lend)
    sigma = (ts / 1E3) * fs
    w = np.exp(-(t / sigma)**2)
    dw = np.exp(-(t / (2 * sigma))**2)
    # Calculate spectrograms
    [f, t, sx] = signal.spectrogram(data, fs=fs, window=w, noverlap=overlap, nfft=nfft)
    [_, _, sxx] = signal.spectrogram(data, fs=fs, window=dw, noverlap=overlap, nfft=nfft)
    # Average of both spectrograms
    image_array = np.log2(abs(sx) + abs(sxx)) / 2
    # Obtain thresholds for background
    minmax = [np.percentile(image_array, 80), np.percentile(image_array, 99)]
    # Subtract background
    image_array = np.minimum(image_array, minmax[1])
    image_array = np.maximum(image_array, minmax[0])
    # Normalize
    image_array = (image_array - np.min(image_array)) / (np.max(image_array) - np.min(image_array))
    # Flip spectrogram
    image_array = np.flip(image_array, 0)
    # Convert to color
    sz = (image_array.shape[0] - 1, image_array.shape[1] - 1, 3)
    image_color = np.zeros(sz)
    tmp = image_array
    image_color[:, :, 0] = RGBch[0] * tmp[0:-1, 0:-1]
    tmp = np.diff(image_array, 1, axis=0)
    image_color[:, :, 1] = RGBch[1] * tmp[:, 0:-1]
    tmp = np.diff(image_array, 1, axis=1)
    image_color[:, :, 2] = RGBch[2] * tmp[0:-1, :]
    
    return image_color, fs, len(data)

# Function to generate spectrogram and check for bouts using YOLOv5
def check_for_bouts(wav_file, temp_path):
    temp_img_path = None
    try:
        # Ensure the temp_path directory exists
        os.makedirs(temp_path, exist_ok=True)
        
        # Generate the spectrogram and convert it to an image
        spectrogram, fs, data_length = filtered_spectrogram(wav_file)
        spectrogram_image = (spectrogram * 255).astype(np.uint8)
        pil_image = Image.fromarray(spectrogram_image)
        
        # Generate a unique filename for the temporary spectrogram image in the temp_path directory
        temp_img_filename = f'temp_spectrogram_{uuid.uuid4().hex}.png'
        temp_img_path = os.path.join(temp_path, temp_img_filename)
        pil_image.save(temp_img_path)
        
        # Verify the saved image
        with Image.open(temp_img_path) as img:
            img.verify()
        
        # Use YOLOv5 to detect bouts in the spectrogram image
        results = model(temp_img_path)
        
        # Filter detections to only include bouts (class 0)
        bouts = [bbox for bbox in results.xyxy[0] if bbox[5] == 0]
        
        # Check if any bouts are detected
        return len(bouts) > 0, results.xyxy[0], spectrogram.shape[1], data_length, fs
    except ValueError as e:
        if "File format" in str(e):
            print(f"Error in check_for_bouts for {wav_file}: {e}")
        else:
            raise e  # Re-raise other types of ValueErrors
        return False, [], 0, 0, 0
    except Exception as e:
        print(f"Error in check_for_bouts for {wav_file}: {e}")
        return False, [], 0, 0, 0
    finally:
        # Remove the temporary image file
        if temp_img_path and os.path.exists(temp_img_path):
            os.remove(temp_img_path)


# Function to process WAV file
def process_wav_file(wav_file, songs_dir, noise_calls_dir, bouts_csv_path, calls_csv_path, scanned_files, scanned_csv_path, processed_files):
    try:
        has_bouts, bboxes, spectrogram_length, data_length, fs = check_for_bouts(wav_file, temp_path)

        results = []
        for bbox in bboxes:
            x1, y1, x2, y2, conf, cls = bbox
            entry = {
                'wav_folder_path': os.path.dirname(wav_file),
                'wav_filename': os.path.basename(wav_file),
                'spectrogram_start_time': x1.item(),
                'spectrogram_end_time': x2.item(),
                'wav_file_start_time': (x1.item() / spectrogram_length) * (data_length / fs),
                'wav_file_end_time': (x2.item() / spectrogram_length) * (data_length / fs),
                'confidence': conf.item(),
                'label_class': cls.item(),
                'bbox_x1': x1.item(),
                'bbox_y1': y1.item(),
                'bbox_x2': x2.item(),
                'bbox_y2': y2.item()
            }
            results.append(entry)

        # Append results to CSV files
        if results:
            bouts_results = [entry for entry in results if entry['label_class'] == 0]
            calls_results = [entry for entry in results if entry['label_class'] == 1]

            with lock:
                if bouts_results:
                    append_to_csv(bouts_results, bouts_csv_path)
                if calls_results:
                    append_to_csv(calls_results, calls_csv_path)

        # Move the file based on detection results
        if has_bouts:
            shutil.move(wav_file, os.path.join(songs_dir, os.path.basename(wav_file)))
            with lock:
                processed_files.add(os.path.basename(wav_file))  # Thread-safe addition
            print(f"Moved {wav_file} to Songs")
        else:
            shutil.move(wav_file, os.path.join(noise_calls_dir, os.path.basename(wav_file)))
            print(f"Moved {wav_file} to Noise_Calls")

        # Update scanned files and save to CSV immediately
        with lock:
            scanned_files.append(os.path.basename(wav_file))
            pd.DataFrame({'filename': scanned_files}).to_csv(scanned_csv_path, index=False)

        return {
            'filepath': wav_file,
            'filename': os.path.basename(wav_file),
            'bboxes': bboxes,
            'spectrogram_length': spectrogram_length,
            'data_length': data_length,
            'fs': fs
        }
    except Exception as e:
        print(f"Error processing {wav_file}: {e}")
        return None

# Function to re-process WAV file that was missed in the original scan
def reprocess_wav_file(wav_file, noise_calls_dir, bouts_csv_path, calls_csv_path):
    try:
        has_bouts, bboxes, spectrogram_length, data_length, fs = check_for_bouts(wav_file, temp_path)

        results = []
        for bbox in bboxes:
            x1, y1, x2, y2, conf, cls = bbox
            entry = {
                'wav_folder_path': os.path.dirname(wav_file),
                'wav_filename': os.path.basename(wav_file),
                'spectrogram_start_time': x1.item(),
                'spectrogram_end_time': x2.item(),
                'wav_file_start_time': (x1.item() / spectrogram_length) * (data_length / fs),
                'wav_file_end_time': (x2.item() / spectrogram_length) * (data_length / fs),
                'confidence': conf.item(),
                'label_class': cls.item(),
                'bbox_x1': x1.item(),
                'bbox_y1': y1.item(),
                'bbox_x2': x2.item(),
                'bbox_y2': y2.item()
            }
            results.append(entry)

        # Append results to CSV files
        if results:
            bouts_results = [entry for entry in results if entry['label_class'] == 0]
            calls_results = [entry for entry in results if entry['label_class'] == 1]

            with lock:
                if bouts_results:
                    append_to_csv(bouts_results, bouts_csv_path)
                if calls_results:
                    append_to_csv(calls_results, calls_csv_path)

        if has_bouts:
            print(f"Reprocessed and found bouts in {wav_file}")
        else:
            print(f"Reprocessed {wav_file}, no bouts found")
            
            # Add this block to move files without bouts to the Noise_Calls folder
            shutil.move(wav_file, os.path.join(noise_calls_dir, os.path.basename(wav_file)))
            print(f"Moved {wav_file} to Noise_Calls after re-scan")

    except Exception as e:
        print(f"Error reprocessing {wav_file}: {e}")


# Function to append results to CSV
def append_to_csv(results, csv_path):
    # Define the expected columns
    columns = [
        'wav_folder_path', 'wav_filename', 'spectrogram_start_time', 'spectrogram_end_time',
        'wav_file_start_time', 'wav_file_end_time', 'confidence', 'label_class',
        'bbox_x1', 'bbox_y1', 'bbox_x2', 'bbox_y2'
    ]
    
    # Convert results to DataFrame
    new_df = pd.DataFrame(results, columns=columns)
    
    # Append to the CSV file
    if os.path.exists(csv_path):
        new_df.to_csv(csv_path, mode='a', header=False, index=False)
    else:
        new_df.to_csv(csv_path, mode='w', header=True, index=False)

# Function to hash individual files to avoid duplicates
def hash_file(filepath):
    """Returns the MD5 hash of the file content."""
    hasher = hashlib.md5()
    with open(filepath, 'rb') as file:
        buf = file.read()
        hasher.update(buf)
    return hasher.hexdigest()

# Function to gather filename of wav files in folders and zip files
def get_all_wav_files(hatch_path, bird_path):
    wav_files = {}  # Use a dictionary to avoid duplicates by file content hash
    songs_dir = os.path.join(hatch_path, 'Songs')
    noise_calls_dir = os.path.join(hatch_path, 'Noise_Calls')

    # Extract ZIP files recursively
    zip_files = glob(os.path.join(hatch_path, '**', '*.zip'), recursive=True)
    for zip_file in zip_files:
        with zipfile.ZipFile(zip_file, 'r') as zip_ref:
            zip_ref.extractall(hatch_path)
        
        # Move the ZIP file to the birdname folder (at the same level as the mic folder)
        shutil.move(zip_file, os.path.join(bird_path, os.path.basename(zip_file)))

    # Function to add files based on their content hash
    def add_wav_files(dir_path):
        for wav_file in glob(os.path.join(dir_path, '**', '*.wav'), recursive=True):
            file_hash = hash_file(wav_file)
            if file_hash not in wav_files:
                wav_files[file_hash] = wav_file
    
    # Add WAV files from the main directory, Songs, and Noise_Calls
    add_wav_files(hatch_path)
    if os.path.exists(songs_dir):
        add_wav_files(songs_dir)
    if os.path.exists(noise_calls_dir):
        add_wav_files(noise_calls_dir)
    
    return list(wav_files.values())  # Return only the unique file paths

# Function to check if a file has already been scanned
def is_already_scanned(wav_file, scanned_files):
    return os.path.basename(wav_file) in scanned_files

# Function to check if a song file is in the bouts CSV
def is_in_bouts_csv(wav_file, bouts_csv_path):
    if os.path.exists(bouts_csv_path):
        try:
            bouts_df = pd.read_csv(bouts_csv_path)
            return os.path.basename(wav_file) in bouts_df['wav_filename'].values
        except pd.errors.EmptyDataError:
            return False
    return False

Using cache found in C:\Users\Gonzalez Lab/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2024-9-4 Python-3.12.5 torch-2.4.0 CPU

Fusing layers... 
Model summary: 166 layers, 7056607 parameters, 0 gradients
Adding AutoShape... 


In [2]:
# Base directory and suppression of warnings remain unchanged
base_dir = r'G:\Lab Dropbox\BirdSong\BirdData_2024'
warnings.filterwarnings("ignore", category=FutureWarning, message=".*torch.cuda.amp.autocast.*")


# Process each bird's mic folder
for bird_dir in os.listdir(base_dir):
    bird_path = os.path.join(base_dir, bird_dir)
    mic_path = os.path.join(bird_path, 'mic')

    if not os.path.exists(mic_path):
        continue

    for hatch_folder in os.listdir(mic_path):
        hatch_path = os.path.join(mic_path, hatch_folder)
        bird_name = bird_dir
        hatch_day = hatch_folder

        if not os.path.isdir(hatch_path):
            continue

        songs_dir = os.path.join(hatch_path, 'Songs')
        noise_calls_dir = os.path.join(hatch_path, 'Noise_Calls')

        # Create Songs and Noise_Calls directories if they don't exist
        os.makedirs(songs_dir, exist_ok=True)
        os.makedirs(noise_calls_dir, exist_ok=True)

        bouts_csv_path = os.path.join(hatch_path, f'results_bouts_{bird_name}_{hatch_day}.csv')
        calls_csv_path = os.path.join(hatch_path, f'results_calls_{bird_name}_{hatch_day}.csv')
        scanned_csv_path = os.path.join(hatch_path, f'wav_scanned_{bird_name}_{hatch_day}.csv')

        scanned_files = []
        if os.path.exists(scanned_csv_path):
            try:
                scanned_df = pd.read_csv(scanned_csv_path, on_bad_lines='skip')
                scanned_files = scanned_df['filename'].tolist()
            except pd.errors.EmptyDataError:
                scanned_files = []

        wav_files = get_all_wav_files(hatch_path, bird_path)

        # Filter out already scanned files
        unscanned_wav_files = [wav_file for wav_file in wav_files if not is_already_scanned(wav_file, scanned_files)]

        processed_files = set()  # Thread-safe modification

        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = {executor.submit(process_wav_file, wav_file, songs_dir, noise_calls_dir, bouts_csv_path, calls_csv_path, scanned_files, scanned_csv_path, processed_files): wav_file for wav_file in unscanned_wav_files}
            for i, future in enumerate(tqdm(concurrent.futures.as_completed(futures), total=len(unscanned_wav_files), desc="Processing WAV files")):
                if i % 10 == 0:
                    clear_output(wait=True)  # Clear the output every 10 files
                wav_file = futures[future]
                try:
                    result = future.result()
                except Exception as exc:
                    print(f'{wav_file} generated an exception: {exc}')

        # Final check: Ensure all files in Songs appear in both wav_scanned.csv and results_bouts.csv
        song_files_to_rescan = []
        for song_file in glob(os.path.join(songs_dir, '*.wav')):
            if not is_already_scanned(song_file, scanned_files):
                print(f"Appending {song_file} to scanned files")
                with lock:
                    scanned_files.append(os.path.basename(song_file))
                    pd.DataFrame({'filename': scanned_files}).to_csv(scanned_csv_path, index=False)
            if not is_in_bouts_csv(song_file, bouts_csv_path):
                print(f"File {song_file} missing from results_bouts.csv, adding to re-scan list")
                song_files_to_rescan.append(song_file)

        # Parallel processing for the re-scan of song files
        with concurrent.futures.ThreadPoolExecutor() as reprocess_executor:
            reprocess_futures = {reprocess_executor.submit(reprocess_wav_file, song_file, noise_calls_dir, bouts_csv_path, calls_csv_path): song_file for song_file in song_files_to_rescan}
            for i, future in enumerate(tqdm(concurrent.futures.as_completed(reprocess_futures), total=len(song_files_to_rescan), desc="Re-scanning Songs")):
                if i % 10 == 0:
                    clear_output(wait=True)  # Clear the output every 10 iterations
                song_file = reprocess_futures[future]
                try:
                    future.result()
                except Exception as exc:
                    print(f'{song_file} generated an exception during re-scan: {exc}')

Re-scanning Songs:  64%|██████▎   | 202/318 [03:52<02:05,  1.08s/it]

Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54600900_4_10_15_10_0.wav, no bouts found
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54600900_4_10_15_10_0.wav to Noise_Calls after re-scan


Re-scanning Songs:  64%|██████▍   | 203/318 [03:52<01:39,  1.16it/s]

Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54110764_4_10_15_1_50.wav, no bouts found
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54110764_4_10_15_1_50.wav to Noise_Calls after re-scan


Re-scanning Songs:  64%|██████▍   | 204/318 [03:52<01:27,  1.30it/s]

Error in check_for_bouts for G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54193451_4_10_15_3_13.wav: Sizes of tensors must match except in dimension 4. Expected size 3 but got size 1 for tensor number 2 in the list.
Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54193451_4_10_15_3_13.wav, no bouts found
Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.58124017_4_10_16_8_44.wav, no bouts found
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54193451_4_10_15_3_13.wav to Noise_Calls after re-scan
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.58124017_4_10_16_8_44.wav to Noise_Calls after re-scan


Re-scanning Songs:  65%|██████▍   | 206/318 [03:53<01:07,  1.67it/s]

Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.57928953_4_10_16_5_28.wav, no bouts found
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.57928953_4_10_16_5_28.wav to Noise_Calls after re-scan


Re-scanning Songs:  65%|██████▌   | 207/318 [03:54<01:15,  1.47it/s]

Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54716557_4_10_15_11_56.wav, no bouts found
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54716557_4_10_15_11_56.wav to Noise_Calls after re-scan


Re-scanning Songs:  65%|██████▌   | 208/318 [03:55<01:15,  1.46it/s]

Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54249559_4_10_15_4_9.wav, no bouts found
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.54249559_4_10_15_4_9.wav to Noise_Calls after re-scan


Re-scanning Songs:  66%|██████▌   | 209/318 [03:55<01:01,  1.77it/s]

Reprocessed G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.59396340_4_10_16_29_56.wav, no bouts found
Moved G:\Lab Dropbox\BirdSong\BirdData_2024\B028\mic\100\Songs\B028_45392.59396340_4_10_16_29_56.wav to Noise_Calls after re-scan


In [7]:
base_dir = r'H:\birdsongs'
bird_dir = os.listdir(base_dir)
bird_path = os.path.join(base_dir, bird_dir[0])
mic_path = os.path.join(bird_path,'mic')
hatch_folder= os.listdir(mic_path)
hatch_path = os.path.join(mic_path, hatch_folder[2])
hatch_path
songs_dir = os.path.join(hatch_path, 'Songs')
noise_calls_dir = os.path.join(hatch_path, 'Noise_Calls')
bouts_csv_path = os.path.join(hatch_path, 'results_bouts.csv')
calls_csv_path = os.path.join(hatch_path, 'results_calls.csv')
scanned_files = []

# Load scanned files if the CSV exists
scanned_csv_path = os.path.join(hatch_path, 'wav_scanned.csv')
scanned_df = pd.read_csv(scanned_csv_path, on_bad_lines='skip')

scanned_files = scanned_df['filename'].tolist()

hatch_path
wav_files = get_all_wav_files(hatch_path, bird_path)
os.path.basename(wav_file[0]) in scanned_files
unscanned_wav_files = [wav_file for wav_file in wav_file if not is_already_scanned(wav_file, scanned_files)]
unscanned_wav_files

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'H:\\birdsongs'

# Remove noise and call wav files which have the least amount of calls

This script processes bird song recordings to manage and retain the top 100 WAV files with the most calls in each hatch day folder. The steps are as follows:

1. **Import Libraries**: The script imports necessary libraries including `os`, `pandas`, and `glob`.

2. **Set Base Directory**: The base directory containing bird song recordings is specified.

3. **Define Function**:
    - `process_hatch_day_folder`: This function processes each hatch day folder by:
        - Checking if the `results_calls.csv` file exists.
        - Loading the CSV file and counting the number of calls in each WAV file.
        - Sorting the files by call count and retaining the top 100.
        - Deleting the remaining files in the `Noise_Calls` folder.

4. **Process Each Bird's Mic Folder**:
    - The script iterates through each bird's mic folder.
    - For each hatch day folder, it calls the `process_hatch_day_folder` function to manage the WAV files based on the number of calls.

This script helps in organizing and retaining the most relevant bird song recordings for further analysis.


In [4]:
import os
import csv
import pandas as pd
from glob import glob

# Base directory
base_dir = r'G:\Lab Dropbox\BirdSong\BirdData_2024'

# Function to process each hatch day folder
def process_hatch_day_folder(hatch_path, bird_name, hatch_day):
    calls_csv_path = os.path.join(hatch_path, f'results_calls_{bird_name}_{hatch_day}.csv')
    noise_calls_dir = os.path.join(hatch_path, 'Noise_Calls')
    
    # Check if the calls CSV file exists
    if not os.path.exists(calls_csv_path):
        print(f"No CSV file found for {hatch_path}. Skipping...")
        return
    
    try:
        # Load the calls CSV file
        calls_df = pd.read_csv(calls_csv_path, on_bad_lines='skip')
        
        if calls_df.empty:
            print(f"No data in {calls_csv_path}. Skipping...")
            return
        
        # Group by filename and count the number of calls
        call_counts = calls_df.groupby('wav_filename').size().reset_index(name='call_count')
        
        # Sort by call count in descending order and keep the top 100
        top_calls = call_counts.sort_values(by='call_count', ascending=False).head(100)
        
        # Get the list of top 100 filenames
        top_filenames = top_calls['wav_filename'].tolist()
        
        # Get all WAV files in the Noise_Calls folder
        all_wav_files = glob(os.path.join(noise_calls_dir, '*.wav'))
        
        # Delete files not in the top 100
        for wav_file in all_wav_files:
            if os.path.basename(wav_file) not in top_filenames:
                os.remove(wav_file)
                print(f"Deleted {wav_file}")
    except pd.errors.EmptyDataError:
        print(f"EmptyDataError: No columns to parse from file {calls_csv_path}")

# Process each bird's mic folder
for bird_dir in os.listdir(base_dir):
    bird_path = os.path.join(base_dir, bird_dir)
    mic_path = os.path.join(bird_path, 'mic')
    
    if not os.path.exists(mic_path):
        continue
    
    for hatch_folder in os.listdir(mic_path):
        hatch_path = os.path.join(mic_path, hatch_folder)
        
        if not os.path.isdir(hatch_path):
            continue
        
        bird_name = bird_dir
        hatch_day = hatch_folder
        
        process_hatch_day_folder(hatch_path, bird_name, hatch_day)


def align_csv_columns(file_path, expected_columns=12):
    aligned_rows = []
    
    try:
        with open(file_path, 'r') as infile:
            reader = csv.reader(infile)
            for row in reader:
                # Count non-empty cells
                non_empty_cells = [cell for cell in row if cell]
                
                if len(non_empty_cells) == expected_columns:
                    # Shift the row to start at column 1 and ensure it has exactly 12 columns
                    aligned_row = non_empty_cells[:expected_columns]
                else:
                    aligned_row = row
                
                # Ensure the row has exactly 12 columns
                aligned_row = aligned_row[:expected_columns]
                
                aligned_rows.append(aligned_row)
        
        with open(file_path, 'w', newline='') as outfile:
            writer = csv.writer(outfile)
            writer.writerows(aligned_rows)
        print(f"CSV columns aligned successfully for {file_path}.")
    
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")

def process_hatch_folders(base_path, expected_columns=12):
    for root, dirs, files in os.walk(base_path):
        for file in files:
            if file.startswith('results_bout') or file.startswith('results_call'):
                file_path = os.path.join(root, file)
                align_csv_columns(file_path, expected_columns)

# Fix excel files
process_hatch_folders(base_dir)


Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24389694_5_9_6_46_29.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24429282_5_9_6_47_9.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24447321_5_9_6_47_27.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24478364_5_9_6_47_58.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24510013_5_9_6_48_30.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24546328_5_9_6_49_6.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24615920_5_9_6_50_15.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24666698_5_9_6_51_6.wav
Deleted G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\mic\129\Noise_Calls\8UWL2_45421.24834919_5_9_6_53_54.wav
Deleted G:\Lab Dropbox

# Simple bout analytics

This script processes bird song recordings to generate and save plots for the distribution of bout durations and YOLO confidence scores. The steps are as follows:

1. **Import Libraries**: The script imports necessary libraries including `pandas`, `matplotlib.pyplot`, and `os`.

2. **Define Functions**:
    - `plot_bout_durations`: This function calculates and plots the distribution of bout durations from the `results_bouts.csv` file. It saves the plot as an SVG file.
    - `plot_yolo_confidence`: This function calculates and plots the distribution of YOLO confidence scores from the `results_bouts.csv` and `results_calls.csv` files. It saves the plot as an SVG file.

3. **Set Base Directory**: The base directory containing bird song recordings is specified.

4. **Process Each Bird's Mic Folder**:
    - The script iterates through each bird's mic folder.
    - For each hatch day folder, it checks for the existence of `Songs` and `Noise_Calls` directories.
    - It loads the results from `results_bouts.csv` and `results_calls.csv` if available.
    - It calls the `plot_bout_durations` and `plot_yolo_confidence` functions to generate and save the plots.

5. **Output**: The script prints a message indicating that the plots have been generated and saved successfully.

This script helps in visualizing the distribution of bout durations and YOLO confidence scores for bird song recordings.


In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Base directory
base_dir = r'G:\Lab Dropbox\BirdSong\BirdData_2024'

# Function to plot and save the heatmap of bout duration distributions across hatch days
def plot_bout_duration_heatmap(bird_path, bird_name):
    mic_path = os.path.join(bird_path, 'mic')
    hatch_folders = [f for f in os.listdir(mic_path) if os.path.isdir(os.path.join(mic_path, f))]
    
    # Dictionary to store bout durations for each hatch day
    bout_durations = {}
    
    for hatch_folder in hatch_folders:
        hatch_path = os.path.join(mic_path, hatch_folder)
        bouts_file = os.path.join(hatch_path, f'results_bouts_{bird_name}_{hatch_folder}.csv')
        
        if os.path.exists(bouts_file):
            try:
                bouts_df = pd.read_csv(bouts_file)
                if not bouts_df.empty:
                    bouts_df['duration'] = bouts_df['wav_file_end_time'] - bouts_df['wav_file_start_time']
                    bout_durations[hatch_folder] = bouts_df['duration'].tolist()
            except pd.errors.EmptyDataError:
                print(f"EmptyDataError: No columns to parse from file {bouts_file}")
    
    if not bout_durations:
        print(f"No valid bout data found for {bird_name}")
        return
    
    # Create a DataFrame from the dictionary
    max_duration = 6  # Limit the duration to 6 seconds
    num_bins = 20
    duration_bins = pd.cut([0, max_duration], bins=num_bins, retbins=True)[1]  # Get the bin edges
    
    heatmap_data = pd.DataFrame(index=hatch_folders, columns=range(num_bins)).fillna(0)
    
    for hatch_folder, durations in bout_durations.items():
        # Filter durations to be within 0 to 6 seconds
        filtered_durations = [d for d in durations if 0 <= d <= max_duration]
        if filtered_durations:
            duration_counts = pd.cut(filtered_durations, bins=duration_bins).value_counts()
            heatmap_data.loc[hatch_folder] = duration_counts.values
    
    # Normalize the data to represent probabilities
    heatmap_data = heatmap_data.div(heatmap_data.sum(axis=1), axis=0).fillna(0)
    
    # Plot the heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(heatmap_data, cmap="YlGnBu", cbar_kws={'label': 'Probability'}, xticklabels=[f'{x:.2f}' for x in duration_bins[:-1]])
    plt.title(f'Bout Duration Distribution Heatmap for {bird_name}')
    plt.xlabel('Duration (seconds)')
    plt.ylabel('Hatch Day')
    
    # Save the heatmap as a vector image
    heatmap_file = os.path.join(bird_path, 'bout_duration_heatmap.svg')
    plt.savefig(heatmap_file)
    plt.close()
    print(f"Heatmap saved for {bird_name} at {heatmap_file}")


# Process each bird's mic folder
for bird_dir in os.listdir(base_dir):
    bird_path = os.path.join(base_dir, bird_dir)
    mic_path = os.path.join(bird_path, 'mic')
    
    if not os.path.exists(mic_path):
        continue
    
    bird_name = bird_dir
    
    # Plot and save the heatmap of bout duration distributions across hatch days
    plot_bout_duration_heatmap(bird_path, bird_name)

print("Heatmaps have been generated and saved successfully.")


  heatmap_data = pd.DataFrame(index=hatch_folders, columns=range(num_bins)).fillna(0)


Heatmap saved for 8UWL2 at G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\bout_duration_heatmap.svg


  heatmap_data = pd.DataFrame(index=hatch_folders, columns=range(num_bins)).fillna(0)


Heatmap saved for B028 at G:\Lab Dropbox\BirdSong\BirdData_2024\B028\bout_duration_heatmap.svg
No valid bout data found for B033
No valid bout data found for B045
No valid bout data found for B17Y4
No valid bout data found for Bk19
No valid bout data found for Bk52W16
No valid bout data found for BK6
No valid bout data found for BK8W98
No valid bout data found for EVC2303
No valid bout data found for Gr30
No valid bout data found for NoBa01
No valid bout data found for NoBa02
No valid bout data found for Or04
No valid bout data found for OR09V022V023
No valid bout data found for R02
No valid bout data found for R03
No valid bout data found for R04
No valid bout data found for R07
No valid bout data found for R08
No valid bout data found for R104
No valid bout data found for R120
No valid bout data found for R15
No valid bout data found for R155
No valid bout data found for R156
No valid bout data found for R159
No valid bout data found for R16
No valid bout data found for R17
No valid 

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Base directory
base_dir = r'G:\Lab Dropbox\BirdSong\BirdData_2024'

# Function to plot and save the heatmap of bout duration distributions across hatch days
def plot_bout_duration_heatmap(bird_path, bird_name):
    mic_path = os.path.join(bird_path, 'mic')
    hatch_folders = [f for f in os.listdir(mic_path) if os.path.isdir(os.path.join(mic_path, f))]
    
    # Dictionary to store bout durations for each hatch day
    bout_durations = {}
    
    for hatch_folder in hatch_folders:
        hatch_path = os.path.join(mic_path, hatch_folder)
        bouts_file = os.path.join(hatch_path, f'results_bouts_{bird_name}_{hatch_folder}.csv')
        
        if os.path.exists(bouts_file):
            try:
                bouts_df = pd.read_csv(bouts_file)
                if not bouts_df.empty:
                    bouts_df['duration'] = bouts_df['wav_file_end_time'] - bouts_df['wav_file_start_time']
                    bout_durations[hatch_folder] = bouts_df['duration'].tolist()
            except pd.errors.EmptyDataError:
                print(f"EmptyDataError: No columns to parse from file {bouts_file}")
    
    if not bout_durations:
        print(f"No valid bout data found for {bird_name}")
        return
    
    # Create a DataFrame from the dictionary
    max_duration = max([max(durations) for durations in bout_durations.values()])
    duration_bins = range(0, int(max_duration) + 1)
    
    heatmap_data = pd.DataFrame(index=hatch_folders, columns=duration_bins).fillna(0)
    
    for hatch_folder, durations in bout_durations.items():
        duration_counts = pd.Series(durations).value_counts(bins=duration_bins, sort=False)
        heatmap_data.loc[hatch_folder] = duration_counts
    
    # Normalize the data to represent probabilities
    heatmap_data = heatmap_data.div(heatmap_data.sum(axis=1), axis=0).fillna(0)
    
    # Plot the heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(heatmap_data, cmap="YlGnBu", cbar_kws={'label': 'Probability'})
    plt.title(f'Bout Duration Distribution Heatmap for {bird_name}')
    plt.xlabel('Duration (seconds)')
    plt.ylabel('Hatch Day')
    
    # Save the heatmap as a vector image
    heatmap_file = os.path.join(bird_path, 'bout_duration_heatmap.svg')
    plt.savefig(heatmap_file)
    plt.close()
    print(f"Heatmap saved for {bird_name} at {heatmap_file}")


# Process each bird's mic folder
for bird_dir in os.listdir(base_dir):
    bird_path = os.path.join(base_dir, bird_dir)
    mic_path = os.path.join(bird_path, 'mic')
    
    if not os.path.exists(mic_path):
        continue
    
    bird_name = bird_dir
    
    # Plot and save the heatmap of bout duration distributions across hatch days
    plot_bout_duration_heatmap(bird_path, bird_name)

print("Heatmaps have been generated and saved successfully.")


  heatmap_data = pd.DataFrame(index=hatch_folders, columns=duration_bins).fillna(0)


Heatmap saved for 8UWL2 at G:\Lab Dropbox\BirdSong\BirdData_2024\8UWL2\bout_duration_heatmap.svg


  heatmap_data = pd.DataFrame(index=hatch_folders, columns=duration_bins).fillna(0)


Heatmap saved for B028 at G:\Lab Dropbox\BirdSong\BirdData_2024\B028\bout_duration_heatmap.svg
No valid bout data found for B033
No valid bout data found for B045
No valid bout data found for B17Y4
No valid bout data found for Bk19
No valid bout data found for Bk52W16
No valid bout data found for BK6
No valid bout data found for BK8W98
No valid bout data found for EVC2303
No valid bout data found for Gr30
No valid bout data found for NoBa01
No valid bout data found for NoBa02
No valid bout data found for Or04
No valid bout data found for OR09V022V023
No valid bout data found for R02
No valid bout data found for R03
No valid bout data found for R04
No valid bout data found for R07
No valid bout data found for R08
No valid bout data found for R104
No valid bout data found for R120
No valid bout data found for R15
No valid bout data found for R155
No valid bout data found for R156
No valid bout data found for R159
No valid bout data found for R16
No valid bout data found for R17
No valid 

# bout Entropy analytics

This script processes bird song recordings to calculate and visualize the spectral entropy of detected bouts. The steps are as follows:

1. **Import Libraries**: The script imports necessary libraries including `os`, `pandas`, `numpy`, `scipy`, and `matplotlib`.

2. **Define Function**:
    - `calculate_spectral_entropy`: This function computes the spectral entropy of a given signal using the Short-Time Fourier Transform (STFT).

3. **Set Base Directory**: The base directory containing bird song recordings is specified.

4. **Process Each Bird's Mic Folder**:
    - The script iterates through each bird's mic folder.
    - For each hatch day folder, it checks for the existence of `Songs` and `Noise_Calls` directories.
    - It loads the results from `results_bouts.csv` if available.

5. **Process Each Bout**:
    - For each bout in the CSV file, the corresponding WAV file is read.
    - A region of interest (ROI) around the bout is extracted with 1-second padding.
    - The spectral entropy of the bout signal is calculated and stored.

6. **Sort and Align Bouts**:
    - The bouts are sorted based on their total spectral entropy.
    - All bouts are aligned to a common signal at the start of the bout using cross-correlation.

7. **Visualize Spectral Entropy**:
    - A figure is created where each row represents a bout, the x-axis is time, and the color represents the spectral entropy.
    - The figure is saved as an SVG file and displayed.

This script helps in analyzing and visualizing the complexity of bird songs by calculating and aligning the spectral entropy of detected bouts.


In [25]:
import os
import pandas as pd
import numpy as np
from scipy.io import wavfile
from scipy.signal import spectrogram
import matplotlib.pyplot as plt

# Function to calculate spectral entropy
def calculate_spectral_entropy(signal, fs, nperseg=256, noverlap=128):
    f, t, Sxx = spectrogram(signal, fs, nperseg=nperseg, noverlap=noverlap)
    Sxx_norm = Sxx / np.sum(Sxx, axis=0)
    spectral_entropy = -np.sum(Sxx_norm * np.log2(Sxx_norm + 1e-10), axis=0)
    return spectral_entropy

# Base directory
base_dir = r'C:\Users\ucsfg\Documents\Code\boutRight_v3\birdsong'

# Process each bird's mic folder
for bird_dir in os.listdir(base_dir):
    bird_path = os.path.join(base_dir, bird_dir)
    mic_path = os.path.join(bird_path, 'mic')
    
    if not os.path.exists(mic_path):
        continue
    
    for hatch_folder in os.listdir(mic_path):
        hatch_path = os.path.join(mic_path, hatch_folder)
        
        if not os.path.isdir(hatch_path):
            continue
        
        songs_dir = os.path.join(hatch_path, 'Songs')
        noise_calls_dir = os.path.join(hatch_path, 'Noise_Calls')
        
        if not os.path.exists(songs_dir) or not os.path.exists(noise_calls_dir):
            continue
        
        # Load results from CSV
        results_csv = os.path.join(hatch_path, 'results_bouts.csv')
        if not os.path.exists(results_csv):
            continue
        
        results_df = pd.read_csv(results_csv)

        # Process each bout
        bouts = []
        for index, row in results_df.iterrows():
            wav_file = row['wav_filename']
            wav_file_path = os.path.join(songs_dir, wav_file)
            try:
                fs, data = wavfile.read(wav_file_path)
            except PermissionError as e:
                print(f"PermissionError: {e}")
                continue
            
            roi_wav_start = int(row['wav_file_start_time'] * fs)
            roi_wav_end = int(row['wav_file_end_time'] * fs)
            
            # Add 1 second padding at start and end of each bout
            start_idx = max(0, roi_wav_start - fs)
            end_idx = min(len(data), roi_wav_end + fs)
            
            bout_signal = data[start_idx:end_idx]
            spectral_entropy = calculate_spectral_entropy(bout_signal, fs)
            
            bouts.append({
                'spectral_entropy': spectral_entropy,
                'total_spectral_entropy': np.sum(spectral_entropy),
                'start_idx': start_idx,
                'end_idx': end_idx,
                'wav_file': wav_file_path
            })

        # Sort bouts based on total summed spectral entropy
        bouts.sort(key=lambda x: x['total_spectral_entropy'], reverse=True)

        # Align all bouts to the common signal at the start of the bout
        aligned_bouts = []
        reference_bout = bouts[0]['spectral_entropy']
        for bout in bouts:
            correlation = np.correlate(reference_bout, bout['spectral_entropy'], mode='full')
            shift = np.argmax(correlation) - len(reference_bout) + 1
            aligned_bout = np.roll(bout['spectral_entropy'], shift)
            aligned_bouts.append(aligned_bout)

        # Make a figure in which each row is a bout, the x-axis is time, and the color is the spectral entropy
        plt.figure(figsize=(10, len(aligned_bouts)))
        for i, aligned_bout in enumerate(aligned_bouts):
            plt.imshow(aligned_bout[np.newaxis, :], aspect='auto', cmap='viridis', extent=[0, len(aligned_bout) / fs, i, i + 1])
        
        plt.xlabel('Time (s)')
        plt.ylabel('Bout')
        plt.yticks(np.arange(len(aligned_bouts)) + 0.5, np.arange(1, len(aligned_bouts) + 1))
        plt.gca().invert_yaxis()
        plt.tight_layout()
        output_svg = os.path.join(hatch_path, 'bouts_spectral_entropy.svg')
        plt.savefig(output_svg, format='svg')
        plt.show()
