# Capuchin Bird Call Detection System

This notebook implements an efficient deep learning system for detecting Capuchin bird calls in forest audio recordings. The system utilizes a pre-trained EfficientNet-B0 convolutional neural network combined with a two-stage sliding window algorithm for optimized inference.

## System Overview

The Capuchin bird detection system processes audio recordings through:
1. **Feature Extraction**: Converting audio to mel spectrograms
2. **Two-Stage Detection**: 
   - First stage: Quick screening of large audio segments (6-second windows)
   - Second stage: Detailed analysis of promising segments
3. **Classification**: Using an EfficientNet-B0 model to classify audio segments

This approach balances efficiency and accuracy by only performing detailed analysis on segments that show potential bird calls.

## Configuration Parameters

The `Config` class centralizes all system parameters including:
- **Audio Processing**: Sample rate (16kHz), mel bands (128), FFT settings
- **Detection Thresholds**: Stage 1 (0.6) and Stage 2 (0.5) confidence thresholds
- **Window Parameters**: Outer window duration (6s), inner step duration (0.3s)
- **File Paths**: Data locations, model weights, and output paths

In [5]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import librosa
import librosa.display
import soundfile as sf
import random

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Input
from tensorflow.keras.applications import EfficientNetB0


## Configuration Parameters

The `Config` class centralizes all system parameters including:
- **Audio Processing**: Sample rate (16kHz), mel bands (128), FFT settings
- **Detection Thresholds**: Stage 1 (0.6) and Stage 2 (0.5) confidence thresholds
- **Window Parameters**: Outer window duration (6s), inner step duration (0.3s)
- **File Paths**: Data locations, model weights, and output paths

In [6]:
class Config:
    """
    Configuration class for Capuchin bird call detection.
    
    This class centralizes all settings and parameters used for inference,
    including paths, audio processing parameters, and detection thresholds.
    """
    
    # Paths
    REAL_WORLD_DATA_FOLDER = 'data/Forest Recordings'
    MANUAL_COUNTS_CSV_PATH = 'answer/capuchin_correct_call_counts.csv'
    MODEL_WEIGHTS_FILE = 'weights/capuchin_bird_classifier.weights.h5'
    
    # Audio processing parameters
    TARGET_SR = 16000
    N_MELS = 128
    N_FFT = 2048
    HOP_LENGTH = 512
    RES_TYPE = 'kaiser_best'
    
    # Sliding window parameters
    THRESHOLD_STAGE1 = 0.6
    THRESHOLD_STAGE2 = 0.5
    WINDOW_DURATION_OUTER = 6.0
    STEP_DURATION_INNER = 0.3
    OVERLAP_INNER = 0.0
    
    # Input shape for the model
    INPUT_SHAPE = (128, 157, 3)
    
    @classmethod
    def setup(cls):
        """Set random seeds for reproducibility."""
        seed_value = 42
        os.environ['PYTHONHASHSEED'] = str(seed_value)
        random.seed(seed_value)
        np.random.seed(seed_value)
        tf.random.set_seed(seed_value)

In [7]:
Config.setup()
print(f"TensorFlow version: {tf.__version__}")

TensorFlow version: 2.18.0


## Feature Extraction

Audio signals are converted to mel spectrograms - a time-frequency representation that captures the spectral content of audio in a way that approximates human auditory perception. These spectrograms are then processed into the appropriate format for the neural network.

In [8]:
def extract_mel_spectrogram(audio_path, target_sr=Config.TARGET_SR, n_mels=Config.N_MELS, 
                           n_fft=Config.N_FFT, hop_length=Config.HOP_LENGTH, 
                           y=None, sr=None, target_time_frames=None):
    """
    Extract Mel spectrogram features from audio.
    
    Args:
        audio_path (str): Path to the audio file.
        target_sr (int): Target sample rate for audio.
        n_mels (int): Number of Mel bands to generate.
        n_fft (int): Length of the FFT window.
        hop_length (int): Number of samples between successive frames.
        y (np.ndarray, optional): Audio time series.
        sr (int, optional): Sample rate of y.
        target_time_frames (int, optional): Target number of time frames for the output.
    
    Returns:
        np.ndarray: Mel spectrogram in dB scale, optionally resized to target_time_frames.
    """
    if y is None or sr is None:
        try:
            y, sr = librosa.load(audio_path, sr=target_sr, res_type=Config.RES_TYPE)
        except Exception as e:
            print(f"Error loading audio file: {audio_path}, {e}")
            return None
            
    mel_spectrogram = librosa.feature.melspectrogram(
        y=y, sr=sr, n_mels=n_mels, n_fft=n_fft, hop_length=hop_length)
    mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)

    if target_time_frames is not None:
        mel_spectrogram_db_resized = librosa.util.fix_length(
            mel_spectrogram_db, size=target_time_frames, axis=1)
        return mel_spectrogram_db_resized
    else:
        return mel_spectrogram_db

## Model Architecture

The system uses EfficientNet-B0, a state-of-the-art convolutional neural network known for its efficiency and accuracy. The architecture includes:
- Base EfficientNet-B0 network (without top layers)
- Global Average Pooling layer
- A dense layer with 128 units and ReLU activation
- Output layer with sigmoid activation for binary classification

In [9]:
def build_model(input_shape=Config.INPUT_SHAPE):
    """
    Build the EfficientNetB0 model architecture for inference.
    
    Args:
        input_shape (tuple): Shape of the input tensor (height, width, channels).
        
    Returns:
        tf.keras.models.Model: Compiled model architecture without weights.
    """
    input_tensor = Input(shape=input_shape)
    
    # Load EfficientNetB0 base model (architecture) - WITHOUT weights initially
    base_model = EfficientNetB0(include_top=False, weights=None, input_tensor=input_tensor)
    
    # Add Global Average Pooling layer
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    
    # Add a Dense layer for classification
    x = Dense(128, activation='relu')(x)
    output_tensor = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=input_tensor, outputs=output_tensor)
    return model

In [10]:
def load_trained_model(weights_path=Config.MODEL_WEIGHTS_FILE):
    """
    Load a trained model with weights for inference.
    
    Args:
        weights_path (str): Path to the model weights file.
        
    Returns:
        tf.keras.models.Model: Model with loaded weights.
    """
    model = build_model()
    try:
        model.load_weights(weights_path)
        print(f"Model weights loaded from {weights_path}")
        return model
    except Exception as e:
        print(f"Error loading model weights: {e}")
        return None

In [12]:
model = load_trained_model()
    
if model is None:
    print("Failed to load model. Exiting.")

Model weights loaded from weights/capuchin_bird_classifier.weights.h5


## Two-Stage Sliding Window Algorithm

The detection process employs a computationally efficient two-stage approach:

1. **Stage 1 (Coarse Detection)**:
   - Analyzes 6-second windows with minimal overlap
   - Applies a higher threshold (0.6) to quickly reject windows without calls
   - Only windows passing this threshold proceed to Stage 2

2. **Stage 2 (Fine-Grained Detection)**:
   - Only runs on promising segments identified in Stage 1
   - Processes smaller chunks (0.3s) within the larger window
   - Uses a lower threshold (0.5) for final detection
   - Counts a call when at least one chunk in the window exceeds the threshold

In [13]:
def count_capuchin_calls_two_stage_sliding_window(
    long_audio_path, model=None,
    threshold_stage1=Config.THRESHOLD_STAGE1, 
    threshold_stage2=Config.THRESHOLD_STAGE2,
    window_duration_outer=Config.WINDOW_DURATION_OUTER,
    step_duration_inner=Config.STEP_DURATION_INNER,
    overlap_inner=Config.OVERLAP_INNER):
    """
    Count Capuchin bird calls using a two-stage sliding window approach.
    
    This function implements an efficient detection algorithm that:
    1. First uses a large window to quickly check entire segments
    2. Then applies a finer-grained analysis only on promising segments
    
    Args:
        long_audio_path (str): Path to the audio file to analyze.
        model (tf.keras.models.Model): Trained model for prediction.
        threshold_stage1 (float): Confidence threshold for the first stage detection.
        threshold_stage2 (float): Confidence threshold for the second stage detection.
        window_duration_outer (float): Duration of the outer sliding window in seconds.
        step_duration_inner (float): Duration of inner chunks in seconds.
        overlap_inner (float): Overlap fraction between inner chunks.
        
    Returns:
        int: Number of Capuchin bird calls detected.
    """
    if model is None:
        print("Model is not provided. Please load your trained model.")
        return None

    try:
        y_long, sr_long = librosa.load(long_audio_path, sr=Config.TARGET_SR, res_type=Config.RES_TYPE)
    except Exception as e:
        print(f"Error loading long audio file: {long_audio_path}, {e}")
        return None

    window_samples_outer = int(window_duration_outer * sr_long)
    step_samples_inner = int(step_duration_inner * sr_long)
    hop_samples_inner = int(step_samples_inner * (1 - overlap_inner))
    total_duration_seconds = librosa.get_duration(y=y_long, sr=sr_long)

    capuchin_call_count = 0
    outer_window_start_time = 0.0

    print(f"Processing audio file (Two-Stage Sliding Window): {os.path.basename(long_audio_path)}")
    print(f"Total duration: {total_duration_seconds:.2f} seconds")
    print(f"Outer Window: {window_duration_outer}s, Stage 1 Threshold: {threshold_stage1}, Stage 2 Threshold: {threshold_stage2}")
    print(f"Inner Step: {step_duration_inner}s, Inner Overlap: {overlap_inner}")

    # Outer sliding window loop (6-second windows)
    outer_start_sample = 0
    while outer_window_start_time < total_duration_seconds:
        outer_end_sample = outer_start_sample + window_samples_outer
        if outer_end_sample > len(y_long):
            outer_end_sample = len(y_long)
        outer_window_audio = y_long[outer_start_sample:outer_end_sample]
        outer_window_duration = len(outer_window_audio) / sr_long
        outer_window_end_time = outer_window_start_time + outer_window_duration

        print(f"\nOuter Window: Start Time: {outer_window_start_time:.2f}s, End Time: {outer_window_end_time:.2f}s, Duration: {outer_window_duration:.2f}s")

        # Stage 1: Quick Check - Classify Entire 6-second Segment
        mel_spec_outer_window = extract_mel_spectrogram(audio_path=None, y=outer_window_audio, sr=sr_long, target_time_frames=157)
        stage1_predicted_class = 0  # Default to no call
        if mel_spec_outer_window is not None:
            mel_spec_outer_window_rgb = np.stack([mel_spec_outer_window] * 3, axis=-1)
            mel_spec_outer_window_reshaped = mel_spec_outer_window_rgb[np.newaxis, ...]
            stage1_prediction = model.predict(mel_spec_outer_window_reshaped, verbose=0)
            stage1_prediction_prob = stage1_prediction[0][0]
            stage1_predicted_class = int(stage1_prediction_prob > threshold_stage1)

        if stage1_predicted_class == 0:  # No Call Indicated by Stage 1
            print("  Stage 1: No Call Indicated - Skipping Stage 2")
        else:  # Call Indicated by Stage 1 - Proceed to Stage 2
            print("  Stage 1: Call Indicated - Proceeding to Stage 2 Inner Loop")
            has_call_in_window = False
            inner_start_sample = 0
            # Stage 2: Detailed Analysis (Inner Loop - smaller chunks)
            while inner_start_sample < len(outer_window_audio):
                inner_end_sample = inner_start_sample + step_samples_inner
                if inner_end_sample > len(outer_window_audio):
                    inner_end_sample = len(outer_window_audio)
                inner_chunk_audio = outer_window_audio[inner_start_sample:inner_end_sample]
                inner_chunk_duration = len(inner_chunk_audio) / sr_long
                inner_chunk_start_time_rel_outer = inner_start_sample / sr_long
                inner_chunk_start_time_abs = outer_window_start_time + inner_chunk_start_time_rel_outer
                inner_chunk_end_time_abs = inner_chunk_start_time_abs + inner_chunk_duration

                if len(inner_chunk_audio) < step_samples_inner / 2:
                    continue

                mel_spec_chunk = extract_mel_spectrogram(audio_path=None, y=inner_chunk_audio, sr=sr_long)
                if mel_spec_chunk is not None:
                    mel_spec_chunk_rgb = np.stack([mel_spec_chunk] * 3, axis=-1)
                    mel_spec_chunk_reshaped = mel_spec_chunk_rgb[np.newaxis, ...]
                    prediction = model.predict(mel_spec_chunk_reshaped, verbose=0)
                    prediction_prob_inner = prediction[0][0]
                    predicted_class_inner = int(prediction_prob_inner > threshold_stage2)

                    if predicted_class_inner == 1:
                        has_call_in_window = True

                inner_start_sample += hop_samples_inner

            if has_call_in_window:
                capuchin_call_count += 1
                print(f"  Call Event Detected in Outer Window - Incrementing Call Count")
            else:
                print(f"  Stage 2: No Call Event Detected in Outer Window (Despite Stage 1 Indication)")
        
        outer_window_start_time += window_duration_outer
        outer_start_sample = int(outer_window_start_time * sr_long)

    print(f"\nTotal Capuchin calls detected in {os.path.basename(long_audio_path)}: {capuchin_call_count}\n")
    return capuchin_call_count

## Results Processing

The system processes audio files individually or in batches, calculating the number of Capuchin calls detected in each recording. Results can be saved to CSV files for further analysis.

The output shown in this notebook demonstrates the detection process on a 3-minute recording (recording_06.mp3), successfully identifying 5 Capuchin bird calls at timestamps around 25s, 45s, 100s, 115s, and 130s.

In [14]:
def process_single_file(audio_file_path, model):
    """
    Process a single audio file to detect Capuchin bird calls.
    
    Args:
        audio_file_path (str): Path to the audio file.
        model (tf.keras.models.Model): Trained model for prediction.
        
    Returns:
        dict: Dictionary containing the audio file name and the count of detected calls.
    """
    print(f"\nStarting inference on: {os.path.basename(audio_file_path)}")
    
    call_count = count_capuchin_calls_two_stage_sliding_window(
        long_audio_path=audio_file_path,
        model=model
    )
    
    result = {'audio_file': os.path.basename(audio_file_path), 'model_calls': call_count}
    return result

In [15]:
def process_multiple_files(audio_files, model):
    """
    Process multiple audio files to detect Capuchin bird calls.
    
    Args:
        audio_files (list): List of paths to audio files.
        model (tf.keras.models.Model): Trained model for prediction.
        
    Returns:
        pd.DataFrame: DataFrame containing results for all processed files.
    """
    print(f"Processing {len(audio_files)} audio files...")
    print("\nStarting inference (Multiple Files - Two-Stage Sliding Window)...\n")
    
    model_results = []
    
    for audio_file in audio_files:
        print(f"Processing audio file: {os.path.basename(audio_file)}")
        result = process_single_file(audio_file, model)
        model_results.append(result)
    
    model_df = pd.DataFrame(model_results)
    
    print("\nModel inference complete (Multiple Files).")
    print(model_df)
    
    return model_df

In [16]:
def save_results(results_df, output_path="results/detection_results.csv"):
    """
    Save detection results to a CSV file.
    
    Args:
        results_df (pd.DataFrame): DataFrame containing detection results.
        output_path (str): Path to save the CSV file.
        
    Returns:
        bool: True if successful, False otherwise.
    """
    try:
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        results_df.to_csv(output_path, index=False)
        print(f"Results saved to {output_path}")
        return True
    except Exception as e:
        print(f"Error saving results: {e}")
        return False


In [17]:
def save_results(results_df, output_path="results/detection_results.csv"):
    """
    Save detection results to a CSV file.
    
    Args:
        results_df (pd.DataFrame): DataFrame containing detection results.
        output_path (str): Path to save the CSV file.
        
    Returns:
        bool: True if successful, False otherwise.
    """
    try:
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        results_df.to_csv(output_path, index=False)
        print(f"Results saved to {output_path}")
        return True
    except Exception as e:
        print(f"Error saving results: {e}")
        return False


In [18]:
audio_files = [
        os.path.join(Config.REAL_WORLD_DATA_FOLDER, f) 
        for f in os.listdir(Config.REAL_WORLD_DATA_FOLDER) 
        if f.lower().endswith(('.wav', '.mp3', '.flac', '.ogg'))
    ]

In [21]:
single_file = audio_files[6]  # Change index to select a different file
single_result = process_single_file(single_file, model)
single_df = pd.DataFrame([single_result])


Starting inference on: recording_06.mp3
Processing audio file (Two-Stage Sliding Window): recording_06.mp3
Total duration: 180.04 seconds
Outer Window: 6.0s, Stage 1 Threshold: 0.6, Stage 2 Threshold: 0.5
Inner Step: 0.3s, Inner Overlap: 0.0

Outer Window: Start Time: 0.00s, End Time: 6.00s, Duration: 6.00s
  Stage 1: No Call Indicated - Skipping Stage 2

Outer Window: Start Time: 6.00s, End Time: 12.00s, Duration: 6.00s
  Stage 1: No Call Indicated - Skipping Stage 2

Outer Window: Start Time: 12.00s, End Time: 18.00s, Duration: 6.00s
  Stage 1: No Call Indicated - Skipping Stage 2

Outer Window: Start Time: 18.00s, End Time: 24.00s, Duration: 6.00s
  Stage 1: No Call Indicated - Skipping Stage 2

Outer Window: Start Time: 24.00s, End Time: 30.00s, Duration: 6.00s
  Stage 1: Call Indicated - Proceeding to Stage 2 Inner Loop
  Call Event Detected in Outer Window - Incrementing Call Count

Outer Window: Start Time: 30.00s, End Time: 36.00s, Duration: 6.00s
  Stage 1: No Call Indicated 

## Performance Considerations

The two-stage approach significantly reduces computational requirements compared to a single-stage sliding window method. By quickly filtering out audio segments that are unlikely to contain bird calls (Stage 1), the system minimizes the need for more intensive processing (Stage 2), making it suitable for analyzing large volumes of audio data.