# Forest elephant vocalisation feature extraction

This Jupyter notebook provides a step-by-step guide to using pre-trained CNNs via transfer learning techniques to extract acoustic features from forest elephant vocalisation. In this notebook, we take a dataset containing 1254 different forest elephant vocalisations and automatically extract acoustic features using four different pre-trained CNN models.

## Dataset Description

We will be working with audio files of African forest elephants recorded by the Elephant Listening Project in the Dzanga-Bai clearing in the the southwestern Central African Republic between September 2018 and April 2019. The vocalisation dataset has 1254 rows each representing an elephant vocalisation with the start time, end time, low frequency and high frequency annotated alongside the call-type (roar, rumble or trumpet).

## Steps
1. **Set-up**: Import the libraries and functions needed to conduct the analysis, load the dataset and understand its structure
2. **Audio pre-processing**: Pre-process the data to isolate the vocalisations
3. **Feature extraction**: Automatically extract the acoustic features using 4 pre-trained Convolutional Neural Networks.
4. **Save features**: Save the features for further analysis.

### 1. Set up

Here we import some pre-defined helper functions located in the `elephants_scripts` folder of the main repository.

In [1]:
from pathlib import Path

from elephant_scripts.load_data import load_vocalisation_dataset

Now we load the table containing information about each of the elephant vocalisations.

In [None]:
AUDIO_DIR = Path("audio_dir")
DATA_DIR = Path("data")
OUTPUTS_DIR = Path("outputs")

# This function will load the table containing information about each vocalisation
# and finds the corresponding audio file in which they appear.
df = load_vocalisation_dataset(
    DATA_DIR / "elephant_vocalisations.csv",
    audio_dir=AUDIO_DIR,
)

In [None]:
df.head()

### 2. Audio pre-processing

Now that the dataframe is loaded and associated with the audio files, we need to pre-process the audio files to extract the exact frequency ranges and time periods associated with the vocalisations. This helps to minimise unwanted environmental sound that may cause erroneous results. These are the steps take to pre-process the files and extract their audio features:

<ol type="A">

<li>Read the entire audio file.<li>
Apply a bandpass filter to exclude frequencies outside the vocalisation range using the low_frequency and high_frequency information.<li>
Extract the audio clip corresponding to the vocalisation using the start_time and end_time information.<li>
Zero-pad the vocalisation to a length that is a multiple of the input windows of the CNNs and centre the recording within this padding.<li>
Normalise the audio clip to have a peak amplitude of 1 to control for elephant distance from the microphone.<li>
Reapply the bandpass filter to remove any acoustic artifacts that have been introduced by the pre-processing.

To visualise how this audio pre-processing works, we can take a sample file and show the effect of each step in the pre-processing

In [None]:
# Select a random vocalisation
sample = df.sample(n=1).iloc[0]

# Print information about the audio file
print(sample)

In [None]:
# Each of the steps is implemented as a re-usable function defined in the
# feature_extraction.py module within the `elephant_scripts` folder.
from elephant_scripts.feature_extraction import (
    apply_bandpass_filter,
    extract_audio,
    generate_spectrogram,
    normalise_sound_file,
    plot_spectrograms,
    read_audio,
    zero_pad,
)

# Set audio parameters used in the functions below
DEFAULT_SAMPLERATE = 4000  # Hz
DEFAULT_WINDOW_SIZE = 4  # seconds
DEFAULT_CLIP_POSITION = "middle"  # Default position of the clip within the zero padding

# Work through each of the pre-processing steps in turn

# 1) Read audio
wav, sr = read_audio(sample.audio_filepaths)

# 2) Apply bandpass filter
filtered_wav = apply_bandpass_filter(
    wav,
    sample.low_freq,
    sample.high_freq,
    samplerate=DEFAULT_SAMPLERATE,
)

# 3) Extract audio segment based on annotation times
extracted_audio = extract_audio(filtered_wav, sample, samplerate=DEFAULT_SAMPLERATE)

# 4) Zero-pad the wav array
padded_wav = zero_pad(
    extracted_audio,
    annotation=sample,
    window_size=DEFAULT_WINDOW_SIZE,
    samplerate=DEFAULT_SAMPLERATE,
    position=DEFAULT_CLIP_POSITION,
)

# 5) Normalise the sound file
normalised_clip = normalise_sound_file(padded_wav)

# 6) Re-apply bandpass filter to the normalised waveform
refiltered_wav = apply_bandpass_filter(
    normalised_clip,
    sample.low_freq,
    sample.high_freq,
    samplerate=DEFAULT_SAMPLERATE,
)

# Plot spectrograms for each pre-processing step
steps = [
    "1) Read Audio",
    "2) Apply Bandpass Filter",
    "3) Extract Audio",
    "4) Zero-Pad",
    "5) Normalise Amplitude",
    "6) Reapply Bandpass filter",
]
audios = [
    wav,
    filtered_wav,
    extracted_audio,
    padded_wav,
    normalised_clip,
    refiltered_wav,
]
plot_spectrograms(
    steps,
    audios,
    sr=DEFAULT_SAMPLERATE,
    window_size=DEFAULT_WINDOW_SIZE,
    position=DEFAULT_CLIP_POSITION,
    figsize=(10, 10),
    fontsize=10,
)

These individual steps are combined in one single function called the wav_cookiecutter which we'll use for the pre-processing the rest of the data.

In [None]:
from elephant_scripts.feature_extraction import plot_spectrogram, wav_cookiecutter

# The wav_cookiecutter function integrates all the previous step into a single
# function.
# Note that you can control the amount of zero-padding by changing the window_size
# This facilitates adjusting the size of the audio samples to the input sizes required
# by the different models.
preprocessed = wav_cookiecutter(
    sample,
    samplerate=DEFAULT_SAMPLERATE,
    window_size=DEFAULT_WINDOW_SIZE,
    position=DEFAULT_CLIP_POSITION,
)

plot_spectrogram(preprocessed, figsize=(4, 2));

We can now apply the `wav_cookiecutter` to all of the vocalisations to see how it affects the 4 CNNs differently.

**Generate and visualise spectrograms for each CNN after pre-processing**

In [7]:
# First we import all the available models
from elephant_scripts.models import BirdNET, Perch, VGGish, YAMNet


MODELS = [VGGish, YAMNet, BirdNET, Perch]

As each model has specific input requirements, audio must be padded to fit the necessary input size during feature extraction. 
This padding ensures the vocalisation remains centered within the input audio. 
Notably, for models like BirdNET and Perch, which require larger input sizes, the elephant vocalisation occupies only a small portion of the total input audio. 
This can be seen in the following visualisation:

In [None]:
from elephant_scripts.plotting import plot_spectrogram_matrix

for model in MODELS:
    # Adjust the window size to account for the model's native samplerate.
    window_size = model.window_size * model.samplerate / DEFAULT_SAMPLERATE

    # Plotting spectrograms for each model
    plot_spectrogram_matrix(
        df,
        window_size=window_size,
        title=f"{model.__name__} inputs",
    )

### 3. Feature extraction

The pre-processed audio files are then passed through the pre-trained CNNs in their time window multiples and their acoustic features are automatically extracted to produce embeddings. These embeddings are then averaged to obtain a single embedding per vocalisation. This feature extraction phase involves the following steps:

1. Extract the acoustic feature embeddings for each sample chunk of the audio clip using the pretrained CNNs.
2. Average the embeddings across the chunks to obtain a single embedding for each vocalisation.
3. Add in information about the duration of each vocalisation

This resulting embedding encodes the acoustic feature representation of the vocalisation.

In [9]:
import numpy as np
import pandas as pd
from tqdm import tqdm


def extract_features(
    annotation,
    model,
    position: str = DEFAULT_CLIP_POSITION,
    samplerate: int = DEFAULT_SAMPLERATE,
):
    """Extract features of a single vocalisation using specified model."""
    # Preprocess vocalisation audio
    wav = wav_cookiecutter(
        annotation,
        window_size=model.window_size * model.samplerate / DEFAULT_SAMPLERATE,
    )

    # Compute features using model
    features = model.extract_features(wav)

    # Average features in case the audio was split into multiple chunks
    mean_features = features.mean(axis=0)

    # Return features with duration added.
    return {
        **{f"feature_{i}": value for i, value in enumerate(mean_features)},
        "duration": annotation.duration,
    }


def extract_all_features(
    df,
    model_type,
    output: Path | str | None = None,
    position: str = DEFAULT_CLIP_POSITION,
    samplerate: int = DEFAULT_SAMPLERATE,
    force: bool = True,
) -> pd.DataFrame:
    """Extract all vocalisation features from the given dataset with the specified model."""
    
    # Instatiate the model. This loads the model weights and prepares it for processing.
    model = model_type()

    # Check if features have been pre-computed and load if so
    if output is not None and Path(output).is_file() and not force:
        print(
            f"Pre-processed features for {type(model).__name__} were found, "
            "skipping computation. To recompute features, use `force=True`."
        )
        return pd.read_parquet(output)

    # Compute features for all annotated vocalisations
    features = pd.DataFrame(
        [
            extract_features(annotation, model)
            for annotation in tqdm(df.itertuples(), total=len(df))
        ],
        index=df["vocalisation_id"],
    )

    # Save to file
    if output is not None:
        features.to_parquet(output)

    return features

In [None]:
# Process the whole vocalisation dataset with all available model and save the output
for model_type in MODELS:
    model_name = model_type.__name__.lower()
    extract_all_features(
        df,
        model_type,
        output=OUTPUTS_DIR / f"{model_name}_vocalisation_features.parquet",
    )