<a href="https://colab.research.google.com/github/MScEcologyAndDataScienceUCL/BIOS0032_AI4Environment/blob/main/4_AI_for_Bioacoustics/AI_for_Bioacoustics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 - AI for Bioacoustics

## What we will learn

In this weeks practical we will explore computer audition applications in ecology such as automated animal detection and species classification from audio sensor data. Among other stuff we will:

TODO

## Intro

### Computer audition

**What is Computer audition?**

- Computer audition is the field of research that deals with the automatic
  analysis of audio signals.

- It intersects with many other fields, including machine learning, signal
  processing, and computer vision.

**What does a computer hear**?

- Audio files are a sequence of numbers representing the amplitude of the sound
  wave at a given time.

- The number of samples per second is called the **sampling rate**.

<img alt="audio and sampling rate" width="600" src="https://cdn.shopify.com/s/files/1/1169/2482/files/Sampling_Rate_Cover_image.jpg?v=1654170259"></img>

**What tasks can we do with computer audition?**

Computer audition is used in a wide range of applications, including:

- Speech recognition: Siri, Alexa, Google Assistant
- Music information retrieval: Spotify, Shazam
- Audio classification: What is sounding in this audio?
- Sound event detection: Transcription of audio into a sequence of events.

<img alt="Sound event detection" width="400" src="http://d33wubrfki0l68.cloudfront.net/508a62f305652e6d9af853c65ab33ae9900ff38e/17a88/images/tasks/challenge2016/task3_overview.png"></img>

> Taken from the paper: Mesaros, A., Heittola, T., Diment, A., Elizalde, B., Shah, A., Vincent, E., ... & Virtanen, T. (2017, November). DCASE 2017 challenge setup: Tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.

Recently neural network models have taken over the field of computer audition and are being used to solve many of the above tasks.

### Data collection

**Acoustic sensors** can be used to collect field recordings of animal sounds.

Usually, these sensors are deployed statically in the field for a long periods of time and record sounds continuously. This is called **passive acoustic monitoring**.

<img alt="passive acoustic monitoring" width="400" src="https://wittmann-tours.de/wp-content/uploads/2018/06/AudioMoth.jpg"></img>

Alternatively, recordings are actively directed towards a specific animal species or sound events.

<img alt="active recording" width="400" src="https://s3.amazonaws.com/cdn.freshdesk.com/data/helpdesk/attachments/production/48032687175/original/xjI7Dy3Q9kaCZinr5vf4ksNxQbjK13Yv3A.jpg?1584552543"></img>

> Taken from the Macaulay Library blog post:
> [Sound recording tips](https://support.ebird.org/en/support/solutions/articles/48001064298-sound-recording-tips)

### Acoustics for ecology

The sound at a site is a reflection of the species present in the area and other
environmental factors.

<img alt="composition of acoustic space" width="400" src="https://media.springernature.com/full/springer-static/image/art%3A10.1007%2Fs12304-017-9288-5/MediaObjects/12304_2017_9288_Fig1_HTML.gif?as=webp"></img>

> Taken from the paper: Mullet, T.C., Farina, A. & Gage, S.H. The Acoustic
> Habitat Hypothesis: An Ecoacoustics Perspective on Species Habitat Selection.
> Biosemiotics 10, 319–336 (2017). https://doi.org/10.1007/s12304-017-9288-5

If we could link the sounds to the species, we could use this information to
study and monitor the biodiversity of an area.

Acoustic sensors produce a lot of data, and it is not always easy to analyse.
Can we use computer audition to help us?

In this practical we will explore the task of **animal sound detection** and
**species classification**, using both manual and automated methods.

## Setup Steps

Here we will go through the steps to setup the environment for this practical.

1. Make sure to use GPU runtime in Colab. Go to `Runtime` -> `Change runtime
   type` and select `GPU` as the hardware accelerator.

2. Mount your Google Drive. 

In [None]:
from google.colab import drive

drive.mount("/content/drive")

Add a shortcut in you drive to this [shared folder](https://drive.google.com/drive/folders/1hbbbsILNBsQghktuj0z_Jq_3iEZQCCbj?usp=share_link).

This will allow you to access the data we will use in this practical.

In [None]:
%%capture
# Extract data into machine
!unzip /content/drive/MyDrive/BIO0032_AI4Environment/week4_data.zip -d /content/week4_data

3. Install and import dependencies. Run the following cell to install the required
   dependencies. This will take a few minutes. You can omit
   the outputs of this cell.

In [None]:
%%capture
!sudo apt-get install libfftw3-dev libicu-dev libsndfile1-dev libqt5core5a
!pip install pytadarida git+https://github.com/MScEcologyAndDataScienceUCL/BIOS0032_AI4Environment.git git+https://github.com/mbsantiago/batdetect2

In [None]:
import os
from time import perf_counter

import ipywidgets
import librosa
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import glob
import plotly.express as px
import xarray as xa
from IPython.display import Audio
from librosa import display

from bios0032utils.bioacoustics import detection, evaluate_detection, plotting

## Part 1: Detecting Animal Sounds

- Most of the time, we are not interested in all the sounds in a recording, but only in the sounds of a specific animal species.

- Acoustic sensors will indiscriminately record all sounds in the environment, including those of animals, wind, rain, etc. Although some recorders can be triggered by a specific sound, this is not always the case.

- Passive acoustic monitoring produces many hours of recordings, and it's hard to identify and explore the sounds of interest.

These are similar problems to the ones we have seen in the previous practicals for camera trap images.

While not as developed as in **computer vision**, there are some tools for automatically **detecting animal sounds** in recordings. Here we will explore a few of them.

But first we need to understand how to visualise and annotate sounds.

### Animal sounds visualisation

- While we can listen to the sounds in a recording, it is often easier to
visualise them.

- This is especially true when we want to compare sounds from
different recordings or navigate quickly without the need to listen.

- We can use waveplots and spectrograms to visualise the sounds in a recording.

Now we will load a dataset of animal recordings provided by [Avisoft](https://www.avisoft.com/animal-sounds/) and visualize them.

In [None]:
AVISOFT_AUDIO_DIR = "/content/week4_data/data/avisoft/audio"
AVISOFT_METADATA_FILE = "/content/week4_data/data/avisoft/avisoft_metadata.csv"

In [None]:
# Load the metadata dataframe
avisoft = pd.read_csv(AVISOFT_METADATA_FILE)

In [None]:
# Print first few rows
avisoft.head()

In [None]:
# select a random file from the dataset
random_recording = avisoft.sample(n=1).iloc[0]

# read the audio file and import it as a numpy array
wav, samplerate = librosa.load(
    os.path.join(AVISOFT_AUDIO_DIR, random_recording.wav),
    sr=None,
)

# Compute the duration of audio
num_samples = len(wav)  # Number of samples taken by the recorder
duration = num_samples / samplerate

# Get name of animal
animal_name = random_recording.english_name

print(f"File selected = {random_recording.wav}")
print(f"Samplerate = {samplerate} Hz")
print(f"Duration = {duration:.2f} s")
print(f"Species = {animal_name}")

Lets first listen to the audio

In [None]:
Audio(data=wav, rate=samplerate)

In [None]:
# Create plot of the waveform
times = np.linspace(0, duration, num_samples)
plt.figure(figsize=(10, 3))
plt.plot(times, wav)
plt.xlabel("time (s)")
plt.title(f"Waveform of {animal_name} sound");

The **waveform** gives us a visual representation of the sound amplitude over time.

However, ff there are multiple simultaneous sounds in the recording, it can be hard to see each individual sound. 

We can use a **spectrogram** to decompose the sound into **frequencies** and visualise them as a 2D image.

In [None]:
# Compute the spectrogram with the short time fourier transform (STFT)
spectrogram = np.abs(librosa.stft(wav))

# Amplitude is best represented in logarithmic scale (decibels)
db_spectrogram = librosa.amplitude_to_db(spectrogram, ref=np.max)

In [None]:
# Create plot of spectrogram
num_freq_bins, num_time_bins = db_spectrogram.shape
times = np.linspace(0, duration, num_time_bins)
freqs = np.linspace(0, samplerate / 2, num_freq_bins)

plt.figure(figsize=(10, 4))
plt.pcolormesh(times, freqs, db_spectrogram, cmap="magma")
plt.colorbar()
plt.title(f"Spectrogram of {random_recording.english_name} sound")
plt.xlabel("time (s)")
plt.ylabel("freq (Hz)");

### Exercise 🔊 👀

The sounds produced by animals can be very different from each other. The
transformation used to create the spectrogram, called the **short-time Fourier
transform** (STFT), will highlight different features of the sound depending on
the parameters used.

- Research what the STFT is and how its parameters affect the spectrogram. In
  particular, try to understand the effect of the **window size** and the **hop
  size** or **overlap**.

Here you can visualise sounds from different species and see how the STFT
parameters affect the spectrogram.

In [None]:
# @title Interactive spectrogram of animal sounds

# @markdown Select the file you wish to visualize. Modify the spectrogram parameters to see its effect on the spectrogram. Change the reproduction speed for interesting effects!

# Select some varied sounds from avisoft dataset
examples = [
    (row.english_name, os.path.join(AVISOFT_AUDIO_DIR, row.wav))
    # select one random recording per taxonomic group
    for row in avisoft.groupby("order").sample(n=1).itertuples()
]

# Create interactive plot
ipywidgets.interact(
    plotting.plot_waveform_with_spectrogram,
    hop_length=(32, 1024, 32),
    n_fft=(32, 2048, 32),
    window=plotting.WINDOW_OPTIONS,
    file=examples,
    cmap=plotting.COLORMAPS,
    speed=[
        ("x1", 1),
        ("x1.5", 1.5),
        ("x2", 2),
        ("x0.5", 0.5),
        ("x0.2", 0.2),
        ("x0.1", 0.1),
    ],
);

- Try changing the parameters and see how they affect sounds from different
  species.

- Can you see that some choice of parameters are good for some species but
  not for others?

- How do the parameters affect the computation time and resulting image size?

### Detecting sounds

As you can imagine, it is not easy to manually annotate all relevant sounds in a recording.

Look at this recording from a bat detector:

In [None]:
YUCATAN_METADATA = "/content/week4_data/data/yucatan/yucatan_metadata.csv"

YUCATAN_AUDIO_DIR = "/content/week4_data/data/yucatan/audio"

# Load metadata of dataset of bat recordings from the yucatan peninsula
yucatan = pd.read_csv(YUCATAN_METADATA)

In [None]:
crowded_recording = os.path.join(YUCATAN_AUDIO_DIR, yucatan.id[1017])

plotting.plot_spectrogram(
    crowded_recording,
    hop_length=128,
    n_fft=512,
    figsize=(14, 4),
)

There are many bat calls in this recording, it would be very time consuming to annotate them all.

In [None]:
empty_recording = os.path.join(YUCATAN_AUDIO_DIR, yucatan.id[205])
plotting.plot_spectrogram(
    empty_recording,
    hop_length=128,
    n_fft=512,
    figsize=(14, 4),
)

This other recording has a single bat pulse. You still need to review it thoroughly to make sure there are no other sounds of interest.

### Tadarida

Similar to **MegaDetector** for camera traps, there are some tools that automatically detect animal sounds in recordings.

Here we will explore the tool **Tadarida**. Tadarida is a non-ML generic detector that uses a set of hand-crafted features to detect sounds. It is based on the work of:

> Bas, Y., Bas, D. and Julien, J.-F., 2017. Tadarida: A Toolbox for Animal
> Detection on Acoustic Recordings. Journal of Open Research Software, 5(1),
> p.6. DOI: http://doi.org/10.5334/jors.154

We will use tadarida to detect bat calls in a recording.

In [None]:
detections = detection.run_tadarida_detection([empty_recording, crowded_recording])

In [None]:
plotting.plot_spectrogram_and_detection(crowded_recording, detections);

In [None]:
plotting.plot_spectrogram_and_detection(empty_recording, detections);

### Evaluate Detections

The calls of this dataset were manually annotated, so we can compare the detections with the ground truth.

In [None]:
YUCATAN_ANNOTATIONS = "/content/week4_data/data/yucatan/yucatan_annotations.csv"

In [None]:
# Load the annotations file
yucatan_annotations = pd.read_csv(YUCATAN_ANNOTATIONS)

Lets first visualize the ground truth annotations.

In [None]:
plotting.plot_spectrogram_and_detection(empty_recording, yucatan_annotations);

In [None]:
plotting.plot_spectrogram_and_detection(crowded_recording, yucatan_annotations);

Now we can compare the detections with the ground truth. We can use the
Intersection over Union (IoU) to measure the overlap between the detections and
the ground truth.

<img alt="intersection over union" src="https://upload.wikimedia.org/wikipedia/commons/c/c7/Intersection_over_Union_-_visual_equation.png" width="400"></img>

In [None]:
# Select the predictions and annotations from the crowded recording
file_detections = detections[
    detections.recording_id == os.path.basename(crowded_recording)
]
file_annotations = yucatan_annotations[
    yucatan_annotations.recording_id == os.path.basename(crowded_recording)
]

# Match the bounding boxes by computing the IoU. Discard all matches with IoU less than 0.5
pred_boxes = evaluate_detection.bboxes_from_annotations(file_detections)
true_boxes = evaluate_detection.bboxes_from_annotations(file_annotations)
matches = evaluate_detection.match_bboxes(true_boxes, pred_boxes, iou_threshold=0.5)

We select all the detections that have an IoU greater than 0.5 and count them as true positives. All the other detections are false positives. Sound events that are not detected are false negatives.

In [None]:
# total number of annotated sound events
positives = len(file_annotations)

num_predictions = len(file_detections)

# number of matched prediction boxes
true_positives = len(matches)

# number of predicted boxes that were not matched
false_positives = num_predictions - len(matches)

# number of annotated sound events that were not matched
false_negatives = positives - len(matches)

With this information we can compute the precision and recall of the detections.

In [None]:
# Percentage of predictions that are correct
precision = true_positives / num_predictions

# Percentage of sound events that were detected
recall = true_positives / positives

print(
    f"Tadarida precision={precision:.1%} recall={recall:.1%} on file {crowded_recording}"
)

Lets plot predictions and annotations at the same time.

- red = spurious predicted sound event (false positive)
- green = correct prediction (true positive)
- white = missed sound event (false negative)

In [None]:
plotting.plot_spectrogram_with_predictions_and_annotations(
    crowded_recording,
    detections,
    yucatan_annotations,
    iou_threshold=0.3,
);

We can also compute the precision/recall on each file in the dataset.

In [None]:
# load precomputed tadarida detections to save some time
full_tadarida_detections = pd.read_csv("/content/week4_data/data/yucatan/yucatan_tadarida_detections.csv")

# compute the precision and recall for each file
td_evaluation = []
for filename in yucatan_annotations.recording_id.unique():
    precision, recall = evaluate_detection.compute_file_precision_recall(
        filename,
        full_tadarida_detections,
        yucatan_annotations,
        iou_threshold=0.5,
    )
    td_evaluation.append({"wav": filename, "precision": precision, "recall": recall})

# store the results in a pandas dataframe
td_evaluation = pd.DataFrame(td_evaluation)

### Exercise: Evaluate detections

Using the dataframe with precision and recall of tadarida on each file (`td_evaluation`), calculate: 

- The mean precision and recall across all files.
- The percentage of files where all bat calls were missed.
- The percentage of files where at least half of the predictions were correct.

Run the full evaluation again but change the IoU `threshold` parameter. What do you observe?

You can use the following interactive widget to get a better grasp on tadarida's behaviour.

In [None]:
# @title Tadarida predictions

# @markdown Select a file and an IoU threshold.

example_files = [
    os.path.join(YUCATAN_AUDIO_DIR, row["id"])
    for _, row in yucatan.sample(n=20).iterrows()
]


@ipywidgets.interact(path=example_files, iou_threshold=(0, 1, 0.1))
def plot_results_file_results(path=crowded_recording, iou_threshold=0.5):
    precision, recall = evaluate_detection.compute_file_precision_recall(
        path,
        full_tadarida_detections,
        yucatan_annotations,
        iou_threshold=iou_threshold,
    )

    print(
        f"Tadarida precision={precision:.1%} recall={recall:.1%} on file {crowded_recording}"
    )

    plotting.plot_spectrogram_with_predictions_and_annotations(
        path,
        full_tadarida_detections,
        yucatan_annotations,
        iou_threshold=iou_threshold,
    )

### BatDetect2

As we have seen the performance of Tadarida has room for improvement. We can improve performance by using a specialised machine learning model.

Now we will use **BatDetect2**, a deep learning model for simultaneous
detection and classification of bat calls. Although the model was trained on a
bat calls of UK bats, we can test its detection performance on the dataset of
bats from the Yucatán peninsula.

In [None]:
%%bash
batdetect2 /content/week4_data/data/yucatan/audio /content/week4_data/data/yucatan/predictions 0.3

The **BatDetect2** model can predict multiple bounding boxes for each recording.
Unlike Tadarida, each bounding box has a **score**, a predicted species and a
confidence score for the species.

We will throw out the predicted species and confidence score, and only use the
bounding box score. The **score** is the probability that the bounding box contains
a bat call.

In [None]:
# Get all prediction files
files = glob.glob("/content/week4_data/data/yucatan/predictions/*.csv")

# Read each prediction file
batdetect2_predictions = []
for path in files:
    df = pd.read_csv(path).drop(columns=["id", "class", "class_prob"])
    df["recording_id"] = os.path.basename(path)[:-4]
    batdetect2_predictions.append(df)

# And concatenate them into a single dataframe
batdetect2_predictions = pd.concat(batdetect2_predictions)

We can then use the same evaluation procedure as before to compute the
precision and recall, except now we can select detections with a score greater
than some customizable threshold.

In [None]:
batdetect2_predictions.head()

In [None]:
score_threshold = 0.3

plotting.plot_spectrogram_with_predictions_and_annotations(
    crowded_recording,
    batdetect2_predictions[batdetect2_predictions.det_prob > score_threshold],
    yucatan_annotations,
    iou_threshold=0.3,
);

In [None]:
# @title Batdetect2 predictions

# @markdown Select a file and an IoU threshold.

example_files = [
    os.path.join(YUCATAN_AUDIO_DIR, row["id"])
    for _, row in yucatan.sample(n=20).iterrows()
]


@ipywidgets.interact(
    path=example_files,
    iou_threshold=(0, 1, 0.1),
    score_threshold=(0, 1, 0.1),
)
def plot_batdetect2_results_file_results(
    path=crowded_recording,
    iou_threshold=0.5,
    score_threshold=0.3,
):
    confident_detections = batdetect2_predictions[
        batdetect2_predictions.det_prob > score_threshold
    ]

    precision, recall = evaluate_detection.compute_file_precision_recall(
        path,
        confident_detections,
        yucatan_annotations,
        iou_threshold=iou_threshold,
    )

    print(
        f"Batdetect2 precision={precision:.1%} recall={recall:.1%} on file {crowded_recording}"
    )

    plotting.plot_spectrogram_with_predictions_and_annotations(
        path,
        confident_detections,
        yucatan_annotations,
        iou_threshold=iou_threshold,
        linewidth=2,
    )

### Exercise: Compare performance

## Part 2: Identifying Sounds

- In the previous section we saw how to detect sounds in a recording. But we still need to identify the species that produced the sound.

- Generally, classification is more challenging than detection, as the sounds produced by different species can be very similar (**interspecific overlap**).

- A single species can have flexible vocalisations, think humans or mimic birds such as starling (**intraspecific variation**).

- Bioacoustic data presents similar challenges to the camera trap datasets as recordings can be:
    - **Ocluded** (Simultaneous sounds)
    - **Appear in varying ambient condition**s (rain/wind/thunder)
    - **Partial** (Only captured half the sound)
    - **Noisy** (Saturation and faulty sensor)
    - **Quiet or very loud** (depending on animal size, distance, environment)

For the rest of this notebook we will focus on **5** bat species present in the Yucatán dataset.

In [None]:
SPECIES = [
    "Eumops auripendulus",
    "Mormoops megalophylla",
    "Eumops ferox",
    "Myotis keaysi",
    "Saccopteryx bilineata",
]

In [None]:
classification_df = yucatan_annotations[yucatan_annotations["class"].isin(SPECIES)]

### Bat call features

- Previous research on bat call identification was based on hand-crafted features of the bat calls.

- Measuring call features used to be a manual process.

<img src="https://www.elekon.ch/batexplorer2/doc/_images/CallParams.png" alt="call parameters" width="400"/>

> Image taken from the [BatExplorer 2.1 user guide](https://www.elekon.ch/batexplorer2/doc/batcall_params.html).

**Peak frequency [kHz]:**

>    The frequency at which the call is loudest (peak in the spectrum display), aka frequency of maximum energy (FME) or main frequency.
>    Most important parameter for bat classification because it can easily be measured and is often typical for a certain species or group of species.
>    The standard deviation of the peak frequency allows the detection of alternating calling species.
    
**Max frequency [kHz]**

>    The maximum frequency of the call. Often this is equal to the start frequency, for Rhinolophidae typically equal to the peak frequency.
    
**Min frequency [kHz]**

>    The minimum frequency of the call. Often this is equal to the end frequency, for hockey stick calls (e.g. Pipistrelle) it might be lower than the end frequency.
    
**BW Peak2Min [kHz]**

>    Bandwidth Peak2Min = Peak frequency - Min frequency
>    Often used to distinguish Myotis and Pipistrelle calls, Myotis mostly have higher bandwidth.
    
**Call length [ms]**

>    Time period of call start to call end in ms. Can be measured most accurately in the oscillogram (wave rise to wave drop).
>    Search calls from European bats are usually between one and up to approximately 30 ms (horseshoe bats up to 80 ms).
    
**Call distance [ms]**

>    Time period between two consecutive calls in ms. Can be measured most accurately in the oscillogram (wave rise call A to wave rise call B).
>    Often this parameter is not very significant since most bat species have irregular rhythms. But it can be an indicator for behavior.
>    Search calls from European bats usually have distances of about 30 to 300 ms, sometimes even longer.


### Exercise: Annotation

* Explore 2 calls per species
* Measure peak frequency, max frequency, min frequency and call length
* Store in a dataframe
* Scatterplot

In [None]:
to_annotate = classification_df.groupby("class").sample(n=2)


@ipywidgets.interact(
    index=[(f"{r['class']}_{index}", i) for i, (index, r) in enumerate(to_annotate.iterrows())]
)
def manually_extract_features_from_spectrogram(index=0):
    row = to_annotate.iloc[index]
    return plot_spectrogram_with_plotly(
        path=os.path.join(YUCATAN_AUDIO_DIR, row["recording_id"]),
        start_time=row["start_time"],
        end_time=row["end_time"],
        low_freq=row["low_freq"],
        high_freq=row["high_freq"],
    ).show()

### Tadarida automated features

- Tadarida extracts a large set of features from each detected sound event.

- It is possible to build a pipeline for automated species identification using of automatic feature extraction process.

- First we detect the sounds in the recording, then we extract the features, and
  finally we classify the sounds.

- The classification is done using the extracted features a classifier
  algorithm, like Random Forest classifier.

### Random Forest Classifier

Build a random forest classifier

### Evaluation

Evaluate performance with features

Exercise

What species get confused?

Which features are more significative? 

### Universal feature set (AudioSet + Yamnet)

Instead of relying on hand-crafted features use acoustic feature extractor trained on 5.8k hours of sound from YouTube (AudioSet).

Yamnet was trained to classify sounds into 527 different classes. The features it learned to extract are useful to distinguish and identify a large
variety of sounds.

Audioset does not contain ultrasonic recordings, and thus is devoid of bat sounds. However, we expect the learnt features to be sufficiently general that it can help identify bat calls.

*Process all clips with yamnet*

*Use umap to visualize*

### Exercise: use YamNet features + RF to create classifier

### Train a model from Scratch (Yamnet)

*Fine tune Yamnet*

### Exercise

Modify training parameters.

Identify poorly performing species across all methods