<a href="https://colab.research.google.com/github/BenUCL/Reef-acoustics-and-AI/blob/main/Tutorial/2-Feature_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine learning with coral reef soundscape data**

This notebook is a supporting tutorial for the study **Unlocking the soundscape of coral reefs with artificial intelligence** by [Williams et al (2024a)](https://www.biorxiv.org/content/10.1101/2024.02.02.578582v1).

In this publication we recommend combining pretrained neural networks with unsupervised learning for analysing soundscape ecology.

## What this notebook does:
1. Set up: access some sample data and install the required packages we need.
2. Extract features from the audio data using the SurfPerch pretrained neural network.


## **SurfPerch: A pretrained neural network fine tuned to coral reefs**

In the associated study to this tutorial we used VGGIsh. However, here we will use SurfPerch, a newly developed pretrained neural network fine tuned to coral reefs which we created in a collaboration with Google DeepMind. It was created and rigorously tested on audio data from 16 unique datasets across 12 countries. You can read more about the network in its supporting research article **Leveraging tropical reef, bird and unrelated sounds for superior transfer learning in marine bioacoustics**, [Williams et al (2024b)](https://arxiv.org/abs/2404.16436).

SurfPerch can also be used to rapdily identify individual sounds in your data, as opposed to the whole soundscape approach presented in Unlocking the soundscape of coral reefs with artificial intelligence by [Williams et al (2024a)](https://www.biorxiv.org/content/10.1101/2024.02.02.578582v1). See a full tutorial on identifying individual sounds [here](https://github.com/BenUCL/surfperch/blob/surfperch/SurfPerch_Demo_with_Calling_in_Our_Corals.ipynb).


## **Our sample data**
We'll use a small sample dataset for this tutorial. This data consists of 262 audio files from healthy, degraded and restored coral reefs in Indonesia. These reefs are part of the worlds largest coral reef restoration program [buildingcoral.com](https://www.buildingcoral.com/). See [Williams et al (2022) ](https://doi.org/10.1016/j.ecolind.2022.108986) for more detail on this audio.







# **Step 1: Set up**

## **Access sample data**

There are two routes to accessing the sample data.

### **Route 1 (the quick and easy route)**
The sample data is held in this [public GDrive folder](https://drive.google.com/drive/folders/1JDqpHaUyVxFNuw3K_y9J7f9g28oBk3v5?usp=sharing). To access it:
1. Click this link.
2. Click on the dropdown arrow next to the 'sample_data' heading.
3. Click organize -> Add shortcut.
4. Select the 'All locations' tab.
5. Select 'MyDrive' -> 'Add'

This will add a link to the folder in your GDrive without taking up any of your own GDrive space.


### **Route 2**
If for any reason route 1 no longer works, this sample dataset is also held within the tutorial zip file on the Zenodo repository for the this study (make sure you have accessed the most up to date version of this repo from the tab on the right):

1. Download the tutorial file.
2. Unzip the file.
3. Upload the sample data folder into a folder called 'sample_data' on your GDrive. Note, make sure this is in the the MyDrive folder on your GDrive, and not in a sub folder.


## **Access the pretrained model**
### **Route 1**
As above, the model can be accessed from a public [GDrive folder](https://drive.google.com/drive/folders/1PzxO1dcjMtIVdqBqEDBBlUQHf-P22EkD?usp=sharing).

Once again, create a shortcut to this in your MyDrive folder.


### **Route 2**
The model can also be accessed from a [Zenodo repository](https://doi.org/10.5281/zenodo.11071202).

1. Download the SurfPerch.zip file.
2. Unzip this file.
3. Find the 'Saved model' folder (note not saved_model.pb) inside this.
4. Rename the 'Saved model' folder 'SurfPerch-model'.
3. Upload this 'SurfPerch-model'folder to your MyDrive folder in Google Drive.


In [3]:
#@title Import packages
import os # for handling files and directories
import librosa # for audio processing
import tensorflow as tf # for machine learning
import tensorflow_hub as hub # for machine learning
import numpy as np # for numerical processing
import pandas as pd # for handling dataframes
from tqdm import tqdm # for progress bar

In [5]:
#@title Mount Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
#@title Set all filepaths

# Directory containing audio files (look for the shortcut link you added in GDrive)
sample_audio = '/content/drive/MyDrive/sample_data'

# Path to pretained network (look for the shortcut link you added in GDrive)
model_path = '/content/drive/MyDrive/SurfPerch-model'

# Path where a csv of extracted features will be saved
results_path = '/content/drive/MyDrive/AI4Reefs-tutorial-results/'
if not os.path.exists(results_path):
  os.mkdir(results_path)

### Load the SurfPerch neural network model
 Check we have the savedmodel folder in GDrive, you should see:

 assets	saved_model.pb	variables

In [4]:
# Check model is present
!ls '/content/drive/MyDrive/SurfPerch-model'

assets	saved_model.pb	variables


In [5]:
# We will load the pretrained neural net as 'model'
model = tf.saved_model.load(model_path)

# **Step 2: Extract features with the neural net**

Now we run the main for loop to iterate over each file extract features using the pretrained nereul network.

The results will be saved to a 'pandas dataframe', similar to a dataframe in R, and, to the 'extracted_features.csv' which should appear in the file tab on the left.

In [6]:
#@title Define helper functions for inference
def get_sample_rate(file_path):
    audio, sample_rate = librosa.load(file_path, sr=None)
    return sample_rate


def resample_and_split_audio(file_path, original_sr, target_sr=32000, segment_duration=5):
    audio, _ = librosa.load(file_path, sr=original_sr)  # Load with original sample rate
    audio = librosa.resample(audio, orig_sr=original_sr, target_sr=target_sr)  # Resample to 32kHz
    segments = []

    segment_length = target_sr * segment_duration
    total_segments = len(audio) // segment_length

    for i in range(total_segments):
        start = i * segment_length
        end = start + segment_length
        segments.append(audio[start:end])

    return segments


def process_audio_files(audio_dir, model):
    rows_list = []
    original_sr = None

    # Loop through every file in audio_dir
    for filename in tqdm(os.listdir(audio_dir), desc="Processing audio files"):
        if filename.lower().endswith('.wav'):
            file_path = os.path.join(audio_dir, filename)

            # Check if the sample rate has not been set yet
            if original_sr is None:
                original_sr = get_sample_rate(file_path)  # Get the sample rate from the first file

            try:
                segments = resample_and_split_audio(file_path, original_sr=original_sr)

                for i, segment in enumerate(segments):
                    # Model expects batch dimension, so use np.newaxis to add it
                    logits, embeddings = model.infer_tf(segment[np.newaxis, :])

                    embedding = embeddings.numpy()[0]

                    embedding_index = i + 1
                    row_data = {'filename': filename, 'embedding_index': embedding_index}
                    for j, feature in enumerate(embedding):
                        row_data[f'feature_{j}'] = feature
                    rows_list.append(row_data)
            except Exception as e:
                print(f"An error occurred while processing file: {filename}. Error: {e}")

    features_df = pd.DataFrame(rows_list)
    return features_df

## Run feature extraction and save results to a csv

This will run orders of magnitudes faster if using GPU instance of Google colab. Check your runtime type if unsure.

In [None]:
# Extract the features
features_df = process_audio_files(sample_audio, model)

# Save results to GDrive
features_df_path = results_path + 'surfperch_feature_embeddings.csv'
features_df.to_csv(features_df_path, index=False)

That should have run super quick! Once you're done with the GPU on colab, its good practice to switch back to a standard CPU runtime. You'll need to rerun the imports, reconnect your GDrive and run the filepaths cell.

In [7]:
#@title Take a peek at the features dataframe

# Load the saved csv from gdrive as a dataframe
features_df = pd.read_csv(results_path + 'surfperch_feature_embeddings.csv')

features_df

Unnamed: 0,filename,embedding_index,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,...,feature_1270,feature_1271,feature_1272,feature_1273,feature_1274,feature_1275,feature_1276,feature_1277,feature_1278,feature_1279
0,R.SaF2.0534.805322778.180828.2.24.wav,1,0.025595,-0.000961,-0.092028,-0.058200,-0.051054,0.167595,0.255279,0.188934,...,0.041821,-0.010219,-0.077247,-0.000017,-0.014390,0.033622,-0.034953,0.050662,-0.041676,0.025004
1,R.SaF2.0534.805322778.180828.2.24.wav,2,0.049683,-0.047311,-0.028174,0.034870,-0.032217,0.061672,-0.053214,0.298171,...,-0.004927,-0.033122,0.279182,-0.086063,-0.019900,0.066959,0.294342,0.039418,0.130793,0.037784
2,R.SaF2.0534.805322778.180828.2.24.wav,3,0.006276,-0.033015,-0.026197,-0.078695,-0.028663,0.055661,0.018560,0.120216,...,0.040522,-0.009973,-0.031889,-0.039409,-0.014203,0.051319,0.059844,0.032144,0.024642,0.013555
3,R.SaF2.0534.805322778.180828.2.24.wav,4,0.108855,-0.009209,-0.085800,-0.036926,0.092913,0.047049,0.153386,0.258276,...,0.093242,0.015254,0.035684,-0.053235,-0.021096,0.050116,0.030732,0.047232,0.028821,0.011349
4,R.SaF2.0534.805322778.180828.2.24.wav,5,-0.018318,-0.044021,0.011520,-0.112629,-0.007731,0.041312,-0.037928,0.144136,...,0.006202,-0.007312,0.045920,-0.039345,-0.005329,0.026487,0.230358,0.027082,0.136863,0.028374
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3139,H.BaN12.0529.805322778.180907.1.24.wav,8,0.120332,0.004834,-0.052110,-0.037165,0.063772,0.086834,-0.025986,0.102768,...,0.015969,0.021648,0.004820,-0.080648,-0.020962,0.054402,0.127424,0.026032,0.114652,0.042423
3140,H.BaN12.0529.805322778.180907.1.24.wav,9,0.044322,-0.048361,-0.083801,-0.030796,0.082536,0.105688,0.102236,0.135143,...,0.047041,0.005686,-0.061259,-0.125162,-0.005236,0.045542,-0.066210,0.030608,0.050268,0.007461
3141,H.BaN12.0529.805322778.180907.1.24.wav,10,0.026055,0.036961,-0.105839,0.000249,-0.038886,0.055119,0.043387,0.024983,...,0.011465,-0.049227,-0.098524,-0.111199,-0.010655,0.027981,-0.129594,0.025398,-0.001154,0.030654
3142,H.BaN12.0529.805322778.180907.1.24.wav,11,-0.056069,-0.007967,-0.110816,-0.014680,-0.040302,0.149659,-0.006894,0.209868,...,-0.013996,0.020267,-0.026193,-0.093179,-0.037199,0.049188,-0.003985,0.031350,0.035892,0.020742


## **Finished!**

You should see a results table that contains:
1. All the audio files in our sample data under the 'filename' column.
2. SurfPerch cuts audio files into 5s chunks, the chunk which each rows corresponds to is under 'embedding_index'.
3. There should be feature columns running from feature_0 to feature_1279. Each 5s chunk is now represented by these highly informative (to a machine) feature embeddings.
