<a href="https://colab.research.google.com/github/SarrKhadija/Reef-Bioacoustics/blob/main/1_Coral_Reef_Soundscape_Feature_Extraction_08_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### First save a copy of this notebook in your drive so all your changes are saved somewhere!

# **Machine learning with coral reef soundscape data**

What this notebook does:
1. Finalizes audio pre-processing by trimming sound files for SurfPerch pretrained neural network.
2. Extracts feature embeddings from the audio data using the Surfperch  neural network.


Getting started:
1. The link to the audio data is here: https://drive.google.com/drive/folders/1uCqkeq8pBwAN2KQCv807TBdhgyTFrBgL?usp=drive_link
2. Copy the SurfPerch shortcut into your google drive using this link: https://drive.google.com/drive/folders/1PzxO1dcjMtIVdqBqEDBBlUQHf-P22EkD
3. To extract the features, use a GPU runtime
4. Modify the code as necessary where **EDIT** appears.








# **Step 1: Set up**


In [1]:
#@title Import packages
import os # for handling files and directories
import librosa # for audio processing
import soundfile as sf
import tensorflow as tf # for machine learning
import tensorflow_hub as hub # for machine learning
import numpy as np # for numerical processing
import pandas as pd # for handling dataframes
from tqdm import tqdm # for progress bar

#Set seed for reproducibility
random_seed = 0

In [2]:
#@title Mount Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
#Set filepaths

# Directory containing audio files
audio_dir = '/content/drive/MyDrive/UCL /Final preprocessed data' #EDIT

#Path for spectrograms
spectropath = '/content/drive/MyDrive/UCL /spectrograms_test.tif' #EDIT

# Path to pretrained network from your Google Drive folder
model_path = '/content/drive/MyDrive/UCL /Dissertation coral Bioacoustics/SurfPerch-model' #EDIT

# Path where we will save a csv of extracted features
feature_df_path = '/content/drive/MyDrive/UCL /Dissertation coral Bioacoustics/full_extracted_features.csv' #EDIT



Create a filepath to include column names in the file.

###Trim the WAV files

In [7]:
#Verify that the entire WAV dataset has loaded (1550 total files)
len(os.listdir('/content/drive/MyDrive/UCL /Final preprocessed data'))


1550



```
# This is formatted as code
```



In [8]:
# Save the trimmed audio as a new file
new_wav_path = "/content/drive/MyDrive/UCL /Trimming/" #EDIT

for filename in os.listdir(audio_dir):
  if filename.lower().endswith(".wav"):
    # Create the full file path
    wav_path = os.path.join(audio_dir, filename)
    # Load the audio data
    y, sr = librosa.load(wav_path)
    # Calculate the number of samples for 55 seconds
    trim_duration = 55
    trimmed_length = int(sr * trim_duration)
    # Define start and end points for trimming (remove the last 2 seconds)
    start_trim = 0
    end_trim = trimmed_length
    # Extract the trimmed audio
    trimmed_audio = y[start_trim:end_trim]
    new_file_path = new_wav_path+filename
    print (new_file_path)
    sf.write(new_file_path, trimmed_audio, sr)



/content/drive/MyDrive/UCL /Trimming/M14_SD1_SailisiBDegradedCES_ 24F3190361CB64DD_20220905_115301.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD4_SailisiBDegradedCES_249BC30461CB6536_20220905_115300.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD2_SailisiBDegradedCES_24A04F0861CB7239_20220905_115301.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD5_SailisiBDegradedCES_24F3190361CB6A83_20220905_115300.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD6_SailisiBDegradedCES_249BC30461CB7108_20220905_115300.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD6_SailisiBDegradedCES_249BC30461CB7108_20220905_115400.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD4_SailisiBDegradedCES_249BC30461CB6536_20220905_115400.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD2_SailisiBDegradedCES_24A04F0861CB7239_20220905_115401.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD1_SailisiBDegradedCES_ 24F3190361CB64DD_20220905_115401.WAV
/content/drive/MyDrive/UCL /Trimming/M14_SD5_SailisiBDegradedCES_24F3190361CB6A8

In [10]:
#Check length of the trimming folder, which should be equl to 1550 as well
len(os.listdir('/content/drive/MyDrive/UCL /Trimming'))

1550

### Load the SurfPerch neural network model

In [9]:
# Check we have saved the model in a folder in GDrive
!ls '/content/drive/MyDrive/UCL /Dissertation coral Bioacoustics/SurfPerch-model' #EDIT

assets	saved_model.pb	variables


In [11]:
model = tf.saved_model.load('/content/drive/MyDrive/UCL /Dissertation coral Bioacoustics/SurfPerch-model') #EDIT

## Extract features with the neural net

Now we run the main for loop to iterate over each file extract features using the pretrained nereul network.

The results will be saved to a 'pandas dataframe', similar to a dataframe in R, and, to the 'extracted_features.csv' which should appear in the file tab on the left.

In [12]:
#Define helper functions for inference
original_sr=16000
target_sr=32000
segment_duration=5

def resample_and_split_audio(file_path, original_sr=original_sr, target_sr=target_sr, segment_duration=segment_duration):
    audio, _ = librosa.load(file_path, sr=original_sr)  # Load with original sample rate
    audio = librosa.resample(audio, orig_sr=original_sr, target_sr=target_sr)  # Resample to 32kHz
    segments = []

    segment_length = target_sr * segment_duration
    total_segments = len(audio) // segment_length

    for i in range(total_segments):
        start = i * segment_length
        end = start + segment_length
        segments.append(audio[start:end])

    return segments


def process_audio_files(audio_dir, model):
    rows_list = []

    for filename in tqdm(os.listdir(audio_dir), desc="Processing audio files"):
        if filename.lower().endswith('.wav'):
            try:
                file_path = os.path.join(audio_dir, filename)

                segments = resample_and_split_audio(file_path, original_sr=16000)

                for i, segment in enumerate(segments):
                    # Model expects batch dimension, so use np.newaxis to add it
                    logits, embeddings = model.infer_tf(segment[np.newaxis, :])

                    embedding = embeddings.numpy()[0]

                    embedding_index = i + 1
                    row_data = {'filename': filename, 'embedding_index': embedding_index}
                    for j, feature in enumerate(embedding):
                        row_data[f'feature_{j}'] = feature
                    rows_list.append(row_data)
            except Exception as e:
                print(f"An error occurred while processing file: {filename}. Error: {e}")

    feature_df = pd.DataFrame(rows_list)
    return feature_df

In [None]:
#Using SurfPerch, run feature extraction and save results to a csv
feature_df = process_audio_files(audio_dir, model)

# Save results to your drive
feature_df.to_csv(feature_df_path, index=False

Processing audio files: 100%|██████████| 1550/1550 [03:31<00:00,  7.34it/s]


In [13]:
# Load the saved csv from gdrive as a dataframe
results_df = pd.read_csv(feature_df_path)

results_df

Unnamed: 0,filename,embedding_index,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,...,feature_1270,feature_1271,feature_1272,feature_1273,feature_1274,feature_1275,feature_1276,feature_1277,feature_1278,feature_1279
0,M14_SD1_SailisiBDegradedCES_ 24F3190361CB64DD_...,1,-0.002539,0.083185,0.222650,0.004583,0.038043,0.017060,-0.051531,0.048980,...,-0.026595,-0.078421,0.141446,-0.028569,0.053492,0.003565,0.013917,0.010823,0.028640,0.045186
1,M14_SD1_SailisiBDegradedCES_ 24F3190361CB64DD_...,2,0.046627,0.109666,0.282049,-0.077606,0.367022,-0.025223,0.031093,0.117443,...,-0.029934,0.014744,0.191568,0.086299,0.063188,0.010785,0.108753,0.066897,0.031906,0.045598
2,M14_SD1_SailisiBDegradedCES_ 24F3190361CB64DD_...,3,-0.010742,0.172192,0.392482,-0.095805,0.209476,-0.028116,0.007652,0.027511,...,-0.036542,-0.003506,0.110579,-0.000310,0.081100,-0.001886,0.072354,0.053671,0.115199,0.052882
3,M14_SD1_SailisiBDegradedCES_ 24F3190361CB64DD_...,4,-0.044842,0.186239,0.366160,-0.096586,0.203667,-0.036239,0.058365,0.119342,...,-0.034858,-0.014681,0.079934,0.067377,0.009581,0.008874,0.114696,0.039663,0.026609,0.054825
4,M14_SD1_SailisiBDegradedCES_ 24F3190361CB64DD_...,5,-0.023756,0.180806,0.452852,-0.080918,0.123796,-0.041198,-0.057296,0.180241,...,-0.028202,-0.103154,0.003755,0.222628,0.040353,0.021912,0.417833,0.107317,0.121505,0.032021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17045,M14_SD6_SailisiBDegradedCES_249BC30461CB7108_2...,7,-0.029924,0.020315,0.122832,-0.105937,0.066880,0.018540,-0.031619,0.003219,...,-0.031646,-0.054069,-0.040042,0.046798,0.047721,0.021873,-0.093131,0.038579,0.031557,0.051119
17046,M14_SD6_SailisiBDegradedCES_249BC30461CB7108_2...,8,-0.020249,0.044725,0.071765,-0.071905,0.006452,0.074069,0.099389,-0.061598,...,-0.034972,-0.111570,0.032762,0.104740,-0.010674,0.021865,-0.071306,0.001389,-0.019429,0.020112
17047,M14_SD6_SailisiBDegradedCES_249BC30461CB7108_2...,9,-0.008965,-0.014248,0.042899,-0.086784,0.069233,0.031179,-0.070563,-0.035562,...,-0.018560,-0.095192,0.067146,0.015477,-0.002130,-0.000711,-0.045309,0.027523,-0.081269,0.063390
17048,M14_SD6_SailisiBDegradedCES_249BC30461CB7108_2...,10,-0.038951,-0.013668,0.116261,-0.073538,-0.004532,0.028004,-0.041521,-0.006882,...,-0.015565,-0.023930,-0.021138,0.082813,0.022156,-0.000128,-0.095444,0.021635,-0.055686,0.050165


## Preprocessing: Convert WAV file to spectrograms
Surfperch uses mel transformation to convert the WAV files into spectrogram before classifying them. Below, we aim to take a closer look at how it does so.

WAV files are 57 second chunks of audio. The spectrograms wcan be generated for any time interval, we will start with the full 57 second recordings.

Surfperch however trims audio in 5 second intervals. In the next Colab Notebook, we will use Surfperch to extract feature embeddings (11 embedding indices for each 55 second WAV. file.

In [14]:
#Correct filepaths
WAVM10_Sd1_GosongHealthy2 = '/content/drive/MyDrive/UCL /Trimming/M10_SD1_GosongHealthy2_24F3190361CB64DD_20220901_113000.WAV'
WAVfolder ='/content/drive/MyDrive/UCL /Trimming'


In [17]:
#We convert WAV files into spectrograms to visualize what will be passed through the Convolutional Neural Network in the next Collab
def wav_to_spectrogram(spectropath):
  y,sr= librosa.core.load(WAVM10_Sd1_GosongHealthy2)
  S = librosa.feature.melspectrogram(y=y, sr=sr)
  S_db = librosa.core.power_to_db(S)
  return (S_db)


### The extracted features

You should obtain a dataframe with the features Surfperch exctracted from the audio data. It includes a column for the original filename, one for the embedding index, and the corresponding 1280 features.  

We will add metadata to the above dataframe and perform UMAP and PCA visualizations in the next Collab notebook:
https://colab.research.google.com/drive/1EocFBzSt9fQdLM4sLk5XY_LGv3a1wfkm
(EDIT LINK WHEN MARKING IS OVER!)
