<a href="https://colab.research.google.com/github/BenUCL/AI4Reefs-Workshop/blob/main/Audio_Transfer_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine learning with coral reef soundscape data**

We're going to use audio data you accessed in the introductory colab. This includes 151 one-minute audio files from healthy and degraded reefs in Indonesia.

Follow these steps:
- First, run the code up to the end of Step 1. Let the neural net extract features from the first several audio files and note the speed of this.
- Now, switch to a GPU runtime. This should be lightning fast compared to the original execution. Let the feature extraction run to completion.

# **Step 1: Feature extraction**
ADD NOTES ON, it cuts audio into 0.96s





In [1]:
#@title Imports
import os # for handling files and directories
import librosa # for audio processing
import tensorflow as tf # for machine learning
import tensorflow_hub as hub # for machine learning
import numpy as np # for numerical processing
import pandas as pd # for handling dataframes
from tqdm import tqdm # for progress bar

### Mount your drive and add the path to the audio data once again

In [None]:
# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

# Directory containing audio files
#audio_dir = 'ADD PATH HERE'
audio_dir = '/content/drive/MyDrive/Reef soundscapes with AI/audio_dir'


In [3]:
#@title Load the VGGish model
model = hub.load('https://www.kaggle.com/models/google/vggish/frameworks/TensorFlow2/variations/vggish/versions/1')

## Extract features with the neural net

Now we run the main for loop to iterate over each file extract features using the pretrained neruel network.

To speed things up, we only use 40 healthy and 40 degraded files.

The results will be saved to a 'pandas dataframe', similar to a dataframe in R, and, to the 'extracted_features.csv' which should appear in the file tab on the left.

While the code is running, take a look at it and see how much you can understand. Try asking Chatgpt or Claude to explain parts you don't understand.

In [None]:
# Function to extract class from filename
def get_class(filename):
    if 'D' in filename:
        return 'degraded'
    elif 'H' in filename:
        return 'healthy'
    else:
        return 'unknown'

# Function to exctract features audio files and save these to a df
def process_audio_files(audio_dir, model):

    # List to store ouputs
    rows_list = []
    # Create outputs in loop
    for filename in tqdm(os.listdir(audio_dir), desc="Processing audio files"):
        if filename.endswith('.wav'):
            file_class = get_class(filename)

            # Process the file
            file_path = os.path.join(audio_dir, filename)
            audio, _ = librosa.load(file_path, sr=16000)
            vggish_features = model(audio).numpy()

            for i, embedding in enumerate(vggish_features):
                # Splitting the filename and embedding index
                embedding_index = i + 1
                row_data = {'filename': filename, 'embedding_index': embedding_index, 'class': file_class}
                for j, feature in enumerate(embedding):
                    row_data[f'feature_{j}'] = feature
                rows_list.append(row_data)

    # Create DataFrame from the list
    results_df = pd.DataFrame(rows_list)
    return results_df

# Process the audio files
results_df = process_audio_files(audio_dir, model)

# Save the results to a CSV file
results_df.to_csv('/content/extracted_features.csv', index=False)

Once you're done with the GPU on colab, its good practice to switch back to a standard runtime. Make sure to download the extracted_features.csv first! You can then re-upload this to your new runtime.

In [None]:
#@title Take a peek at the dataframe

# Load the csv as a dataframe
df = pd.read_csv('ADD PATH')

# We add another column that encodes classes as integers for later
class_mapping = {'healthy': 1, 'degraded': 0}
df['encoded_class'] = df['class'].map(class_mapping)

# Place 'encoded_class' next to the 'class' column
cols = df.columns.tolist()
class_index = cols.index('class')
cols.insert(class_index + 1, cols.pop(cols.index('encoded_class')))
df = df[cols]

# Display the DataFrame to see the new column
df

### Use colab to plot the df

Scroll to the right of the df table and you should see the plotting symbol in the top right.

Lets make a plot that shows the distribution of classes.

Questions
1. What is the approximate ratio of healthy or degreded?
2. How might this affect the metric we use for assessing a supervised classifer?



## **Step 2: Cluster the data**

Now we are going to perform unsupervised clustering. We'll start with k-means clustering. Like Random Forests for classification, this is always a safe bet to use for a first quick model.

Start by looking at the scikit-learn documentation and try get a quick understanding of k-means (https://scikit-learn.org/stable/modules/clustering.html#k-means). Stuck on anything? Ask ChatGPT! Its knowledge on commonly used approaches like this is very reliable.

In [None]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Set variables
num_clusters = 2
random_seed = 0

# Select only the feature columns (assuming they are named 'feature_0', 'feature_1', ...)
feature_cols = [col for col in df.columns if col.startswith('feature_')]
X = df[feature_cols]

# Apply k-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=random_seed)
clusters = kmeans.fit_predict(X)

# Add the cluster information to the DataFrame
df['cluster'] = clusters

# Prepare data for the bar plot
cluster_class_counts = df.groupby(['cluster', 'class']).size().unstack().fillna(0)

# Create the bar plot
cluster_class_counts.plot(kind='bar', stacked=True, color=['green', 'orange'])
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.title('Cluster Distribution by Class')
plt.legend(title='Class')
plt.show()


### Questions
1. Try varying the number of clusters. What are the benefits of setting less or more clusters?
2. What can you intepret about how different healthy and degraded reefs sound by these results?
2. We know whether audio came from healthy or degraded reefs. Why is unsupervised learning still useful in place of supervised learning here?
3. HARDER: Now try using ChatGPT and/or the scikit learn clustering documentation (https://scikit-learn.org/stable/modules/clustering.html#k-means) to implement a different clustering algorithm. You will only need to modify the two lines of code underneath 'Apply k-means clustering'. Do this in a new cell below.

# **Step 3: Dimensionality reduction and visualisation**

We will use 'Uniform Manifold Approximation' (UMAP). Put simply, UMAP is able to reduce the dimensions of our data from the 128 VGGish outputs to something lower. It can go as low as two dimensions, allowing us to visualise this in 2D.

This will plot recordings that sound similar cloe together, and recordings that sound different apart. Note it doesn't directly create clusters.

UMAP is not a standard package on Colab so we need to start install it. This is typically using the terminal. We can use an '!' before code to run it as a terminal command.

In [None]:
# Install UMAP
!pip install umap-learn

### Now run UMAP

We pass UMAP 'X' - can you see where is this coming from? What does X contain?

UMAP take a minute or two...

In [None]:
import umap

# n_neighbors and min_dist are important parameters, but we can use common defaults
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=random_seed)
embedding = reducer.fit_transform(X)

Given the UMAP component takes longer to run. We produce the plot in a different cell to allow quicker edits to the plot.

In [None]:
# Plot the UMAP projection
c = df['cluster']

plt.figure(figsize=(12, 8))
plt.scatter(embedding[:, 0], embedding[:, 1], c=c, cmap='Spectral', s=5)
plt.colorbar(label='Cluster')
plt.title('UMAP Projection of the Dataset')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()

### Questions / Exercises
1. You can save the plot by right clicking and use this for comparing different version.
2. Here we are seeing the points colour coded by 'cluster'. Try colour coding them by 'class'. How does this change the plot? Why is this?
3. What if you produce more clusters in the k-means section and plot by cluster again?
4. You might see some small groups of anomolies away from their main group. What could cause this?

# **Advanced: Supervised learning**
1. Take the feature variables created earlier from the df. Create a labels variable from the 'encoded_labels' column (use ChatGPT for help if needed). Check your labels variable looks correct.
2. Now use and adapt the relevant code from the intro colab to train a random forest classifier and report the accuracy.
3. This should be very accurate. Remember, we have 62 samples from each 1-min recording that likely sound similar. Why could this bias accuracy? How should we better handle this? Try using the 'groupby' function in pandas.
4. Due to the small class imbalance, we might want a different metric to accuracy? Why is this? Try and find some contenders and implement one of these (ask ChatGPT for help).
