# Forest elephant vocalisation call-type classification

This Jupyter notebook provides a step-by-step guide to using pre-trained CNNs via transfer learning techniques to classify forest elephant vocalisation call-types and evaluate the performance of these techniques. 
This notebook evaluates the performance of automated feature extraction methods in distinguishing between three call types from a dataset of 1254 forest elephant vocalisations.

## Dataset Description

This dataset contains information on African forest elephant vocalisations recorded in Dzanga-Bai clearing, Central African Republic, between September 2018 and April 2019 by the Elephant Listening Project.

It includes:

1. `elephant_vocalisations.csv`: A table of 1254 annotated vocalisations, each with start time, end time, frequency range, call type (roar, rumble, or trumpet), and corresponding audio file.
2. `{model}_vocalisations_features.parquet`: Parquet files storing acoustic features extracted from the vocalisations using the workflow described in the `1_feature_extraction_notebook`. Features are extracted using four different CNN models (VGGish, YAMNet, BirdNET, and Perch).

## Steps
1. **Set-up**: Load the vocalisations data and the pre-computed features.
2. **Dimensionality reduction**: Project the acoustic features into 2D space as the basis for clustering and statistical analysis
3. **Silhouette analysis**: Calculate the silhouette scores for the UMAP acoustic feature embeddings.
4. **Call-Type Classification**: Train a random forest classifier and assess its classification performance.

**Note:** Feature extraction is a computationally intensive process. 
To avoid re-computation, this notebook uses pre-computed features. 
Refer to the `1_feature_extraction_notebook` for details on the feature extraction workflow.

### 1. Set up

Here we will import all dependencies as well as some pre-defined helper functions located in the `elephants_scripts` folder of the main repository.

In [None]:
from pathlib import Path

import pandas as pd

from elephant_scripts.load_data import load_vocalisation_dataset

AUDIO_DIR = Path("audio_dir")
DATA_DIR = Path("data")
OUTPUTS_DIR = Path("outputs")

# Now we load the table containing information about each of the elephant vocalisations.
df = load_vocalisation_dataset(
    DATA_DIR / "elephant_vocalisations.csv",
    audio_dir=AUDIO_DIR,
)

# And we load each of the pre-computed features.
MODELS = ["VGGish", "YAMNet", "BirdNET", "Perch"]

features = {
    model: pd.read_parquet(
        OUTPUTS_DIR / f"{model.lower()}_vocalisation_features.parquet"
    )
    for model in MODELS
}

### 2. Dimensionality reduction

Now that we have the feature embeddings for each of the 1254 recordings we need to reduce this high-dimensional data into lower dimensional space to make it interpretable to the human brain and usable in the statistical tests. This involves 2 steps:
1. Normalise the embeddings so that their mean = 0 and variance = 1. This ensures equal weighting of the features
2. Carry out the dimensionality reduction with specified parameters, including the number of components (2) and distance metric we want to use (cosine).

**Normalisation step**

In [2]:
from sklearn.preprocessing import StandardScaler


# Function to normalise the DataFrame
def normalise_features(features):
    # Initialize the scaler
    scaler = StandardScaler()

    # Fit and transform the features
    normalised_features = scaler.fit_transform(features)

    # Create a new DataFrame for the normalised features
    normalised = pd.DataFrame(
        normalised_features,
        columns=features.columns,
        index=features.index,
    )

    # Return the normalised DataFrame
    return normalised

In [3]:
# Normalise each of the extracted features
normalised = {model: normalise_features(feats) for model, feats in features.items()}

**Dimensionality reduction**

In [4]:
import umap.umap_ as umap

# Specify the UMAP parameters
N_COMP = 2  # select 1, 2 or 3 dimensions
METRIC = "cosine"  # distance metric used
N_NEIGHBORS = 15
MIN_DIST = 0
RANDOM_STATE = 2204


# Function to fit UMAP and merge metadata
def process_umap(
    normalised_df,
    metadata_df,
    n_comp=N_COMP,
    metric=METRIC,
    min_dist=MIN_DIST,
    n_neighbors=N_NEIGHBORS,
    random_state=RANDOM_STATE,
):
    # Instantiate UMAP projector with provided parameters
    reducer = umap.UMAP(
        n_components=N_COMP,
        metric=metric,
        min_dist=min_dist,
        random_state=random_state,
    )

    # Fit UMAP and obtain embeddings
    embedding = reducer.fit_transform(normalised_df)

    # Create DataFrame with UMAP embeddings, preserving 'vocalisation_id' as index
    umap_results = pd.DataFrame(
        embedding,
        columns=[f"UMAP{i + 1}" for i in range(N_COMP)],
        index=normalised_df.index,
    )

    # Merge UMAP coordinates with metadata to obtain the
    # corresponding call type
    return umap_results.merge(metadata_df, on="vocalisation_id", how="left")

In [None]:
umaps = {model: process_umap(norm, df) for model, norm in normalised.items()}

### 3. Silhouette analysis 

To evaluate the effectiveness of our feature extraction method in grouping different call types, we use the silhouette score. This score measures how similar each data point is to its own cluster compared to other clusters. It is calculated by comparing the average distance to all points within the same cluster against the average distance to points in the nearest neighboring cluster (for a detailed explanation, see [silhouette-coefficient](https://scikit-learn.org/1.5/modules/clustering.html#silhouette-coefficient)).

In [6]:
from sklearn.metrics import silhouette_samples, silhouette_score


def silhouette_report(umap_df, groupby="call_type"):
    labels = umap_df[groupby]
    silhouette_avg = silhouette_score(umap_df[["UMAP1", "UMAP2"]], labels)
    silhouette_values = silhouette_samples(umap_df[["UMAP1", "UMAP2"]], labels)

    # Prepare a dictionary to store average silhouette scores for each label
    silhouette_dict = {
        "Average": silhouette_avg,
    }
    unique_labels = labels.unique()
    for label in unique_labels:
        label_indices = labels == label
        avg_silhouette_score = silhouette_values[label_indices].mean()
        silhouette_dict[label] = avg_silhouette_score

    return silhouette_dict

In [None]:
silhouette = pd.DataFrame(
    {
        model: silhouette_report(projected)
        for model, projected in umaps.items()
    }
)
silhouette

**Plot UMAP 2D for each model**

In [None]:
import matplotlib.pyplot as plt

from elephant_scripts.plotting import plot_umap_with_silhouette

# Plotting all 4 models' UMAPs in a 2x2 grid (scaling appropriately for the grid)
fig, axs = plt.subplots(
    2, 2, figsize=(20, 20), dpi=300
)  # Set a good DPI for high quality

# Plot each model's UMAP with scaling
plot_umap_with_silhouette(umaps["VGGish"], "VGGish", axs[0, 0])
plot_umap_with_silhouette(umaps["Perch"], "Perch", axs[0, 1])
plot_umap_with_silhouette(umaps["YAMNet"], "YAMNet", axs[1, 0])
plot_umap_with_silhouette(umaps["BirdNET"], "BirdNET", axs[1, 1])

# Adjust layout for a clean 2x2 grid
plt.tight_layout(pad=4.0)  # Ensure no overlap, adjust spacing

# Show the plot
plt.show()

**Add spectrogram images to UMAP 2D plot to visualise separation**

In [None]:
from elephant_scripts.plotting import scatter_spec

fig = scatter_spec(
    umaps["BirdNET"],
    column_size=8,
    matshow_kwargs={"cmap": plt.cm.magma},
    scatter_kwargs={
        "alpha": 0.75,
        "s": 40,
    },
    line_kwargs={"lw": 1, "ls": "dashed", "alpha": 0.5},
    draw_lines=True,
    figsize=(20, 20),
    range_pad=0.1,
)

### 4. Call-Type Classification

Lastly, we use the UMAP acoustic feature embeddings to train a Random Forest Classifier to predict the 3 Call-Types on unseen test data. We perform hyperparameter optimisation on the Random Forest model and cross validation on the resultant model.

In [10]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold


# Define a function to handle the model training and evaluation for each category
def train_evaluate_category(X, y, category_name):
    # Define outer cross-validation strategy
    outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Initialize list to store results
    best_accuracy_list = []

    # Loop through outer cross-validation folds
    for train_index, test_index in outer_cv.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # Oversample the minority class using RandomOverSampler with automatic sampling strategy
        ros = RandomOverSampler(sampling_strategy="auto", random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

        # Hyperparameter optimization using GridSearchCV (Random Forest parameters)
        param_grid = {
            "n_estimators": [50, 100, 200],
            "max_depth": [None, 10, 20],
            "min_samples_split": [2, 5, 10],
            "min_samples_leaf": [1, 2, 4],
            "class_weight": ["balanced", "balanced_subsample"],
        }
        rf = RandomForestClassifier(random_state=42)
        grid_search = GridSearchCV(rf, param_grid, cv=3, scoring="accuracy", verbose=1)
        grid_search.fit(X_resampled, y_resampled)

        # Store best hyperparameters and accuracy
        print(f"Best Parameters for {category_name}:", grid_search.best_params_)
        print(f"Best Accuracy for {category_name}:", grid_search.best_score_)

        # Get the best model
        rf_best = grid_search.best_estimator_

        # Evaluate the model using the outer fold test data
        y_pred_best = rf_best.predict(X_test)

        # Calculate macro average accuracy score for classification
        accuracy = (
            grid_search.best_score_
        )  # Using best score from grid search (inner CV)
        best_accuracy_list.append(accuracy)

        # Print classification report for the best model
        print(
            f"Random Forest {category_name} Classification Report:\n",
            classification_report(y_test, y_pred_best),
        )

    # Return average of best accuracy across folds
    return np.mean(best_accuracy_list)

In [None]:
import numpy as np 

# Initialize the results dictionary
results = {}

# Run model training and evaluation for each model/umap_df
for model_name, umap_df in umaps.items():
    X = umap_df[["UMAP1", "UMAP2"]]  # UMAP features
    y = umap_df["call_type"]  # Target labels
    mean_accuracy = train_evaluate_category(X, y, model_name)
    results[model_name] = mean_accuracy
    print(f"Mean accuracy for {model_name}: {mean_accuracy}")

# Store and present the results in a table
results_df = pd.DataFrame(
    list(results.items()), columns=["Model", "Best Macro Average Accuracy"]
)