
## Binary Animal Sound Classifier using EfficientAT (mn40\_as\_ext)

This model was developed to significantly extend Project Echo’s ability to classify **animal sounds**, building a binary classifier that distinguishes between animal vocalisations and non-animal environmental noise. The model is built using the EfficientAT (mn40_as_ext) audio-specialised architecture. The earlier EfficientNetV2 system  was replaced to test a more complex architecture and see if it could achieve higher accuracy. This goal was achieved, the updated classifier reached 98.57% accuracy and significantly outperformed the earlier EfficientNetV2-based version. 

---

### Architecture and Model Design
The EfficientAT (mn40_as_ext) model is a compact, high-performing convolutional audio classifier pre-trained on the large-scale AudioSet dataset. It leverages knowledge distillation—a technique where a smaller model is trained to replicate the behaviour of a larger, more complex model—to retain strong accuracy while remaining computationally efficient. In our setup, we froze the pre-trained base layers to preserve general audio feature representations and added a new classification head, which was fine-tuned to distinguish between animal sounds and environmental noise.

### EfficientAT vs. EfficientNetV2

The original system used EfficientNetV2 with 7.1 million parameters, trained on 224×224 mel spectrograms. It achieved 90% classification accuracy across 121 species but was originally designed for image tasks, making it less suited for audio-based classification. In contrast, the updated model adopts EfficientAT (mn40_as_ext)—an audio-specialised network with 120 million parameters. Despite the increased parameter count, EfficientAT is more efficient for sound classification due to its distillation-based training and audio-centric design. We kept the input resolution and augmentation setup consistent to ensure a fair comparison. With expanded species coverage from 121 to 264, the updated model achieved 98.57% accuracy and delivered more consistent predictions across all classes.

---

### Dataset Construction

To build the binary classifier, the animal recordings were converted to spectograms in the `animal_mels_224` directory, paired with environmental noise samples sourced from **ESC-50** and **UrbanSound8K** datasets. All files were converted into 224×224 mel spectrograms and saved as .pt tensors for efficient loading and saved as `.pt` tensors. Crucially, all environmental samples from ESC-50 and UrbanSound8K that contained animal-like sounds (e.g., dog barks, bird chirps, insect buzzing) were manually removed to ensure clean binary separation between animal and non-animal classes

---

### Augmentation Strategy

We applied pectrogram-level augmentations to the training in order to improve generalisation and simulate real-world recording conditions,  These included:

* **Frequency masking**
* **Time masking**
* **Gaussian noise injection** (30% probability)

These augmentations helped the model stay accurate under situations where there are noise, interference, and recording variability. They were deliberately kept minimal to align with the original model setup, while the validation data remained unaugmented to ensure both fair and consistent evaluation.


---

### Performance and Improvements

Compared to the original EfficientNetV2-based pipeline (90% accuracy), the new EfficientAT-based model is both more accurate and better suited for audio. It maintains the same input format (224×224 mel spectrograms) while reducing training complexity and improving classification performance (98.57% accuracy). This improvement is largely due to two key factors: a more specialised architecture for audio classification, and a much richer dataset that includes over 260 animal species. With more diverse sounds and a model built for sound recognition, the system delivers more accurate and consistent results

### Improvements

During this sprint, the EfficientAT (mn40_as_ext) model was independently developed to improve classification accuracy and generalisation. It outperformed the previous EfficientNetV2-based approach by increasing species recognition accuracy from 90% to 98.57% and extending support from 121 to 264 species. This included 224 bird species, 23 mammals, 16 amphibians, and 2 reptiles. Despite this progress, Project Echo still lacks a standardised training pipeline—currently offering only raw, unbalanced audio buckets with no reusable model or setup guide for future teams. We recommend formally upgrading to EfficientAT or another strong audio baseline, and providing a pre-trained model with preprocessing tools and evaluation benchmarks. Expanding species coverage and standardising workflows will ensure better onboarding, reproducibility, and long-term scalability of project echo.






### Section 1: Preprocessing 


## Download and Integrate New Animal Sound Folders from Google Drive

This Python script automates the process of:

1. **Downloading a public folder** containing new animal sound recordings from a shared Google Drive link using the `gdown` library.
2. **Moving the downloaded folders** (each representing a species) into the `audio_root` directory used by the Echo Engine model.
3. **Avoiding duplicates** by skipping folders that already exist.
4. **Cleaning up** the temporary download directory after the files are transferred.

---


You can find the updated sound recordings at the following link:  
[https://drive.google.com/drive/folders/1VHutT83YhaUzPw6wKI_hjFeF1GRUb8hO?usp=sharing]

In [1]:
!pip install gdown




DEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [2]:

import gdown
import os
import shutil

#Google Drive folder URL
url = "https://drive.google.com/drive/folders/1MP1j_oiMGL6hWWMrPcuJYsKLSUH8gjp_"
download_dir = "downloaded_sounds"

# Download the folder from Google Drive
gdown.download_folder(
    url=url,
    output=download_dir,
    quiet=False,
    use_cookies=False
)

# Define target audio directory
audio_root = r"C:\Users\riley\Documents\Project-Echo\src\Prototypes\data\data_files"

# Move downloaded folders into the audio_root directory
for folder_name in os.listdir(download_dir):
    src = os.path.join(download_dir, folder_name)
    dst = os.path.join(audio_root, folder_name)

    if os.path.isdir(src):
        if not os.path.exists(dst):
            shutil.move(src, dst)
            print(f"Moved: {folder_name}")
        else:
            print(f"Skipped (already exists): {folder_name}")

# Clean up temporary download directory
if os.path.exists(download_dir):
    shutil.rmtree(download_dir)
    print(" Cleaned up temporary folder:", download_dir)

print("New species folders have been integrated into:", audio_root)

ModuleNotFoundError: No module named 'gdown'

In [None]:


## Alternative: Script to Import New Animal Sound Folders from Google Drive

This script downloads a shared Google Drive folder using `gdown` and moves the species subfolders into: 
Note: please replace with your name 

```
C:\Users\riley\Documents\Project-Echo\src\Prototypes\data\data_files
```
**Note: please replace with your name**
- Skips folders that already exist  
- Cleans up temporary files after moving  
- Requires the Drive folder to be set to “Anyone with the link”

This is a complete script that converts .wav or .mp3 files into 224×224 mel spectrogram tensors and saves them as .pt files. It works with both formats as long as you have FFmpeg installed.



In [None]:
import os
from pathlib import Path
import torchaudio
import torchaudio.transforms as T
import torch

# Settings
input_dir = r"C:\Users\riley\Documents\Project-Echo\src\Prototypes\data\data_files"
output_dir = r"C:\Users\riley\Documents\Project-Echo\src\Prototypes\data\mel_224"
sample_rate = 16000

# Mel spectrogram transformer
mel_transform = T.MelSpectrogram(
    sample_rate=sample_rate,
    n_fft=1024,
    hop_length=320,
    n_mels=224
)
resize = T.Resize((224, 224))  

# Process all audio files
for class_folder in Path(input_dir).iterdir():
    if class_folder.is_dir():
        output_class_dir = Path(output_dir) / class_folder.name
        output_class_dir.mkdir(parents=True, exist_ok=True)

        for audio_file in class_folder.glob("*"):
            if not audio_file.suffix.lower() in [".wav", ".mp3"]:
                continue

            try:
                # Load and resample
                waveform, sr = torchaudio.load(audio_file)
                if sr != sample_rate:
                    resampler = T.Resample(sr, sample_rate)
                    waveform = resampler(waveform)

                # Convert to mono
                if waveform.shape[0] > 1:
                    waveform = waveform.mean(dim=0, keepdim=True)

                # Generate and resize mel spectrogram
                mel_spec = mel_transform(waveform)
                mel_spec = torch.log1p(mel_spec)  # Log-scale
                mel_spec = resize(mel_spec)

                # Save .pt tensor
                out_path = output_class_dir / f"{audio_file.stem}.pt"
                torch.save(mel_spec, out_path)

                print(f"Saved: {out_path}")

            except Exception as e:
                print(f"Failed: {audio_file} — {e}")


official download links for the **ESC-50** and **UrbanSound8K** datasets:



---

###  ESC-50: Environmental Sound Classification

* **Dataset Website:** [https://github.com/karoldvl/ESC-50](https://github.com/karoldvl/ESC-50)
* **Direct Download (ZIP):**
[https://github.com/karoldvl/ESC-50/archive/master.zip](https://github.com/karoldvl/ESC-50/archive/master.zip)
* **Audio files only:**
[https://github.com/karoldvl/ESC-50/blob/master/audio/](https://github.com/karoldvl/ESC-50/blob/master/audio/)
  (You’ll need to download from GitHub or clone the repo)

---

### UrbanSound8K

* **Dataset Website (with registration):**
[https://urbansounddataset.weebly.com/urbansound8k.html](https://urbansounddataset.weebly.com/urbansound8k.html)
* **Direct Download (ZIP, 6.2 GB):**
[https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz](https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz)
  (You may need a Zenodo account for large downloads)




Here’s a Python script that removes audio files from ESC-50 and UrbanSound8K if their class labels indicate animal-like sounds (e.g., dog, bird, insect). It assumes you're working with the original directory structure and accompanying metadata (esc50.csv for ESC-50 and UrbanSound8K.csv for UrbanSound8K):



In [None]:
import os
import pandas as pd
from pathlib import Path
import shutil

# unwanted animal-related categories
animal_classes_esc50 = {'dog', 'rooster', 'pig', 'cow', 'frog', 'cat', 'hen', 'insects', 'sheep'}
animal_classes_us8k = {'dog_bark'}

# ESC-50 
def clean_esc50(audio_dir, csv_path):
    print("Cleaning ESC-50...")
    df = pd.read_csv(csv_path)
    animal_files = df[df['category'].isin(animal_classes_esc50)]['filename'].tolist()

    for fname in animal_files:
        file_path = os.path.join(audio_dir, fname)
        if os.path.exists(file_path):
            os.remove(file_path)
            print(f"Removed ESC-50 animal sound: {fname}")

# UrbanSound8K 
def clean_us8k(audio_root_dir, csv_path):
    print("Cleaning UrbanSound8K")
    df = pd.read_csv(csv_path)
    for _, row in df.iterrows():
        if row['class'] in animal_classes_us8k:
            fold = f"fold{row['fold']}"
            file_path = os.path.join(audio_root_dir, fold, row['slice_file_name'])
            if os.path.exists(file_path):
                os.remove(file_path)
                print(f"Removed UrbanSound8K animal sound: {file_path}")

# (update these to your machine's actual locations) 
esc_audio_dir = r"C:\path\to\ESC-50-master\audio"
esc_csv_path = r"C:\path\to\ESC-50-master\meta\esc50.csv"

us8k_audio_root = r"C:\path\to\UrbanSound8K\audio"
us8k_csv_path = r"C:\path\to\UrbanSound8K\metadata\UrbanSound8K.csv"

#Run cleaning 
clean_esc50(esc_audio_dir, esc_csv_path)
clean_us8k(us8k_audio_root, us8k_csv_path)

print("Cleaning complete.")


EfficientAT download

In [None]:
# https://github.com/fschmid56/EfficientAT (download the model from here)

import os
os.chdir(r"C:\Users\riley\Documents\Project-Echo\src\Prototypes\EfficientAT")



### Section 2: Model Code Implementation 


In [None]:
import os
import sys
import torch
import torch.nn as nn
import torch.optim as optim
from pathlib import Path
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm import tqdm
from sklearn.metrics import classification_report, confusion_matrix
import torchaudio.transforms as T
import numpy as np

# Binary Classification
class BinaryAnimalDataset(Dataset):
    def __init__(self, animal_dir, esc_dir, augment=False):
        self.samples = []
        self.augment = augment

      
        for path in Path(animal_dir).glob("*.pt"):
            self.samples.append((path, 1))

     
        for path in Path(esc_dir).glob("*.pt"):
            self.samples.append((path, 0))

        # augmentations
        if self.augment:
            self.freq_mask = T.FrequencyMasking(freq_mask_param=12)
            self.time_mask = T.TimeMasking(time_mask_param=20)

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        path, label = self.samples[idx]
        mel = torch.load(path)

        ## Ensure spectrogram has the shape [1, 224, 224]
        if mel.dim() == 2:
            mel = mel.unsqueeze(0)
        elif mel.shape[0] != 1:
            mel = mel.mean(dim=0, keepdim=True)

        if mel.shape != (1, 224, 224):
            raise ValueError(f"Bad shape {mel.shape} in file: {path}")

        #Apply augmentation only if enabled
        if self.augment:
            mel = self.freq_mask(mel)
            mel = self.time_mask(mel)
            if torch.rand(1).item() < 0.3:  # 30% chance of adding Gaussian noise
                mel += 0.005 * torch.randn_like(mel)

        return mel, label

# Load EfficientAT (mn40_as_ext) Model with Frozen Base
def load_mn40_as_ext_model(device):
    project_root = r"C:\Users\riley\Documents\Project-Echo\src\Prototypes"
    efficientat_root = os.path.join(project_root, "EfficientAT")
    if efficientat_root not in sys.path:
        sys.path.append(efficientat_root)

    from models.mn.model import get_model as get_mn
    model = get_mn(pretrained_name="mn40_as_ext", width_mult=4.0)
    model.to(device)

    for name, param in model.named_parameters():
        param.requires_grad = "classifier" in name

    return model

#Training and Evaluation Pipeline
def train_binary_animal_model():
    
   # Paths to mel spectrogram 
    animal_dir = r"C:\Users\riley\Documents\Deakin 2025\SIT378\Sprint 1\animal_mels_224"
    esc_dir = r"C:\Users\riley\Documents\Deakin 2025\SIT378\Sprint 1\other_sounds"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

     # Load and split dataset (80% train, 20% val)
    full_dataset = BinaryAnimalDataset(animal_dir, esc_dir)
    val_size = int(0.2 * len(full_dataset))
    train_size = len(full_dataset) - val_size
    train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

      # Enable augmentation for training only
    train_dataset.dataset.augment = True
    val_dataset.dataset.augment = False

     # Prepare data loaders
    train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=8)

      # Load model and replace classifier
    model = load_mn40_as_ext_model(device)
    model.classifier = nn.Sequential(
        nn.AdaptiveAvgPool2d((1, 1)),
        nn.Flatten(),
        nn.Linear(model.classifier[2].in_features, 256),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(256, 2)
    )
    model.to(device)

    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.classifier.parameters(), lr=1e-4)

    # store results 
    train_losses, train_accuracies = [], []
    val_losses, val_accuracies = [], []


     # Training loop
    for epoch in range(1, 6):
        model.train()
        running_loss, correct, total = 0.0, 0, 0
        for x, y in tqdm(train_loader, desc=f"Epoch {epoch}"):
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            outputs = model(x)
            if isinstance(outputs, tuple):
                outputs = outputs[0]
            loss = criterion(outputs, y)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * x.size(0)
            preds = outputs.argmax(dim=1)
            correct += (preds == y).sum().item()
            total += y.size(0)

         # Computing the training metrics
        train_loss = running_loss / total
        train_acc = 100 * correct / total
        train_losses.append(train_loss)
        train_accuracies.append(train_acc)

         # Validation
        model.eval()
        val_loss_total, correct, total = 0.0, 0, 0
        all_preds, all_labels = [], []
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                outputs = model(x)
                if isinstance(outputs, tuple):
                    outputs = outputs[0]
                loss = criterion(outputs, y)
                val_loss_total += loss.item() * x.size(0)

                preds = outputs.argmax(dim=1)
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(y.cpu().numpy())
                correct += (preds == y).sum().item()
                total += y.size(0)

         # the validation metrics
        val_loss = val_loss_total / total
        val_acc = 100 * correct / total
        val_losses.append(val_loss)
        val_accuracies.append(val_acc)

          # Displays the metrics
        print(f"Epoch {epoch}: Train Loss = {train_loss:.4f}, Train Acc = {train_acc:.2f}%")
        print(f"Epoch {epoch}: Val Loss = {val_loss:.4f}, Val Acc = {val_acc:.2f}%")

         # Final evaluation
    print(f"\nFinal Train Accuracy: {train_accuracies[-1]:.2f}%")
    print(f"Final Val Accuracy: {val_accuracies[-1]:.2f}%")

    print("\nFinal Classification Report")
    print(classification_report(
        all_labels,
        all_preds,
        target_names=["non-animal", "animal"],
        zero_division=0
    ))

    print("\nConfusion Matrix")
    print(confusion_matrix(all_labels, all_preds, labels=[0, 1]))

     # Plot accuracy
    epochs = list(range(1, 6))
    plt.plot(epochs, train_accuracies, label="Train Acc")
    plt.plot(epochs, val_accuracies, label="Val Acc")
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy (%)")
    plt.title("Training vs Validation Accuracy")
    plt.legend()
    plt.show()

    # Plot loss
    plt.plot(epochs, train_losses, label="Train Loss")
    plt.plot(epochs, val_losses, label="Val Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Training vs Validation Loss")
    plt.legend()
    plt.show()

# Run Training
train_binary_animal_model()