author: P.P

# Project Echo Engine - Transfer Learning Report

# Introduction

Project Echo is a pioneering initiative launched in Trimester 3 of 2022 by Stephan Kokkas, Andrew Kudilczak, and Daniel Gladman, aimed at leveraging cutting-edge bioacoustics technology to aid global conservation efforts. This project specifically focuses on developing advanced audio classification tools to detect, track, and monitor endangered species and their predators. By employing a network of audio sensors strategically placed in natural habitats, Project Echo captures sounds of passing animals which are then analyzed using AI-driven models to classify species and log crucial data.

The core vision of Project Echo is to provide a comprehensive suite of bioacoustic tools through the development of AI and machine learning solutions. These tools are designed to assess animal vocalizations within their natural habitats efficiently and non-destructively, aiding in environmental surveys and enabling conservationists to make informed decisions to protect threatened animal populations.

Project Echo is structured around three foundational values: Voice, Data, and Environment. Reflecting the project’s name, 'Echo' and its emphasis on sound, each team member is encouraged to actively contribute and propel the project forward using innovative data-driven solutions that benefit the environment.

The primary objectives for the 2023 Trimester 2 team included:

1. Researching and testing various audio classification methods to enhance the prototype.
2. Constructing prototypes for each primary system component.
3. Demonstrating an end-to-end system operation.
4. Exploring cloud implementation strategies for the Echo system.

These objectives guide our ongoing efforts to refine and expand the capabilities of Project Echo, ensuring it remains at the forefront of technology for wildlife conservation.

## Why Transfer Learning for Project Echo?

Transfer learning is a pivotal technique in machine learning that involves repurposing models trained on one problem for use on a related problem. 

This approach is particularly beneficial for **Project Echo**, where the challenge lies in accurately classifying diverse animal sounds from complex rainforest environments. 

By leveraging pre-trained models, such as **EfficientNetV2**, Project Echo can significantly reduce the need for extensive data collection and computational resources typically required for training deep learning models from scratch. 

These models have already learned rich, transferable features from large and diverse datasets, enabling them to quickly adapt to the specific task of recognizing animal vocalizations. 

This not only accelerates the development process but also enhances the accuracy and robustness of the classification system, making it a potent tool for conservationists monitoring wildlife populations. 

The use of transfer learning thus aligns seamlessly with the project’s goals of developing efficient, scalable, and effective bioacoustics classification tools, empowering researchers to make informed decisions based on reliable data analysis.

## Audio Classification with Mel Spectrograms and Transfer Learning

Audio classification, crucial for applications ranging from wildlife monitoring to urban sound analysis, involves categorizing audio clips into predefined classes. This process is significantly enhanced by sophisticated machine learning models and techniques such as Mel Spectrograms and Transfer Learning.

### Understanding Mel Spectrograms
Mel Spectrograms convert audio signals into a visual, time-frequency representation, bridging the gap between audio and image data. This conversion is essential for applying advanced machine learning algorithms, particularly those designed for image data.

#### Key Features of Mel Spectrograms:
- **Spectrogram**: Visual representation of the spectrum of frequencies in a sound as they vary with time.
- **Mel Scale**: A perceptual scale that mimics the human auditory system's response, focusing on pitches perceived to be equidistant.
- **Application**: Mel Spectrograms allow the use of Convolutional Neural Networks (CNNs), which are adept at capturing spatial hierarchies in image data, to process audio signals effectively.

### The Role of Transfer Learning
Transfer Learning involves adapting a model trained on one task to a new, related task, leveraging pre-trained models to improve performance on audio classification with limited data.

#### Advantages of Transfer Learning:
- **Efficiency**: Reduces the need for large datasets and extensive computational resources.
- **Flexibility**: Pre-trained models on large image datasets can be adapted to new tasks using audio data represented as Mel Spectrograms.

### Future Enhancements and Techniques

#### Augmentation Strategies
To improve model robustness and address challenges like overfitting:
- **Audio Augmentations**: Implement dynamic effects such as Time Stretch and Pitch Shift during training. Overlay environmental noises (rain, wind) to mimic real-world conditions.
- **Image Augmentations**: After converting audio to Mel Spectrograms, apply SpecAugment techniques like masking and random rotations to enhance training.

#### Model and Dataset Optimization
- **Dataset Rebalancing**: Adjust the dataset to handle imbalances by selectively applying augmentations and expanding sample diversity.
- **Architectural Innovations**: Explore different models like InceptionResNetV2 and adaptations of VGGish to optimize classification accuracy.

By integrating Mel Spectrograms with Transfer Learning, and continuously refining augmentation techniques and model architectures, we aim to enhance the precision and efficiency of audio classification systems. These advancements ensure our methods stay at the forefront of technology, ready to tackle real-world challenges effectively.


Here are some examples of all the transfer learning models that I see in the prototypes/engine directory. Note: not all of them are currently being used but they have been delved into at some point in their respective notebook however Tanmay has the most optimised InceptionResNetV2 version as the MASTER copy right now. 

**YAMNet (Yet Another Mobile Net)** is a deep learning model developed by Google for audio event classification. It is based on the MobileNet architecture, which is designed for efficient computation on mobile and embedded devices. YAMNet is pre-trained on a large dataset of audio events and can be fine-tuned or used as a feature extractor for various audio classification tasks. The model takes audio input and outputs a set of class probabilities corresponding to different audio events.

Other notable transfer learning models explored in Project Echo include:

- **SoundNet**: SoundNet is a deep learning model for audio recognition that learns audio representations by leveraging large amounts of unlabeled video data. It is trained to predict the visual context associated with a given audio segment, enabling it to learn meaningful audio representations. SoundNet can be used as a feature extractor for audio-related tasks.

- **EfficientNet**: EfficientNet is a family of convolutional neural network (CNN) models designed for image classification. These models achieve state-of-the-art accuracy while being computationally efficient. EfficientNets can be used as pre-trained models for transfer learning in various image-related tasks.

- **SWE_OpenL3**: SWE_OpenL3 is an audio embedding model that learns general-purpose audio representations. It is trained on a large dataset of audio-visual correspondences and can capture high-level semantic information from audio. SWE_OpenL3 can be used as a feature extractor for audio-related tasks.

- **VGGish**: VGGish is an audio feature extraction model based on the VGG architecture. It is trained on a large dataset of YouTube videos and learns to extract compact audio representations. VGGish can be used as a feature extractor for audio-related tasks.

- **GoogLeNet**: GoogLeNet, also known as Inception, is a CNN architecture developed by Google for image classification. It introduces the concept of Inception modules, which allow for efficient computation and improved performance. GoogLeNet can be used as a pre-trained model for transfer learning in image-related tasks.

- **Xception**: Xception is a CNN architecture inspired by Inception, but it replaces the Inception modules with depthwise separable convolutions. It has been shown to outperform Inception in terms of accuracy and computational efficiency. Xception can be used as a pre-trained model for transfer learning in image-related tasks.

These models can be leveraged for feature extraction or fine-tuning, allowing for faster convergence, improved performance, and the ability to work with smaller datasets.

### InceptionResNetV2 Model Breakdown (Current MASTER model)

#### Objective
The InceptionResNetV2 model in Project Echo is optimized to enhance the classification accuracy of audio samples from various animal species. This model combines the benefits of Inception architectures with residual connections to facilitate faster and more effective training dynamics.

#### Key Techniques
- **Audio Pipeline Adaptation**: Utilizes an advanced audio processing pipeline to convert raw audio into Mel spectrogram representations, which serve as the input for the model.
- **Compatibility and Performance Enhancements**: Includes updates for compatibility with Lambda Stack for containerization and cross-platform development, upgrades TensorFlow versions, and addresses parallel pipeline execution issues to boost performance.

#### Results and Discussion
The model shows robust classification capability across various animal species, achieving high predictive probabilities in most cases. For instance:
- **High Accuracy**: Predictions like the Owlet-nightjar and the Sambar deer show probabilities of 99.92% and 99.88%, respectively, indicating high reliability in recognizing distinct animal sounds.
- **Challenges**: Some misclassifications were observed, such as the Eastern yellow robin being incorrectly identified as a Rufous whistler with a probability of 71.48%. This points to potential areas for further refinement in distinguishing closely related or acoustically similar species.

InceptionResNetV2's implementation within Project Echo showcases a significant advancement in the automated classification of bioacoustic data. While achieving high accuracy in many cases, ongoing adjustments and model training enhancements are expected to further improve its effectiveness in real-world conservation applications.


The current sprint with the project echo engine team is now focused on transfer learning with newer models and storing all the results in a google sheets file for comparisons so it will be great to add that to this notebook once completed with visual imagery along with data metrics. This will serve as the baseline for now. 