Advancing Biodiversity Conservation: Comparative Evaluation of Machine Learning Models for Species Classification

Hello! This is a collection of the work I did at my internship at the Vector Institute under the supervision of Graham Taylor. This repository contains code, data, and documentation for our project on evaluating machine learning models for species classification, focusing on biodiversity conservation. This project compares various models to assess their effectiveness in classifying species, particularly rare and underrepresented ones. The project also proposes novel methodologies for model evaluation to improve species classification accuracy and robustness. I strongly suggest you click on the powerpoint presentation with the same name to learn more!

Introduction

In this project, we explore machine learning (ML) approaches for species classification to support biodiversity conservation. Accurate species classification is essential for monitoring biodiversity, preserving rare species, and supporting conservation efforts worldwide. By comparing models such as Hiera and BioCLIP, we aim to improve classification accuracy and evaluate model effectiveness, particularly for rare or difficult-to-distinguish species. This project also proposes tailored metrics to assess model performance more effectively.

Dataset Description

We use three key datasets in this project:

Tree of Life 10M+: A comprehensive dataset representing a broad taxonomy of millions of species, with rich hierarchical structure.
iNaturalist (iNat): User-generated data collected through citizen science, with species observations from across the world.
Rare Species: A subset focusing on rare and underrepresented species that are challenging to classify due to data scarcity.

Dataset Details

Intersections and Similarities: While Tree of Life and iNat have similar species but differ in quality (curated vs. user-generated), the Rare Species subset specifically targets endangered or uncommon species to enhance model robustness.
Data Pruning: Each dataset is cleaned and pruned to remove duplicates, low-quality images, and outliers.

Model Architecture

Hierarchical Vision Transformers (Hiera)

Hiera is a Vision Transformer (ViT) model specialized in hierarchical classification, which aligns well with taxonomic classification in species identification. By leveraging Hiera, we aim to capture the hierarchical relationships in species data, allowing for more nuanced classification, especially across complex taxonomy levels.

BioCLIP

BioCLIP is a variant of the CLIP model, adapted for biological and ecological data. CLIP is a powerful multi-modal model that aligns images and textual descriptions into a shared embedding space. BioCLIP fine-tunes this approach for species classification by leveraging scientific, common, and taxonomic text styles, improving its performance in distinguishing similar or visually ambiguous species.

Additional Models

Other models, such as standard ViTs, are also tested to provide a comparative baseline for evaluating Hiera and BioCLIP’s performance.

Methods

The primary methods and preprocessing techniques employed include:

Data Augmentation: Applying RandAug, Mixup, CutMix, and other augmentation techniques to enhance model robustness.
Layer-wise Decay: Utilizing variable learning rates across layers to optimize fine-tuning and model generalization.
Text Style Variations for BioCLIP: Experimenting with different text styles (Common, Scientific, Taxonomic, and Mixed) to determine which performs best for text-image embedding.

Evaluation Metrics

Two key metrics are introduced to evaluate model performance:

Precision-Recall Based Metric: Focused on handling the imbalance in species data, especially for rare species.
Embedding-Based Similarity Metric: Compares the closeness of species embeddings to better capture the model's understanding of species similarity.

Results

Our results highlight the strengths and weaknesses of each model:

Hiera outperforms standard ViTs in handling hierarchical data structures but requires significant computational resources.
BioCLIP demonstrates strong performance in species classification by leveraging text descriptions, particularly with mixed text styles (scientific, common, and taxonomic).
Comparison of BioCLIP vs Hiera: BioCLIP was more effective when detailed text annotations were available, while Hiera excelled in purely visual classification tasks with a hierarchical structure.

Lessons Learned

Key insights from the project:

Data Imbalance: Handling rare species remains challenging due to limited data availability.
Complex Model Training: Advanced models like Hiera and BioCLIP require large datasets and computational power.
Evaluation Limitations: Standard metrics were insufficient, necessitating custom metrics tailored to biodiversity tasks.

Future Work

This project opens up potential research directions:

Refined RAG Methodologies: Further enhance RAG approaches for improved species classification.
iNat Community Contribution: Investigate ways to incorporate iNaturalist community contributions for model refinement.
Hybrid Models: Combine Hiera’s hierarchical capabilities with BioCLIP’s embedding-based approach for more robust results.

Contributions

Thank you to Graham Taylor for supervising my work at the Vector Institute. Thank you to Nate Lesperance for being a great mentor and advisor. Big Shoutout to the Unviersity of Guelph's Machine Learning Reading Group. Lastly, a huge thanks to the Inaturalist community without whom this project never could've happened.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Notebooks		Notebooks
iNAT-Model-Code		iNAT-Model-Code
Advancing Biodiversity Conservation Comparative Evaluation of Machine Learning Models for Species Classification.pptx		Advancing Biodiversity Conservation Comparative Evaluation of Machine Learning Models for Species Classification.pptx
RAGpresentation.pptx		RAGpresentation.pptx
README.md		README.md
iNaturalist Evaluation Metrics _ hiera_inat2021_transfer_learning – Weights & Biases.pdf		iNaturalist Evaluation Metrics _ hiera_inat2021_transfer_learning – Weights & Biases.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advancing Biodiversity Conservation: Comparative Evaluation of Machine Learning Models for Species Classification

Table of Contents

Introduction

Dataset Description

Dataset Details

Model Architecture

Hierarchical Vision Transformers (Hiera)

BioCLIP

Additional Models

Methods

Evaluation Metrics

Results

Lessons Learned

Future Work

Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Advancing Biodiversity Conservation: Comparative Evaluation of Machine Learning Models for Species Classification

Table of Contents

Introduction

Dataset Description

Dataset Details

Model Architecture

Hierarchical Vision Transformers (Hiera)

BioCLIP

Additional Models

Methods

Evaluation Metrics

Results

Lessons Learned

Future Work

Contributions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages