This project provides a complete, end-to-end pipeline for building a custom insect classification system. The framework is designed to be domain-agnostic, allowing you to train a powerful detection and classification model for any insect species by simply providing a list of names.
Using the Bplusplus
library, this pipeline automates the entire machine learning workflow, from data collection to video inference.
- Automated Data Collection: Downloads hundreds of images for any species from the GBIF database.
- Intelligent Data Preparation: Uses a pre-trained model to automatically find, crop, and resize insects from raw images, ensuring high-quality training data.
- Hierarchical Classification: Trains a model to identify insects at three taxonomic levels: family, genus, and species.
- Video Inference & Tracking: Processes video files to detect, classify, and track individual insects over time, providing aggregated predictions.
The process is broken down into six main steps, all detailed in the full_pipeline.ipynb
notebook:
- Collect Data: Select your target species and fetch raw insect images from the web.
- Prepare Data: Filter, clean, and prepare images for training.
- Train Model: Train the hierarchical classification model.
- Download Weights: Fetch pre-trained weights for the detection model.
- Test Model: Evaluate the performance of the trained model.
- Run Inference: Run the full pipeline on a video file for real-world application.
- Python 3.10+
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install the required packages:
pip install bplusplus
The pipeline can be run step-by-step using the functions from the bplusplus
library. While the full_pipeline.ipynb
notebook provides a complete, executable workflow, the core functions are described below.
Download images for your target species from the GBIF database. You'll need to provide a list of scientific names.
import bplusplus
from pathlib import Path
# Define species and directories
names = ["Vespa crabro", "Vespula vulgaris", "Dolichovespula media"]
GBIF_DATA_DIR = Path("./GBIF_data")
# Define search parameters
search = {"scientificName": names}
# Run collection
bplusplus.collect(
group_by_key=bplusplus.Group.scientificName,
search_parameters=search,
images_per_group=200, # Recommended to download more than needed
output_directory=GBIF_DATA_DIR,
num_threads=5
)
Process the raw images to extract, crop, and resize insects. This step uses a pre-trained model to ensure only high-quality images are used for training.
PREPARED_DATA_DIR = Path("./prepared_data")
bplusplus.prepare(
input_directory=GBIF_DATA_DIR,
output_directory=PREPARED_DATA_DIR,
img_size=640 # Target image size for training
)
Train the hierarchical classification model on your prepared data. The model learns to identify family, genus, and species.
TRAINED_MODEL_DIR = Path("./trained_model")
bplusplus.train(
batch_size=4,
epochs=30,
patience=3,
img_size=640,
data_dir=PREPARED_DATA_DIR,
output_dir=TRAINED_MODEL_DIR,
species_list=names
)
The inference pipeline uses a separate, pre-trained YOLO model for initial insect detection. You need to download its weights manually.
You can download the weights file from this link.
Place it in the trained_model
directory and ensure it is named yolo_weights.pt
.
Process a video file to detect, classify, and track insects. The final output is an annotated video and a CSV file with aggregated results for each tracked insect.
VIDEO_INPUT_PATH = Path("my_video.mp4")
VIDEO_OUTPUT_PATH = Path("my_video_annotated.mp4")
HIERARCHICAL_MODEL_PATH = TRAINED_MODEL_DIR / "best_multitask.pt"
YOLO_WEIGHTS_PATH = TRAINED_MODEL_DIR / "yolo_weights.pt"
bplusplus.inference(
species_list=names,
yolo_model_path=YOLO_WEIGHTS_PATH,
hierarchical_model_path=HIERARCHICAL_MODEL_PATH,
confidence_threshold=0.35,
video_path=VIDEO_INPUT_PATH,
output_path=VIDEO_OUTPUT_PATH,
tracker_max_frames=60,
fps=15 # Optional: set processing FPS
)
To train the model on your own set of insect species, you only need to change the names
list in Step 1. The pipeline will automatically handle the rest.
# To use your own species, change the names in this list
names = [
"Vespa crabro",
"Vespula vulgaris",
"Dolichovespula media",
# Add your species here
]
To train a model that can recognize an "unknown" class for insects that don't belong to your target species, add "unknown"
to your species_list
. You must also provide a corresponding unknown
folder containing images of various other insects in your data directories (e.g., prepared_data/train/unknown
).
# Example with an unknown class
names_with_unknown = [
"Vespa crabro",
"Vespula vulgaris",
"unknown"
]
The pipeline will create the following directories to store artifacts:
GBIF_data/
: Stores the raw images downloaded from GBIF.prepared_data/
: Contains the cleaned, cropped, and resized images ready for training.trained_model/
: Saves the trained model weights (best_multitask.pt
) and pre-trained detection weights.
All information in this GitHub is available under MIT license, as long as credit is given to the authors.
Venverloo, T., Duarte, F., B++: Towards Real-Time Monitoring of Insect Species. MIT Senseable City Laboratory, AMS Institute.