feat: add species classifier training pipeline #69

mihow · 2026-02-11T02:50:40Z

Summary

Adds scripts/build_species_list.py, a bridge script that reads verbatimScientificName from a DwC-A file (not included in load_dwca_data()) and joins it onto the clean-dataset annotations CSV. Also prints DwC-A summary stats and builds a category_map.json.
Adds scripts/train_species_classifier.sh, a single bash script that orchestrates all 7 pipeline steps from DwC-A to trained ConvNeXt-Tiny model using uv run. Each step checks for existing outputs to support resuming interrupted runs.
No modifications to existing src/ code. The bridge script works around load_dwca_data() not including verbatimScientificName in its column selection.

Pipeline steps: fetch-images -> verify-images -> clean-dataset -> build_species_list.py -> split-dataset -> create-webdataset -> train-model

Test plan

Verified end-to-end on a 97-occurrence Lepidoptera DwC-A (103 images, 48 species)
Confirmed step-skipping works when re-running (resume support)
Confirmed category_map.json contains correct species-to-ID mapping
Confirmed training starts, logs loss/accuracy, and early-stops as expected
Test with a larger DwC-A dataset where MIN_INSTANCES=3 and default val/test fractions apply

🤖 Generated with Claude Code

Add two scripts that orchestrate the full flow from a GBIF Darwin Core Archive (DwC-A) to a trained ConvNeXt-Tiny species classification model: - scripts/build_species_list.py: Bridge script that reads verbatimScientificName from a DwC-A (not included in load_dwca_data()), joins it onto the clean-dataset annotations CSV, and builds a category map JSON. - scripts/train_species_classifier.sh: Single bash script running all 7 pipeline steps with uv run: fetch-images, verify-images, clean-dataset, build_species_list.py, split-dataset, create-webdataset, train-model. Supports resuming (skips steps with existing outputs) and computes num_classes dynamically from the category map. No modifications to existing src/ code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This was referenced Feb 11, 2026

feat: support verbatimScientificName as species label in existing dataset tools #70

Open

feat: add Docker Compose environment to simulate SLURM for local testing #71

Open

test: add e2e test for the full training pipeline with a tiny dataset #72

Open

Copilot AI mentioned this pull request Feb 11, 2026

feat: add Docker Compose SLURM environment for local job testing #73

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add species classifier training pipeline #69

feat: add species classifier training pipeline #69

Uh oh!

mihow commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add species classifier training pipeline #69

Are you sure you want to change the base?

feat: add species classifier training pipeline #69

Uh oh!

Conversation

mihow commented Feb 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant