feat: add species classifier training pipeline #69
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
scripts/build_species_list.py, a bridge script that readsverbatimScientificNamefrom a DwC-A file (not included inload_dwca_data()) and joins it onto the clean-dataset annotations CSV. Also prints DwC-A summary stats and builds acategory_map.json.scripts/train_species_classifier.sh, a single bash script that orchestrates all 7 pipeline steps from DwC-A to trained ConvNeXt-Tiny model usinguv run. Each step checks for existing outputs to support resuming interrupted runs.src/code. The bridge script works aroundload_dwca_data()not includingverbatimScientificNamein its column selection.Pipeline steps:
fetch-images->verify-images->clean-dataset->build_species_list.py->split-dataset->create-webdataset->train-modelTest plan
category_map.jsoncontains correct species-to-ID mappingMIN_INSTANCES=3and default val/test fractions apply🤖 Generated with Claude Code