Skip to content

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Feb 11, 2026

Summary

  • Adds scripts/build_species_list.py, a bridge script that reads verbatimScientificName from a DwC-A file (not included in load_dwca_data()) and joins it onto the clean-dataset annotations CSV. Also prints DwC-A summary stats and builds a category_map.json.
  • Adds scripts/train_species_classifier.sh, a single bash script that orchestrates all 7 pipeline steps from DwC-A to trained ConvNeXt-Tiny model using uv run. Each step checks for existing outputs to support resuming interrupted runs.
  • No modifications to existing src/ code. The bridge script works around load_dwca_data() not including verbatimScientificName in its column selection.

Pipeline steps: fetch-images -> verify-images -> clean-dataset -> build_species_list.py -> split-dataset -> create-webdataset -> train-model

Test plan

  • Verified end-to-end on a 97-occurrence Lepidoptera DwC-A (103 images, 48 species)
  • Confirmed step-skipping works when re-running (resume support)
  • Confirmed category_map.json contains correct species-to-ID mapping
  • Confirmed training starts, logs loss/accuracy, and early-stops as expected
  • Test with a larger DwC-A dataset where MIN_INSTANCES=3 and default val/test fractions apply

🤖 Generated with Claude Code

Add two scripts that orchestrate the full flow from a GBIF Darwin Core
Archive (DwC-A) to a trained ConvNeXt-Tiny species classification model:

- scripts/build_species_list.py: Bridge script that reads verbatimScientificName
  from a DwC-A (not included in load_dwca_data()), joins it onto the
  clean-dataset annotations CSV, and builds a category map JSON.

- scripts/train_species_classifier.sh: Single bash script running all 7
  pipeline steps with uv run: fetch-images, verify-images, clean-dataset,
  build_species_list.py, split-dataset, create-webdataset, train-model.
  Supports resuming (skips steps with existing outputs) and computes
  num_classes dynamically from the category map.

No modifications to existing src/ code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant