Skip to content

Add processing code#27

Merged
egrace479 merged 12 commits intomainfrom
update/processing
Jun 10, 2025
Merged

Add processing code#27
egrace479 merged 12 commits intomainfrom
update/processing

Conversation

@egrace479
Copy link
Copy Markdown
Member

@egrace479 egrace479 commented May 30, 2025

Adds processing code from initial dataset construction:

  • museum specimen, citizen science, and camera trap image filtering
  • human face detection
  • webdataset construction
    Still needs the content dedupe (PDQ hashing) from @samuelstevens, though it is referenced in READMEs.

Updates the repo description to address these files as living in processing/ directory and being worked into the package, while indicating that the code required to actually download the images for TreeOfLife-200M is already integrated into the package at the root level repo.

Also, fixes some ref links in the pyproject.toml and adds citation information.

egrace479 and others added 7 commits May 29, 2025 12:52
Co-authored-by: Net Zhang <zhang.11091@osu.edu>
Co-authored-by: Jianyang Gu <gu.1220@osu.edu>
Co-authored-by: Net Zhang <zhang.11091@osu.edu>
Co-authored-by: Andrei Kopanev <andrey24122004@gmail.com>
@egrace479 egrace479 added documentation Improvements or additions to documentation enhancement New feature or request labels May 30, 2025
Copy link
Copy Markdown
Contributor

@vimar-gu vimar-gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@samuelstevens
Copy link
Copy Markdown
Contributor

Just made a final push to content-dedup here: https://github.com/Imageomics/TreeOfLife-dev/tree/content-dedup/src/content_dedup.

@egrace479
Copy link
Copy Markdown
Member Author

Just made a final push to content-dedup here: https://github.com/Imageomics/TreeOfLife-dev/tree/content-dedup/src/content_dedup.

Great! Did you want me to clean the paths (set $BASE_DIR in place of OSC project info) and add it here?

@samuelstevens
Copy link
Copy Markdown
Contributor

Yes, that would be great! Thanks for the help

Co-authored-by: Sam <samuel.robert.stevens@gmail.com>
@egrace479 egrace479 force-pushed the update/processing branch from d0b1d93 to 8073c9a Compare June 3, 2025 19:40
Comment thread processing/scripts/processing/README.md Outdated
Comment thread processing/scripts/processing/README.md Outdated
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@NetZissou NetZissou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up! Great work!

@egrace479 egrace479 merged commit 8434a6b into main Jun 10, 2025
@egrace479 egrace479 deleted the update/processing branch June 10, 2025 13:53
@egrace479 egrace479 mentioned this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants