Skip to content

Conversation

Andrey170170
Copy link
Collaborator

@Andrey170170 Andrey170170 commented Feb 4, 2025

Added new initializers - fathom_net, EoL and Lila

Small fixes:

  • added MATERIAL_CITATION filtering for gbif initializer, issue described here
  • formatting updates
  • consolidate dependencies in pyproject
  • adds descriptions of downloaded data format

Andrey170170 and others added 10 commits August 7, 2024 21:04
Small fixes
Also adjusted tools to use `source_id` instead of `gbif_id`
Added tool_name_override option for Tools, to be able to use custom tools
Added the use of `verification_scheme` instead of hard coded column names for some of the parts of the runner
plus some minor fixes
Updated readme - added `how to access data` section.

Updated pyproject.toml - added dependency libraries directly into this file, instead of link to `requirements.txt`.
Now there is a distinction between scheduled filtering or scheduling jobs and completed ones.

Adjusted logic of scripts according to this change.
@egrace479 egrace479 added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 4, 2025
Extracted initializers into a class structure
Rewrote initialization calling file to have a dict with initialization types.
Added `initializer_type` in mandatory config fields
@egrace479
Copy link
Member

Please add description of base initializer, inheritance to child initializers, and considerations for making a custom child initializer. Use existing as examples, e.g., filters that could be applied (GBIF excluding MATERIAL_CITATION). This would be a good place to note the importance of understanding the metadata coming from the source before creating a child initializer and to not rely on source IDs to be persistent (also check uniqueness if relying on them to map to additional metadata)---considering EOL content IDs, which may be unique but it's the page ID that maps to the taxa information.

As discussed, put this description into a README in the initializer/ folder and link to it from the root repo README.

Added README.md to initializer.
Made small code quality adjustments to initializers
Added doc strings to `base_initializer.py`
@egrace479 egrace479 linked an issue Apr 3, 2025 that may be closed by this pull request
@thompsonmj
Copy link
Contributor

Todo: add comments to base initializer and individual data sources as well.

Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more suggestions and questions.

And lastly, it removes any entries that have `MATERIAL_CITATION` in `basisOfRecord`.
- `FathomNetInitializer`: Initializer for the FathomNet dataset. It filters out any entries without an `uuid` or `url`
value.
Additionally, removes any entries that are "not valid" by the `valid` column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recall what determined if an entry was valid in FathomNet?

Andrey170170 and others added 3 commits April 10, 2025 23:50
Added docstrings to python files

Adjusted main `README`

Deleted unused `multimedia_scheme` file
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Updated package requirements
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested generalization for clarifying uuids (beyond just TreeOfLife).

@egrace479 egrace479 linked an issue Apr 15, 2025 that may be closed by this pull request
@egrace479 egrace479 mentioned this pull request Apr 15, 2025
Andrey170170 and others added 3 commits April 16, 2025 13:41
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Updated dependency list
Updated readmes
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Andrey170170 and others added 2 commits April 16, 2025 14:20
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Fix in checkpoint
Comment on lines 559 to 566
competed_queue: Queue of completed batches
total_batches: Total number of batches to process
done_batches: Number of batches that have been processed
"""

server_name: str
download_complete: threading.Event
competed_queue: queue.Queue[CompletedBatch]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo here: "competed_queue" instead of "completed_queue" It seems to only exist here (search result), so should be a simple fix.

Andrey170170 and others added 2 commits April 16, 2025 23:41
dataclasses.py misspelling fix
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
@Andrey170170 Andrey170170 requested a review from egrace479 April 17, 2025 03:47
@Andrey170170 Andrey170170 merged commit a4a94f3 into main Apr 17, 2025
@Andrey170170 Andrey170170 deleted the eol_download branch April 17, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate initializers into Initializer class Clean up requirements to avoid listing extra dependencies

3 participants