Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset cannot be created #13

Closed
penfever opened this issue May 30, 2024 · 6 comments · Fixed by #16
Closed

Dataset cannot be created #13

penfever opened this issue May 30, 2024 · 6 comments · Fixed by #16
Assignees

Comments

@penfever
Copy link

Hi; thanks for your work on BioCLIP!

I am sorry to report that it is not possible to reproduce the Tree of Life 10M dataset following the steps described here; several links in this downloader are broken.

PROPOSED SOLUTION: Add a link in README.MD to a hosted copy of the SQLITE database referenced here; /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite

@hlapp
Copy link
Member

hlapp commented May 30, 2024

@penfever just to make sure, you aren't trying to obtain the dataset (which is available on HF: doi:10.57967/hf/1972) but to recreate it from scratch?

@penfever
Copy link
Author

@hlapp , thanks for the quick response. Despite what the name would suggest, as I understand it, and please correct me if I am wrong, the TreeOfLife-10M dataset is, in fact, not available on HuggingFace. TreeofLife-10M is a composite of three datasets; (1) iNat 2021 (neither images nor metadata made available on HF in WDS trainable format), (2) BioSCAN-1M (neither images nor metadata made available on HF in WDS trainable format), (3) EOL (only images made available in WDS trainable format).

The instruction provided is to download EOL and then run this script, which contains calls to several hardcoded paths in disk.py, pointing to files which do not appear to be available in the repo, for example --

# Files we make

eol_name_lookup_json = "data/names/eol_name_lookup.json"
inat21_name_lookup_json = "data/names/inat21_name_lookup.json"
bioscan_name_lookup_json = "data/names/bioscan_name_lookup.json"

Regards,
Ben

@thompsonmj
Copy link
Contributor

@penfever Thanks for bringing this up! We will look into the broken EOL metadata links and wds creation issues.

@egrace479
Copy link
Member

@penfever, the SQLite database you're referencing is just metadata (a superset of TreeOfLife-10M: metadata/catalog.csv). We would have preferred to provide the entire webdataset; however, as noted in the HF Dataset Card, we cannot republish iNat21 in the compilation as per their terms of use.

As to your construction issues, it seems you're starting with step 1 of docs/imageomics/treeoflife10m.md, which will not work due to changes on EOL's site. Please try starting at step 6 (as noted in the Dataset Contents section of the Dataset Card). Note also that step 8 may overwrite the catalog downloaded from Hugging Face depending on where they're saved.

Thanks for bringing this to our attention; we'll add a note to docs/imageomics/treeoflife10m.md to start at step 6 after downloading TreeOfLife-10M from Hugging Face for better clarity.

We know this process can be buggy and requires some modifications on the user's system, so please let us know if you have any more issues.

@penfever
Copy link
Author

@egrace479, thanks for the quick reply. I did see the instructions on the HF Dataset Card, and I did start at Step 6, as instructed there. Unfortunately, step 6 is also not working because of the file dependencies -- I describe them in part in the second half of my comment: #13 (comment)

@egrace479
Copy link
Member

@penfever, everything should have been provided or created in that process; we'll look into why those files seem to be missing. Thanks for clarifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants