-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset cannot be created #13
Comments
@penfever just to make sure, you aren't trying to obtain the dataset (which is available on HF: doi:10.57967/hf/1972) but to recreate it from scratch? |
@hlapp , thanks for the quick response. Despite what the name would suggest, as I understand it, and please correct me if I am wrong, the TreeOfLife-10M dataset is, in fact, not available on HuggingFace. TreeofLife-10M is a composite of three datasets; (1) iNat 2021 (neither images nor metadata made available on HF in WDS trainable format), (2) BioSCAN-1M (neither images nor metadata made available on HF in WDS trainable format), (3) EOL (only images made available in WDS trainable format). The instruction provided is to download EOL and then run this script, which contains calls to several hardcoded paths in disk.py, pointing to files which do not appear to be available in the repo, for example --
Regards, |
@penfever Thanks for bringing this up! We will look into the broken EOL metadata links and wds creation issues. |
@penfever, the SQLite database you're referencing is just metadata (a superset of TreeOfLife-10M: metadata/catalog.csv). We would have preferred to provide the entire webdataset; however, as noted in the HF Dataset Card, we cannot republish iNat21 in the compilation as per their terms of use. As to your construction issues, it seems you're starting with step 1 of docs/imageomics/treeoflife10m.md, which will not work due to changes on EOL's site. Please try starting at step 6 (as noted in the Dataset Contents section of the Dataset Card). Note also that step 8 may overwrite the catalog downloaded from Hugging Face depending on where they're saved. Thanks for bringing this to our attention; we'll add a note to docs/imageomics/treeoflife10m.md to start at step 6 after downloading TreeOfLife-10M from Hugging Face for better clarity. We know this process can be buggy and requires some modifications on the user's system, so please let us know if you have any more issues. |
@egrace479, thanks for the quick reply. I did see the instructions on the HF Dataset Card, and I did start at Step 6, as instructed there. Unfortunately, step 6 is also not working because of the file dependencies -- I describe them in part in the second half of my comment: #13 (comment) |
@penfever, everything should have been provided or created in that process; we'll look into why those files seem to be missing. Thanks for clarifying! |
Hi; thanks for your work on BioCLIP!
I am sorry to report that it is not possible to reproduce the Tree of Life 10M dataset following the steps described here; several links in this downloader are broken.
PROPOSED SOLUTION: Add a link in README.MD to a hosted copy of the SQLITE database referenced here; /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
The text was updated successfully, but these errors were encountered: