Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup dataset lazy loading #58

Closed
wants to merge 2 commits into from

Conversation

wconnell
Copy link
Contributor

@wconnell wconnell commented Dec 7, 2021

  1. MoleculeDataset(lazy=True) converts a smile to an rdkit Mol object and overwrites it with None. Loading large datasets like ZINC2m is severely slowed down by this, so avoid the unnecessary conversion to a Mol when lazy=True
  2. Check download/extract logic with ZINC2m dataset to avoid redundancy and conflicts.

@KiddoZhu
Copy link
Contributor

Thanks for your contribution! That's a really important step for large datasets.

For the check of file in Zinc2m, I would prefer keeping extracting the zip file anyway. This is because we don't check the md5 of extracted file and it is possible that the extracted file might be broken (e.g. due to Ctrl+C break).

@KiddoZhu
Copy link
Contributor

Manually rebased and merged

@KiddoZhu KiddoZhu closed this Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants