Conversation
|
After testing this, the way records are cached (with all their child records) is not going to work. It works for singlepoints, but torsiondrives are way too big. So this is going to need a bit more work I think the solution is to store individual records (without children) in a separate table, and store foreign keys in the current |
|
I've been playing around with this today, and it's great! Very intuitive. Two comments along with their importance out of 10 (I wouldn't consider either one blocking).
The code that I'm playing with is: import qcportal
client = qcportal.PortalClient("https://api.qcarchive.molssi.org:443", cache_dir="./cache2")
ds = client.get_dataset("torsiondrive", "XtalPi Shared Fragments TorsiondriveDataset v1.0")
# the next two lines didn't immediately do what I wanted, so I ran the loops below
#ds.fetch_entries()
#ds.fetch_records(include=["optimizations"], force_refetch=True)
for entry in ds.iterate_entries():
entry
for record in ds.iterate_records():
for angle, opt in record[2].minimum_optimizations.items():
opt.final_moleculeThe resulting cache file size is pretty reasonable: (bespokefit) jw@mba$ ls -lrth cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite
-rw-r--r-- 1 jeffreywagner staff 13M Feb 16 15:32 cache2/api.qcarchive.molssi.org_443/dataset_378.sqliteThen in a separate interpreter (and with minor changes to qcsubmit): from qcportal import dataset_models
ds2 = dataset_models.dataset_from_cache("./cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite")
from openff.qcsubmit.results import TorsionDriveResultCollection
tdrc = TorsionDriveResultCollection.from_datasets([ds2])
tdrc
Success!! |
|
Glad it's working so far!
At the moment, there is no limit to the cache (basically if you set the max size to
If you create a client with the same |
|
I'm going to go ahead and merge this. There's still some tasks to be done before the next release, but it seems to be working well. The main reason is I have another feature being built on top of this and leaving this open makes it a bit complicated. |
Description
Previously, dataset information was not cached locally at all. So rerunning a script, or just calling
client.get_datasetagain could require re-fetching all data, even if it had been fetched before.This PR implements this caching. All storage of records is now in an SQLite database, either in a file or in memory. Some care has been taken in trying to keep the cache up-to-date as much as possible, but I am sure there are still loopholes. This includes records writing themselves back to the cache when they have been updated with additional data (for example, fetching molecules or trajectories).
There are a few ways to use:
cache_dirparameter when creating a client. This will then automatically create SQLite files for each dataset, and re-use them as long as the samecache_diris used in subsequent client construction.dataset_from_cachefunction where you can pass a file directly (ie, downloaded out-of-band).dataset_from_cachefunction indataset_models.pythat works similarly, but will result in an offline dataset object completely disconnected from any server.This is purely a client-side change, so this branch will work with the currently-deployed MolSSI QCArchive servers.
There is still some polishing to be done (and docs), but I am looking for feedback and any bugs before merging.
See #740
Todos and missing features:
refresh_cacheneeds to be finishedChangelog description
Implement client-side caching of datasets
Status