New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add isSame method to allow file caching based on system/molecule properties #3
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this is great :-)
I have been thinking about how to expose the right properties from the parsers and have some ideas that we should discuss at some point. I've also thought more about the caching, and think another "Wishlist" thing would be to unite the different caching in the code so that the user can set a single total memory value, and know that this won't be accidentally exceeded by the combination of the different caches. Again, it will require some discussion and design :-). There's the beginning of this in the CentralCache class, but this is still at a prototyping stage, and was designed for caching frames from a trajectory.
* update * update * update * update * update --------- Co-authored-by: William (Zhiyi) Wu <zwu@exscientia.co.uk>
Does this pull request introduce new functionality?
This pull request adds the
BioSimSpace._SireWrappers.System.isSame
method, to allow system comparisons based on properties. This allows a user to compare two systems based on a set of common properties, excluding any that are required, e.g.coordinates
,space
, andvelocity
wanting to just just compare topologies. The method has some basic system checks, i.e. UID, number of molecules, number of atoms, number of residues, which are always performed to quickly reject non-comparable systems.To compare a
system
toother
:To speed things up, I've also added the option to
skip_water
, i.e. if you know that the systems use the same water topology.With this in place, we now have the ability to perform per-format file caching, using a set of excluded property keys for a given format. The existing file cache code has been refactored and updated to use the
isSame
functionality behind the scenes when comparing systems. The cache key is now the system UID, the set of excluded properties, whether to ignore water molecules, and the chosen format. This value is the last system used to write to the matching key, the path to the file that was written, and its MD5 checksum. The cache uses a simple fixed-sized dictionary with a rough 2GB, which is based on the total number of atoms stored. This can be optimised at later date, or we could expose an environment variable to allow users to set the size limit in their scripts. (In practice, I don't think the cache will get large for a typical script.)To confirm that things are working as expected I ran a full hydration free energy setup and checked the resulting cache:
In this example,
FreeEnergy.Relative
sets up processes for 11 lambda windows. From the cache you can see that GROMACS coordinate and topology files have only been written for the lambda=0 window, and have been re-used for all 10 other windows (confirmed by checking the files exist, and are the same).To take advantage of the caching a user can update the use of
BioSimSpace.IO.saveMolecules
wherever approprate, e.g. during the setup stage of a particular process. They just need to pass inexcluded_options
as an additional keyword argument.Things to do:
@xiki-tempula: Just tagging you in so that you are aware of progress. No need to review at this stage. I'll back-port into
michellab
once this is complete.Checklist:
devel
into this branch before issuing this pull request (e.g. by runninggit pull origin devel
): [y]Suggested reviewers:
@chryswoods