Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isSame method to allow file caching based on system/molecule properties #3

Merged
merged 18 commits into from Feb 13, 2023

Conversation

lohedges
Copy link
Contributor

@lohedges lohedges commented Feb 13, 2023

Does this pull request introduce new functionality?

This pull request adds the BioSimSpace._SireWrappers.System.isSame method, to allow system comparisons based on properties. This allows a user to compare two systems based on a set of common properties, excluding any that are required, e.g. coordinates , space, and velocity wanting to just just compare topologies. The method has some basic system checks, i.e. UID, number of molecules, number of atoms, number of residues, which are always performed to quickly reject non-comparable systems.

To compare a system to other:

import BioSImSpace as BSS

files = BSS.IO.expand(
    BSS.tutorialUrl(),
    ["ala.crd", "ala.top"],
    ".bz2"
)

system = BSS.IO.readMolecules(files)

# Make a copy of the system.
other = system.copy()

# Assert they are the same, making sure it is invariant to the order.
assert system.isSame(other)
assert other.isSame(system)

# Translate the other system.
other.translate(3 * [BSS.Units.Length.angstrom])

# Assert that they are different.
assert not system.isSame(other)
assert not other.isSame(system)

# Assert that they are the same, apart from their coordinates.
assert system.isSame(other, excluded_properties=["coordinates"])
assert other.isSame(system, excluded_properties=["coordinates"])

# Now delete a property.
other._sire_object.removeProperty("space")

# Assert that they are different.
assert not system.isSame(other, excluded_properties=["coordinates"])
assert not other.isSame(system, excluded_properties=["coordinates"])

# Assert that they are the same, apart from their coordinates and space.
assert system.isSame(other, excluded_properties=["coordinates", "space"])
assert other.isSame(system, excluded_properties=["coordinates", "space"])

To speed things up, I've also added the option to skip_water, i.e. if you know that the systems use the same water topology.

With this in place, we now have the ability to perform per-format file caching, using a set of excluded property keys for a given format. The existing file cache code has been refactored and updated to use the isSame functionality behind the scenes when comparing systems. The cache key is now the system UID, the set of excluded properties, whether to ignore water molecules, and the chosen format. This value is the last system used to write to the matching key, the path to the file that was written, and its MD5 checksum. The cache uses a simple fixed-sized dictionary with a rough 2GB, which is based on the total number of atoms stored. This can be optimised at later date, or we could expose an environment variable to allow users to set the size limit in their scripts. (In practice, I don't think the cache will get large for a typical script.)

To confirm that things are working as expected I ran a full hydration free energy setup and checked the resulting cache:

import BioSimSpace as BSS

m0 = BSS.IO.readMolecules("molecules/ethane.pdb")[0]
m1 = BSS.IO.readMolecules("molecules/methanol.pdb")[0]

m0 = BSS.Parameters.gaff(m0).getMolecule()
m1 = BSS.Parameters.gaff(m1).getMolecule()

mapping = BSS.Align.matchAtoms(m0, m1)
merged = BSS.Align.merge(m0, m1, mapping)

solvated = BSS.Solvent.tip3p(molecule=merged, box=3 * [5 * BSS.Units.Length.nanometer])

p = BSS.Protocol.FreeEnergy()
free_nrg = BSS.FreeEnergy.Relative(solvated, p, engine="gromacs", `work_dir="free")`

BSS.IO._file_cache._cache
_FixedSizeOrderedDict([(('{05758435-945f-4d74-acac-78d08a7b21cd}',
                         'PDB',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=1>,
                         '/tmp/tmpy90w2gb5/tmp.pdb',
                         '6302349bac84167d8d9711bad0af80eb')),
                       (('{4cdc49bc-f763-446d-b06d-1055222f214b}',
                         'PDB',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=1>,
                         '/tmp/tmpigj09fl0/antechamber.pdb',
                         '6302349bac84167d8d9711bad0af80eb')),
                       (('{36d6bb97-1ae2-45e5-a7b3-7f01fc4ce648}',
                         'PDB',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=1>,
                         '/tmp/tmp7sk0nfih/tmp.pdb',
                         '19613ec5a20f7e99936f36abe2bba62a')),
                       (('{5a92a925-3b77-405b-97f6-5f7516354432}',
                         'PDB',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=1>,
                         '/tmp/tmpklc3ipy4/antechamber.pdb',
                         '19613ec5a20f7e99936f36abe2bba62a')),
                       (('{9bcc7b74-fde7-48fc-bd2f-007c1eb9292f}',
                         'PDB',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=1>,
                         '/tmp/tmpa_hzmd22/tmp0.pdb',
                         '6302349bac84167d8d9711bad0af80eb')),
                       (('{f1c52df4-5fca-4773-8e2c-c1fd6ee39c89}',
                         'PDB',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=1>,
                         '/tmp/tmpa_hzmd22/tmp1.pdb',
                         '19613ec5a20f7e99936f36abe2bba62a')),
                       (('{59e0d301-0bac-404a-8a18-46d8797c5d94}',
                         'Gro87',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=1>,
                         '/tmp/tmpt8d7g4mk/input.gro',
                         'ff65fa2b29e231ea55a0796355210046')),
                       (('{3546a07d-e02c-4c12-a764-e5cc4c2af49f}',
                         'Gro87',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=4052>,
                         '/tmp/tmpt8d7g4mk/solvated.gro',
                         '3b7f17e1e81019db8380e5be7f7b57e0')),
                       (('{3546a07d-e02c-4c12-a764-e5cc4c2af49f}',
                         'GroTop',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=4052>,
                         '/tmp/tmpt8d7g4mk/solvated.top',
                         'ab9576151bedc6860ad6f16cdddeb72b')),
                       (('{aadef6b5-7176-4078-b8bb-32cdf753c044}',
                         'Gro87',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=4052>,
                         '/home/lester/Code/openbiosim/biosimspace/demo/free/lambda_0.0000/gromacs.gro',
                         'b2718324ab90b1b7633ca64921d01c4a')),
                       (('{aadef6b5-7176-4078-b8bb-32cdf753c044}',
                         'GroTop',
                         'set()',
                         'True'),
                        (<BioSimSpace.System: nMolecules=4052>,
                         '/home/lester/Code/openbiosim/biosimspace/demo/free/lambda_0.0000/gromacs.top',
                         'ba093c5a90b41d89246fb75e6ed10b12'))])

In this example, FreeEnergy.Relative sets up processes for 11 lambda windows. From the cache you can see that GROMACS coordinate and topology files have only been written for the lambda=0 window, and have been re-used for all 10 other windows (confirmed by checking the files exist, and are the same).

To take advantage of the caching a user can update the use of BioSimSpace.IO.saveMolecules wherever approprate, e.g. during the setup stage of a particular process. They just need to pass in excluded_options as an additional keyword argument.

Things to do:

  • Test performance. I imagine this should be okay given that we skip water molecules. If it's too slow for large systems, then it would be easy to re-write in the C++ layer and wrap.
  • The fixed-size cache is pretty basic but I think it should work well for most needs. I don't want to spend to much time optimising this until I get some real-world feedback.

@xiki-tempula: Just tagging you in so that you are aware of progress. No need to review at this stage. I'll back-port into michellab once this is complete.

Checklist:

  • I confirm that I have merged the latest version of devel into this branch before issuing this pull request (e.g. by running git pull origin devel): [y]
  • I confirm that I have added a test for any new functionality in this pull request: [y]
  • I confirm that I have added documentation (e.g. a new tutorial page or detailed guide) for any new functionality in this pull request: [y] This is currently internal functionality so isn't directly exposed to the user.
  • I confirm that I have permission to release this code under the GPL3 license: [y]

Suggested reviewers:

@chryswoods

@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 10:20 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 10:20 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 10:20 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 11:09 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 11:09 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 11:09 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 11:09 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 11:09 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 12:11 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 12:11 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 12:11 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 12:11 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 12:11 — with GitHub Actions Inactive
@lohedges lohedges temporarily deployed to biosimspace-build February 13, 2023 12:11 — with GitHub Actions Inactive
Copy link
Contributor

@chryswoods chryswoods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this is great :-)

I have been thinking about how to expose the right properties from the parsers and have some ideas that we should discuss at some point. I've also thought more about the caching, and think another "Wishlist" thing would be to unite the different caching in the code so that the user can set a single total memory value, and know that this won't be accidentally exceeded by the combination of the different caches. Again, it will require some discussion and design :-). There's the beginning of this in the CentralCache class, but this is still at a prototyping stage, and was designed for caching frames from a trajectory.

@lohedges lohedges merged commit 21c6b73 into devel Feb 13, 2023
@lohedges lohedges deleted the feature_is_same branch February 13, 2023 19:07
@lohedges lohedges added the enhancement New feature or request label Mar 16, 2023
lohedges pushed a commit that referenced this pull request Jun 26, 2023
* update

* update

* update

* update

* update

---------

Co-authored-by: William (Zhiyi) Wu <zwu@exscientia.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants