Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have amber process write binary RST file instead of text rst7 file #22

Merged
merged 6 commits into from
Aug 21, 2023

Conversation

xiki-tempula
Copy link
Collaborator

Have amber process write binary RST file instead of text rst7 file

@lohedges
Copy link

You should be able to fix your failure by loading using thesire.legacy.IO.AmberRst parser and comparing the first frame, e.g:

from sire.legacy.IO import AmberRst

ref = AmberRst(proc._ref_file).getFrame(0)
rst = AmberRst(proc._rst_file).getFrame(0)

assert ref == rst

@xiki-tempula
Copy link
Collaborator Author

@lohedges Thanks for the help. Sorry, I'm still a bit confused as to why I got this problem as I cannot reproduce it locally. The error seems to suggest that the run failed?

@lohedges
Copy link

Ah, I hadn't spotted that there was a failure with the single point energy check too. At first I just noticed the error trying to compare the position restraint reference with the coordinate file, which fails because you can't parse a binary rst file in the same way. Will have a look at the single point one locally too.

@lohedges
Copy link

Yes, I was looking at the wrong action. (The top one, not the one for this PR.) I've patched in your changes and, apart from the test with the fix I give above, all of the other failures pass locally for me too 🤷‍♂️

@lohedges
Copy link

Actually, if I hop on your branch, rather than patching the changes onto mine, then I see the failures. For test_makeCompatibleWith, the failure seems to come from a difference in the AMBER config that is written:

diff my_amber.cfg your_amber.cfg
11,12c11,12
<    ntb=0,
<    cut=999.,
---
>    cut=8.0,
>    iwrap=1,

With yours, I see no results written to the mdout file. Not sure why this wasn't a problem with the rst7 file.

@xiki-tempula
Copy link
Collaborator Author

This seems to be related to the box. One would get
< ntb=0,
< cut=999.,
If the there is no box and one would get

cut=8.0,
iwrap=1,

If there is a box. I hope it won't be the case where the binary rst format cannot handle the case with no box.

@xiki-tempula
Copy link
Collaborator Author

Ok, I managed to reproduce the error on a linux machine by recreating the env. The error is

--------------------------------------------------------------------------------
   1.  RESOURCE   USE:
--------------------------------------------------------------------------------

| Flags:
ERROR: NetCDF restart has Conventions that are not AMBERRESTART.
 getting new box info from bottom of inpcrd
|  INFO: Old style inpcrd file read

ERROR: I could not find the number of atoms or the time on
       the second line of your inpcrd file [test.rst]. Bad INPCRD file!

I guess amber only take some form of the binary input and the binary rst file that sire writes is not liked by amber.

@xiki-tempula
Copy link
Collaborator Author

I think the problem might be related to sire 2023.2.3 but is fixed in 2023.3.

@lohedges
Copy link

Ah interesting, I don't see that locally with sander, the mdout just ends with:

--------------------------------------------------------------------------------
   1.  RESOURCE   USE:
--------------------------------------------------------------------------------

| Flags:
 getting box info from netcdf restart file

For the differences in the config, I had forgotten that this PR is based on an older Sire version, hence and older version of BioSimSpace, i.e. rather being based off the tip of your current devel. I think the issue might be that Sire 2023.2.3 defaults to having an infinite cartesian space if no space is present. Since 2023.3.3 we've added some checks for this when determining whether we need to run a vacuum simulation, since checking for a space property alone (as done here is no longer sufficient. The changes in our version of the code can be seen here. I thought this was needed for 2023.3.0+, though, so surprised you are seeing this issue with 2023.2.3.

@xiki-tempula
Copy link
Collaborator Author

Do you know when would the old check to be not sufficient? This is to patch our Q2 release so as long as it works for our limited use case, I'm fine with it. In our Production floe, we don't need to do any vacuum simulation, we would just use BSS.Solvent.solvate to create the BSS system and run simulation in periodic system with box.

@lohedges
Copy link

I've just recreated a clean 2023.2.3 environment and the additional space checks that I mentioned are not needed, This is a 2023.3+ change. However, I now see the same error as you, i.e.:

--------------------------------------------------------------------------------
   1.  RESOURCE   USE:
--------------------------------------------------------------------------------

| Flags:
|    NONPERIODIC   ntb=0 and igb=0: Setting up nonperiodic simulation
ERROR: NetCDF restart has Conventions that are not AMBERRESTART.

As you say, it looks like something in the AmberRst parser itself has changed in 2023.3.0. I'll tag in @chryswoods to see if he knows what the fix is. It might be something that is easy to backport.

@chryswoods
Copy link

This is because the old AmberRst object used "wrong" logic to decide if the file was a trajectory file or a restart file. This line is the culprit: https://github.com/OpenBioSim/sire/blob/2c624dbd346cb69940b6e25deb72e0e61bf8088d/corelib/src/libs/SireIO/amberrst.cpp#L1736

In the old AmberRst code (pre 2023.3.0) it said that the file was a restart file if it had been loaded as a restart file. Otherwise, it was a trajectory file. This is what these lines do;

    if (created_from_restart)
    {
        globals.insert("Conventions", "AMBERRESTART");
    }
    else
    {
        globals.insert("Conventions", "AMBER");
    }

In the new AmberRst (>=2023.3.0) I completely rewrote the entire class to use the new trajectory code. In this case, it says that restart files are those that contain just a single frame, while trajectory files are those that contain more than one frame. That's what this line does: https://github.com/OpenBioSim/sire/blob/a073701d68cd61f9be56e360be3b3ee3f812bbba/corelib/src/libs/SireIO/amberrst.cpp#L269

   if (nframes <= 1)
    {
        globals.insert("Conventions", "AMBERRESTART");
        is_restart = true;
    }
    else
    {
        globals.insert("Conventions", "AMBER");
        is_restart = false;
    }

In your case, I would "fix" the old AmberRst by patching locally and changing to

    //if (created_from_restart)
    //{
        // all files written by old sire are RESTART files
        globals.insert("Conventions", "AMBERRESTART");
    //}
    //else
    //{
    //    globals.insert("Conventions", "AMBER");
    //}

The old AmberRst code couldn't actually write trajectories, so all files really are just restart files.

@lohedges
Copy link

Great, thanks for the clarification. I can patch tomorrow and create a 2023.2.3 build number 3. It would be really good to know if the binary format fixes the reduction issue before attempting anything else.

lohedges added a commit to OpenBioSim/sire that referenced this pull request Aug 18, 2023
@lohedges
Copy link

Okay, I've push build number 2 of 2023.2.3 for Linux which seems to fix the problems that you are seeing. I don't have time to do a macOS build today, so it might be worth skipping the failures on that platform for now. (I assume that you will only be testing the REMD on Linux anyway.)

@xiki-tempula
Copy link
Collaborator Author

@lohedges Thanks. Do you have a tag that I could use to build our own sire for our dependency list?

@lohedges
Copy link

You can just use the branch name for the git_tag in your conda recipe, i.e. backport_2023.2.3.

@lohedges
Copy link

I'm also wondering if this explains why you might have still been seeing the reduction issue. If you were building from the 2023.2.3 tag internally (rather than using our Linux conda package) then you wouldn't be using the second triclinic box backport. This was a specific and manual backport that I made on the named branch, rather than ever being merged into main, which would have required a new tag. (At the time a simple rebuild was requested, not a patch version.)

@xiki-tempula
Copy link
Collaborator Author

@lohedges So for the Q2 release, we use the sire that is build by Chirs. We only begin using our own sire build in Q3.

@xiki-tempula
Copy link
Collaborator Author

@lohedges Hi, have you built the new version of sire? I have restarted the CI but it seems that we are still pulling in the old build. Thanks.

@lohedges
Copy link

Apologies, it looks like I forgot to pin the Python version. Will rebuild now!

@lohedges
Copy link

Okay, the py38 package is now uploaded. Looking at your most recent failures it looks like the frame comparision suggestion that I made hasn't worked. Will investigate this locally. Perhaps you can just compare a subset of the attributes of the frame.

@xiki-tempula
Copy link
Collaborator Author

xiki-tempula commented Aug 18, 2023

Thanks. The only one that is failing is the position restraint now. I think it would be the best if there is a way of comparing the rst file. If there isn't, then I guess I have to load it as BSS system and them dump them to rst7 file for comparison.

@lohedges
Copy link

Strangely I'm finding that the coordinate and reference file have different coordinates when I debug locally, hence your failure. Not sure why this is, since you are saving the same system to file, only to a different file name.

@lohedges
Copy link

Ah, you are passing a reference system so they should be different. The bit that was confusing me was that you appear to be writing the system to the ref_file, not the ref_system, e.g. here.

@xiki-tempula
Copy link
Collaborator Author

You are right. I have just found out that this is checking that the file is different, instead of the same.

Copy link
Collaborator

@msuruzhon msuruzhon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as our internal CI passes

@xiki-tempula
Copy link
Collaborator Author

@msuruzhon Still waiting for the sire to build for our dep now. I guess there isn't anything other thing that needs to be done in this PR?

@msuruzhon
Copy link
Collaborator

I think this PR is fine as is yeah.

_IO.saveMolecules(file, system, "rst7", property_map=self._property_map)
_IO.saveMolecules(file, system, "rst", property_map=self._property_map)
# To keep the file extension the same.
shutil.move(f'{file}.rst', f'{file}.rst7')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black formatting?

@msuruzhon
Copy link
Collaborator

Hmm macOS is erroring out but Linux isn't...

@xiki-tempula
Copy link
Collaborator Author

@msuruzhon because Lester only rebuild sire for the linux version.

@msuruzhon
Copy link
Collaborator

But why should the process be failing?

@xiki-tempula
Copy link
Collaborator Author

@msuruzhon Because it saves rst file in a bad way that cannot be used by amber.

@xiki-tempula xiki-tempula merged commit ce6d4c1 into prep-0.9.1 Aug 21, 2023
2 of 4 checks passed
@xiki-tempula
Copy link
Collaborator Author

xiki-tempula commented Aug 21, 2023

@lohedges Sorry. New things. In this repo, I have tested the process with sander, which seems to be working fine.
When I move to our production env and test it with Amber 22 PMEMD
I got

#11 637.0 NetCDF error: NetCDF: Attribute not found
#11 637.0   at Getting title

@lohedges
Copy link

Ah, how annoying. I'll try to build pmemd locally to test. I've only been using sander at present. Will see how I get on. I think @chryswoods has a macOS pmemd build that he might be able to test against if needed.

@xiki-tempula
Copy link
Collaborator Author

@lohedges Thanks. I hop they could do a amber conda build then these shall be much easier.

@lohedges
Copy link

You'll be pleased to hear that it works just file for me 🤷‍♂️ I've only tested pemed, not pmemd.cuda. Will have another try tomorrow.

Are you definitely using the patched version of Sire in your production environment?

@xiki-tempula
Copy link
Collaborator Author

Weird. I think it shall be the new sire as the error message is different. I will have another go.

@xiki-tempula
Copy link
Collaborator Author

@lohedges Ok, you are right in that most system would work fine for the new Sire and binary rst file. But this system failed. I have pickled the system.
And if you run

process = BSS.Process.Amber(system, protocol=BSS.Protocol.FreeEnergy(), exe='pmemd')
process.start()

system.p.zip

You would get the error.

@lohedges
Copy link

Thanks. I now have pmemd installed locally. Will try to debug this afternoon.

@lohedges
Copy link

Yes, I see the same error locally. I'm testing with the latest version of Sire (not 2023.2.3) so this isn't something that has been fixed more recently.

@lohedges
Copy link

Okay, it's as simple as setting a title attribute in the NetCDF file. Why on earth this is needed for FEP only, I have no idea. You can test this by doing:

In [1]: import BioSimSpace.Sandpit.Exscientia as BSS

INFO:numexpr.utils:Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
WARNING:pymbar.timeseries:Warning on use of the timeseries module: If the inherent timescales of the system are long compared to those being analyzed, this statistical inefficiency may be an underestimate.  The estimate presumes the use of many statistically independent samples.  Tests should be performed to assess whether this condition is satisfied.   Be cautious in the interpretation of the data.
INFO:pymbar.mbar_solvers:JAX detected. Using JAX acceleration.

In [2]: import pickle

In [3]: import netCDF4 as nc

In [4]: s = pickle.load(open("system.p", "rb"))

In [5]: process = BSS.Process.Amber(s, protocol=BSS.Protocol.FreeEnergy(), exe="/home/lester/.conda/envs/pmemd/bin/pmemd", work_dir="test")

In [6]: ds = nc.Dataset("test/amber.rst7", "r+")

In [7]: ds.title = "test"

In [8]: ds.close()

In [9]: process.start()
Out[9]:
BioSimSpace.Process.Amber(<BioSimSpace.System: nMolecules=879>, BioSimSpace.Protocol.FreeEnergy(timestep=2.0000 fs, runtime=4.0000 ns, temperature=300.0000 K, pressure=1.0000 atm, report_interval=200, restart_interval=1000, first_step=0, restart=False, restraint=None, force_constant=10.0 kcal_per_mol/angstrom**2, lam=fep    0.0, lam_vals=    fep
0   0.0
1   0.1
2   0.2
3   0.3
4   0.4
5   0.5
6   0.6
7   0.7
8   0.8
9   0.9
10  1.0), exe='/home/lester/.conda/envs/pmemd/bin/pmemd', name='amber', work_dir='/home/lester/Downloads/test', seed=None)

In [10]: process.wait()
/home/lester/Code/openbiosim/biosimspace/python/BioSimSpace/Sandpit/Exscientia/Process/_amber.py:884: UserWarning: The process exited with an error!
  _warnings.warn("The process exited with an error!")
2023-08-22 14:55:32.999 | WARNING  | alchemlyb.parsing.amber:extract:376 - WARNING: file /home/lester/Downloads/test/amber.out is a prematurely terminated run
2023-08-22 14:55:33.000 | INFO     | alchemlyb.parsing.amber:extract:400 - Read 1 dV/dl data points in file /home/lester/Downloads/test/amber.out

In [11]: process.stdout()
 SC_RES_DIST=     0.0000  SC_RES_ANG=       0.0000  SC_RES_TORS=         0.0000
 SC_EEL_DER=      0.0000  SC_VDW_DER=       0.0000  SC_DERIV   =         0.0000
 ------------------------------------------------------------------------------

vlimit exceeded for step      0; vmax =    22.2253
vlimit exceeded for step      1; vmax =    24.1338

     Coordinate resetting cannot be accomplished,
     deviation is too large
     iter_cnt, my_bond_idx, i and j are :       2      13    2642    2649

I'll open a Sire issue to make sure that a title attribute is always set. I'm not sure if this has a specific meaning in some cases, but can't find anything online. I'll see if I can add this to the backport branch to allow you to move forward.

@xiki-tempula
Copy link
Collaborator Author

@lohedges Thank you very much for the help.

@lohedges
Copy link

It was an easy fix. The title attribute was being generated from the Sire system name, but wasn't being written when empty. I just needed to update the code to write a default title when this happens. I'm creating a build 3 for 2023.2.3 now and will let you know when it's uploaded. The fix has already been pushed to the backport_2023.2.3 branch so you can rebuild your internal version.

@xiki-tempula
Copy link
Collaborator Author

@lohedges Excellent. Thank you very much. I will build a new version with backport_2023.2.3 now

@lohedges
Copy link

Build 3 is now uploaded.

@xiki-tempula xiki-tempula deleted the feat_box_angle branch September 18, 2023 10:21
xiki-tempula added a commit that referenced this pull request Sep 18, 2023
Co-authored-by: William (Zhiyi) Wu <zwu@exscientia.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants