Re-construct database files from primary data #20

davidlmobley · 2015-04-17T17:16:22Z

This is implied by several other issues, but ought to exist as its own issue.

To complete this issue, we must first resolve Issue #14, #13 , and #19.

This issue must be completed before we can resolve Issues #15 and #16 .

Resolving this issue will also provide the best resolution of the issue with water molecules in topology files (# to be inserted here).

kyleabeauchamp · 2015-04-18T17:10:23Z

I'm not claiming that my code is ready for prime time yet, but I do have a somewhat automated pipeline for building the AMBER files using OpenEye and Antechamber.

https://github.com/choderalab/gbff/blob/master/code/rebuild_freesolv.py

jchodera · 2015-04-18T17:13:42Z

I think we decided to not use gaff2xml for this purpose until we have a stable public API, so my plan was to ingest the dependencies into a standalone script for FreeSolv until gaff2xml (or some other tool) stabilized and we could reduce complexity by using these external functions.

kyleabeauchamp · 2015-04-18T17:17:49Z

Agreed. But we may still want to check that the standalone script does the same thing as the latest gaff2xml functions. A lot of the code we have floating around differs in the various steps used to prepare ligands, which leads to different charges. In particular, the setSampleHydrogens.

Another possibilty is for me to roll out a release of gaff2xml. Even if the API is not stable, we can still just specific a hard-coded version number for installation via conda. I'm happy with whatever here.

davidlmobley · 2015-04-20T16:53:07Z

What's the issue with setSampleHydrogens? Is it an only an issue with
ambiguity about protonation state/tautomer state? The compounds here in
general are quite straightforward at this point; there are only a handful
where there would be any concern at all about tautomer state.

John, what's your preference on the approaches Kyle suggested?

Also, I should point out that Chris Bayly just pointed me to the
documentation on how to generate canonical charges using his multiconformer
procedure -
https://docs.eyesopen.com/toolkits/cookbook/python/modeling/am1-bcc.html.
This is what we should be using for charging.

On Sat, Apr 18, 2015 at 10:17 AM, Kyle Beauchamp notifications@github.com
wrote:

Agreed. But we may still want to check that the standalone script does the
same thing as the latest gaff2xml functions. A lot of the code we have
floating around differs in the various steps used to prepare ligands, which
leads to different charges. In particular, the setSampleHydrogens.

Another possibilty is for me to roll out a release of gaff2xml. Even if
the API is not stable, we can still just specific a hard-coded version
number for installation via conda. I'm happy with whatever here.

—
Reply to this email directly or view it on GitHub
#20 (comment).

David Mobley
dmobley@gmail.com
949-385-2436

jchodera · 2015-04-20T17:02:55Z

Christopher Bayly's canonical AM1-BCC recipe, which we have now adopted, uses

    omega.SetSampleHydrogens(True)

which @kyleabeauchamp has found leads to modest charge differences for ethanol due to placement of the proton.

@kyleabeauchamp has also found that

omega.SetRMSThreshold(rms)

seems to behave independently of the specified rms value, and omega.SetRMSThreshold(0) behaves completely differently from omitting a call to this method altogether. We're not yet sure if this is a bug. Using or omitting this line leads to charge differences for ethyl acrylate.

I think we can either use a specific gaff2xml version to regenerate these or ingest the methods that we are using into scripts here---either method will work from a technical standpoint.

@kyleabeauchamp makes a good point about testing, however---gaff2xml currently has an automated testing framework to ensure correct behavior of its methods, while this project does not yet have automated testing framework.

The biggest question might therefore be "What kind of testing do we need to ensure correctness of FreeSolv?". It seems that not much in the way of testing process existed previously (which is why all of these errors have come to light). Perhaps we want to figure out how we want things to be tested first, and that will make the path forward clearer?

davidlmobley · 2015-04-20T17:16:13Z

What is being done about the omega.SetRMSThreshold issue you mentioned?
Have you talked with support & Bayly?

Can you elaborate on how gaff2xml is tested and how that might be extended
to FreeSolv?

There is no obvious way that I can tell to check that the
topology/coordinate files generated are in fact correct, though we can
certainly check a number of side issues:

Do the number of water molecules/water atoms in the topology file and
coordinate file match, and match what's expected?
Do the charges in the topology file match the expected charges, within
rounding?
Do the atom types in the topology file match those in the
AMBER-atom-typed .mol2 file from antechamber?
others?
Do the number of atoms in the topology file match what's expected?

You could also do more complicated testing, such as:

If you turn the resulting topology file back into an OE molecule (which
would require writing a parser) do you get the same isomeric SMILES as the
original molecule?

On Mon, Apr 20, 2015 at 10:02 AM, John Chodera notifications@github.com
wrote:

Christopher Bayly's canonical AM1-BCC recipe
http://docs.eyesopen.com/toolkits/cookbook/python/modeling/am1-bcc.html,
which we have now adopted, uses
omega.SetSampleHydrogens(True)
which @kyleabeauchamp https://github.com/kyleabeauchamp has found leads
to modest charge differences for ethanol due to placement of the proton.

@kyleabeauchamp https://github.com/kyleabeauchamp has also found that

omega.SetRMSThreshold(rms)

seems to behave independently of the specified rms value, and
omega.SetRMSThreshold(0) behaves completely differently from omitting a
call to this method altogether. We're not yet sure if this is a bug. Using
or omitting this line leads to charge differences for ethyl acrylate.

I think we can either use a specific gaff2xml version to regenerate these
or ingest the methods that we are using into scripts here---either method
will work from a technical standpoint.

@kyleabeauchamp https://github.com/kyleabeauchamp makes a good point
about testing, however---gaff2xml currently has an automated testing
framework to ensure correct behavior of its methods, while this project
does not yet have automated testing framework.

The biggest question might therefore be "What kind of testing do we need
to ensure correctness of FreeSolv?". It seems that not much in the way of
testing process existed previously (which is why all of these errors have
come to light). Perhaps we want to figure out how we want things to be
tested first, and that will make the path forward clearer?

—
Reply to this email directly or view it on GitHub
#20 (comment).

David Mobley
dmobley@gmail.com
949-385-2436

jchodera · 2015-04-20T17:45:34Z

What is being done about the omega.SetRMSThreshold issue you mentioned?
Have you talked with support & Bayly?

@kyleabeauchamp emailed Christopher Bayly on 11 Apr 2015, and I've just pinged him again today. We can email OpenEye support for clarification if that doesn't work.

Can you elaborate on how gaff2xml is tested and how that might be extended
to FreeSolv?

There is a suite of nosetests used to test individual functions and run some datasets through them:
https://github.com/choderalab/gaff2xml/tree/master/gaff2xml/tests

There is no obvious way that I can tell to check that the topology/coordinate files generated are in fact correct,

Supposing technical constraints were not a problem, what would the true test of "correctness" be? Would it be identical free energies of hydration in each program (gromacs, AMBER, etc)? Identical potential energies for various conformations?

we can certainly check a number of side issues:

All of these things (many of which caused trouble in the past) would be good to test.

If you turn the resulting topology file back into an OE molecule (which would require writing a parser) do you get the same isomeric SMILES as the original molecule?

@swails has a neat tool called ParmEd that is very much getting close to being able to do a lot of these tests, so it's possible we can utilize it in a testing framework.

davidlmobley · 2015-04-20T17:58:29Z

There is no obvious way that I can tell to check that the
topology/coordinate files generated are in fact correct,

Supposing technical constraints were not a problem, what would the true
test of "correctness" be? Would it be identical free energies of hydration
in each program (gromacs, AMBER, etc)? Identical potential energies for
various conformations?

When I said "correct", what I had in mind was actually a different issue -
that the topology in question contains the intended set of force field
parameters for the intended molecule with the intended protonation/tautomer
state. For the sake of terminology, let's call the issue you're raising
"consistent" rather than correct. That is, we would like FreeSolv to
contain topology/coordinate files which are both correct (containing the
intended parameters for the intended molecule as just described) AND
consistent (across file formats). I think "correctness" is probably harder
(in theory) to test than "consistent", and I don't know how to check it in
general except the specific aspects I already mentioned.

Consistency in principle ought to be able to be verified by identical free
energies of hydration and identical potential energies. However, this is
really still a research issue (i.e. we have a separate project trying to
get reproducibility of hydration free energies across several different
codes, and it is not straightforward. This also looks some at potential
energies, and a similar thing is true there). Potential energies are
certainly the easier of the two and probably ought to be our benchmark
here, at least for reference configurations between AMBER and GROMACS
files.

David Mobley
dmobley@gmail.com
949-385-2436

jchodera · 2015-04-20T18:14:04Z

I think "correctness" is probably harder (in theory) to test than "consistent", and I don't know how to check it in general except the specific aspects I already mentioned.

Excellent points.

I think it might be practical to check consistency via some potential energy comparisons (especially with tools the OpenMM app and its file readers or ParmEd to help), and correctness via Topology -> OpenEye OEMol -> canonical SMILES, at least for now.

davidlmobley · 2015-04-23T20:28:27Z

Yes, I think that ought to be the goal.

On Mon, Apr 20, 2015 at 11:14 AM, John Chodera notifications@github.com
wrote:

I think "correctness" is probably harder (in theory) to test than
"consistent", and I don't know how to check it in general except the
specific aspects I already mentioned.

Excellent points.

I think it might be practical to check consistency via some potential
energy comparisons (especially with tools the OpenMM app and its file
readers or ParmEd to help), and correctness via Topology -> OpenEye
OEMol -> canonical SMILES, at least for now.

—
Reply to this email directly or view it on GitHub
#20 (comment).

David Mobley
dmobley@gmail.com
949-385-2436

davidlmobley · 2017-01-30T16:56:48Z

This is basically resolved by #28 but can be re-opened later if there is anything we need to reawaken here. A lot of the energetic consistency stuff has been handled elsewhere via ParmEd and in tests for openmoltools.

I intend to work with Michael Shirts separately to get in input files in other formats such as DESMOND.

davidlmobley mentioned this issue Apr 17, 2015

Strip water molecules from all topology/coordinate files in current database #21

Closed

davidlmobley closed this as completed Jan 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-construct database files from primary data #20

Re-construct database files from primary data #20

davidlmobley commented Apr 17, 2015

kyleabeauchamp commented Apr 18, 2015

jchodera commented Apr 18, 2015

kyleabeauchamp commented Apr 18, 2015

davidlmobley commented Apr 20, 2015

jchodera commented Apr 20, 2015

davidlmobley commented Apr 20, 2015

jchodera commented Apr 20, 2015

davidlmobley commented Apr 20, 2015

jchodera commented Apr 20, 2015

davidlmobley commented Apr 23, 2015

davidlmobley commented Jan 30, 2017

Re-construct database files from primary data #20

Re-construct database files from primary data #20

Comments

davidlmobley commented Apr 17, 2015

kyleabeauchamp commented Apr 18, 2015

jchodera commented Apr 18, 2015

kyleabeauchamp commented Apr 18, 2015

davidlmobley commented Apr 20, 2015

jchodera commented Apr 20, 2015

davidlmobley commented Apr 20, 2015

jchodera commented Apr 20, 2015

davidlmobley commented Apr 20, 2015

jchodera commented Apr 20, 2015

davidlmobley commented Apr 23, 2015

davidlmobley commented Jan 30, 2017