Chemical identity information for non-QM packages #35

davidlmobley · 2018-01-16T17:02:57Z

Those of us in the MD community would very much like to be able to take output from QM packages and take it directly into MD engines and chemistry toolkits we use. However, these typically require what I'll call the "chemical identity" of the molecule, as (without QM) we can't infer this simply from the elements/number of electrons.

To that end, I'd like to see how receptive people would be to including in the schema the necessary information, such as formal charges (on atoms), connectivity, and bond orders. Presumably this wouldn't be particularly helpful to people staying in the QM world, but for us it would save a whole bunch of intermediate steps and/or a need to know what molecule is contained in the JSON before we can do anything with it.

Alternatively, providing an isomeric SMILES of the molecule or similar could also work. Basically, we just need some way of knowing what molecule (and charge state) it contains without having to "do chemistry" on the file to determine that.

If people are receptive to this idea I can open up a PR to add this to the requirements.md.

To be slightly more specific, I am also proposing broadening the concept of topology to also include bonding information and/or chemical identity (beyond just the coordinates and elements for the molecule).

The text was updated successfully, but these errors were encountered:

davidlmobley · 2018-01-16T17:04:02Z

@dgasmith - is this sufficiently specific as to what I think we need, or is further elaboration needed?

Also taggin @jchodera as he expressed interest as well. I will also discuss with colleagues at OpenEye.

cryos · 2018-01-16T17:55:32Z

I am very receptive to optionally including this information, and we already have some proposed improvements related to bonding information. It would be great to understand what would be most helpful for the MD side, from my perspective at least.

davidlmobley · 2018-01-16T22:41:12Z

An alternative to formal charges would be partial charges. Isomeric SMILES might be harder for you to be able to generate in general (since it requires more detailed chemical perception), though perhaps if you provide them for the parent molecule (input) this could work.

However, after discussing it with some others we think maybe the easiest/best for you would be just the bond order matrix (e.g. Wiberg bond orders). Possibly a relatively general solution would be to give the wiberg bond order matrix and the isomeric SMILES for the parent molecule, since after applying chemical perception these should agree.

Other properties can be very useful for information purposes for us, such as dipole moments, though the above is probably the most critical.

dgasmith · 2018-01-17T21:52:54Z

I think we can expand the topology sections to (optionally) have a place for bond orders and a SMILES string. This would be different than Wiberg bond-orders which would likely fall in the "results" section of the schema as it requires a QM computation.

For dipole moments are you thinking of some sort of localized approach or just the general dipole of the molecule. The former would again need to be a result specification while the latter should already be included.

What other kind of chemical perception where you thinking of?

dgasmith · 2018-01-27T02:39:11Z

@davidlmobley and @jchodera do we just want an optional SMILES field as a string or would you possibly want the ability to add more chemical perception than just that.

davidlmobley · 2018-01-27T05:14:24Z

Sorry about that, @dgasmith - I missed your last question.

The things I'm most concerned about are the bond order matrix and the SMILES string (or if SMILES strings are problematic to generate, I can come up with alternatives). Beyond that, I can see some things being useful for convenience (dipole moment for example) but they are not crucial for me.

I don't see the need to add more chemical perception than that:

What other kind of chemical perception where you thinking of?

I just meant that I'm trying to be understanding that you guys might not want to have to do sophisticated chemical perception. If SMILES strings are tricky, then formal charges or partial charges (together with the bond order matrix) could serve.

In terms of dipole moment, just the general dipole moment of the molecule.

dgasmith · 2018-01-27T15:52:54Z

Ok, good to know. I think we can add a SMILES section to the molecule class as an optional field no problem.

If I think about a charge model (such as RESP) and bond-orders those will have to be outputs of QC programs rather than attached to the molecule specification itself. I think we can straightforwardly add a spec for charge models and bond-orders if the QC program supports it.

loriab · 2018-01-27T16:11:57Z

I'm thinking there'll have to be domains w/i the overall molecule spec. The QCprog provides the results it can generate w/o subjectivity. An EFPprog may come along and provide its fragments in a separate domain and interact with the QC portion to the extent of fixing its Cartesian coordinates and using all atoms as input. A RESPprog can come along and add a charge set. And anything that can generate SMILES can read the QC portion and whatever other portions (programs shouldn't amend the molecule JSON unless it can understand it completely) and add its own domain.

So, I think SMILES is great, just not directly in the QC molecule domain, where it's (1) not an input or output and (2) any implementations are likely non-expert.

jchodera · 2018-01-27T17:35:55Z

@davidlmobley and @jchodera do we just want an optional SMILES field as a string or would you possibly want the ability to add more chemical perception than just that.

An optional SMILES string corresponding to the original chemical species the calculation was performed for (if applicable), would be exceptionally helpful in searching large sets of calculation results on many molecules for calculations of interest. Presumably, this wan't apply to all calculations (for example, a transition state will not necessarily correspond to a single chemical species), but it may still be useful attached metadata for many calculations involving small organic molecules.

Specifically, a canonical isomeric SMILES string would be optimal. Despite the term "canonical", not all programs produce the same canonical string, so one computed with a specific program (e.g., RDKit) may be ideal.

Note that the SMILES string is only useful in identifying whether a specific calculation may be for a molecular species of interest, but will not help identify which atoms correspond to which parts of the molecule (which is often important for tasks like forcefield parameterization). Some additional topology information mapping atoms in the molecular topology to atoms in the calculation would still be necessary.

dgasmith · 2018-01-27T19:16:06Z

Hmm, we may be on slightly different pages. I'm not entirely sure we could support something that would require the following "workflow":

Molecule specification
Bond-order computation
RDKit SMILES build based on bond-order computation.

I was thinking more that we would have an option to add a SMILES string before the computation that could just ride through the QC computation. The database tech that all of this is associated with can definitely handle the above workflow however.

@cryos Any thoughts here?

jchodera · 2018-01-27T19:39:22Z

I was thinking more that we would have an option to add a SMILES string before the computation that could just ride through the QC computation.

This would be totally adequate! I was just intending to note that

there are some challenges in making even "canonical" isomeric SMILES strings unique
we likely also need some additional optional information to ride along that can map atoms in the SMILES-defined molecule to specific atoms in the QC computation

dgasmith · 2018-01-27T20:27:37Z

Right now we have it so that unknown fields are passed through. Perhaps a better way of thinking of the SMILES field is a "registered" pass through field.

For 2) I dont think thats a problem as long as we do not need too many more of these. Would a simple list of integers that index the rest of the molecule spec work?

jchodera · 2018-01-28T19:31:03Z

For 2) I dont think thats a problem as long as we do not need too many more of these. Would a simple list of integers that index the rest of the molecule spec work?

Unfortunately, no. There's no unique way to render a SMILES string into a molecular topology, so a list of atom indices would not be sufficient for identifying which atoms correspond to which parts of the molecule. This is essentially why we need some portable way of describing a chemical topology with indexed atoms.

In SMARTS strings, it's possible to tag atoms with integers to uniquely identify matched atoms. @davidlmobley : Are SMILES also valid SMARTS? If so, can we have an explicit-hydrogen SMARTS that uniquely tags each atom in the molecule? If so, a single string would be all we need to both create the molecular topology and tag all the atoms.

davidlmobley · 2018-01-30T05:02:45Z

SMILES are valid SMARTS, I believe, but I'm not sure how you'd tag the atoms. What exactly do you have in mind?

Or maybe I'm missing something obvious. @bannanc?

jchodera · 2018-01-30T14:36:31Z

Suppose we have ethanol, and would like to specify which atoms belong to which chemically distinct parts of the molecule.

The SMILES string would be something like C(H3)OH.

We could specify a corresponding SMARTS string that matches each chemically distinct atom in the molecule, tagging it with a unique index:

[C:1]([H:2])([H:3])([H:4])[O:5][H:6]

This way, we only need to carry through a string that allows us to identify which atoms in the quantum chemical calculation correspond to which atoms in the molecule.

bannanc · 2018-01-30T21:42:00Z

I'm trying to think through the logistics of using a SMARTS string for this purpose. The typical idea behind SMARTS is that they describe a substructure of a molecule. SMILES are valid SMARTS in that you can use a SMILES string to perform a substructure search. However, the reverse isn't true a SMARTS is not a valid SMILES. That is when parsing SMARTS toolkits expect a SMARTS to describe a substructure query and treat that differently from a molecule.

Assuming all atoms are specified explicitly (including hydrogens and bonds), I think this is a reasonable solution to needing the molecule identity and the mapping to the coordinate information, it just might be more complex than you realize to get that information extracted correctly.

jchodera · 2018-01-30T21:48:43Z

I'd suggest we include both a SMILES string and the corresponding tagged SMARTS string that matches the atoms from the SMILES-generated molecule (in whatever toolkit you use) to the ordered atoms in the quantum chemical calculation.

bannanc · 2018-01-30T22:13:27Z

I think that makes the sense, I wasn't sure if you were suggesting replacing the SMILES with a SMARTS.

jchodera · 2018-02-02T20:51:24Z

(Apologies to the QC folks for the degree of back-and-forth needed to come to a consensus!)

OK, to summarize our thinking so far:

Use cases

Many calculations of interest will likely start with a specific small organic molecule in mind. Atomic coordinates are generated, and quantum chemical properties computed. It would be useful for many applications that make use of this data for forcefield parameterization, machine learning, or the study of molecular properties to be able to easy identify (1) whether the calculation was performed for a molecule of potential interest, and (2) which atoms in the calculation correspond to which chemically distinct parts of the original molecule.

The proposal

While it is generally easy to go from a small molecule identity to atomic coordinates of a plausible conformation for that molecule, it is very difficult to go the other way. As such, it would be useful to optionally associate information sufficient for (1) and (2) above with the calculation's metadata.

We propose to add two optional string fields to the metadata for the calculation:

SMILES : This field would generally contain a canonical isomeric SMILES string that specifies the small molecule chemical species from which the initial input atomic positions were generated. This will help easily identify which calculations may be of interest, but doesn't easily identify which atoms in the QC calc correspond to which parts of the molecule.
SMARTS: This field would generally contain an explicit hydrogen canonical isomeric SMARTS string with tagged atoms (introduced by reaction SMARTS / SMIRKS, but still supported by nearly all major SMARTS implementation). The tags would uniquely identify the atoms in different chemical environments in the molecule with the order in which atoms appear in the calculation. These SMARTS strings can be read by a huge number of small cheminformatics packages, such as RDKit, OpenEye, Open Babel, JMol, CDK, etc.

These two pass-through string fields should be sufficient to enable a huge amount of QC-derived use of datasets stored in the QC JSON spec.

Example

A calculation for ethanol might contain the following fields:

        {
            "SMILES": "C(H3)OH",
            "SMARTS": "[C:1]([H:2])([H:3])([H:4])[O:5][H:6]"
        },

davidlmobley · 2018-02-02T22:13:26Z

Totally agree with this -- that would be tremendously useful and fix a huge number of problems we have as people who want to put things in to QM packages and then use the output for non-QM things. Thanks, @jchodera !

ghutchis · 2018-03-29T20:05:58Z

Just saw this now - agreed that it would generally be useful to have SMILES/SMARTS as a pass-through to add identifiers and atom-maps from a connection-table view of the world.

In principal, QM programs can estimate this using bond order calculations, but I see the use case as a submitting script / workflow / GUI as embedding the identifier for reading later.

dgasmith · 2018-03-30T15:20:56Z

From @ghutchis it was thought we might have a "identifier" section to the molecules which expands on the amount of tags that we can associate with a given molecule. Other tags can be added, but these would be officially encoded.

"identifier" : {
   "name": "aspirin",
   "comment": "training set",
   "formula": "C9H8O4",
   "smiles": "O=C(C)Oc1ccccc1C(=O)O",
   "smarts": "...",
   "inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"
}

@jchodera Would this work for you where we insert your definitions for the the smiles/smarts patterns?

Ping @davidlmobley @bannanc

jchodera · 2018-03-30T15:43:08Z

I think some of these should be optional.

Name: which name? Are common or localized names allowed? Or do we require IUPAC names? This should likely be optional since there is no great way to make this canonical.
Formula could be useful for searching. What standard is used?
InChI keys add a whole new dimension of pain. Can we generate unique InChI keys? How? Are they really guaranteed to be unique?

cryos · 2018-03-30T15:48:43Z

I think there is value in having them, but agree that it would be preferable to make them optional.

ghutchis · 2018-03-30T16:03:55Z

I intended these as optional examples of identifiers - that is, there are likely a range of identifiers (SMILES, SMARTS, InChI, etc.) Certainly some programs (e.g., Open Babel) would write some of these.

My point was that most QC programs allow some sort of title or comment field, which was included in the schema - but those are just one example of an identifier.

Most QC programs also write a formula (in some format) in the output file.

dgasmith · 2018-03-30T16:31:04Z

Agreed, all of these would be optional. Things like name and comment would be up to users/programs and mostly free for all fields as they are ill defined.

For SMILES/SMARTS/InChi/Formula are these or can they be deterministic and do all programs produce the same? Without this we may need to attach providence fields to them.

ghutchis · 2018-03-30T16:33:12Z

InChI is a standard. Formula can be standardized. The others are standards, but not canonical/unique. OTOH, I think this thread was indicating that the SMILES or SMARTS should match the atom order in the file.

wadejong · 2018-03-30T18:00:28Z

I agree with an earlier comment, there needs to be consistency (or no ambiguity) between the SMILES, SMARTS and InChi and InchiKey as the rules are not the same in each case.

jchodera · 2018-03-30T21:31:05Z

It would be great if we could specify recommended, consistent standards if these fields are included, since this would maximize the potential that searching the database on these keys will return as many useful entries as possible.

dgasmith · 2018-03-30T22:52:53Z

We can certainly recommend programs and algorithms to generate these quantities, but requiring them might be difficult. Can someone up write up a recommended way of computing these quantities to get the ball rolling?

davidlmobley · 2018-04-03T20:57:58Z

@dgasmith - computing them given what? I usually use the OpenEye tools so would tend to rely on that; can I assume the user would have those? If not we need to rope in someone with expertise. What tools can we assume? If not OpenEye, what about RDKit?

ghutchis · 2018-04-04T18:40:47Z

My view is that these are all optional. InChI and InChIKey are standardized - it doesn't matter the toolkit, they should be the same regardless.

As for SMILES and/or SMARTS, if they're supposed to match the atom order, I would think that Open Eye, RDKit, and Open Babel should all give the same SMILES, but might differ slightly on aromaticity.

cryos · 2018-04-05T15:50:04Z

InChI and InChIKey match if bonding information is preserved, but can still be an issue where the QM package strips all that and it needs to be perceived, especially when things move around and you throw in different approaches to perceive it. Agreed on them all being optional.

jchodera · 2018-04-05T15:56:12Z

I certainly think the should be optional, but what do people think about making at least one of these choices (canonical, isomeric, explicit-hydrogen SMILES or InChI) recommended (if appropriate to the calculation) so that researchers datamining QC databases can hope to make maximal use of the information? This would not be required, but simply encouraged to facilitate data re-use.

dgasmith · 2018-04-05T17:28:13Z

Can someone volunteer to try adding this to the schema? You would want to extend the Molecule definition found here.

ghutchis mentioned this issue Mar 30, 2018

Formalizing the Schema #37

Merged

2 tasks

Chemical identity information for non-QM packages #35

Chemical identity information for non-QM packages #35

Comments

davidlmobley commented Jan 16, 2018

davidlmobley commented Jan 16, 2018

cryos commented Jan 16, 2018

davidlmobley commented Jan 16, 2018

dgasmith commented Jan 17, 2018

dgasmith commented Jan 27, 2018

davidlmobley commented Jan 27, 2018

dgasmith commented Jan 27, 2018

loriab commented Jan 27, 2018

jchodera commented Jan 27, 2018

dgasmith commented Jan 27, 2018 • edited Loading

jchodera commented Jan 27, 2018 • edited Loading

dgasmith commented Jan 27, 2018

jchodera commented Jan 28, 2018

davidlmobley commented Jan 30, 2018

jchodera commented Jan 30, 2018

bannanc commented Jan 30, 2018

jchodera commented Jan 30, 2018

bannanc commented Jan 30, 2018

jchodera commented Feb 2, 2018 • edited Loading

Use cases

The proposal

Example

davidlmobley commented Feb 2, 2018

ghutchis commented Mar 29, 2018

dgasmith commented Mar 30, 2018 • edited Loading

jchodera commented Mar 30, 2018

cryos commented Mar 30, 2018

ghutchis commented Mar 30, 2018 • edited Loading

dgasmith commented Mar 30, 2018 • edited Loading

ghutchis commented Mar 30, 2018

wadejong commented Mar 30, 2018

jchodera commented Mar 30, 2018

dgasmith commented Mar 30, 2018

davidlmobley commented Apr 3, 2018

ghutchis commented Apr 4, 2018

cryos commented Apr 5, 2018

jchodera commented Apr 5, 2018

dgasmith commented Apr 5, 2018

dgasmith commented Jan 27, 2018 •

edited

Loading

jchodera commented Jan 27, 2018 •

edited

Loading

jchodera commented Feb 2, 2018 •

edited

Loading

dgasmith commented Mar 30, 2018 •

edited

Loading

ghutchis commented Mar 30, 2018 •

edited

Loading

dgasmith commented Mar 30, 2018 •

edited

Loading