Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalizing the Schema #37

Merged
merged 19 commits into from
Mar 30, 2018
Merged

Formalizing the Schema #37

merged 19 commits into from
Mar 30, 2018

Conversation

dgasmith
Copy link
Collaborator

@dgasmith dgasmith commented Mar 27, 2018

Description

To kick this off again, I have started creating a json-schema implementation based the discussed schema along with documentation via Sphinx and RTD. As a forward I should note that the implementation here is an attempt to build something concrete. Many things can be changed, but some sort of core needs to be formed to move forward.

This has three basic parts:

  • A small python module to help organize the schema and collection previous versions.
  • The start of a set of examples that will also be used for testing.
  • A RTD-based website that is auto generated from the current repo and started porting the current markdown files to this website. This will be up once this PR is merged.

The presented schema aims to provide a very basic interface with a few notable points such as lack of unit support. Before continuing to build out this schema I wanted to get intermediate feedback about the current setup and general organization.

Questions

  • Is there a good way to produce html-based documentation of schema automatically? I found several python/sphinx solution, but they do not seem to work. I also like swagger.io for REST interfaces; however, it doesn't seem to match up to the requirements of this schema.

Status

  • Ready to go

.gitignore Outdated
@@ -99,3 +99,11 @@ ENV/

# mypy
.mypy_cache/

# Vim leftover
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not everyone uses vim, and this is not strictly related to the schema. Maybe put this in your private ignore list? (.git/info/exclude)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do. I normally see these left in the .gitignore files, is there a reason to move them out?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO projects should have just entries relevant to their code. But of course it's not super important.

}
},
"masses": {
"description": "The mass of the molecule, canonical weights assumed if not given.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is plural but the description has singular mass.

"The doubles portion of the MP2 correlation energy including same-spin and opposite-spin correlations."
}

mp_properties['mp2_total_correlation_energy'] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to also have a total MP energy property. I know it's easy to sum up, but some people like to see the total energy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, missed this.

scf_properties = {}

scf_properties["scf_one_electron_energy"] = {
"description": "The one-electron energy contribution [H] to the total SCF energy.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and later on, what is [H]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core Hamiltonian, I dont really look at these descriptions as complete (they are not intended to be), but more about the general scaffold to build on.

"$ref": "#/definitions/provenance"
}
},
"required": ["symbols", "geometry"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually enforced somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run this through a json-schema validator you will get a raise like the following:

E           jsonschema.exceptions.ValidationError: 'symbols' is a required property
E           Failed validating 'required' in schema['properties']['molecule']:

Im not sure how to require this at other levels.

How do we store large lists of objects (such as lists of atoms or bonds?)

1) As a table of values (these values do very well in HDF5 as well)
2) As a set of arrays
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably emphasize that 2) has been chosen.

On a personal note, I was using 1) in psi4 as closer to local layout. Switched over to 2) after LBNL mtg and found it perfectly easy to work in and much tidier to inspect.

* Repeat of input components
* Driver return - Return of the requested driver (energy/gradient/etc)
* Properties - Other properties/values constructed as by products of the computation
* Provenance - Code, computer, user information, actual settings used by the code (lots
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request that this be a list so that multiple programs can contribute provenance to a calc.

"description": "The singles portion of the MP2 correlation energy. Zero except in ROHF."
}

mp_properties["mp2_double_energy"] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this and the previous be "singles" and "doubles"?

}
},
"geometry": {
"description": "The 3N XYZ coordinates of the atoms involved.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly mention flat (3N, ) rather than nester (N, 3), I think.

scf_properties["scf_one_electron_energy"] = {
"description": "The one-electron energy contribution to the total SCF energy.",
"type": "number"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to refer to Hartree--Fock, not HF-or-DFT, I suggest we be explicit now and use "HF".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we define that is as simply \sum{D * H}? I worry that splitting HF and SCF into two parts could get much more complex.

Copy link
Collaborator

@ghutchis ghutchis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing missing IMHO was Bob Hanson's discussion about having a required 'magic header' that indicates that this JSON is, in fact, a QC_JSON version 1.0.

I think that's useful for handling files 'in the wild' for parsers, but also to ensure we can gracefully upgrade the spec over time.

@@ -23,15 +23,15 @@ The following are optional fields and default values (option, more a list of pos
- `comment` (str) - Any additional comment one would attach to the molecule.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, based on #35 that it makes sense to have an "identifier" section, with comment, molecule name, formula, InChI, SMILES, etc. as optional. This would probably also include provenance and DOI.

Looking at my notes, I think these were grouped in the discussion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of comment as a dictionary? I was considering moving up a few of those to top level optional fields.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm thinking that there's an obvious request for identifiers, and I think comment/title have usually been identifiers in QC codes.

So my suggestion is that there's an explicit identifier object - something like:

"identifier" : {
   "name": "aspirin",
   "comment": "training set",
   "formula": "C9H8O4",
   "smiles": "O=C(C)Oc1ccccc1C(=O)O",
   "inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"
}


**Translators:**
- `cclib <http://cclib.github.io>`_

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • openbabel <http://openbabel.org/>_

- `Avogadro <https://avogadro.cc>`_
- `Molecular Design Toolkit <https://github.com/Autodesk/molecular-design-toolkit>`_
- `VTK <http://www.vtk.org>`_

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Jmol / JSmol <http://jmol.org/>_

['N', 7, 14.20, 0, [0.214,12.124,1.12], [0,0,0]],
...}

// 2) Storing fields as arrays: much more compact, but harder to read and edit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps its harder to read/edit as a human, but not in a program. Please don't use this language. As @loriab mentioned, this was the choice at the workshop and there are a lot of benefits on the "writing code" side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly porting over current Markdown documentation to RST as an example, but there seems to be bringing up a bit of old discussion that is indeed solved. Should we covert the solved topics over to a simpler discussion of why the choice was made?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a good idea. As people implement, I don't think there should be ambiguity and I think a discussion of the choice is good.

(Incidentally, I'd suggest when the schema is done, that this be published in J. Cheminf.)


.. code:: python

// 1) Storing fields as tables: creates an mmCIF/PDB-like layout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no. Let's not even suggest something remotely like the mess of PDB files.

"description": "The name of the molecule.",
"type": "string"
},
"comment": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these part of an object for "identifiers" to reflect feedback from #35

}
},
"fix_com": {
"description": "Whether to adjust to the molecule to the COM or not.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COM? I assume you mean "center of mass" but let's be explicit.

"""
}

scf_properties["scf_dipole_moment"] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be 'scf' dipole moment? Maybe I'm pedantic, but what if I run CCSD(T) - most programs don't print out two different dipole moments IIRC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its good to clarify if someone does say CCSD bond orders. It can become unclear if you stored the SCF or CCSD dipoles.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't that lead to a proliferation of new objects? Maybe there should be a properties object:

property: {
  type: "dipole moment"
  method: "SCF"
  value: ...
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like too many more fields as we will have energy definitions for these slots anyways. I might be missing your point on the object side of things however.

@dgasmith
Copy link
Collaborator Author

For the magic header, poking around there doesn't seem to be a full standard. However, finding a few examples this following seems popular:

{
  "name": "Schema name",
  "description": "A quick description",
  "url": "http://schema_host.org/schemas/v0.1/something.schema",
  "version": "v1.0",
}

Should this be spun off into its own issue however?

@ghutchis
Copy link
Collaborator

I'd probably go with name/version/url/description in that order, but implementations will go with alphabetical order.

Can we drop the description part? Then you'd have:

{
  "name": "QC_JSON",
  "url": "http://schema_host.org/schemas/v0.1/something.schema",
  "version": "v1.0",
}

That seems suitably unique for file/magic sensing.

@ghutchis
Copy link
Collaborator

IMHO, this is a great start and I'm in favor of merging this and starting new issues / pull requests.

@dgasmith
Copy link
Collaborator Author

Im in favor of merging as well. Certainly a lot left to do, but we can break the current topics off into issues.

@cryos
Copy link
Collaborator

cryos commented Mar 30, 2018

This is looking great, and would agree that merging, and then breaking off into smaller PRs/discussions would be beneficial.

@dgasmith
Copy link
Collaborator Author

Going to pull this in and setup the docs. Please spin off issues or PR's as needed to fix current issues.

@dgasmith dgasmith merged commit bf61e6b into MolSSI:master Mar 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants