Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Input parsing without Getkw #88

Open
robertodr opened this issue Aug 17, 2017 · 4 comments
Open

[Discussion] Input parsing without Getkw #88

robertodr opened this issue Aug 17, 2017 · 4 comments

Comments

@robertodr
Copy link
Member

This thread is a Request For Comments on how to rewrite input parsing.
@ilfreddy @arnfinn @bast Please comment on any of the following sections. @loriab I would also appreciate your input on this matter, if you have the time.

Goals to achieve

  • Make it easier for hosts to integrate with the library.
  • Remove the dependency on Getkw.
  • Bottom-up: reduce the number of steps a developer has to go through to introduce a new parameter into any of the core classes.
  • Top-down: reduce the distance an input parameter has to travel before it can get used in any of the core classes.
  • Automatize documentation of the input format.

Current situation

Input can go through two, mutually exclusive, routes:

  1. An extra input file is parsed by a Python script and then read C++ side. This is the Getkw way of doing things. Python-side, options left blank are filled with a default. C++-side there's the Input class that contains all information read in. In the class, structs with input for specific classes (i.e. for the cavity, the Green's functions, the solvers etc.) are generated and passed onto the factory functions for object creation (look at the flow of the code in Meddle.cpp)
  2. The host passes a struct with input parameters set, usually from its own input parser. The input so passed is only a subset of the possible input to PCMSolver. This makes the second route not usable for most (if not all of) the recent functionality introduced.

Drawbacks

  • There are basically two ways of providing input that need to be kept in sync and this means checking that one route doesn't override sensible defaults that would be set by the other. Needless to say, this is not very easy to do.
  • None of the current host programs is particularly thrilled about introducing an extra layer of input parsing. The best we can currently do is piggybacking[^1] on the host input file, as done in Psi4.
    With Psi4 this is relatively easy: it's just integrating one small Python module into other Python modules. Plus, the look-and-feel of Psi4 and PCMSolver input are similar enough not to disconcert the user. Piggybacking on e.g. DALTON's input format would be harder to do.
  • Documentation of the input is done completely by hand. You can imagine how easy it is to miss stuff.
  • We depend on Getkw, which is basically abandonware. Getkw is not a lightweight dependency. Getkw depends on:
    1. boost::any. Soon to be banned (together with all of Boost)
    2. C++ exceptions. Banned.
    3. Run-time type identification. Banned.
    4. Python availability at runtime. This is mostly fine, but surprising for some hosts.
  • There a number of issues related, closely and less closely, to input reading. These all take a very long time to address, because it's boring stuff, but also because the current structure has become too complex:
    1. Clean-up references to yet-to-be-released functionality #75
    2. Input reading #52
    3. Input reading using YAML #50
    4. pcmsolver/__init__.py #40
    5. GePolCavity printer might not work properly for explicit lists of spheres #27

Suggested solution

The suggested solution is to remove Getkw altogether:

  1. The Input object would go away for good.
  2. The Python script would go away, solving pcmsolver/__init__.py #40

and move towards a data-passing-based solution, i.e. extending current route number 2 to cover all of the input needed by PCMSolver. This is a top-down perspective:

  1. The pcmsolver_new API function would accept one single parameter, a struct called PCMSolverInput that collects all host program input to the module, in atomic units:
    • Molecular geometry (number of nuclei, charges, coordinates, symmetry generators)
    • Host writer, i.e the function the module will use to write its output to the host program output file.
    • Cavity generator input parameters
    • Green's function input parameters
    • Solver input parameters
  2. The pcmsolver_new function would in turn invoke the constructor of the Meddle object.[^2] Instead of generating a whole Input object from the passed PCMSolverInput struct, it would unpack the struct (pattern matching of sorts) into smaller structs to feed to the factory function generating the cavity, Green's function and so forth. This touches on Input reading #52
  3. From this point on, it's business as usual.
  4. To run the module standalone we would need to rewrite a wrapper for reading input from file. This could become difficult, but I think the correct strategy would then be to use a standard format, like YAML Input reading using YAML #50

Why/How would this automatize documentation of input?
Because we could attach Doxygen documentation strings to the various fields on the PCMSolverInput struct.

Challenges

  1. The input management for some host programs would have to be rewritten from scratch. I think the benefits would outweigh the annoyance though.
  2. Getting interoperability of the C struct and the Fortran 90 user defined type right. The struct will need to have strings and array fields.

Notes

[^1] The input to PCMSolver is embedded in the host input and unpacked into a different input file in the scratch directory at run time. The PCMSolver Python script then does its own parsing.
[^2] The Meddle class is the one responsible for routing the computation to the correct internal objects. The C++ API, if you wish.

@ilfreddy
Copy link
Collaborator

The input file is "touched" when the program is running. @robertodr said is because everything is made uppercase. Still that should happen in the cache and not directly to the file on disk. This is (mildly) annoying when one is performing many tests with small modifications to the same input, because the editor complains the file has been modified.
Feature request: make sure the original input file is not touched (an intermediate uppercase version could be created if need be).

@robertodr
Copy link
Member Author

@ilfreddy I've made a separate issue #91 out of your request. Let's try to limit this thread for comments on input parsing 😉

@loriab
Copy link
Contributor

loriab commented Aug 18, 2017

Would you want to consider PCMSolver only taking JSON/py-dict input, with QC progs responsible for generating it from their native options parsing (perhaps this is "piggypacking"). Then perhaps PCMSolver additionally does a light str --> json or argparse --> json utility so it can be run independently.

Related discussion at https://github.com/MolSSI/QC_JSON_Schema , though admittedly Mol-centered at present

@robertodr
Copy link
Member Author

Thanks @loriab. Admittedly, I'd prefer YAML (less curly braces around 😺), but anything that is a standard format and would relieve me from doing most of the work in defining it, is very welcome.

These are the flows of the input I am envisioning:

  1. QC program needs to add a new input section for PCMSolver, with input parameters whose names and types are defined by the given version of the API.
  2. QC program reads input and dumps the PCMSolver section into a prescribed format, let's say JSON. Notice that the QC program does not validate the input, but simply translates it to a standard format.
    This can happen in one of two ways:
    1. Disk-based PCMSolver adds nhomann/json as dependency and exposes a function pcmsolver_write_input_file() that accepts the (presumably long) list of input parameters (all plain-old data structures - PODs) read by the QC program. This function would validate the input and write it to file in the correct format. A call to pcmsolver_new() would then bear no arguments and look for the already validated input file in the current directory.
    2. Software-based PCMSolver adds nhomann/json as dependency. I wrap it with a C API, in turn wrapped by a Fortran API. QC program calls pcmsolver_generate_input_json() that accepts the (presumably long) list of input parameters read by the QC program. This function would validate the input and return the JSON data structure for the input. Eventually, QC program calls pcmsolver_new(json_input) and gets a handle to a PCM computation.

QC program pseudocode for route i.

bool success = pcmsolver_write_input_file(/* PCM input parameters as PODs*/);
if (success) {
   pcmsolver_context * ctx = pcmsolver_new();
} else {
  // ABORT 
}

QC program pseudocode for route ii.

JSON * json_input = pcmsolver_generate_input_json(/* PCM input parameters as PODs */);
pcmsolver_context * ctx = pcmsolver_new(json_input);

Route ii. looks more elegant, but it will require more work in order to maintain the C and Fortran wrappers to the C++ JSON library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants