Use a file for ESPEI inputs. #28

bocklund · 2017-09-18T05:52:23Z

Major features:

ESPEI deprecates command line input arguments for a YAML (or JSON, others possible) input file.
Files are validated using cerberus with a schema.
Support for setting chains per parameter and the standard deviation of the chains.
Include tests for several different possible runs

Validation schema

Almost all of the constraints are handled by the following YAML file (really, anything that can be converted to dict)

# core run settings
system: # phase models and input data
  type: dict
  schema:
    phase_models: # describes the CALPHAD models for the phases
      type: string
      required: True
      regex: '.*\.json$'
    datasets: # path to datasets. Defaults to current directory.
      type: string
      required: True

output:
  type: dict
  default: {}
  schema:
    verbosity: # integer verbosity level 0 | 1 | 2, where 2 is most verbose.
      type: integer
      min: 0
      max: 2
      default: 0
      required: True
    output_db:
      type: string
      default: out.tdb
    tracefile: # name of the file containing the mcmc chain array
      type: string
      default: chain.npy
      regex: '.*\.npy$'
    probfile: # name of the file containing the mcmc ln probability array
      type: string
      default: lnprob.npy
      regex: '.*\.npy$'

## if present, will do a single phase fitting
generate_parameters:
  type: dict
  schema:
    excess_model:
      type: string
      required: True
      regex: 'linear'
    ref_state:
      type: string
      required: True
      regex: 'SGTE91'

## if present, will run mcmc fitting
## you must specifiy some kind of input for the parameters.
## Parameters can come from
##   1. a preceding generate_parameters step
##   2. by generating chains from a previous input_db
##   3. by using chains from a restart_chain for phases in an input_db
mcmc:
  type: dict
  oneof_dependencies:
    - 'mcmc.input_db'
    - 'generate_parameters'
  schema:
    mcmc_steps:
      type: integer
      min: 1
      required: True
    mcmc_save_interval:
      type: integer
      default: 20
      min: 1
      required: True
    scheduler: # scheduler to use for parallelization
      type: string
      default: dask # dask | MPIPool
      regex: 'dask|MPIPool'
      required: True
    input_db: # TDB file used to start the mcmc run
      type: string
    restart_chain: # restart the mcmc fitting from a previous calculation
      type: string
      dependencies: input_db
      regex: '.*\.npy$'
    chains_per_parameter: # even integer multiple of number of chains corresponding to on parameter
      type: integer
      iseven: True
      min: 2
      allof:
        - required: True
        - excludes: restart_chain
    chain_std_deviation: # fraction of a parameter for the standard deviation in the walkers
      min: 0
      allof:
        - required: True
        - excludes: restart_chain

This allows us to do to do almost all of the validation, including

checking for parameter conflicts
handling enumeration options (e.g. choose either 'linear' or 'exponential' models, validated with regex)
Checking for filetype compatibility (again, regex)

Remaining

Still TODO (open new issues):

Number of dask cores to run with
pycalphad point density
(when implemented) pycalphad solver convergence options? Allows for tunable accuracy/speed tradeoff.

- Make output have a default so the dict is filled - Add file regexes - Make the model types required in generate_parameters required, as well as mcmc steps in mcmc. This ensures that users have something to enter if they use these keys.

…default

…settings Still need to update main to properly use get_run_settings rather than argparse

bocklund · 2017-09-18T06:16:00Z

I should mention that the minimal required yaml input files are as follows.

The goals, evidenced in the schema, are sensible defaults, but everything customizable.

One last thing I really like about this schema approach is that we can keep deprecated keys in schemas and use the library to rename them and provide compatibility and nice error messages so the schema can evolve organically.

Parameter selection only

system:
  phase_models: my-phases.json
  datasets: my-input-data
generate_parameters:
  excess_model: linear
  ref_state: SGTE91

MCMC only

system:
  phase_models: my-phases.json
  datasets: my-input-data
mcmc:
  mcmc_steps: 1000
  input_db: my_tdb.tdb

MCMC from a previous run

system:
  phase_models: my-phases.json
  datasets: my-input-data
mcmc:
  mcmc_steps: 1000
  input_db: my_tdb.tdb
  restart_chain: my_previous_chain.npy

Full run (from scratch)

system:
  phase_models: my-phases.json
  datasets: my-input-data
generate_parameters:
  excess_model: linear
  ref_state: SGTE91
mcmc:
  mcmc_steps: 1000

richardotis · 2017-09-18T18:07:35Z

espei/run_espei.py

+    system_settings = run_settings['system']
+    output_settings = run_settings['output']
+    generate_parameters_settings = run_settings.get('generate_parameters')
+    mcmc_settings = run_settings.get('mcmc')


I'm thinking everything roughly after here should be broken out into a separate public function. It should be possible to construct run_settings programmatically (without writing an input file to disk), and then call a function which will handle the error checking and constructing the call to fit().

run_settings comes from the public get_run_settings function that takes a dict and validates it against a schema. In main we determine whether the input file is a YAML or JSON, then deserialize to dict and call get_run_settings.

Is that what you're thinking?

As far as constructing the call to fit, the next step is to factor out the call to fit into two functions, one for generating parameters and one for fitting them with mcmc. Then, ideally, we just pass in the mcmc_settings dict as a **kwargs and everything is handled internally

I want to be able to pass my own run_settings to the fitting routine without it needing to know anything about an input file. I'm thinking about programmatic generation of runs with, e.g., different excess models, different point densities (and different output filenames). It would be convenient to just be able to write a template run settings file, load it, and then modify a dict inside a for loop to kick off all the runs.

Yeah I think we are in agreement.

Right now you could reasonably enumerate all the keyword arguments you want to pass to fit as a dict. The refactor I proposed would just make it more simple to do that. Ultimately, run_espei.py should just be a way to run ESPEI from the command line with an input file, but you should in principle be able to do without it if you want to work from a custom script, interactively, with a notebook, or whatever.

For the purposes of this PR, we are just switching from a command line argument interface to a file interface to the fit command. What you've brought up is worth opening an issue for, but since this doesn't introduce any regressions on that front, I think it's out of scope here.

richardotis · 2017-09-18T20:27:11Z

@bocklund By the way, the examples in #28 (comment) would make an excellent addition to the docs!

bocklund added 20 commits September 17, 2017 14:58

WIP: Add input schema

455487d

WIP: Schema passes as a validator in cerberus

5fb27c8

REF: Move input schema to the package and add it to the MANIFEST

ed2b71c

ENH: add schema loading to package

38c50b3

WIP: Scaffold tests for input validation

45d8022

WIP: Update regex

341b3c1

- Make output have a default so the dict is filled - Add file regexes - Make the model types required in generate_parameters required, as well as mcmc steps in mcmc. This ensures that users have something to enter if they use these keys.

FIX: tdb -> db in schema, remove exponential model

6d3b475

FIX: schema: require restart file to match the .npy extension

4433306

FIX: schema: restart depends on input_db

5aa7e41

ENH: schema: attempt to enforce input dependencies

87e601f

FIX: schema: make everything required actually required, even with a …

ad43f02

…default

WIP: Drop most argparse arguments in run_espei, replace with get_run_…

939d005

…settings Still need to update main to properly use get_run_settings rather than argparse

WIP: Fill in tests. Partial passing

ff3f183

FIX: return normalized dict, rather than validated bool

c941caf

FIX: fix poor naming in MCMC_RESTART_DICT in schema tests

2b821c4

ENH/TST: Add schema support for chains_per_parameter

a285243

WIP: First attempt at switching to a yaml/json input format.

9d91b77

WIP: Move scheduler logic to mcmc_settings logic

54b5154

WIP/FIX: Move more mcmc get logic under the mcmc subheading.

23abad6

ENH: Add chains_per_parameter and chain_std_deviation

727a9a0

bocklund requested a review from richardotis September 18, 2017 05:52

bocklund mentioned this pull request Sep 18, 2017

Support parameter selection for custom single phase models #5

Closed

richardotis requested changes Sep 18, 2017

View reviewed changes

richardotis mentioned this pull request Sep 18, 2017

URI and multiple-output support #29

Open

richardotis approved these changes Sep 18, 2017

View reviewed changes

bocklund merged commit b1272af into master Sep 18, 2017

bocklund deleted the input-file branch October 31, 2017 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a file for ESPEI inputs. #28

Use a file for ESPEI inputs. #28

bocklund commented Sep 18, 2017 •

edited

bocklund commented Sep 18, 2017 •

edited

richardotis Sep 18, 2017

bocklund Sep 18, 2017

richardotis Sep 18, 2017

bocklund Sep 18, 2017

bocklund Sep 18, 2017

richardotis commented Sep 18, 2017

Use a file for ESPEI inputs. #28

Use a file for ESPEI inputs. #28

Conversation

bocklund commented Sep 18, 2017 • edited

Major features:

Validation schema

Remaining

bocklund commented Sep 18, 2017 • edited

Parameter selection only

MCMC only

MCMC from a previous run

Full run (from scratch)

richardotis Sep 18, 2017

Choose a reason for hiding this comment

bocklund Sep 18, 2017

Choose a reason for hiding this comment

richardotis Sep 18, 2017

Choose a reason for hiding this comment

bocklund Sep 18, 2017

Choose a reason for hiding this comment

bocklund Sep 18, 2017

Choose a reason for hiding this comment

richardotis commented Sep 18, 2017

bocklund commented Sep 18, 2017 •

edited

bocklund commented Sep 18, 2017 •

edited