Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a file for ESPEI inputs. #28

Merged
merged 20 commits into from
Sep 18, 2017
Merged

Use a file for ESPEI inputs. #28

merged 20 commits into from
Sep 18, 2017

Conversation

bocklund
Copy link
Member

@bocklund bocklund commented Sep 18, 2017

Major features:

  • ESPEI deprecates command line input arguments for a YAML (or JSON, others possible) input file.
  • Files are validated using cerberus with a schema.
  • Support for setting chains per parameter and the standard deviation of the chains.
  • Include tests for several different possible runs

Validation schema

Almost all of the constraints are handled by the following YAML file (really, anything that can be converted to dict)

# core run settings
system: # phase models and input data
  type: dict
  schema:
    phase_models: # describes the CALPHAD models for the phases
      type: string
      required: True
      regex: '.*\.json$'
    datasets: # path to datasets. Defaults to current directory.
      type: string
      required: True

output:
  type: dict
  default: {}
  schema:
    verbosity: # integer verbosity level 0 | 1 | 2, where 2 is most verbose.
      type: integer
      min: 0
      max: 2
      default: 0
      required: True
    output_db:
      type: string
      default: out.tdb
    tracefile: # name of the file containing the mcmc chain array
      type: string
      default: chain.npy
      regex: '.*\.npy$'
    probfile: # name of the file containing the mcmc ln probability array
      type: string
      default: lnprob.npy
      regex: '.*\.npy$'

## if present, will do a single phase fitting
generate_parameters:
  type: dict
  schema:
    excess_model:
      type: string
      required: True
      regex: 'linear'
    ref_state:
      type: string
      required: True
      regex: 'SGTE91'

## if present, will run mcmc fitting
## you must specifiy some kind of input for the parameters.
## Parameters can come from
##   1. a preceding generate_parameters step
##   2. by generating chains from a previous input_db
##   3. by using chains from a restart_chain for phases in an input_db
mcmc:
  type: dict
  oneof_dependencies:
    - 'mcmc.input_db'
    - 'generate_parameters'
  schema:
    mcmc_steps:
      type: integer
      min: 1
      required: True
    mcmc_save_interval:
      type: integer
      default: 20
      min: 1
      required: True
    scheduler: # scheduler to use for parallelization
      type: string
      default: dask # dask | MPIPool
      regex: 'dask|MPIPool'
      required: True
    input_db: # TDB file used to start the mcmc run
      type: string
    restart_chain: # restart the mcmc fitting from a previous calculation
      type: string
      dependencies: input_db
      regex: '.*\.npy$'
    chains_per_parameter: # even integer multiple of number of chains corresponding to on parameter
      type: integer
      iseven: True
      min: 2
      allof:
        - required: True
        - excludes: restart_chain
    chain_std_deviation: # fraction of a parameter for the standard deviation in the walkers
      min: 0
      allof:
        - required: True
        - excludes: restart_chain

This allows us to do to do almost all of the validation, including

  • checking for parameter conflicts
  • handling enumeration options (e.g. choose either 'linear' or 'exponential' models, validated with regex)
  • Checking for filetype compatibility (again, regex)

Remaining

Still TODO (open new issues):

  • Number of dask cores to run with
  • pycalphad point density
  • (when implemented) pycalphad solver convergence options? Allows for tunable accuracy/speed tradeoff.

- Make output have a default so the dict is filled
- Add file regexes
- Make the model types required in generate_parameters required, as well
  as mcmc steps in mcmc. This ensures that users have something to enter
  if they use these keys.
…settings

Still need to update main to properly use get_run_settings rather than argparse
@bocklund
Copy link
Member Author

bocklund commented Sep 18, 2017

I should mention that the minimal required yaml input files are as follows.

The goals, evidenced in the schema, are sensible defaults, but everything customizable.

One last thing I really like about this schema approach is that we can keep deprecated keys in schemas and use the library to rename them and provide compatibility and nice error messages so the schema can evolve organically.

Parameter selection only

system:
  phase_models: my-phases.json
  datasets: my-input-data
generate_parameters:
  excess_model: linear
  ref_state: SGTE91

MCMC only

system:
  phase_models: my-phases.json
  datasets: my-input-data
mcmc:
  mcmc_steps: 1000
  input_db: my_tdb.tdb

MCMC from a previous run

system:
  phase_models: my-phases.json
  datasets: my-input-data
mcmc:
  mcmc_steps: 1000
  input_db: my_tdb.tdb
  restart_chain: my_previous_chain.npy

Full run (from scratch)

system:
  phase_models: my-phases.json
  datasets: my-input-data
generate_parameters:
  excess_model: linear
  ref_state: SGTE91
mcmc:
  mcmc_steps: 1000

system_settings = run_settings['system']
output_settings = run_settings['output']
generate_parameters_settings = run_settings.get('generate_parameters')
mcmc_settings = run_settings.get('mcmc')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking everything roughly after here should be broken out into a separate public function. It should be possible to construct run_settings programmatically (without writing an input file to disk), and then call a function which will handle the error checking and constructing the call to fit().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_settings comes from the public get_run_settings function that takes a dict and validates it against a schema. In main we determine whether the input file is a YAML or JSON, then deserialize to dict and call get_run_settings.

Is that what you're thinking?

As far as constructing the call to fit, the next step is to factor out the call to fit into two functions, one for generating parameters and one for fitting them with mcmc. Then, ideally, we just pass in the mcmc_settings dict as a **kwargs and everything is handled internally

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be able to pass my own run_settings to the fitting routine without it needing to know anything about an input file. I'm thinking about programmatic generation of runs with, e.g., different excess models, different point densities (and different output filenames). It would be convenient to just be able to write a template run settings file, load it, and then modify a dict inside a for loop to kick off all the runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we are in agreement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now you could reasonably enumerate all the keyword arguments you want to pass to fit as a dict. The refactor I proposed would just make it more simple to do that. Ultimately, run_espei.py should just be a way to run ESPEI from the command line with an input file, but you should in principle be able to do without it if you want to work from a custom script, interactively, with a notebook, or whatever.

For the purposes of this PR, we are just switching from a command line argument interface to a file interface to the fit command. What you've brought up is worth opening an issue for, but since this doesn't introduce any regressions on that front, I think it's out of scope here.

@bocklund bocklund merged commit b1272af into master Sep 18, 2017
@richardotis
Copy link
Collaborator

@bocklund By the way, the examples in #28 (comment) would make an excellent addition to the docs!

@bocklund bocklund deleted the input-file branch October 31, 2017 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants