Exabyte Parser (ExaParser)
Exabyte parser is a python package to extract and convert materials modeling data (eg. Density Functional Theory, Molecular Dynamics) on disk to ESSE/EDC format.
- Extract structural information and material properties from simulation data
- Serialize extracted information according to ESSE/EDC
- Store serialized data on disk or remote databases
- Support for multiple simulation engines, including:
ExaParser can be installed as below.
Install git-lfs in order to pull the files stored on Git LFS.
git clone email@example.com:Exabyte-io/exaprser.git
pip install virtualenv
Create virtual environment and install required packages:
cd exaprser virtualenv venv source venv/bin/activate export GIT_LFS_SKIP_SMUDGE=1 pip install -r requirements.txt
Open config and adjust parameters as necessary. The most important ones are listed below.
data_handlersparameters list (comma-separated), if not already present. This will enable upload the data into Exabyte.io account.
- New users can register here to obtain an Exabyte.io account.
- See RESTful API Documentation to learn how to obtain authentication parameters.
workflow_template_nameparameter in case a different template should be used.
propertiesparameter to extract desired properties; all listed properties will be attempted for extraction.
Run the below commands to extract the data.
source venv/bin/activate ./bin/exaparser -w PATH_TO_JOB_WORKING_DIRECTORY
Run the following command to run the tests.
source venv/bin/activate sh run-tests.sh
This repository is an open-source work-in-progress and we welcome contributions. We suggest forking this repository and introducing the adjustments there, the changes in the fork can further be considered for merging into this repository as explained in GitHub Standard Fork and Pull Request Workflow.
The following diagram presents the package architecture.
Here's an example flow of data/events:
User invokes the parser with a path to a job working directory.
The parser initializes a
Jobclass to extract and serialize the job.
Job class uses
Workflowparser to extract and serialize the workflow.
The Workflow is initialized with a Template to help the parser to construct the workflow.
- Users can add new templates or adjust the current ones to support complex workflows.
Workflow parser iterates over the Units to extract
- application-related data
- input and output files
- materials (initial/final structures) and properties
The job utilizes Compute classes to extract compute configuration from the resource management system.
Once the job is formed it is passed to Data Handler classes to handle data, e.g. storing data in Exabyte platform.
Workflow templates are used to help the parser extracting the data as users follow different approaches to name their input/output files and organize their job directories. Readers are referred to Exabyte.io Documentation for more information about the structure of workflows. As explain above a Shell Workflow Template is used by default to construct the workflow. For each unit of the workflow one should specify
stdoutFile, the relative path to the file containing the standard output of the job,
workDir, the relative path to directory containing data for the unit and the name of
Desirable features for implementation:
- Implement PBS/Torque and SLURM compute parsers
- Implement VASP and Espresso execution unit parsers
- Add other data handlers
- Add complex workflow templates