Patrick Huck (tschaume), Oct-28-2014
Under Development
fix issue of 'snl_id' and 'mp_id' synchronicity first. IN PROGRESS, see
SNL Group Checks
CSV format good for tabular data, possibly separate out the provenance into yaml, e.g. see
submit_snl_from_cif()
use plot.ly for embed- and shareable graphs, synchronize formats, e.g. see
results of SNL Group Checks
model script similar to AWS CLI, e.g. see
check_snl
optionally define separator in (sub-)section title line.
support multiple plots on same data.
comments: no special line breaks, allow for line wrapping.
special syntax for multi-index tables in Pandas.
support optional sectional indentation.
input/data validation and error handling.
sample parsing of authors/publications based on project-wide bibtex-file.
use section names as keywords to facilitate search feature.
add XAS/XMCD data provided by ALS (Alpha N’Diaye) as test sample. DONE
discuss post-processing/code-injector support via ALS data. POSTPONED
develop designated REST interface function.
reverse section level character repetition (a la markdown). DONE
The objective of this module
is to develop a test scenario for the submission/contribution of a multitude of
possible user data formats and their resulting JSON representation using
'Pandas'. 'Pandas' already handles a variety of
different data table formats with index columns and header rows, and their
translation into the according python objects. The respective common language
is well established and allows for the convenient
import of data
from csv and other sources via a few options as well as the definition of a
set of
orientations and
out-types
to translate data objects to JSON via to_json()
or to python dicts via to_dict()
.
Default
plotting of the data objects with sensible options are also provided. Basicly,
'Pandas' represents simple one-column, possibly indexed data tables via
Series
and all else (i.e. even multi-indexed data) via DataFrame
objects.
This considerably facilitates the programmatic submission of data using the
Materials Project’s REST API for user and developer. The approach of using
'Pandas' for MP user data submissions would not only guarantee a common
language with already plenty of existing documentation but would also leave the
specific data table formats under the user’s control.
A user wishing to submit data to the 'Materials Project' will have his or her
own understanding of what "material" means in the context of the submission.
The definition of a "material" is ambivalent which requires a suitable basis to
be established for a submission. The context of the submission can be
categorized hierarchically using MP’s terminology as Structure [mp-24972] or
Molecule < Composition [Fe2O3] < Chemical System [Fe*O*]
sorted by increasing
number of structures included by the definition of the respective term.
Generally, a user’s data submission can be a contribution at any of these
levels which significantly changes its scope and purpose with respect to the
usage within the Materials Project. For the experimental data provided by ALS
below, for instance, the submitter does not have any further knowledge about
the "material" under investigation than its chemical system or at most its
composition. This is comparable to the context in which phase diagrams are
currently produced in MP. The user would still like to first submit his
processed/final data (i.e. x-ray absorption spectra or XMCD signals) and
compare it to a list of FEFF calculations or overlay it with a phase diagram
provided by MP. This is probably true for many of the future experimental data
submissions to MP which is why these use cases need to also be kept in mind
when developing a general user submission scheme as is intended here. Other use
cases are the submission of a new structure via a CIF-file during a journal’s
publication process with the request for MP calculations or the submission of
already conducted calculations to be "attached" to an existing structure. A
possible solution to generally cover submissions in the future, would be to
maintain a database of arbitrary user submissions tagged with dedicated keys to
determine their category and scope. Extended internal discussion is required to
sort out possibilities and priorities.
'Note from AJ meeting, 9/19/2014': Chemical System
in general would be too
broad as a definition but could be implemented as a list of Compositions
to
narrow it down and make it managable on the infrastructure/database level.
For the development of the current submission scheme we’re working off the assumption that the submission by the user is based on a unique 'snl_group_id'. This allows for the extension of the already existing 'projects' key in the SNL to serve as a list of projects contributing to the respective SNL. Each element in this list would reference the according document in the project’s collection of data submissions. The issue of mapping 'mp_id' and 'task_id' to 'snl_group_id' then needs to be addressed separately. Note that the solution proposed here assumes the submission of any general final user data associated with the respective SNL. It does not try to solve the separate issue of a user’s desire to submit customized but MP-based user tasks to the MP’s core task collection.
The organization of authors and publications is long well established in the scientific community using dedicated BibTeX files including designated field names and entry types commonly required for references. GUIs & tools exist for many platforms to maintain these file types such that the user does not need to be familiar with the particular syntax. In the MP, each project would maintain a single "global/project-wide" bibtex-file which would be submitted separately from the data. The existing python module Pybtex can be used to parse the bibtex-file and save it to the Mongo database. The resulting bibtex-key would serve as a unique identifier to link the data in the user submission to the corresponding authors and publications. The bibtex-keys can then be resolved dynamically into author names etc. on the frontend, for instance.
'Pandas' allows for the import of data from many different sources which makes
it a suitable basis to be extended later based on the feedback by MP’s user
community. For the purpose of developing a test scenario of user submissions we
start with basic CSV files using a minimal amount of meta-data necessary to
customize the submission for MP. CSVs are commonly used, even ubiquitous! They
are easy to produce and parse, while well suited for tabular data. CSV
does not handle hierarchical data or free-form text well, but this should be
manageable for now. Once the general submission scheme is established, other
more programmatic ways of submission should be easily implementable.
input.csv
is a csv-formatted file with a collection of
possible user data formats separated in nested sections by multiples of >
.
The character chosen as separator is open for discussion. See inline comments
in the following excerpt from input.csv
for more info on the details of the
input format.
>>> GENERAL
# - anything after section delimiter is parsed as section name, excl. comments
# - number of '>'s denotes section level (depth), min. 3 too avoid collision with '>>' sign
# - a general section with properties, settings and defaults. The MP might
# require certain unique row names in this section (snl-id, mp-id, xtal-name..)
# alternatively the mp-id can be repeated in each main section and a global GENERAL section be omitted
mp-id: 1143
# comment lines in (sub)section body are ignored
xtal-name: Al2O3
submitters: slany@nrel.gov, pgraf@nrel.gov # usernames = email addresses
references: slany14, slany12 # bibtex-keys
>>> CRYSTAL
>>>> general
# - use colon as separator for 'general' and 'plot' (sub-)sections
# - simple list of key-value pairs (all section but 'data' currently interpreted this way)
# - key serves as index -> needs to be unique
# - separate header entry in general section is not necessary. Pandas already
# provides that since it is part of the data (user just "labels" the data)
standards: fere, gwvd
>>>>> bibtex # example for tree-like section nesting
publications: ja295760, ja295765 # bibtex-keys
authors: nrel_authors # bibtex-key
>>>>> comments
acknowledgment: This dataset is the result of DOE grant 12345, NSF grant 12345, and the contributed efforts of many researchers. # line-wrapping?
thanks: my wife, Donald Duck, and Tom & Jerry
>>>> plot
# 'plot' subsection:
# - specify a plot and its options
# - supports columns to be plotted referred to by header name
# - key-value pairs in this section are passed through to df.plot() (not tested)
x: alpha
>>>> data
# - 'data' sections are parsed with comma or tab as delimiter (dep. on file ending)
# - always require header row in data section
# - define column header like desired for axis labels (for now)
alpha,beta,gamma
10,11,12
>>> BAND GAPS
# a section with a simple list of annotated numbers including units. The number
# can have multiple columns to provide info on the respective conditions under
# which the number was generated, for instance.
>>>> plot
x: name
kind: bar
>>>> data
name,type,functional,method,value,unit
band gap,indirect,GLLB-SC,Kohn-Sham,6.887038,eV
band gap,direct,GLLB-SC,Kohn-Sham,6.886986,eV
deriv. discont.,,GLLB-SC,,2.42833,eV
>>> ELASTIC TENSOR
# no subsections -> parsed as 'data'
Matrix,Exp.,Theo.,Ref.
c11,287.0,284.7,PSP11 # bibtex-key
c22,302.1,299.5
>>> DIELECTRIC CONSTANT
>>>> plot # no y-axis headers -> overlay all y_i vs x in plot
x: freq
>>>> data
freq,real,imag
0,2.0065,0
>>> XMCD
>>>> general
mp-id: mp-54 # multiple mp-id's per csv yet to be decided
Date: 8/11/2014
Count Time (s): 3.50000000
>>>> plot
x: Energy
>>>> data
Energy,Intensity B<0,Intensity B>0,XAS,XMCD
755.73651123,0.08770159571747617,0.08229754835057994,0.08499957203402805,0.005404047366896231
760.99865723,0.08111457285575464,0.08104228193740101,0.08107842739657783,7.229091835363188e-05
The RecursiveParser
recursively splits the input file section-by-section
using appropriate regular expressions with the current separator level. When
no section separator is found anymore, the section body is read into 'Pandas'
objects Series
or DataFrame
via read_csv()
and subsequently incorporated
into the output document with the appropriate nesting using to_dict()
.
Lists of 1-1-mappings are always imported as an indexed Series
object
("squeezed"). For the Series
object, the conversion to a dict is obvious. For
the DataFrame
object, list
is used as conversion type if all columns are
numeric and records
for all else. The RecursiveDict
class extends usual
python dicts for nested updating. The plot
function reads the data from
output.json
and produces 'Pandas' default plots. It currently only passes the
key-value pairs of the 'plot' subsections through to 'Pandas' without checks or
secondary adjustments. In the future, the plotting part and infrastructure will
employ the services provided by plot.ly.
The inclusion of the XMCD/XAS data from ALS raises an
interesting feature request which could be taken into consideration when
importing data into MP. The raw data taken by the instruments is already close
to the format proposed here, and only needs minor filtering and simple
processing to produce the final XAS spectra and XMCD signal to be displayed on
the MP web-page. As the data import is based on 'Pandas' data objects, one
could allow the user to provide/inject designated post-processing code that is
executed on the respective DataFrame object before the data is dumped in the
MP’s database. The xmcd_post_process()
function represents a simple example
post-processing the raw ALS data. It basically subtracts a baseline column,
splits a column based on a condition, and recombines them via addition and
subtraction to generate the final results which are saved in the database. The
code could live in the 'MPWorks' repo and possibly be "applied" to a DataFrame
via
df.apply()
.
It needs to be discussed internally to which degree the MP import process would
support this use case or whether this type of data processing should be left to
the user entirely. Allowing for some post-processing capabilities in MP, would
facilitate fully automated workflows where newly produced data is submitted to
MP on a regular basis. BTW, the code necessary for data validation would hook
into the import process in a similar way (see
code).
Running python -m mpworks.user_contributions
] over
input.csv
, pretty-prints the imported data using 'Pandas'
defaults and outputs a JSON representation of how the data would be saved in
MP’s database internally (→ output.json
). Finally, the
imported data is plotted using 'Pandas' defaults based on the generated
output.json
.
Pandas Plot | JSON Representation |
---|---|
{ ... "crystal": { "data": { "alpha": [ 10, 20, 30, 40, 50 ], "beta": [ 11, 21, 31, 41, 51 ], "gamma": [ 12, 22, 32, 42, 52 ] }, "general": { "bibtex": { "authors": "nrel_authors ", "publications": "ja295760, ja295765 " }, "comments": { "acknowledgment": "This dataset is the ...", "thanks": "my wife, Donald Duck, and Tom & Jerry" } }, "plot": { "x": "alpha" } }, ... } |
Pandas Plot | JSON Representation |
---|---|
{ ... "band gaps": { "data": [ { "functional": "GLLB-SC", "method": "Kohn-Sham", "name": "band gap", "type": "indirect", "unit": "eV", "value": 6.887038 }, ... ], "plot": { "kind": "bar", "x": "name" } }, ... } |
Pandas Plot | JSON Representation |
---|---|
{ ... "elastic tensor": [ { "Exp.": 287.0, "Matrix": "c11", "Ref.": "PSP11 ", "Theo.": 284.7 }, ... ], ... } |
Pandas Plot | JSON Representation |
---|---|
{ ... "dielectric constant": { "data": { "freq": [ 0.0, 0.5, 1.0, ... ], "imag": [ 0.0, 0.0, 0.0, ... ], "real": [ 2.0065, 2.0073, 2.0097, ... ] }, "plot": { "x": "freq" } }, ... } |
Pandas Plot | JSON Representation |
---|---|
{ ... "xmcd": { "data": { "Energy": [ 755.7365, 760.9987, ... ], "Intensity B<0": [ 0.08770, 0.08111, ... ], "Intensity B>0": [ 0.08229, 0.08104, ... ], "XAS": [ 0.08500, 0.08108, ... ], "XMCD": [ 0.005404, 7.229e-05, ... ] }, "general": { "Count Time (s)": "3.50000000", "Date": "8/11/2014", "mp-id": "mp-54 " }, "plot": { "x": "Energy" } }, ... } |
The function rest.submit_snl_from_cif
supports a small demo case to submit a structure via a CIF file and some
meta-data in YAML format to Materials Project during the RSC publishing
process, for instance. The keys in the MetaData file
input_rsc.yaml
correspond to what’s expected by the
StructureNL constructor:
authors: John Doe <johndoe@gmail.com>, Test User <test@materialsproject.org> # could also be dict
references: rsc.bib # interpret as bibtex-string if starts w/ @ else as bibfile name
remarks: # list of strings (<140 chars each)
- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent et
tincidunt magna, vel tincidunt nulla.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc vulputate
varius ex, sit amet dignissim turpis aliquam quis. Sed ex.
projects: ['Project A', 'Project B'] # list of strings (alt. format)
history:
- name: Inorganic Crystal Structure Database (ICSD)
url: http://icsd.fiz-karlsruhe.de/
description: { "icsd_id" : 43732 }
data:
_materialsproject: <custom data>
_icsd:
icsd_id: 43732
comments: [ "Cell from ..." ]
The above file would be prepared during the journal’s publishing process, and submitted along with the CIF file to MP once review is successfully completed (full reference available).