GitHub - ATNDiaye/MPContribs: MP's User Contribution Framework

Patrick Huck (tschaume), Oct-28-2014
Under Development

user-contributed data submissions

General TODO list

fix issue of 'snl_id' and 'mp_id' synchronicity first. IN PROGRESS, see SNL Group Checks
CSV format good for tabular data, possibly separate out the provenance into yaml, e.g. see submit_snl_from_cif()
use plot.ly for embed- and shareable graphs, synchronize formats, e.g. see results of SNL Group Checks
model script similar to AWS CLI, e.g. see check_snl
optionally define separator in (sub-)section title line.
support multiple plots on same data.
comments: no special line breaks, allow for line wrapping.
special syntax for multi-index tables in Pandas.
support optional sectional indentation.
input/data validation and error handling.
sample parsing of authors/publications based on project-wide bibtex-file.
use section names as keywords to facilitate search feature.
add XAS/XMCD data provided by ALS (Alpha N’Diaye) as test sample. DONE
discuss post-processing/code-injector support via ALS data. POSTPONED
develop designated REST interface function.
reverse section level character repetition (a la markdown). DONE

objective

The objective of this module is to develop a test scenario for the submission/contribution of a multitude of possible user data formats and their resulting JSON representation using 'Pandas'. 'Pandas' already handles a variety of different data table formats with index columns and header rows, and their translation into the according python objects. The respective common language is well established and allows for the convenient import of data from csv and other sources via a few options as well as the definition of a set of orientations and out-types to translate data objects to JSON via to_json() or to python dicts via to_dict(). Default plotting of the data objects with sensible options are also provided. Basicly, 'Pandas' represents simple one-column, possibly indexed data tables via Series and all else (i.e. even multi-indexed data) via DataFrame objects. This considerably facilitates the programmatic submission of data using the Materials Project’s REST API for user and developer. The approach of using 'Pandas' for MP user data submissions would not only guarantee a common language with already plenty of existing documentation but would also leave the specific data table formats under the user’s control.

basis of user submissions

A user wishing to submit data to the 'Materials Project' will have his or her own understanding of what "material" means in the context of the submission. The definition of a "material" is ambivalent which requires a suitable basis to be established for a submission. The context of the submission can be categorized hierarchically using MP’s terminology as Structure [mp-24972] or Molecule < Composition [Fe2O3] < Chemical System [Fe*O*] sorted by increasing number of structures included by the definition of the respective term. Generally, a user’s data submission can be a contribution at any of these levels which significantly changes its scope and purpose with respect to the usage within the Materials Project. For the experimental data provided by ALS below, for instance, the submitter does not have any further knowledge about the "material" under investigation than its chemical system or at most its composition. This is comparable to the context in which phase diagrams are currently produced in MP. The user would still like to first submit his processed/final data (i.e. x-ray absorption spectra or XMCD signals) and compare it to a list of FEFF calculations or overlay it with a phase diagram provided by MP. This is probably true for many of the future experimental data submissions to MP which is why these use cases need to also be kept in mind when developing a general user submission scheme as is intended here. Other use cases are the submission of a new structure via a CIF-file during a journal’s publication process with the request for MP calculations or the submission of already conducted calculations to be "attached" to an existing structure. A possible solution to generally cover submissions in the future, would be to maintain a database of arbitrary user submissions tagged with dedicated keys to determine their category and scope. Extended internal discussion is required to sort out possibilities and priorities.

'Note from AJ meeting, 9/19/2014': Chemical System in general would be too broad as a definition but could be implemented as a list of Compositions to narrow it down and make it managable on the infrastructure/database level.

user submissions in current MP infrastructure

For the development of the current submission scheme we’re working off the assumption that the submission by the user is based on a unique 'snl_group_id'. This allows for the extension of the already existing 'projects' key in the SNL to serve as a list of projects contributing to the respective SNL. Each element in this list would reference the according document in the project’s collection of data submissions. The issue of mapping 'mp_id' and 'task_id' to 'snl_group_id' then needs to be addressed separately. Note that the solution proposed here assumes the submission of any general final user data associated with the respective SNL. It does not try to solve the separate issue of a user’s desire to submit customized but MP-based user tasks to the MP’s core task collection.

authors & publications

The organization of authors and publications is long well established in the scientific community using dedicated BibTeX files including designated field names and entry types commonly required for references. GUIs & tools exist for many platforms to maintain these file types such that the user does not need to be familiar with the particular syntax. In the MP, each project would maintain a single "global/project-wide" bibtex-file which would be submitted separately from the data. The existing python module Pybtex can be used to parse the bibtex-file and save it to the Mongo database. The resulting bibtex-key would serve as a unique identifier to link the data in the user submission to the corresponding authors and publications. The bibtex-keys can then be resolved dynamically into author names etc. on the frontend, for instance.

data submission format

'Pandas' allows for the import of data from many different sources which makes it a suitable basis to be extended later based on the feedback by MP’s user community. For the purpose of developing a test scenario of user submissions we start with basic CSV files using a minimal amount of meta-data necessary to customize the submission for MP. CSVs are commonly used, even ubiquitous! They are easy to produce and parse, while well suited for tabular data. CSV does not handle hierarchical data or free-form text well, but this should be manageable for now. Once the general submission scheme is established, other more programmatic ways of submission should be easily implementable. input.csv is a csv-formatted file with a collection of possible user data formats separated in nested sections by multiples of >. The character chosen as separator is open for discussion. See inline comments in the following excerpt from input.csv for more info on the details of the input format.

>>> GENERAL
# - anything after section delimiter is parsed as section name, excl. comments
# - number of '>'s denotes section level (depth), min. 3 too avoid collision with '>>' sign
# - a general section with properties, settings and defaults. The MP might
#   require certain unique row names in this section (snl-id, mp-id, xtal-name..)
#   alternatively the mp-id can be repeated in each main section and a global GENERAL section be omitted
mp-id: 1143
# comment lines in (sub)section body are ignored
xtal-name: Al2O3
submitters: slany@nrel.gov, pgraf@nrel.gov # usernames = email addresses
references: slany14, slany12 # bibtex-keys

>>> CRYSTAL
>>>> general
# - use colon as separator for 'general' and 'plot' (sub-)sections
# - simple list of key-value pairs (all section but 'data' currently interpreted this way)
# - key serves as index -> needs to be unique
# - separate header entry in general section is not necessary. Pandas already
#   provides that since it is part of the data (user just "labels" the data)
standards: fere, gwvd
>>>>> bibtex # example for tree-like section nesting
publications: ja295760, ja295765 # bibtex-keys
authors: nrel_authors # bibtex-key
>>>>> comments
acknowledgment: This dataset is the result of DOE grant 12345, NSF grant 12345, and the contributed efforts of many researchers. # line-wrapping?
thanks: my wife, Donald Duck, and Tom & Jerry
>>>> plot
# 'plot' subsection:
# - specify a plot and its options
# - supports columns to be plotted referred to by header name
# - key-value pairs in this section are passed through to df.plot() (not tested)
x: alpha
>>>> data
# - 'data' sections are parsed with comma or tab as delimiter (dep. on file ending)
# - always require header row in data section
# - define column header like desired for axis labels (for now)
alpha,beta,gamma
10,11,12

>>> BAND GAPS
# a section with a simple list of annotated numbers including units. The number
# can have multiple columns to provide info on the respective conditions under
# which the number was generated, for instance.
>>>> plot
x: name
kind: bar
>>>> data
name,type,functional,method,value,unit
band gap,indirect,GLLB-SC,Kohn-Sham,6.887038,eV
band gap,direct,GLLB-SC,Kohn-Sham,6.886986,eV
deriv. discont.,,GLLB-SC,,2.42833,eV

>>> ELASTIC TENSOR
# no subsections -> parsed as 'data'
Matrix,Exp.,Theo.,Ref.
c11,287.0,284.7,PSP11 # bibtex-key
c22,302.1,299.5

>>> DIELECTRIC CONSTANT
>>>> plot #  no y-axis headers -> overlay all y_i vs x in plot
x: freq
>>>> data
freq,real,imag
0,2.0065,0

>>> XMCD
>>>> general
mp-id: mp-54 # multiple mp-id's per csv yet to be decided
Date: 8/11/2014
Count Time (s): 3.50000000
>>>> plot
x: Energy
>>>> data
Energy,Intensity B<0,Intensity B>0,XAS,XMCD
755.73651123,0.08770159571747617,0.08229754835057994,0.08499957203402805,0.005404047366896231
760.99865723,0.08111457285575464,0.08104228193740101,0.08107842739657783,7.229091835363188e-05

data import code

The RecursiveParser recursively splits the input file section-by-section using appropriate regular expressions with the current separator level. When no section separator is found anymore, the section body is read into 'Pandas' objects Series or DataFrame via read_csv() and subsequently incorporated into the output document with the appropriate nesting using to_dict(). Lists of 1-1-mappings are always imported as an indexed Series object ("squeezed"). For the Series object, the conversion to a dict is obvious. For the DataFrame object, list is used as conversion type if all columns are numeric and records for all else. The RecursiveDict class extends usual python dicts for nested updating. The plot function reads the data from output.json and produces 'Pandas' default plots. It currently only passes the key-value pairs of the 'plot' subsections through to 'Pandas' without checks or secondary adjustments. In the future, the plotting part and infrastructure will employ the services provided by plot.ly.

data post-processing

The inclusion of the XMCD/XAS data from ALS raises an interesting feature request which could be taken into consideration when importing data into MP. The raw data taken by the instruments is already close to the format proposed here, and only needs minor filtering and simple processing to produce the final XAS spectra and XMCD signal to be displayed on the MP web-page. As the data import is based on 'Pandas' data objects, one could allow the user to provide/inject designated post-processing code that is executed on the respective DataFrame object before the data is dumped in the MP’s database. The xmcd_post_process() function represents a simple example post-processing the raw ALS data. It basically subtracts a baseline column, splits a column based on a condition, and recombines them via addition and subtraction to generate the final results which are saved in the database. The code could live in the 'MPWorks' repo and possibly be "applied" to a DataFrame via df.apply(). It needs to be discussed internally to which degree the MP import process would support this use case or whether this type of data processing should be left to the user entirely. Allowing for some post-processing capabilities in MP, would facilitate fully automated workflows where newly produced data is submitted to MP on a regular basis. BTW, the code necessary for data validation would hook into the import process in a similar way (see code).

JSON-formatted data for MongoDB & Pandas Plots

Running python -m mpworks.user_contributions] over input.csv, pretty-prints the imported data using 'Pandas' defaults and outputs a JSON representation of how the data would be saved in MP’s database internally (→ output.json). Finally, the imported data is plotted using 'Pandas' defaults based on the generated output.json.

Crystal

Pandas Plot

JSON Representation

{
  ...
  "crystal": {
    "data": {
      "alpha": [ 10, 20, 30, 40, 50 ],
      "beta": [ 11, 21, 31, 41, 51 ],
      "gamma": [ 12, 22, 32, 42, 52 ]
    },
    "general": {
      "bibtex": {
        "authors": "nrel_authors ",
        "publications": "ja295760, ja295765 "
      },
      "comments": {
        "acknowledgment": "This dataset is the ...",
        "thanks": "my wife, Donald Duck, and Tom & Jerry"
      }
    },
    "plot": {
      "x": "alpha"
    }
  },
  ...
}

Band Gaps

Pandas Plot

JSON Representation

{
  ...
  "band gaps": {
    "data": [
      {
        "functional": "GLLB-SC",
        "method": "Kohn-Sham",
        "name": "band gap",
        "type": "indirect",
        "unit": "eV",
        "value": 6.887038
      },
      ...
    ],
    "plot": {
      "kind": "bar",
      "x": "name"
    }
  },
  ...
}

Elastic Tensor

Pandas Plot	JSON Representation
	{ ... "elastic tensor": [ { "Exp.": 287.0, "Matrix": "c11", "Ref.": "PSP11 ", "Theo.": 284.7 }, ... ], ... }

Dielectric Constants

Pandas Plot	JSON Representation
	{ ... "dielectric constant": { "data": { "freq": [ 0.0, 0.5, 1.0, ... ], "imag": [ 0.0, 0.0, 0.0, ... ], "real": [ 2.0065, 2.0073, 2.0097, ... ] }, "plot": { "x": "freq" } }, ... }

XMCD & XAS

Pandas Plot

JSON Representation

{
  ...
  "xmcd": {
    "data": {
      "Energy":
        [ 755.7365, 760.9987, ... ],
      "Intensity B<0":
        [ 0.08770, 0.08111, ... ],
      "Intensity B>0":
        [ 0.08229, 0.08104, ... ],
      "XAS":
        [ 0.08500, 0.08108, ... ],
      "XMCD":
        [ 0.005404, 7.229e-05, ... ]
    },
    "general": {
      "Count Time (s)": "3.50000000",
      "Date": "8/11/2014",
      "mp-id": "mp-54 "
    },
    "plot": {
      "x": "Energy"
    }
  },
  ...
}

Submit SNL from CIF and YAML MetaData File

The function rest.submit_snl_from_cif supports a small demo case to submit a structure via a CIF file and some meta-data in YAML format to Materials Project during the RSC publishing process, for instance. The keys in the MetaData file input_rsc.yaml correspond to what’s expected by the StructureNL constructor:

authors: John Doe <johndoe@gmail.com>, Test User <test@materialsproject.org> # could also be dict
references: rsc.bib # interpret as bibtex-string if starts w/ @ else as bibfile name
remarks: # list of strings (<140 chars each)
  - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent et
    tincidunt magna, vel tincidunt nulla.
  - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc vulputate
    varius ex, sit amet dignissim turpis aliquam quis. Sed ex.
projects: ['Project A', 'Project B'] # list of strings (alt. format)
history:
  - name: Inorganic Crystal Structure Database (ICSD)
    url: http://icsd.fiz-karlsruhe.de/
    description: { "icsd_id" : 43732 }
data:
  _materialsproject: <custom data>
  _icsd:
    icsd_id: 43732
    comments: [ "Cell from ..." ]

The above file would be prepared during the journal’s publishing process, and submitted along with the CIF file to MP once review is successfully completed (full reference available).

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
mpcontribs		mpcontribs
scripts		scripts
test_files		test_files
.gitignore		.gitignore
README.asc		README.asc
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpcontribs

mpcontribs

scripts

scripts

test_files

test_files

.gitignore

.gitignore

README.asc

README.asc

requirements.txt

requirements.txt

Repository files navigation

user-contributed data submissions

General TODO list

objective

basis of user submissions

user submissions in current MP infrastructure

authors & publications

data submission format

data import code

data post-processing

JSON-formatted data for MongoDB & Pandas Plots

Crystal

Band Gaps

Elastic Tensor

Dielectric Constants

XMCD & XAS

Submit SNL from CIF and YAML MetaData File

About

Releases

Packages

Languages

ATNDiaye/MPContribs

Folders and files

Latest commit

History

Repository files navigation

user-contributed data submissions

General TODO list

objective

basis of user submissions

user submissions in current MP infrastructure

authors & publications

data submission format

data import code

data post-processing

JSON-formatted data for MongoDB & Pandas Plots

Crystal

Band Gaps

Elastic Tensor

Dielectric Constants

XMCD & XAS

Submit SNL from CIF and YAML MetaData File

About

Resources

Stars

Watchers

Forks

Languages