# DKRZ data ingest information handling

some background information for data managers

The submission_forms package provides a collection of components to support the management of information related to data ingest related activities (data transport, data checking, data publication and data archival):

* data submission related information management
  * who, when, what, for which project, data characteristics
* data management related information collection
  * ingest, quality assurance, publication, archiving

## Related approaches at other sites

example workflows in other data centers:
* http://eidc.ceh.ac.uk/images/ingestion-workflow/view
* http://www.mdpi.com/2220-9964/5/3/30/pdf
* https://www.rd-alliance.org/sites/default/files/03%20Nurnberger%20-%20DataPublishingWorkflows-CollabMtg20151208_V03.pdf
* http://ropercenter.cornell.edu/polls/deposit-data/
* https://www.arm.gov/engineering/ingest
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/data_ingest.txt
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/how_to_participate.html
* http://www.nodc.noaa.gov/submit/ online tool 
* https://www2.cisl.ucar.edu/resources/cmip-analysis-platform
  * https://xras-submit-ncar.xsede.org/ 
* http://cmip5.whoi.edu/ 
* https://pypi.python.org/pypi/cmipdata/0.6
* ....

## DKRZ ingest workflow system

Approach:
* Data managment related information is managed in structured json files
* Different ways to populate json files:
  * DKRZ served jupyter notebooks (e.g. in DKRZ jupyterhub http://data-forms.dkrz.de:8080)
  * Client side jupyter notebooks (submission via email, rt ticket, git commit)
  * Client side excel sheets (submission via email, rt ticket)
  * Unstructured email exchange (json population done by data managers)
* A toolset to manage json files along a well defined workflow
* A toolset to search and intercorrelate data submission information
* Support for W3C prov standard exposure of the structured json files

In [None]:
### workfow steps:

* 'sub': data submission related information (client side: who, what, how, .., manager side: who, status,.. )
* 'rev': data submission review information
* 'ing': data ingest related information
* 'qua': data quality assurance related information
* 'pub': data publication related information
* 'lta': data long term archival related information    

In [2]:
info_file = "/opt/jupyter/notebooks/form_directory/CORDEX/CORDEX_mm_mm.json"

from dkrz_forms import form_handler, utils,checks,wflow_handler
from datetime import datetime

my_form = utils.load_workflow_form(info_file)

wflow_dict = wflow_handler.get_wflow_description(my_form)
list(wflow_dict.values())

['sub', 'rev', 'ing', 'qua', 'pub']

### each workflow step is structured acording to:

* **agent:** step related person or software tool information
* **activity**: step execution related information
* **entity_in**: input information for this workflow step
* **entity_out**: output information for this workflow step

these parts have to be filled for each workflow step to characterize who (**agent**), did what (**activity**) using which input information (**entity_in**) to produce which output information (**entity_out**). These parts align with the WC3 Prov model allowing for a translation of all collected information based on the W3C prov standard (see the provenance.ipynb notebook for an example).

### agent related information

this is generally defined in the *dkrz_forms.config.workflow_steps.py* templates 
(see source code on github: https://github.com/IS-ENES/submission_forms/dkrz_forms/config/workflow_steps.py)

for example the agent responsible for data submission this is SUBMISSION_AGENT, which is defined as:
    
SUBMISSION_AGENT = { 
   '__doc__': """Attributes characterizing the person responsible for form completion and submission:

       - last_name: Last name of the person responsible for the submission form content
       - first_name: Corresponding first name
       - email: Valid user email address: all follow up activities will use this email to contact end user
       - keyword : user provided key word to remember and separate submission
              """,
    'i_name': 'submission_agent',
    'last_name' : 'mandatory',
    'first_name' : 'mandatory',
    'keyword': 'mandatory',
    'email': 'mandatory',
    'responsible_person':'mandatory'
  }

All entries charactized as 'mandadory' have to be filled. 

In [8]:
# e.g. set email of person responsible for data submission:
my_form.sub.agent.email = 'franz_mustermann@hzg.de'

#### activity related information

again the generic definition is defined in the dkrz_forms.workflow_steps.py templates. 

for example the quality assurance (qua) related activity information is defined as:
    
QUA_ACTIVITY= {
    '__doc__': """
        Attributes characterizing the data quality assurance activity:
        - status: status information
        - start_time, end_time: data ingest timing information
        - comment : free text comment
        - ticket_id: related RT ticket number
        - follow_up_ticket: in case new data has to be provided
        - quality_report: dictionary with quality related information (tbd.)
        """,
      'i_name':'qua_activity',
      'status':ACTIVITY_STATUS,
      'error_status':ERROR_STATUS,
      'qua_tool_version':"mandatory",
      "start_time":"mandatory",
      "end_time":"optional",
      "comment":"optional",
      "ticket_id": "mandatory",
      "follow_up_ticket": 'optional', # qa feedback to users, follow up actions
      }    

In [None]:
# Examples: 
# (use 'tab' to see publication workflow step related information)
my_form.pub.

In [5]:
# Example: each step is internally documented
?my_form.pub

### workflow step report documents

each workflow step can be associated to a specific report summarizing the results. 
this report is allways associated to the entity_out key word:

e.g. myform.pub.entity_out.report = my_dictionary

Example for a the quality assurance workflow step (qua):

In [None]:
my_form.qua.entity_out.report = {
    "QA_conclusion": "PASS",
    "project": "CORDEX",
    "institute": "CLMcom",
    "model": "CLMcom-CCLM4-8-17-CLM3-5",
    "domain": "AUS-44",
    "driving_experiment":  [ "ICHEC-EC-EARTH"],
    "experiment": [ "history", "rcp45", "rcp85"],
    "ensemble_member": [ "r12i1p1" ],
    "frequency": [ "day", "mon", "sem" ],
    "annotation":
    [
        {
            "scope": ["mon", "sem"],
            "variable": [ "tasmax", "tasmin", "sfcWindmax" ],
            "caption": "attribute <variable>:cell_methods for climatologies requires <time>:climatology instead of time_bnds",
            "comment": "due to the format of the data, climatology is equivalent to time_bnds",
            "severity": "note"
        }
    ]
}

### Links:


* github repo: https://github.com/IS-ENES-Data/submission_forms
* ...       

In [16]:
# to generate empyty project form including all options for variables
# e.g.: 
ACTIVITY_STATUS = "0:open, 1:in-progress ,2:action-required, 3:paused,4:closed"          
ERROR_STATUS = "0:open,1:ok,2:error"
ENTITY_STATUS = "0:open,1:stored,2:submitted,3:re-opened,4:closed"
CHECK_STATUS = "0:open,1:warning, 2:error,3:ok"
import dkrz_forms
#from dkrz_forms import form_handler, utils
#sf_t = utils.generate_project_form('ESGF_replication')
#print(checks.get_options(sf_t.sub.activity.status))