# DKRZ data ingest information handling


The submission_forms package provides a collection of components to support the management of information related to data ingest related activities (data transport, data checking, data publication and data archival):

* data submission related information management
  * who, when, what, for which project, data characteristics
* data management related information collection
  * ingest, quality assurance, publication, archiving
  
The information is stored in structured json files which are 1-to-1 mapped to Form objects to simplify information handling. In the following it is assumed that an initial structured json file was generated. For the different ways to generate initial structured json files see the **Workflow_Form_Generation.ipynb** notebook:  

## DKRZ ingest workflow system

Approach:
* Data managment related information is managed in structured json files
* To simplify interactive information updates etc. json files are converted to Form objects
* There are multiple possibilities to populate the json files (and associated Form objects):
  * DKRZ served jupyter notebooks (e.g. in DKRZ jupyterhub http://data-forms.dkrz.de:8080)
  * Client side jupyter notebooks (submission via email, rt ticket, git commit)
  * Client side excel sheets (submission via email, rt ticket)
  * Unstructured email exchange (json population done by data managers)
* A toolset to manage Form objects (specially structured json files) along a well defined workflow
* A toolset to search and intercorrelate data submission information
* Support for W3C prov standard exposure of the structured json files

## 1) Get a Form object for information stored in a json file 

In [26]:
## the following libraries are needed to interact with 
## json based form submissions

from dkrz_forms import form_handler, utils, checks,wflow_handler
from datetime import datetime

In [27]:
## info_file = "path to json file"
info_file = "/opt/jupyter/notebooks/form_directory/CORDEX/CORDEX_aa_11.json"

# load json file and convert to Form object for simple updating
my_form = utils.load_workflow_form(info_file)

In [None]:
# use "tab" completion to view the attributes
# every form has a project and has the above workflow steps associated
my_form.

In [25]:
# evalulate to see doc string of submission part
?my_form


## 2) Explore the structure of a workflow Form object
      (i.e submission workflow json file)
      
The workflow is structured according to the following workfow steps:

* 'sub': data **submission** related information (client side: who, what, how, .., manager side: who, status,.. )
* 'rev': data submission **review** information
* 'ing': data **ingest** related information
* 'qua': data **quality assurance** related information
* 'pub': data **publication** related information
* 'lta': data **long term archival** and data citation related information   

information on the form objects can be retrieved interactively in ipython 
in jupyter notebooks - use again "tab" for completion and ? to retrieve
docstring documentation. 

Examples:

In [None]:
# evaluate to view associated documentation string
?my_form.sub

In [None]:
# use "tab" completion
my_form.sub.

### each workflow step is structured acording to:

* **agent:** step related person or software tool information
* **activity**: step execution related information
* **entity_in**: input information for this workflow step
* **entity_out**: output information for this workflow step

these parts have to be filled for each workflow step to characterize who (**agent**), did what (**activity**) using which input information (**entity_in**) to produce which output information (**entity_out**). These parts align with the WC3 Prov model allowing for a translation of all collected information based on the W3C prov standard (see the provenance.ipynb notebook for an example).

In [None]:
# example: "tab" completion to view attributes of agent 
# thus - agent has an email, first_name and last_name

my_form.sub.agent.

### agent related information

this is generally defined in the *dkrz_forms.config.workflow_steps.py* templates 
(see source code on github: https://github.com/IS-ENES/submission_forms/dkrz_forms/config/workflow_steps.py)

for example the agent responsible for data submission this is SUBMISSION_AGENT, which is defined as:
    
SUBMISSION_AGENT = { 
   '__doc__': """Attributes characterizing the person responsible for form completion and submission:

       - last_name: Last name of the person responsible for the submission form content
       - first_name: Corresponding first name
       - email: Valid user email address: all follow up activities will use this email to contact end user
       - keyword : user provided key word to remember and separate submission
              """,
    'i_name': 'submission_agent',
    'last_name' : 'mandatory',
    'first_name' : 'mandatory',
    'keyword': 'mandatory',
    'email': 'mandatory',
    'responsible_person':'mandatory'
  }

All entries charactized as 'mandatory' have to be filled. 

In [8]:
# e.g. set email of person responsible for data submission:
my_form.sub.agent.email = 'franz_mustermann@hzg.de'

#### activity related information

again the generic definition is defined in the dkrz_forms.workflow_steps.py templates. 

for example the quality assurance (qua) related activity information is defined as:
    
QUA_ACTIVITY= {
    '__doc__': """
        Attributes characterizing the data quality assurance activity:
        - status: status information
        - start_time, end_time: data ingest timing information
        - comment : free text comment
        - ticket_id: related RT ticket number
        - follow_up_ticket: in case new data has to be provided
        - quality_report: dictionary with quality related information (tbd.)
        """,
      'i_name':'qua_activity',
      'status':ACTIVITY_STATUS,
      'error_status':ERROR_STATUS,
      'qua_tool_version':"mandatory",
      "start_time":"mandatory",
      "end_time":"optional",
      "comment":"optional",
      "ticket_id": "mandatory",
      "follow_up_ticket": 'optional', # qa feedback to users, follow up actions
      }    

In [22]:
## back to example: submission related activity information
import pprint
pprint.pprint(my_form.sub.activity.__doc__)

('\n'
 '                         Attributes characterizing the form submission '
 'activity:\n'
 '                         \n'
 '                         - comment : free text comment\n'
 '                         - method  : How the submission was generated and '
 'submitted to DKRZ: email or DKRZ form server based \n'
 '                         - status : status information\n'
 '                         - error_status : additional information on error '
 'status\n'
 '                         - ticket_id : related rt ticket number\n'
 '                         - start_time, end_time: start and end time of '
 'activity\n'
 '                         - timestamp: intermediate time stamp information '
 '(update activities)\n'
 '                         - pwd: password to access the form\n'
 '                         ')


### workflow step report documents

each workflow step produces an output associated to the **entity_out** keyword.

To each output a user defined dictionary can be attached as **report** 

so e.g.

   my_form.sub.entity_out.report contains all the user input provided e.g. by mail or in a excel
   sheet or provided via a (jupyter notebook) form 
   
   my_form.qua.entity_out.report contains the quality_assurance tool json output as dictionary 
   
   etc.

In [24]:
# view the submission related information provided by the end user:

pprint.pprint(my_form.sub.entity_out.report.__dict__)

{'__doc__': '\n'
            '                         CORDEX information collected as part of '
            'form completion process\n'
            '                         see CORDEX template\n'
            '                         .. details on entries .. to be '
            'completed\n'
            '                        ',
 'data_information': '',
 'data_path': '',
 'data_qc_comment': '',
 'data_qc_status': '',
 'directory_structure': '',
 'example_file_name': '',
 'exclude_variables_list': '',
 'experiment_id': 'CV_CORDEX, experiment_id',
 'grid_as_specified_if_rotated_pole': '',
 'grid_mapping_name': '',
 'institute_id': 'CV_CORDEX,institute_id',
 'institution': 'CV_CORDEX,institution',
 'model_id': 'CV_CORDEX,model_id',
 'project': 'CORDEX',
 'submission_type': 'initial_submission, update_submission, '
                    'submission_retraction, other',
 'terms_of_use': '',
 'time_period': '',
 'uniqueness_of_tracking_id': 'yes,no',
 'variable_list_day': '',
 'variable_lis

In [None]:
## Example for the quality assurance workflow step (qua):
my_form.qua.entity_out.report = {
    "QA_conclusion": "PASS",
    "project": "CORDEX",
    "institute": "CLMcom",
    "model": "CLMcom-CCLM4-8-17-CLM3-5",
    "domain": "AUS-44",
    "driving_experiment":  [ "ICHEC-EC-EARTH"],
    "experiment": [ "history", "rcp45", "rcp85"],
    "ensemble_member": [ "r12i1p1" ],
    "frequency": [ "day", "mon", "sem" ],
    "annotation":
    [
        {
            "scope": ["mon", "sem"],
            "variable": [ "tasmax", "tasmin", "sfcWindmax" ],
            "caption": "attribute <variable>:cell_methods for climatologies requires <time>:climatology instead of time_bnds",
            "comment": "due to the format of the data, climatology is equivalent to time_bnds",
            "severity": "note"
        }
    ]
}

### Links:


* github repo: https://github.com/IS-ENES-Data/submission_forms
* ...       

In [16]:
# to generate empyty project form including all options for variables
# e.g.: 
ACTIVITY_STATUS = "0:open, 1:in-progress ,2:action-required, 3:paused,4:closed"          
ERROR_STATUS = "0:open,1:ok,2:error"
ENTITY_STATUS = "0:open,1:stored,2:submitted,3:re-opened,4:closed"
CHECK_STATUS = "0:open,1:warning, 2:error,3:ok"
import dkrz_forms
#from dkrz_forms import form_handler, utils
#sf_t = utils.generate_project_form('ESGF_replication')
#print(checks.get_options(sf_t.sub.activity.status))

## Related approaches at other sites

example workflows in other data centers:
* http://eidc.ceh.ac.uk/images/ingestion-workflow/view
* http://www.mdpi.com/2220-9964/5/3/30/pdf
* https://www.rd-alliance.org/sites/default/files/03%20Nurnberger%20-%20DataPublishingWorkflows-CollabMtg20151208_V03.pdf
* http://ropercenter.cornell.edu/polls/deposit-data/
* https://www.arm.gov/engineering/ingest
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/data_ingest.txt
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/how_to_participate.html
* http://www.nodc.noaa.gov/submit/ online tool 
* https://www2.cisl.ucar.edu/resources/cmip-analysis-platform
  * https://xras-submit-ncar.xsede.org/ 
* http://cmip5.whoi.edu/ 
* https://pypi.python.org/pypi/cmipdata/0.6
* ....