# DKRZ data ingest information handling

This demo notebook is for data managers only !

The submission_forms package provides a collection of components to support the management of information related to data ingest related activities (data transport, data checking, data publication and data archival):

* data submission related information management
  * who, when, what, for which project, data characteristics
* data management related information management
  * ingest, quality assurance, publication, archiving

## Background: approaches at other sites

example workflows in other data centers:
* http://eidc.ceh.ac.uk/images/ingestion-workflow/view
* http://www.mdpi.com/2220-9964/5/3/30/pdf
* https://www.rd-alliance.org/sites/default/files/03%20Nurnberger%20-%20DataPublishingWorkflows-CollabMtg20151208_V03.pdf
* http://ropercenter.cornell.edu/polls/deposit-data/
* https://www.arm.gov/engineering/ingest
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/data_ingest.txt
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/how_to_participate.html
* http://www.nodc.noaa.gov/submit/ online tool 
* https://www2.cisl.ucar.edu/resources/cmip-analysis-platform
  * https://xras-submit-ncar.xsede.org/ 
* http://cmip5.whoi.edu/ 
* https://pypi.python.org/pypi/cmipdata/0.6

## DKRZ ingest workflow system

Data ingest request via
* jupyter notebook on server (http://data-forms.dkrz.de:8080), or
* jupyter notebook filled at home and sent to DKRZ (email, rt system)
  * download form_template from: (tbd. github or redmine or recommend pypi installation only - which includes all templates .. )

Data ingest request workflow:
* ingest request related information is stored in json format in git repo
* all workflow steps are reflected in json subfields
* workflow json format has W3C prov aligned schema and can be transformed to W3C prov format (and visualized as a prov graph) 
* for search git repo can be indexed into a (in-memory) key-value DB

In [17]:
# do this in case you want to change imported module code while working with this notebook
# -- (for development and testing puposes only)
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## demo examples - step by step

Data managers have two separate application scenarios for data ingest information management:
* interactive information adaptation for specific individual data ingest activities
* automatic information adaptation by integrating in own data management scripts (e.g. data qualitiy assurance or data publication)
  * examples: successfull ESGF publication can update automatically the publication workflow step information

### Step 1: find and load a specific data ingest activity related form

* Alternative A)
   * check out out git repo https://gitlab.dkrz.de/DKRZ-CMIP-Pool/data_forms_repo
   * this repo contains all completed submission forms
   * all data manager related changes are also committed there
   * subdirectories in this repo relate to the individual projects (e.g. CMIP6, CORDEX, ESGF_replication, ..)
   * each entry there contains the last name of the data submission originator  
   
* Alternative B) (not yet documented, only prototype) 
   * use search interface and API of search index on all submision forms
  
  

In [23]:
## To do: include different examples how to query for data ingest activities based on different properties

info_file = "/home/stephan/tmp/Repos/form_repo/test/test_testsuite_1234.json"

from dkrz_forms import form_handler

my_form = form_handler.load_workflow_form(info_file)

### interactive "help": use ?form.part and tab completion:

In [None]:
?my_form


In [None]:
?my_form.sub

In [None]:
# move cursor after "." location and press "Tab"
my_form.

### Step 2: complete information in specific workflow step

* workflow steps of specific project are given in my_form.workflow
* normally my_form.workflow is given as 
  * [[u'sub', u'data_submission'], [u'rev', u'data_submission_review'], [u'ing', u'data_ingest'], [u'qua', u'data_quality_assurance']]
* thus my_form.sub would contain the data_submission related information, my_form.rev the review related one etc.
   * the workflow step related information dictionaries are configured based on config/project_config.py
   

In [24]:
print my_form.workflow 

[[u'sub', u'data_submission'], [u'rev', u'data_submission_review'], [u'ing', u'data_ingest'], [u'qua', u'data_quality_assurance'], [u'pub', u'data_publication']]


each workflow_step dictionary is structured consistently according to
* activity: activity related information
* agent: responsible person related info
* entity_out: specific output information of this workflow step


In [None]:
review = my_form.rev
?review.activity


### workflow step: submission validation

* ToDo: Split in start/end information update actions

In [25]:
workflow_form = form_handler.load_workflow_form(info_file)
   
review = workflow_form.rev
# print predefined information keys:
print review.activity.__dict__.keys()
print review.agent.__dict__.keys()
print review.entity_out.__dict__.keys()

# any additional information keys can be added,
# yet they are invisible to generic information management tools ..

review.activity.review_comment = "corrected information related to quality of submitted data"
review.activity.review_status = "reviewed"
review.activity.ticket_id = "25389"
review.activity.start_time = "..."
review.activity.end_time = "..."
    
review.agent.responsible_person = "stephan"

review.entity_out.comment = "This submission is related to submission abc_cde"
review.entity_out.tag = "abc_cde"
review.entity_out.report = {'x':'y'}   # result of validation in a dict (self defined properties)
review.entity_out.date = "..."

# ToDo: test and document save_form for data managers (config setting for repo)   
sf = form_handler.save_form(workflow_form, "kindermann: form_review()")

[u'comment', u'status', u'start_time', u'i_name', u'ticket_url', u'end_time', u'report', u'__doc__', u'ticket_id']
[u'responsible_person', u'i_name', u'__doc__']
[u'comment', u'status', u'i_name', u'repo', u'date', u'tag', u'report', u'__doc__']


Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_1234.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_1234.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_1234.json
 
 --- commit message:[master (root-commit) 3e684bc] Form Handler: submission form for user testsuite saved using prefix test_testsuite_1234 ## kindermann: form_review()
 2 files changed, 489 insertions(+)
 create mode 100644 test_testsuite_1234.ipynb
 create mode 100644 test_testsuite_1234.json


### add data ingest step related information

__Comment:__ alternatively in tools workflow_step related information could also be 
directly given and assigned via dictionaries, yet this is only 
recommended for data managers making sure the structure is consistent with
the preconfigured one given in config/project_config.py 
* example validation.activity.\__dict\__ = data_manager_generated_dict

In [26]:
workflow_form = form_handler.load_workflow_form(info_file)
   
ingest = workflow_form.ing

In [None]:
?ingest.entity_out

In [27]:
# agent related info

ingest.agent.responsible_person = "hdh"

# activity related info

ingest.activity.comment = "data pull: credentials needed for remote site"
ingest.activity.status = "complete"
ingest.activity.start_time = ""
ingest.activity.end_time = ""

# detailed info in free report dictionary/sub_form
my_activity_report = ingest.activity.report
my_activity_report.__doc__="data ingest protocol"
my_activity_report.anykey = "anyinfo"

# report of the ingest process (entity_out of ingest workflow step)
ingest_report = ingest.entity_out
from datetime import datetime
ingest_report.date = str(datetime.now())
ingest_report.tag = "a:b:c"  # tag structure to be defined
ingest_report.status = "complete"
# free entries for detailed report information
ingest_report.report.remote_server = "gridftp.awi.de://export/data/CMIP6/test"
ingest_report.report.server_credentials = "in server_cred.krb keypass"
ingest_report.report.target_path = ".."


In [None]:
ingest_report.report.

### workflow step: data quality assurance

In [28]:
from datetime import datetime
workflow_form = form_handler.load_workflow_form(info_file)
   
qua = workflow_form.qua

In [29]:
qua.agent.responsible_person = "hdh"

qua.activity.comment = " .. "
qua.activity.status = "ok"

qua.entity_out.date = str(datetime.now())
qua.entity_out.status = "dkrz_qua:esgf_ready"
qua.entity_out.report = {
    "QA_conclusion": "PASS",
    "project": "CORDEX",
    "institute": "CLMcom",
    "model": "CLMcom-CCLM4-8-17-CLM3-5",
    "domain": "AUS-44",
    "driving_experiment":  [ "ICHEC-EC-EARTH"],
    "experiment": [ "history", "rcp45", "rcp85"],
    "ensemble_member": [ "r12i1p1" ],
    "frequency": [ "day", "mon", "sem" ],
    "annotation":
    [
        {
            "scope": ["mon", "sem"],
            "variable": [ "tasmax", "tasmin", "sfcWindmax" ],
            "caption": "attribute <variable>:cell_methods for climatologies requires <time>:climatology instead of time_bnds",
            "comment": "due to the format of the data, climatology is equivalent to time_bnds",
            "severity": "note"
        }
    ]
}


### workflow step: data publication

In [30]:
workflow_form = form_handler.load_workflow_form(info_file)
   
pub = workflow_form.pub

pub.agent.responsible_person = "katharina"

pub.activity.comment = "..."
pub.activity.end_time = ".."
pub.activity.report = {}   # activity related report information

pub.entity_out.report = {} # the report of the publication action - all info characterizing the publication

In [31]:
sf = form_handler.save_form(workflow_form, "kindermann: form demo run 1")



Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_1234.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_1234.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_1234.json
 
 --- commit message:[master 5118bf4] Form Handler: submission form for user testsuite saved using prefix test_testsuite_1234 ## kindermann: form demo run 1
 1 file changed, 4 insertions(+), 4 deletions(-)
