# DKRZ data ingest information handling

This demo notebook is for data managers only !

The submission_forms package provides a collection of components to support the management of information related to data ingest related activities (data transport, data checking, data publication and data archival):

* data submission related information management
  * who, when, what, for which project, data characteristics
* data management related information management
  * ingest, quality assurance, publication, archiving

## Background: approaches at other sites

example workflows in other data centers:
* http://eidc.ceh.ac.uk/images/ingestion-workflow/view
* http://www.mdpi.com/2220-9964/5/3/30/pdf
* https://www.rd-alliance.org/sites/default/files/03%20Nurnberger%20-%20DataPublishingWorkflows-CollabMtg20151208_V03.pdf
* http://ropercenter.cornell.edu/polls/deposit-data/
* https://www.arm.gov/engineering/ingest
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/data_ingest.txt
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/how_to_participate.html
* http://www.nodc.noaa.gov/submit/ online tool 
* https://www2.cisl.ucar.edu/resources/cmip-analysis-platform
  * https://xras-submit-ncar.xsede.org/ 
* http://cmip5.whoi.edu/ 
* https://pypi.python.org/pypi/cmipdata/0.6

## DKRZ ingest workflow system

Data ingest request via
* jupyter notebook on server (http://data-forms.dkrz.de:8080), or
* jupyter notebook filled at home and sent to DKRZ (email, rt system)
  * download form_template from: (tbd. github or redmine or recommend pypi installation only - which includes all templates .. )

Data ingest request workflow:
* ingest request related information is stored in json format in git repo
* all workflow steps are reflected in json subfields
* workflow json format has W3C prov aligned schema and can be transformed to W3C prov format (and visualized as a prov graph) 
* for search git repo can be indexed into a (in-memory) key-value DB

In [1]:
# do this in case you want to change imported module code while working with this notebook
# -- (for development and testing puposes only)
%load_ext autoreload
%autoreload 2

In [15]:
# to generate empyty project form including all options for variables
# e.g.: 

from dkrz_forms import form_handler, utils
sf_t = utils.generate_project_form('ESGF_replication')
print(utils.show_options(sf_t.sub.activity.status))
print(utils.show_options(sf_t.rev.activity.status))
print(utils.show_options(sf_t.ing.activity.status))
print(utils.show_options(sf_t.pub.activity.status))

['0:initialized', '1:generated', '2:checked', '2:incomplete', '3:submitted', '4:re-opened', '5:re-submitted']
['0:open', '1:in-review', '2:adaption-needed', '3:accepted']
['0:open', '1:in-progress', '2:ready', '2:delayed']
['0:open', '1:in-progress', '2:published', '2:delayed', '3:republished']


## demo examples - step by step

Data managers have two separate application scenarios for data ingest information management:
* interactive information adaptation for specific individual data ingest activities
* automatic information adaptation by integrating in own data management scripts (e.g. data qualitiy assurance or data publication)
  * examples: successfull ESGF publication can update automatically the publication workflow step information

### Step 1: find and load a specific data ingest activity related form

* Alternative A)
   * check out out git repo https://gitlab.dkrz.de/DKRZ-CMIP-Pool/data_forms_repo
   * this repo contains all completed submission forms
   * all data manager related changes are also committed there
   * subdirectories in this repo relate to the individual projects (e.g. CMIP6, CORDEX, ESGF_replication, ..)
   * each entry there contains the last name of the data submission originator  
   
* Alternative B) (not yet documented, only prototype) 
   * use search interface and API of search index on all submision forms
  
  

In [2]:
## To do: include different examples how to query for data ingest activities based on different properties

#info_file = "/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json"
info_file = "/home/stephan/Forms/local_repo/test/test_testsuite_1.json"
from dkrz_forms import form_handler, utils
from datetime import datetime

my_form = utils.load_workflow_form(info_file)

### interactive "help": use ?form.part and tab completion:

In [3]:
?my_form


In [5]:
?my_form.sub

In [None]:
# move cursor after "." location and press "Tab"
my_form.

### Step 2: complete information in specific workflow step

* workflow steps of specific project are given in my_form.workflow
* normally my_form.workflow is given as 
  * [[u'sub', u'data_submission'], [u'rev', u'data_submission_review'], [u'ing', u'data_ingest'], [u'qua', u'data_quality_assurance']]
* thus my_form.sub would contain the data_submission related information, my_form.rev the review related one etc.
   * the workflow step related information dictionaries are configured based on config/project_config.py
   

In [7]:
print(my_form.workflow)

[['sub', 'data_submission'], ['rev', 'data_submission_review'], ['ing', 'data_ingest'], ['qua', 'data_quality_assurance'], ['pub', 'data_publication']]


each workflow_step dictionary is structured consistently according to
* activity: activity related information
* agent: responsible person related info
* entity_out: specific output information of this workflow step


In [None]:
review = my_form.rev
?review.activity


### workflow step: submission review

* ToDo: Split in start/end information update actions

In [11]:
# status: ['0:open', '1:in-review', '2:adaption-needed', '3:accepted']
workflow_form = utils.load_workflow_form(info_file)
   
review = workflow_form.rev

utils.show_options(review.activity.status)

['finished']

In [17]:
workflow_form = utils.load_workflow_form(info_file)
   
review = workflow_form.rev

# any additional information keys can be added,
# yet they are invisible to generic information management tools ..
workflow_form.status = "review"

review.activity.status = "1:in-review"
review.activity.start_time = str(datetime.now())
review.activity.review_comment = "data volume check to be done"
review.agent.responsible_person = "sk"

sf = form_handler.save_form(workflow_form, "sk: review started")

review.activity.status = "3:accepted"
review.activity.ticket_id = "25389"
review.activity.end_time = str(datetime.now())

review.entity_out.comment = "This submission is related to submission abc_cde"
review.entity_out.tag = "sub:abc_cde"  # tags are used to relate different forms to each other
review.entity_out.report = {'x':'y'}   # result of validation in a dict (self defined properties)

# ToDo: test and document save_form for data managers (config setting for repo)   
sf = form_handler.save_form(workflow_form, "kindermann: form_review()")



Form Handler - save form status message:
 --- form stored in transfer format in: /home/stephan/Forms/local_repo/test/test_testsuite_1.json
 --- commit message:[master e2c7b33] Form Handler: submission form for user testsuite saved using prefix test_testsuite_1 ## sk: review started
 1 file changed, 3 insertions(+), 3 deletions(-)


Form Handler - save form status message:
 --- form stored in transfer format in: /home/stephan/Forms/local_repo/test/test_testsuite_1.json
 --- commit message:[master 7912fda] Form Handler: submission form for user testsuite saved using prefix test_testsuite_1 ## kindermann: form_review()
 1 file changed, 4 insertions(+), 4 deletions(-)


### add data ingest step related information

__Comment:__ alternatively in tools workflow_step related information could also be 
directly given and assigned via dictionaries, yet this is only 
recommended for data managers making sure the structure is consistent with
the preconfigured one given in config/project_config.py 
* example validation.activity.\__dict\__ = data_manager_generated_dict

In [13]:
workflow_form = utils.load_workflow_form(info_file)
   
ingest = workflow_form.ing

In [14]:
?ingest.entity_out

In [16]:
# agent related info
workflow_form.status = "ingest"

ingest.activity.status = "started"
ingest.agent.responsible_person = "hdh"
ingest.activity.start_time=str(datetime.now())

# activity related info

ingest.activity.comment = "data pull: credentials needed for remote site"
sf = form_handler.save_form(workflow_form, "kindermann: form_review()")



Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_123.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json
 
 --- commit message:[master 3920957] Form Handler: submission form for user testsuite saved using prefix test_testsuite_123 ## kindermann: form_review()
 1 file changed, 6 insertions(+), 6 deletions(-)


In [18]:
ingest.activity.status = "completed"
ingest.activity.end_time = str(datetime.now())

# report of the ingest process (entity_out of ingest workflow step)
ingest_report = ingest.entity_out
ingest_report.tag = "a:b:c"  # tag structure to be defined
ingest_report.status = "completed"
# free entries for detailed report information
ingest_report.report.remote_server = "gridftp.awi.de://export/data/CMIP6/test"
ingest_report.report.server_credentials = "in server_cred.krb keypass"
ingest_report.report.target_path = ".."
sf = form_handler.save_form(workflow_form, "kindermann: form_review()")



Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_123.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json
 
 --- commit message:[master 935cc25] Form Handler: submission form for user testsuite saved using prefix test_testsuite_123 ## kindermann: form_review()
 1 file changed, 11 insertions(+), 7 deletions(-)


In [None]:
ingest_report.report.

### workflow step: data quality assurance

In [20]:
from datetime import datetime
workflow_form = utils.load_workflow_form(info_file)
   
qua = workflow_form.qua

In [24]:
workflow_form.status = "quality assurance"
qua.agent.responsible_person = "hdh"

qua.activity.status = "starting" 
qua.activity.start_time = str(datetime.now())

sf = form_handler.save_form(workflow_form, "hdh: qa start")





Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_123.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json
 
 --- commit message:[master 9cee8bd] Form Handler: submission form for user testsuite saved using prefix test_testsuite_123 ## hdh: qa start
 1 file changed, 3 insertions(+), 3 deletions(-)


In [25]:
qua.entity_out.status = "completed"
qua.entity_out.report = {
    "QA_conclusion": "PASS",
    "project": "CORDEX",
    "institute": "CLMcom",
    "model": "CLMcom-CCLM4-8-17-CLM3-5",
    "domain": "AUS-44",
    "driving_experiment":  [ "ICHEC-EC-EARTH"],
    "experiment": [ "history", "rcp45", "rcp85"],
    "ensemble_member": [ "r12i1p1" ],
    "frequency": [ "day", "mon", "sem" ],
    "annotation":
    [
        {
            "scope": ["mon", "sem"],
            "variable": [ "tasmax", "tasmin", "sfcWindmax" ],
            "caption": "attribute <variable>:cell_methods for climatologies requires <time>:climatology instead of time_bnds",
            "comment": "due to the format of the data, climatology is equivalent to time_bnds",
            "severity": "note"
        }
    ]
}
sf = form_handler.save_form(workflow_form, "hdh: qua complete")




Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_123.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json
 
 --- commit message:[master a5a5f46] Form Handler: submission form for user testsuite saved using prefix test_testsuite_123 ## hdh: qua complete
 1 file changed, 2 insertions(+), 2 deletions(-)


### workflow step: data publication

In [27]:
workflow_form = utils.load_workflow_form(info_file)

workflow_form.status = "publishing"

pub = workflow_form.pub
pub.agent.responsible_person = "katharina"
pub.activity.status = "starting"
pub.activity.start_time = str(datetime.now())

sf = form_handler.save_form(workflow_form, "kb: publishing")



Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_123.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json
 
 --- commit message:[master 5b2506b] Form Handler: submission form for user testsuite saved using prefix test_testsuite_123 ## hdh: qua complete
 1 file changed, 4 insertions(+), 4 deletions(-)


In [28]:
pub.activity.status = "completed"
pub.activity.comment = "..."
pub.activity.end_time = ".."
pub.activity.report = {'model':"MPI-M"}   # activity related report information

pub.entity_out.report = {'model':"MPI-M"} # the report of the publication action - all info characterizing the publication
sf = form_handler.save_form(workflow_form, "kb: published")




Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_123.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json
 
 --- commit message:[master 64ad592] Form Handler: submission form for user testsuite saved using prefix test_testsuite_123 ## kb: published
 1 file changed, 11 insertions(+), 7 deletions(-)


In [29]:
sf = form_handler.save_form(workflow_form, "kindermann: form demo run 1")



Form Handler - save form status message:
/home/stephan/Repos/ENES-EUDAT/submission_forms/test/forms/test/test_testsuite_123.ipynb
/home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.ipynb
 --- form stored in transfer format in: /home/stephan/tmp/Repos/form_repo/test/test_testsuite_123.json
 
 --- commit message:[master 50d3b45] Form Handler: submission form for user testsuite saved using prefix test_testsuite_123 ## kindermann: form demo run 1
 1 file changed, 2 insertions(+), 2 deletions(-)


In [30]:
sf.sub.activity.commit_hash


u'ae5322685506df134d3915f3050a3c375d0e12e4'