# Data ingest workflow and tools to support the (semi-automatic) workflow steps

example workflows in other data centers:
* http://eidc.ceh.ac.uk/images/ingestion-workflow/view
* www.mdpi.com/2220-9964/5/3/30/pdf
* https://www.rd-alliance.org/sites/default/files/03%20Nurnberger%20-%20DataPublishingWorkflows-CollabMtg20151208_V03.pdf
* http://ropercenter.cornell.edu/polls/deposit-data/
* https://www.arm.gov/engineering/ingest
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/data_ingest.txt
* https://eosweb.larc.nasa.gov/GEWEX-RFA/documents/how_to_participate.html
* http://www.nodc.noaa.gov/submit/ online tool 

# DKRZ ingest workflow test system

Data ingest request:
* ipython notebook based web form, or
* client side python library (pip installable)

Data ingest request workflow:
* ingest request is stored in json format in git repo
* all workflow steps are reflected in git repo
* specific workflow steps interact with request tracker

Data provenance:
* workflow information (in git versioned json) is transformed to W3CProv Document

In [2]:
%load_ext autoreload
%autoreload 2

# Example - demonstrating git usage for submission form provenance capture

* json file (with namespace pre-fixed keys) is maintained in git repo
* stages (workflow steps are indicated by using different namespaces)
* helper function integrate stage-changes with information exchange via the RT request tracker ..

In [10]:
# import develop tree (not egg installed version) of form handler
# setup rt interaction (rt_pwd in ~/.dkrz_form/myconfig.py needed)

import sys

sys.path.append('/home/stephan/Repos/ENES-EUDAT/submission_forms')
from dkrz_forms import form_handler

sf = form_handler.init_form("test")   

sub_dict = sf.__dict__

import rt

tracker = rt.Rt('https://dm-rt.dkrz.de/REST/1.0/','kindermann',form_handler.rt_pwd)
tracker.login()

Cordex submission form intitialized: sf
(For the curious: sf is used to store and manage all your information)


True

In [None]:
print sub_dict

In [11]:
form_handler.form_save(sf)  # put json representation into repo



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [13]:
sf.status = "submitted"

sf.sub['ticket_id']=22252
sf.sub['checks_done']="cordex form check v0.1"
sf.sub['ticket_url']='https://dm-rt.dkrz.de/Ticket/Display.html?id=22252'
form_handler.form_save(sf)



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [14]:
sf.status = "submission_processing"
sf.sub['responsible_person']= "pl"
form_handler.form_save(sf)



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [15]:
from datetime import datetime
sf.status = "ingesting"
#sf.check = {}
sf.ing = {'responsible_person': 'pl',
            'started' : str(datetime.now())}
form_handler.form_save(sf)                            
                            



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [16]:
sf.status = "ingested"
sf.ing['finished']= str(datetime.now())
sf.ing['target_path']='/scratch/bb0303/data/cordex/test1'
form_handler.form_save(sf)



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [17]:
sf.status = "checking"
sf.che = {'responsible_person':'hdh',
          'started':str(datetime.now()),
          'tool_version':'qa_dkrz_v1.1'
            }
form_handler.form_save(sf)        



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [21]:
sf.status = 'checked'

sf.che['results']= '/path/to/results'
sf.che['finished']= str(datetime.now())
form_handler.form_save(sf)




 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [22]:
sf.status = 'publishing'

sf.pub  = {'responsible_person':'kberger',
            'started':str(datetime.now())        
           }
form_handler.form_save(sf)



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [23]:
sf.status = 'published'

sf.pub['finished']  = str(datetime.now()) 
sf.pub['search_string'] = "project=cordex&model=hmoc&institute=MPI-M"
           
form_handler.form_save(sf)



 Status message:
-- your submission form Kindermann_test1 was stored in repository 


In [28]:
import json
form_file = open('/home/stephan/tmp/CORDEX/Kindermann_test1.json',"r")
json_info = form_file.read()
#json_info["__type__"] = "sf",
form_file.close()
sf_dict = json.loads(json_info)

{u'status': u'published', u'sub': {u'status': [u'stored'], u'package_name': u'Kindermann_test1.json', u'timestamp': u'2016-03-28 19:25:28.340678', u'ticket_url': u'https://dm-rt.dkrz.de/Ticket/Display.html?id=22252', u'checks_done': u'cordex form check v0.1', u'repo': u'/home/stephan/tmp/CORDEX', u'ticket_id': 22252, u'package_path': u'/home/stephan/tmp/CORDEX/Kindermann_test1.json', u'form_name': u'Kindermann_test1', u'responsible_person': u'pl'}, u'first_name': u'Stephan', u'last_name': u'Kindermann', u'che': {u'started': u'2016-03-28 19:22:40.212107', u'finished': u'2016-03-28 19:24:13.371716', u'tool_version': u'qa_dkrz_v1.1', u'responsible_person': u'hdh', u'results': u'/path/to/results'}, u'keyword': u'test1', u'data_path': u'/scratch/b20030/data/cordex/', u'pub': {u'started': u'2016-03-28 19:24:52.312040', u'search_string': u'project=cordex&model=hmoc&institute=MPI-M', u'finished': u'2016-03-28 19:25:28.340155', u'responsible_person': u'kberger'}, u'institution': u'DKRZ', u'chec