# Using the ctapipe Provenance service

The provenance functionality is used automatically when you use most of ctapipe functionality (particularly `ctapipe.core.Tool` and functions in `ctapipe.io` and `ctapipe.utils`), so normally you don't have to work with it directly. It tracks both input and output files, as well as details of the machine and software environment on which a Tool executed. 

Here we show some very low-level functions of this system:

In [1]:
from ctapipe.core import Provenance
from pprint import pprint



## Activities

The basis of Provenance is an *activity*, which is generally an executable or step in a script. Activities can be nested (e.g. with sub-activities), as shown below, but normally this is not required:

In [2]:
p = Provenance()  # note this is a singleton, so only ever one global provenence object
p.clear()
p.start_activity()
p.add_input_file("test.txt")

p.start_activity("sub")
p.add_input_file("subinput.txt")
p.add_input_file("anothersubinput.txt")
p.add_output_file("suboutput.txt")
p.finish_activity("sub")

p.start_activity("sub2")
p.add_input_file("sub2input.txt")
p.finish_activity("sub2")

p.finish_activity()

In [3]:
p.finished_activity_names

['sub', 'sub2', '/usr/local/bin/python3']

Activities have associated input and output *entities*  (files or other objects)

In [4]:
[ (x['activity_name'], x['input']) for x in p.provenance]

[('sub',
  [{'url': '/github/workspace/docs/examples/subinput.txt', 'role': None},
   {'url': '/github/workspace/docs/examples/anothersubinput.txt',
    'role': None}]),
 ('sub2',
  [{'url': '/github/workspace/docs/examples/sub2input.txt', 'role': None}]),
 ('/usr/local/bin/python3',
  [{'url': '/github/workspace/docs/examples/test.txt', 'role': None}])]

Activities track when they were started and finished:

In [5]:
[ (x['activity_name'],x['duration_min']) for x in p.provenance]

[('sub', 8.33333335137354e-05),
 ('sub2', 8.333333319399117e-05),
 ('/usr/local/bin/python3', 0.002733333333271304)]

## Full provenance

The provence object is a list of activitites, and for each lots of details are collected:

In [6]:
p.provenance[0]

{'activity_name': 'sub',
 'activity_uuid': '956b1559-bbd3-499f-a2a1-c76560853c64',
 'start': {'time_utc': '2020-12-03T15:34:47.096'},
 'stop': {'time_utc': '2020-12-03T15:34:47.101'},
 'system': {'ctapipe_version': '0.1.dev1+gf1cd0bb',
  'ctapipe_resources_version': 'not installed',
  'eventio_version': '1.4.2',
  'ctapipe_svc_path': None,
  'executable': '/usr/local/bin/python3',
  'platform': {'architecture_bits': '64bit',
   'architecture_linkage': '',
   'machine': 'x86_64',
   'processor': '',
   'node': '4494b0f2fe81',
   'version': '#32~18.04.1-Ubuntu SMP Tue Oct 6 10:03:22 UTC 2020',
   'system': 'Linux',
   'release': '5.4.0-1031-azure',
   'libcver': ('glibc', '2.28'),
   'num_cpus': 2,
   'boot_time': '2020-12-03T15:30:35.000'},
  'python': {'version_string': '3.8.2 (default, Feb 26 2020, 15:09:34) \n[GCC 8.3.0]',
   'version': ('3', '8', '2'),
   'compiler': 'GCC 8.3.0',
   'implementation': 'CPython'},
  'environment': {'CONDA_DEFAULT_ENV': None,
   'CONDA_PREFIX': None,
 

This can be better represented in JSON:

In [7]:
print(p.as_json(indent=2))

[
  {
    "activity_name": "sub",
    "activity_uuid": "956b1559-bbd3-499f-a2a1-c76560853c64",
    "start": {
      "time_utc": "2020-12-03T15:34:47.096"
    },
    "stop": {
      "time_utc": "2020-12-03T15:34:47.101"
    },
    "system": {
      "ctapipe_version": "0.1.dev1+gf1cd0bb",
      "ctapipe_resources_version": "not installed",
      "eventio_version": "1.4.2",
      "ctapipe_svc_path": null,
      "executable": "/usr/local/bin/python3",
      "platform": {
        "architecture_bits": "64bit",
        "architecture_linkage": "",
        "machine": "x86_64",
        "processor": "",
        "node": "4494b0f2fe81",
        "version": "#32~18.04.1-Ubuntu SMP Tue Oct 6 10:03:22 UTC 2020",
        "system": "Linux",
        "release": "5.4.0-1031-azure",
        "libcver": [
          "glibc",
          "2.28"
        ],
        "num_cpus": 2,
        "boot_time": "2020-12-03T15:30:35.000"
      },
      "python": {
        "version_string": "3.8.2 (default, Feb 26 2020, 15:09:34

## Storing provenance info in output files

* already this can be stored in something like an HDF5 file header, which allows hierarchies.
* Try to flatted the data so it can be stored in a key=value header in a **FITS file** (using the FITS extended keyword convention to allow >8 character keywords), or as a table 

In [8]:
def flatten_dict(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '.')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '.')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

In [9]:
d = dict(activity=p.provenance)

In [10]:
pprint(flatten_dict(d))

{'activity.0.activity_name': 'sub',
 'activity.0.activity_uuid': '956b1559-bbd3-499f-a2a1-c76560853c64',
 'activity.0.duration_min': 8.33333335137354e-05,
 'activity.0.input.0.role': None,
 'activity.0.input.0.url': '/github/workspace/docs/examples/subinput.txt',
 'activity.0.input.1.role': None,
 'activity.0.input.1.url': '/github/workspace/docs/examples/anothersubinput.txt',
 'activity.0.output.0.role': None,
 'activity.0.output.0.url': '/github/workspace/docs/examples/suboutput.txt',
 'activity.0.start.time_utc': '2020-12-03T15:34:47.096',
 'activity.0.status': 'sub',
 'activity.0.stop.time_utc': '2020-12-03T15:34:47.101',
 'activity.0.system.arguments.0': '/usr/local/lib/python3.8/site-packages/ipykernel_launcher.py',
 'activity.0.system.arguments.1': '-f',
 'activity.0.system.arguments.2': '/tmp/tmpxtezdnaa.json',
 'activity.0.system.arguments.3': '--HistoryManager.hist_file=:memory:',
 'activity.0.system.ctapipe_resources_version': 'not installed',
 'activity.0.system.ctapipe_svc