Merge 5d63459 into 897d74f

ICRAR · May 12, 2022 · ae5b7fa · ae5b7fa
2 parents 897d74f + 5d63459
commit ae5b7fa
Show file tree

Hide file tree

Showing 12 changed files with 329 additions and 3 deletions.
diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst
@@ -12,4 +12,5 @@ behind |daliuge|.
  graphs
  managers
  dlm
+ reproducibility/reproducibility
  reference
diff --git a/docs/architecture/reproducibility/HelloEverybodyPython.png b/docs/architecture/reproducibility/HelloEverybodyPython.png
diff --git a/docs/architecture/reproducibility/HelloHashes.csv b/docs/architecture/reproducibility/HelloHashes.csv
@@ -0,0 +1,4 @@
+Workflow,Rerun,Repeat,Recompute,Reproduce,Replicate-SCI,Replicate-COMP,Replicate-TOTAL
+HelloWorldBash,d35a6ee278dad22b202cc092547022abe8643cb22fe262688e97ed56cdc1a47d,86a5208e9c19113c10c564e36cd628b500b25de75a082fe979b10dd42fe39802,598523833e3249da2ae2e25e5caccb2694df84f9ca47085dfb20b6ebe95d30fc,dd5d192134999d48ab9098844be9b882416eb90ee8965ed18376fc6dfabb2bec,dd5ecdba2c1a92ed44f8e28c82e6156976b6e7e50941ad3746ab426a364e200b,241153dbbc3534409fe89f9a0d1a16a0dd50e33f84b51fc860a6ab6400bc2dfc,ccede91165ea6e95c82ce446d2972124c8ec956d3a12b372b94cabfa7740071c
+HelloWorldPython,6413ca52dc807b4d9d8f0dc60f6f9d939ba363d86410ede1557a89c2d252e3d2,92e9988ae3cef11b2af935960d0de7feae78ca84938bbdb2f1d0b45e4b3f9ee7,3f4f23133903dfb2a5842468ef01ffb266ccd1051d3ed55f4c4fac558a8c97e0,dd5d192134999d48ab9098844be9b882416eb90ee8965ed18376fc6dfabb2bec,dc8210e4dc1c4eec7248a9016a0d28e8032c3f56010bee4a9bf314c1e13bd69a,04a540a06942b11dafcc9bb67a85bbdae0752024a358251a919a363d100aa856,2c9970ebdf2a6a4581cb2e728cf3187d3c1954146967d1724ffae5a0dddfc4b1
+HelloEverybodyPython,6413ca52dc807b4d9d8f0dc60f6f9d939ba363d86410ede1557a89c2d252e3d2,3c162ec8c42182f99643e70ba2b6a0f205f1ee36a9ab70b7af9a76badae97b03,7ad483dea703f6aa6587fc9c05acfe398d8a03201990ba6a42d274bc7fb978ac,ee0d0784c46b04dc1c1578fde0c1be655ea91c1d03d9267f9888f1d45ba8985d,24558387b6066205b7b1483dfd12954bdb5b5a0fa755c58d82c3a69e574a4914,383fabf6d17a0119514ade3cd53b13ff83f16f3d290db6e9070f1e12cdc6c2d1,09e94a24c000098fe03d58a483c16768d37bd4574303abd1a84a91a9f9179631
diff --git a/docs/architecture/reproducibility/HelloWorldBash.png b/docs/architecture/reproducibility/HelloWorldBash.png
diff --git a/docs/architecture/reproducibility/HelloWorldPython.png b/docs/architecture/reproducibility/HelloWorldPython.png
diff --git a/docs/architecture/reproducibility/adding_drops.rst b/docs/architecture/reproducibility/adding_drops.rst
@@ -0,0 +1,24 @@
+.. _reproducibility_adding_drops:
+
+Creating New Drop Types
+=======================
+
+Drops must supply provenance data on demand as part of our scientific reproducibility efforts.
+When implementing entirely new drop types,
+ensuring the availability of appropriate information is essential to continue the feature's power.
+
+Drops supply provenance information for various 'R-modes' through ``generate_x_data`` methods.
+In the case of application drops specifically,
+the ``generate_recompute_data`` method may need overriding if there is any specific information
+for the exact replication of this component.
+For example, Python drops may supply their code or an execution trace.
+
+In the case of data drops, the ``genreate_reproduce_data`` may need overriding
+and should return a summary of the contained data. For example, the hash of a file,
+a list of database queries or whatever information deemed characteristic of a data-artefact
+(perhaps statistical information for science products).
+
+Additionally, if adding an entirely new drop type,
+you will need to create a new drop category in ``dlg.common.__init__.py`` and related entries in
+``dlg.common.reproducibility.reproducibility_fields.py``.
+
diff --git a/docs/architecture/reproducibility/blockdags.rst b/docs/architecture/reproducibility/blockdags.rst
@@ -0,0 +1,80 @@
+.. _reproducibility_blockdags:
+
+Technical approach
+==================
+
+The fundamental primitive powering workflow signatures are Merkle trees and Block directed
+acyclic graphs (BlockDAGs).
+These data structures cryptographically compress provenance and structural information.
+We describe the primitives of our approach and then their combination.
+The most relevant code directory is found under ``dlg.common.reproducibility``
+
+Merkle Trees
+------------
+A Merkle tree is essentially a binary tree with additional behaviours.
+Leaves store singular data elements and are hashed in pairs to produce internal
+nodes containing a signature.
+These internal nodes are recursively hashed in pairs, eventually leaving a single root node with a
+signature for its entire sub-tree.
+
+Merkle tree comparisons can find differing nodes in a logarithmic number of comparisons and find
+their use in version control, distributed databases and blockchains.
+
+We store information for each workflow component in a Merkle tree.
+
+BlockDAGs
+---------
+
+BlockDAGs are our term for a hash graph.
+Each node takes the signature of a previous block(s) in addition to new information, hashes them
+all together to generate a signature for the current node.
+We overlay BlockDAGs onto |daliuge| workflow graphs; the edges between components remain, descendant
+components receive their parents' signatures to generate their signatures, which are passed on to
+their children.
+
+The root of a Merkle tree formed by the signatures of workflow leaves acts as the full
+workflow signature.
+
+One could, in principle, do away with these cryptographic structures, but utilizing Merkle trees
+and BlockDAGs make the comparison between workflow executions constant time independent of
+workflow scale or composition.
+
+Runtime Provenance
+------------------
+
+Each drop implements a series of ``generate_x_data``, where ``x`` is the name of a particular
+standard (defined below).
+At runtime, drops package up pertinent data then sent to its manager, percolating up to the master
+responsible for the drop's session, which then packages the final BlockDAG for that workflow
+execution.
+The resulting signature structure is written to a file stored alongside that session's log file.
+
+In general, specialized processing drops need to implement a customized ``generate_recompute_data``
+function, and data drops need to implement a ``generate_reproduce_data`` function.
+
+Translate-time Provenance
+-------------------------
+|daliuge| can generate BlockDAGs and an associated signature for a workflow at each stage of
+translations from logical to physical layers.
+Passing an ``rmode`` flag (defined below) to the ``fill`` operation, from that point forward,
+|daliuge| will capture provenance and pertinent information automatically, storing this information
+alongside the graph structure itself.
+
+The *pertinent* information is defined in the ``dlg.common.reproducibility.reproducibility_fields``
+file, which will need modification whenever an entirely new type of drop is added (a relatively
+infrequent occurrence).
+
+Signature Building
+------------------
+The algorithm used to build the blockDAG is a variant of
+`Kahn's algorithm <https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/>`__
+for topological sorting.
+Nodes without predecessors are processed first, followed by their children, and so on, moving
+through the graph.
+
+This operation takes time linear in the number of nodes and edges present in the graph at all
+layers.
+Building the MerkleTree for each drop is a potentially expensive operation, dependent on the volume
+of data present in the tree.
+This is a per-drop consideration, and thus when implementing ``generate_reproduce_data``, be wary of
+producing large data volumes.
diff --git a/docs/architecture/reproducibility/graphcertification.rst b/docs/architecture/reproducibility/graphcertification.rst
@@ -0,0 +1,49 @@
+.. _reproducibility_graphcertification:
+
+Graph Certification
+===================
+'Certifying' a graph involves generating and publishing reproducibility signatures.
+These signatures can be integrated into a CI/CD pipeline, used during executions for verification or
+during late-stage development when fine-tuning graphs.
+
+By producing and sharing these signatures, subsequent changes to execution environment, processing
+components, overall graph design and data artefacts can be easily and efficiently tested.
+
+Certifying a Graph
+------------------
+The process of generating and storing workflow signatures is relatively straightforward.
+
+* From the root of the graph-storing directory (usually a repository) create a ``/reprodata/[GRAPH_NAME]`` directory.
+* Run the graph with the ``ALL`` reproducibility flag, and move the produced reprodata.out file to the previously created directory.
+* (optional) Run from ``dlg.common.reproducibility.reprodata_compare.py`` script with this file as input to generate a summary-csv file
+
+In subsequent executions or during CI/CD scripts:
+* Note the reprodata.out file generated during the test execution
+* Run ``dlg.common.reproduciblity.reprodata_compare.py`` with the published ``reprodata/[GRAPH_NAME]`` directory and newly generated signature file
+* The resulting ``[SESSION_NAME]-comparison.csv`` will contain a simple True/False summary for each RMode, for use at your discretion.
+
+What is to be expected?
+***********************
+In general, all but ``Recomputation`` and ``Replicate_Computational`` rmodes should match, moreover:
+
+* A failed ``Rerun`` indicates some fundamental structure is different
+* A failed ``Repeat`` indicates changes to component parameters or a different execution scale
+* A failed ``Recomputation~`` indicates some runtime environment changes have been made
+* A failed ``Reproduction`` indicates data artefacts have changed
+* A failed ``Scientific Replication`` indicates a change in data artefacts or fundamental structure
+* A failed ``Computational Replication`` indicates a change in data artefacts or runtime environment
+* A failed ``Total Replica`` indicates a change in data artefacts, component parameters or different execution scale
+
+When attempting to re-create some known graph-derived result, ``Replication`` is the goal.
+In an operational context, where data changes constantly, ``Reruning`` is the goal
+When conducting science across multiple trials, ``Repeating`` is necessary to use the derived data arte-facts in concert.
+
+Tips on Making Graphs Robust
+----------------------------
+The most common 'brittle' aspect of graphs are hard-coded paths to data resources and access to referenced data.
+This can be ameliorated by:
+
+* Using the ``$DLG_ROOT`` keyword in component parameters as a base path.
+* Providing comments on where to find referenced data artefacts
+* Providing instructions on how to build referenced runtime libraries (in the case of Dynlib drops).
+
diff --git a/docs/architecture/reproducibility/helloWorldExample.rst b/docs/architecture/reproducibility/helloWorldExample.rst
@@ -0,0 +1,53 @@
+.. _reproducibility_helloworld:
+
+Hello World Example
+===================
+We present a simple example based on several 'Hello world' workflows.
+First, we present the workflows and signatures for all rmodes and discuss how they compare.
+
+Hello World Bash
+----------------
+This workflow is comprised of a bash script, writing text to a file -
+Specifically `echo 'Hello World' > %o0`
+
+.. image:: HelloWorldBash.png
+
+Hello World Python
+------------------
+This workflow is comprised of a single python script and a file.
+This function writes 'Hello World' to the linked file.
+
+.. image:: HelloWorldPython.png
+
+Hello Everybody Python
+----------------------
+This workflow is again comprised of a single python script and file.
+This function writes 'Hello Everybody' to the linked file.
+
+.. image:: HelloEverybodyPython.png
+
+Signature Comparisons
+---------------------
+
+By comparing the hashes of each workflow together, we arrive at the following conclusions:
+
+.. csv-table:: Workflow Hashes
+    :file: HelloHashes.csv
+    :widths: 13, 12, 12, 12, 12, 12, 12, 12
+    :header-rows: 1
+
+* HelloEverybodyPython and HelloWorldPython are Reruns
+* No two workflows are repetitions
+* No two workflows are recomputations
+* HelloWorldBash and HelloWorldPython reproduce the same results
+* No two workflows are replicas.
+
+Testing for repetitions is primarily useful when examining stochastic workflows to take their
+results in concert with confidence.
+Testing or replicas is useful when moving between deployment environments or verifying the validity
+of a workflow.
+When debugging a workflow or asserting if the computing environment has changed, recomputations and
+computational replicas are of particular use.
+
+This simple example scratches the surface of what is possible with a robust workflow
+signature scheme.
diff --git a/docs/architecture/reproducibility/reproducibility.rst b/docs/architecture/reproducibility/reproducibility.rst
@@ -0,0 +1,37 @@
+.. _scientific_reproducibility:
+
+Scientific Reproducibility
+==========================
+
+*Under construction*
+
+The scientific reproducibility of computational workflows is a fundamental concern when conducting
+scientific investigations.
+Here, we outline our approach to increasing scientific confidence in DALiuGE workflows.
+Modern methods create a deterministic computing environment through careful software versioning and
+containerization.
+We suggest testing equivalence between carefully selected provenance information to complement such
+approaches.
+
+Doing so allows any workflow system which generates identical provenance information can claim to
+re-create some aspect of the original workflow execution.
+Drops provide component-specific provenance information at runtime and throughout graph translation.
+
+Additionally, a novel hash-graph (BlockDAG) method captures the relationships between components by
+linking provenance throughout an entire workflow.
+The resulting signature completely characterizes a workflow allowing for constant time provenance
+comparison.
+
+We refer a motivated reader to the
+`related thesis <https://research-repository.uwa.edu.au/en/publications/using-blockchain-technology
+-to-enable-reproducible-science>`__.
+
+
+.. toctree::
+ :maxdepth: 2
+
+ rmodes
+ blockdags
+ helloWorldExample
+ graphcertification
+ adding_drops
diff --git a/docs/architecture/reproducibility/rmodes.rst b/docs/architecture/reproducibility/rmodes.rst
@@ -0,0 +1,78 @@
+.. _reproducibility_rmodes:
+
+R-Mode Standards
+================
+
+Each drop's provenance information defines what a workflow signature claims.
+Inspired and extending current workflow literature, we define seven R-modes.
+R-mode selection occurs when submitting a workflow to |daliuge| for initial filling and unrolling;
+|daliuge| handles everything else automatically.
+Additionally, the ALL mode will generate a signature structure containing separate hash graphs for
+all supported modes, which is a good choice when experimenting with new workflow concepts or
+certifying a particular workflow version.
+
+Rerunning
+---------
+A workflow reruns another if they execute the same logical workflow; their logical components and
+dependencies match.
+At this standard, the runtime information is simply an execution status flag; the translate-time
+information is logical template data excluding physical drops structurally.
+
+When scaling up an in-development workflow or deploying to a new facility asserting that executions
+rerun the original workflow build confidence in the workflow tools.
+Rerunning is also useful where data scale and contents change, like an ingest pipeline.
+
+Repeating
+---------
+A workflow repeats another if they execute the same logical workflow and a principally identical
+physical workflow; their logical components, dependencies, and physical tasks match.
+At this standard, the runtime information is still only an execution flag, and translate-time
+information includes component parameters (in addition to rerunning information) and includes all physical drops structurally.
+
+Workflows with stochastic results need statistical power to make scientific claims.
+Asserting workflow repetitions allows using results in concert.
+
+Recomputing
+-----------
+A workflow recomputes another if they execute the same physical workflow; their physical tasks and
+dependencies match precisely.
+In addition to repetition information, a maximal amount of detail for computing drops is stored
+at this standard.
+
+Recomputation is a meticulous approach that is helpful when debugging workflow deployments.
+
+Reproducing
+-----------
+A workflow reproduces another if their scientific information match. In other words, the terminal
+data drops of two workflows match in content.
+The precise mechanism of establishing comparable data need not be a naive copy but is a
+domain-specific decision.
+At this standard, runtime and translate-time data only include data-drops structurally. At runtime,
+data drops are expected to provide a characteristic summary of their contents
+
+Reproductions are practical in asserting whether a given result can be independently reported or to
+test an alternate methodology.
+An alternate methodology could mean an incremental change to a single component
+(somewhat akin to regression testing) or testing a vastly different workflow approach.
+
+Replicating - Scientifically
+----------------------------
+A scientific replica reruns and reproduces a workflow execution.
+
+Scientific replicas establish a workflow design as a gold standard for a given set of results.
+
+Replicating - Computationally
+-----------------------------
+A computational replica recomputes and reproduces a workflow execution.
+
+Computational replicas are useful if performing science on workflows directly
+(performance claims etc.)
+
+Replicating - Totally
+---------------------
+A total replica repeats and reproduces a workflow execution.
+
+Total replicas allow for independent verification of results, adding direct credibility to
+results coming from a workflow.
+Moreover, if a workflow's original deployment environment is unavailable, a total replica is
+the most robust assertion possibly placed on a workflow.
diff --git a/docs/cli.rst b/docs/cli.rst
@@ -154,9 +154,9 @@ Help output::
      -p PARAMETER, --parameter=PARAMETER
                            Parameter specification (either 'name=value' or a JSON
                            string)
-     -R REPRODUCIBILITY, --reproducibility=REPRODUCIBILITY
-                           Level of reproducibility. Defaults to 0 (NOTHING).
-                           Accepts '0,1,2,4,5,6,7,8'
+     -R, --reproducibility
+                           Level of reproducibility. Default 0 (NOTHING). Accepts '-1'-'8'"
+                           Refer to dlg.common.reproducibility.constants for more explanation.
    
 
 Command: dlg include_dir