Skip to content

Commit

Permalink
Merge 5d63459 into 897d74f
Browse files Browse the repository at this point in the history
  • Loading branch information
pritchardn committed May 12, 2022
2 parents 897d74f + 5d63459 commit ae5b7fa
Show file tree
Hide file tree
Showing 12 changed files with 329 additions and 3 deletions.
1 change: 1 addition & 0 deletions docs/architecture/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@ behind |daliuge|.
graphs
managers
dlm
reproducibility/reproducibility
reference
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/architecture/reproducibility/HelloHashes.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Workflow,Rerun,Repeat,Recompute,Reproduce,Replicate-SCI,Replicate-COMP,Replicate-TOTAL
HelloWorldBash,d35a6ee278dad22b202cc092547022abe8643cb22fe262688e97ed56cdc1a47d,86a5208e9c19113c10c564e36cd628b500b25de75a082fe979b10dd42fe39802,598523833e3249da2ae2e25e5caccb2694df84f9ca47085dfb20b6ebe95d30fc,dd5d192134999d48ab9098844be9b882416eb90ee8965ed18376fc6dfabb2bec,dd5ecdba2c1a92ed44f8e28c82e6156976b6e7e50941ad3746ab426a364e200b,241153dbbc3534409fe89f9a0d1a16a0dd50e33f84b51fc860a6ab6400bc2dfc,ccede91165ea6e95c82ce446d2972124c8ec956d3a12b372b94cabfa7740071c
HelloWorldPython,6413ca52dc807b4d9d8f0dc60f6f9d939ba363d86410ede1557a89c2d252e3d2,92e9988ae3cef11b2af935960d0de7feae78ca84938bbdb2f1d0b45e4b3f9ee7,3f4f23133903dfb2a5842468ef01ffb266ccd1051d3ed55f4c4fac558a8c97e0,dd5d192134999d48ab9098844be9b882416eb90ee8965ed18376fc6dfabb2bec,dc8210e4dc1c4eec7248a9016a0d28e8032c3f56010bee4a9bf314c1e13bd69a,04a540a06942b11dafcc9bb67a85bbdae0752024a358251a919a363d100aa856,2c9970ebdf2a6a4581cb2e728cf3187d3c1954146967d1724ffae5a0dddfc4b1
HelloEverybodyPython,6413ca52dc807b4d9d8f0dc60f6f9d939ba363d86410ede1557a89c2d252e3d2,3c162ec8c42182f99643e70ba2b6a0f205f1ee36a9ab70b7af9a76badae97b03,7ad483dea703f6aa6587fc9c05acfe398d8a03201990ba6a42d274bc7fb978ac,ee0d0784c46b04dc1c1578fde0c1be655ea91c1d03d9267f9888f1d45ba8985d,24558387b6066205b7b1483dfd12954bdb5b5a0fa755c58d82c3a69e574a4914,383fabf6d17a0119514ade3cd53b13ff83f16f3d290db6e9070f1e12cdc6c2d1,09e94a24c000098fe03d58a483c16768d37bd4574303abd1a84a91a9f9179631
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions docs/architecture/reproducibility/adding_drops.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
.. _reproducibility_adding_drops:

Creating New Drop Types
=======================

Drops must supply provenance data on demand as part of our scientific reproducibility efforts.
When implementing entirely new drop types,
ensuring the availability of appropriate information is essential to continue the feature's power.

Drops supply provenance information for various 'R-modes' through ``generate_x_data`` methods.
In the case of application drops specifically,
the ``generate_recompute_data`` method may need overriding if there is any specific information
for the exact replication of this component.
For example, Python drops may supply their code or an execution trace.

In the case of data drops, the ``genreate_reproduce_data`` may need overriding
and should return a summary of the contained data. For example, the hash of a file,
a list of database queries or whatever information deemed characteristic of a data-artefact
(perhaps statistical information for science products).

Additionally, if adding an entirely new drop type,
you will need to create a new drop category in ``dlg.common.__init__.py`` and related entries in
``dlg.common.reproducibility.reproducibility_fields.py``.

80 changes: 80 additions & 0 deletions docs/architecture/reproducibility/blockdags.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
.. _reproducibility_blockdags:

Technical approach
==================

The fundamental primitive powering workflow signatures are Merkle trees and Block directed
acyclic graphs (BlockDAGs).
These data structures cryptographically compress provenance and structural information.
We describe the primitives of our approach and then their combination.
The most relevant code directory is found under ``dlg.common.reproducibility``

Merkle Trees
------------
A Merkle tree is essentially a binary tree with additional behaviours.
Leaves store singular data elements and are hashed in pairs to produce internal
nodes containing a signature.
These internal nodes are recursively hashed in pairs, eventually leaving a single root node with a
signature for its entire sub-tree.

Merkle tree comparisons can find differing nodes in a logarithmic number of comparisons and find
their use in version control, distributed databases and blockchains.

We store information for each workflow component in a Merkle tree.

BlockDAGs
---------

BlockDAGs are our term for a hash graph.
Each node takes the signature of a previous block(s) in addition to new information, hashes them
all together to generate a signature for the current node.
We overlay BlockDAGs onto |daliuge| workflow graphs; the edges between components remain, descendant
components receive their parents' signatures to generate their signatures, which are passed on to
their children.

The root of a Merkle tree formed by the signatures of workflow leaves acts as the full
workflow signature.

One could, in principle, do away with these cryptographic structures, but utilizing Merkle trees
and BlockDAGs make the comparison between workflow executions constant time independent of
workflow scale or composition.

Runtime Provenance
------------------

Each drop implements a series of ``generate_x_data``, where ``x`` is the name of a particular
standard (defined below).
At runtime, drops package up pertinent data then sent to its manager, percolating up to the master
responsible for the drop's session, which then packages the final BlockDAG for that workflow
execution.
The resulting signature structure is written to a file stored alongside that session's log file.

In general, specialized processing drops need to implement a customized ``generate_recompute_data``
function, and data drops need to implement a ``generate_reproduce_data`` function.

Translate-time Provenance
-------------------------
|daliuge| can generate BlockDAGs and an associated signature for a workflow at each stage of
translations from logical to physical layers.
Passing an ``rmode`` flag (defined below) to the ``fill`` operation, from that point forward,
|daliuge| will capture provenance and pertinent information automatically, storing this information
alongside the graph structure itself.

The *pertinent* information is defined in the ``dlg.common.reproducibility.reproducibility_fields``
file, which will need modification whenever an entirely new type of drop is added (a relatively
infrequent occurrence).

Signature Building
------------------
The algorithm used to build the blockDAG is a variant of
`Kahn's algorithm <https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/>`__
for topological sorting.
Nodes without predecessors are processed first, followed by their children, and so on, moving
through the graph.

This operation takes time linear in the number of nodes and edges present in the graph at all
layers.
Building the MerkleTree for each drop is a potentially expensive operation, dependent on the volume
of data present in the tree.
This is a per-drop consideration, and thus when implementing ``generate_reproduce_data``, be wary of
producing large data volumes.
49 changes: 49 additions & 0 deletions docs/architecture/reproducibility/graphcertification.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
.. _reproducibility_graphcertification:

Graph Certification
===================
'Certifying' a graph involves generating and publishing reproducibility signatures.
These signatures can be integrated into a CI/CD pipeline, used during executions for verification or
during late-stage development when fine-tuning graphs.

By producing and sharing these signatures, subsequent changes to execution environment, processing
components, overall graph design and data artefacts can be easily and efficiently tested.

Certifying a Graph
------------------
The process of generating and storing workflow signatures is relatively straightforward.

* From the root of the graph-storing directory (usually a repository) create a ``/reprodata/[GRAPH_NAME]`` directory.
* Run the graph with the ``ALL`` reproducibility flag, and move the produced reprodata.out file to the previously created directory.
* (optional) Run from ``dlg.common.reproducibility.reprodata_compare.py`` script with this file as input to generate a summary-csv file

In subsequent executions or during CI/CD scripts:
* Note the reprodata.out file generated during the test execution
* Run ``dlg.common.reproduciblity.reprodata_compare.py`` with the published ``reprodata/[GRAPH_NAME]`` directory and newly generated signature file
* The resulting ``[SESSION_NAME]-comparison.csv`` will contain a simple True/False summary for each RMode, for use at your discretion.

What is to be expected?
***********************
In general, all but ``Recomputation`` and ``Replicate_Computational`` rmodes should match, moreover:

* A failed ``Rerun`` indicates some fundamental structure is different
* A failed ``Repeat`` indicates changes to component parameters or a different execution scale
* A failed ``Recomputation~`` indicates some runtime environment changes have been made
* A failed ``Reproduction`` indicates data artefacts have changed
* A failed ``Scientific Replication`` indicates a change in data artefacts or fundamental structure
* A failed ``Computational Replication`` indicates a change in data artefacts or runtime environment
* A failed ``Total Replica`` indicates a change in data artefacts, component parameters or different execution scale

When attempting to re-create some known graph-derived result, ``Replication`` is the goal.
In an operational context, where data changes constantly, ``Reruning`` is the goal
When conducting science across multiple trials, ``Repeating`` is necessary to use the derived data arte-facts in concert.

Tips on Making Graphs Robust
----------------------------
The most common 'brittle' aspect of graphs are hard-coded paths to data resources and access to referenced data.
This can be ameliorated by:

* Using the ``$DLG_ROOT`` keyword in component parameters as a base path.
* Providing comments on where to find referenced data artefacts
* Providing instructions on how to build referenced runtime libraries (in the case of Dynlib drops).

53 changes: 53 additions & 0 deletions docs/architecture/reproducibility/helloWorldExample.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
.. _reproducibility_helloworld:

Hello World Example
===================
We present a simple example based on several 'Hello world' workflows.
First, we present the workflows and signatures for all rmodes and discuss how they compare.

Hello World Bash
----------------
This workflow is comprised of a bash script, writing text to a file -
Specifically `echo 'Hello World' > %o0`

.. image:: HelloWorldBash.png

Hello World Python
------------------
This workflow is comprised of a single python script and a file.
This function writes 'Hello World' to the linked file.

.. image:: HelloWorldPython.png

Hello Everybody Python
----------------------
This workflow is again comprised of a single python script and file.
This function writes 'Hello Everybody' to the linked file.

.. image:: HelloEverybodyPython.png

Signature Comparisons
---------------------

By comparing the hashes of each workflow together, we arrive at the following conclusions:

.. csv-table:: Workflow Hashes
:file: HelloHashes.csv
:widths: 13, 12, 12, 12, 12, 12, 12, 12
:header-rows: 1

* HelloEverybodyPython and HelloWorldPython are Reruns
* No two workflows are repetitions
* No two workflows are recomputations
* HelloWorldBash and HelloWorldPython reproduce the same results
* No two workflows are replicas.

Testing for repetitions is primarily useful when examining stochastic workflows to take their
results in concert with confidence.
Testing or replicas is useful when moving between deployment environments or verifying the validity
of a workflow.
When debugging a workflow or asserting if the computing environment has changed, recomputations and
computational replicas are of particular use.

This simple example scratches the surface of what is possible with a robust workflow
signature scheme.
37 changes: 37 additions & 0 deletions docs/architecture/reproducibility/reproducibility.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
.. _scientific_reproducibility:

Scientific Reproducibility
==========================

*Under construction*

The scientific reproducibility of computational workflows is a fundamental concern when conducting
scientific investigations.
Here, we outline our approach to increasing scientific confidence in DALiuGE workflows.
Modern methods create a deterministic computing environment through careful software versioning and
containerization.
We suggest testing equivalence between carefully selected provenance information to complement such
approaches.

Doing so allows any workflow system which generates identical provenance information can claim to
re-create some aspect of the original workflow execution.
Drops provide component-specific provenance information at runtime and throughout graph translation.

Additionally, a novel hash-graph (BlockDAG) method captures the relationships between components by
linking provenance throughout an entire workflow.
The resulting signature completely characterizes a workflow allowing for constant time provenance
comparison.

We refer a motivated reader to the
`related thesis <https://research-repository.uwa.edu.au/en/publications/using-blockchain-technology
-to-enable-reproducible-science>`__.


.. toctree::
:maxdepth: 2

rmodes
blockdags
helloWorldExample
graphcertification
adding_drops
78 changes: 78 additions & 0 deletions docs/architecture/reproducibility/rmodes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
.. _reproducibility_rmodes:

R-Mode Standards
================

Each drop's provenance information defines what a workflow signature claims.
Inspired and extending current workflow literature, we define seven R-modes.
R-mode selection occurs when submitting a workflow to |daliuge| for initial filling and unrolling;
|daliuge| handles everything else automatically.
Additionally, the ALL mode will generate a signature structure containing separate hash graphs for
all supported modes, which is a good choice when experimenting with new workflow concepts or
certifying a particular workflow version.

Rerunning
---------
A workflow reruns another if they execute the same logical workflow; their logical components and
dependencies match.
At this standard, the runtime information is simply an execution status flag; the translate-time
information is logical template data excluding physical drops structurally.

When scaling up an in-development workflow or deploying to a new facility asserting that executions
rerun the original workflow build confidence in the workflow tools.
Rerunning is also useful where data scale and contents change, like an ingest pipeline.

Repeating
---------
A workflow repeats another if they execute the same logical workflow and a principally identical
physical workflow; their logical components, dependencies, and physical tasks match.
At this standard, the runtime information is still only an execution flag, and translate-time
information includes component parameters (in addition to rerunning information) and includes all physical drops structurally.

Workflows with stochastic results need statistical power to make scientific claims.
Asserting workflow repetitions allows using results in concert.

Recomputing
-----------
A workflow recomputes another if they execute the same physical workflow; their physical tasks and
dependencies match precisely.
In addition to repetition information, a maximal amount of detail for computing drops is stored
at this standard.

Recomputation is a meticulous approach that is helpful when debugging workflow deployments.

Reproducing
-----------
A workflow reproduces another if their scientific information match. In other words, the terminal
data drops of two workflows match in content.
The precise mechanism of establishing comparable data need not be a naive copy but is a
domain-specific decision.
At this standard, runtime and translate-time data only include data-drops structurally. At runtime,
data drops are expected to provide a characteristic summary of their contents

Reproductions are practical in asserting whether a given result can be independently reported or to
test an alternate methodology.
An alternate methodology could mean an incremental change to a single component
(somewhat akin to regression testing) or testing a vastly different workflow approach.

Replicating - Scientifically
----------------------------
A scientific replica reruns and reproduces a workflow execution.

Scientific replicas establish a workflow design as a gold standard for a given set of results.

Replicating - Computationally
-----------------------------
A computational replica recomputes and reproduces a workflow execution.

Computational replicas are useful if performing science on workflows directly
(performance claims etc.)

Replicating - Totally
---------------------
A total replica repeats and reproduces a workflow execution.

Total replicas allow for independent verification of results, adding direct credibility to
results coming from a workflow.
Moreover, if a workflow's original deployment environment is unavailable, a total replica is
the most robust assertion possibly placed on a workflow.
6 changes: 3 additions & 3 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,9 +154,9 @@ Help output::
-p PARAMETER, --parameter=PARAMETER
Parameter specification (either 'name=value' or a JSON
string)
-R REPRODUCIBILITY, --reproducibility=REPRODUCIBILITY
Level of reproducibility. Defaults to 0 (NOTHING).
Accepts '0,1,2,4,5,6,7,8'
-R, --reproducibility
Level of reproducibility. Default 0 (NOTHING). Accepts '-1'-'8'"
Refer to dlg.common.reproducibility.constants for more explanation.

Command: dlg include_dir
Expand Down

0 comments on commit ae5b7fa

Please sign in to comment.