-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
12 changed files
with
329 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,4 +12,5 @@ behind |daliuge|. | |
graphs | ||
managers | ||
dlm | ||
reproducibility/reproducibility | ||
reference |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Workflow,Rerun,Repeat,Recompute,Reproduce,Replicate-SCI,Replicate-COMP,Replicate-TOTAL | ||
HelloWorldBash,d35a6ee278dad22b202cc092547022abe8643cb22fe262688e97ed56cdc1a47d,86a5208e9c19113c10c564e36cd628b500b25de75a082fe979b10dd42fe39802,598523833e3249da2ae2e25e5caccb2694df84f9ca47085dfb20b6ebe95d30fc,dd5d192134999d48ab9098844be9b882416eb90ee8965ed18376fc6dfabb2bec,dd5ecdba2c1a92ed44f8e28c82e6156976b6e7e50941ad3746ab426a364e200b,241153dbbc3534409fe89f9a0d1a16a0dd50e33f84b51fc860a6ab6400bc2dfc,ccede91165ea6e95c82ce446d2972124c8ec956d3a12b372b94cabfa7740071c | ||
HelloWorldPython,6413ca52dc807b4d9d8f0dc60f6f9d939ba363d86410ede1557a89c2d252e3d2,92e9988ae3cef11b2af935960d0de7feae78ca84938bbdb2f1d0b45e4b3f9ee7,3f4f23133903dfb2a5842468ef01ffb266ccd1051d3ed55f4c4fac558a8c97e0,dd5d192134999d48ab9098844be9b882416eb90ee8965ed18376fc6dfabb2bec,dc8210e4dc1c4eec7248a9016a0d28e8032c3f56010bee4a9bf314c1e13bd69a,04a540a06942b11dafcc9bb67a85bbdae0752024a358251a919a363d100aa856,2c9970ebdf2a6a4581cb2e728cf3187d3c1954146967d1724ffae5a0dddfc4b1 | ||
HelloEverybodyPython,6413ca52dc807b4d9d8f0dc60f6f9d939ba363d86410ede1557a89c2d252e3d2,3c162ec8c42182f99643e70ba2b6a0f205f1ee36a9ab70b7af9a76badae97b03,7ad483dea703f6aa6587fc9c05acfe398d8a03201990ba6a42d274bc7fb978ac,ee0d0784c46b04dc1c1578fde0c1be655ea91c1d03d9267f9888f1d45ba8985d,24558387b6066205b7b1483dfd12954bdb5b5a0fa755c58d82c3a69e574a4914,383fabf6d17a0119514ade3cd53b13ff83f16f3d290db6e9070f1e12cdc6c2d1,09e94a24c000098fe03d58a483c16768d37bd4574303abd1a84a91a9f9179631 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
.. _reproducibility_adding_drops: | ||
|
||
Creating New Drop Types | ||
======================= | ||
|
||
Drops must supply provenance data on demand as part of our scientific reproducibility efforts. | ||
When implementing entirely new drop types, | ||
ensuring the availability of appropriate information is essential to continue the feature's power. | ||
|
||
Drops supply provenance information for various 'R-modes' through ``generate_x_data`` methods. | ||
In the case of application drops specifically, | ||
the ``generate_recompute_data`` method may need overriding if there is any specific information | ||
for the exact replication of this component. | ||
For example, Python drops may supply their code or an execution trace. | ||
|
||
In the case of data drops, the ``genreate_reproduce_data`` may need overriding | ||
and should return a summary of the contained data. For example, the hash of a file, | ||
a list of database queries or whatever information deemed characteristic of a data-artefact | ||
(perhaps statistical information for science products). | ||
|
||
Additionally, if adding an entirely new drop type, | ||
you will need to create a new drop category in ``dlg.common.__init__.py`` and related entries in | ||
``dlg.common.reproducibility.reproducibility_fields.py``. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
.. _reproducibility_blockdags: | ||
|
||
Technical approach | ||
================== | ||
|
||
The fundamental primitive powering workflow signatures are Merkle trees and Block directed | ||
acyclic graphs (BlockDAGs). | ||
These data structures cryptographically compress provenance and structural information. | ||
We describe the primitives of our approach and then their combination. | ||
The most relevant code directory is found under ``dlg.common.reproducibility`` | ||
|
||
Merkle Trees | ||
------------ | ||
A Merkle tree is essentially a binary tree with additional behaviours. | ||
Leaves store singular data elements and are hashed in pairs to produce internal | ||
nodes containing a signature. | ||
These internal nodes are recursively hashed in pairs, eventually leaving a single root node with a | ||
signature for its entire sub-tree. | ||
|
||
Merkle tree comparisons can find differing nodes in a logarithmic number of comparisons and find | ||
their use in version control, distributed databases and blockchains. | ||
|
||
We store information for each workflow component in a Merkle tree. | ||
|
||
BlockDAGs | ||
--------- | ||
|
||
BlockDAGs are our term for a hash graph. | ||
Each node takes the signature of a previous block(s) in addition to new information, hashes them | ||
all together to generate a signature for the current node. | ||
We overlay BlockDAGs onto |daliuge| workflow graphs; the edges between components remain, descendant | ||
components receive their parents' signatures to generate their signatures, which are passed on to | ||
their children. | ||
|
||
The root of a Merkle tree formed by the signatures of workflow leaves acts as the full | ||
workflow signature. | ||
|
||
One could, in principle, do away with these cryptographic structures, but utilizing Merkle trees | ||
and BlockDAGs make the comparison between workflow executions constant time independent of | ||
workflow scale or composition. | ||
|
||
Runtime Provenance | ||
------------------ | ||
|
||
Each drop implements a series of ``generate_x_data``, where ``x`` is the name of a particular | ||
standard (defined below). | ||
At runtime, drops package up pertinent data then sent to its manager, percolating up to the master | ||
responsible for the drop's session, which then packages the final BlockDAG for that workflow | ||
execution. | ||
The resulting signature structure is written to a file stored alongside that session's log file. | ||
|
||
In general, specialized processing drops need to implement a customized ``generate_recompute_data`` | ||
function, and data drops need to implement a ``generate_reproduce_data`` function. | ||
|
||
Translate-time Provenance | ||
------------------------- | ||
|daliuge| can generate BlockDAGs and an associated signature for a workflow at each stage of | ||
translations from logical to physical layers. | ||
Passing an ``rmode`` flag (defined below) to the ``fill`` operation, from that point forward, | ||
|daliuge| will capture provenance and pertinent information automatically, storing this information | ||
alongside the graph structure itself. | ||
|
||
The *pertinent* information is defined in the ``dlg.common.reproducibility.reproducibility_fields`` | ||
file, which will need modification whenever an entirely new type of drop is added (a relatively | ||
infrequent occurrence). | ||
|
||
Signature Building | ||
------------------ | ||
The algorithm used to build the blockDAG is a variant of | ||
`Kahn's algorithm <https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/>`__ | ||
for topological sorting. | ||
Nodes without predecessors are processed first, followed by their children, and so on, moving | ||
through the graph. | ||
|
||
This operation takes time linear in the number of nodes and edges present in the graph at all | ||
layers. | ||
Building the MerkleTree for each drop is a potentially expensive operation, dependent on the volume | ||
of data present in the tree. | ||
This is a per-drop consideration, and thus when implementing ``generate_reproduce_data``, be wary of | ||
producing large data volumes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
.. _reproducibility_graphcertification: | ||
|
||
Graph Certification | ||
=================== | ||
'Certifying' a graph involves generating and publishing reproducibility signatures. | ||
These signatures can be integrated into a CI/CD pipeline, used during executions for verification or | ||
during late-stage development when fine-tuning graphs. | ||
|
||
By producing and sharing these signatures, subsequent changes to execution environment, processing | ||
components, overall graph design and data artefacts can be easily and efficiently tested. | ||
|
||
Certifying a Graph | ||
------------------ | ||
The process of generating and storing workflow signatures is relatively straightforward. | ||
|
||
* From the root of the graph-storing directory (usually a repository) create a ``/reprodata/[GRAPH_NAME]`` directory. | ||
* Run the graph with the ``ALL`` reproducibility flag, and move the produced reprodata.out file to the previously created directory. | ||
* (optional) Run from ``dlg.common.reproducibility.reprodata_compare.py`` script with this file as input to generate a summary-csv file | ||
|
||
In subsequent executions or during CI/CD scripts: | ||
* Note the reprodata.out file generated during the test execution | ||
* Run ``dlg.common.reproduciblity.reprodata_compare.py`` with the published ``reprodata/[GRAPH_NAME]`` directory and newly generated signature file | ||
* The resulting ``[SESSION_NAME]-comparison.csv`` will contain a simple True/False summary for each RMode, for use at your discretion. | ||
|
||
What is to be expected? | ||
*********************** | ||
In general, all but ``Recomputation`` and ``Replicate_Computational`` rmodes should match, moreover: | ||
|
||
* A failed ``Rerun`` indicates some fundamental structure is different | ||
* A failed ``Repeat`` indicates changes to component parameters or a different execution scale | ||
* A failed ``Recomputation~`` indicates some runtime environment changes have been made | ||
* A failed ``Reproduction`` indicates data artefacts have changed | ||
* A failed ``Scientific Replication`` indicates a change in data artefacts or fundamental structure | ||
* A failed ``Computational Replication`` indicates a change in data artefacts or runtime environment | ||
* A failed ``Total Replica`` indicates a change in data artefacts, component parameters or different execution scale | ||
|
||
When attempting to re-create some known graph-derived result, ``Replication`` is the goal. | ||
In an operational context, where data changes constantly, ``Reruning`` is the goal | ||
When conducting science across multiple trials, ``Repeating`` is necessary to use the derived data arte-facts in concert. | ||
|
||
Tips on Making Graphs Robust | ||
---------------------------- | ||
The most common 'brittle' aspect of graphs are hard-coded paths to data resources and access to referenced data. | ||
This can be ameliorated by: | ||
|
||
* Using the ``$DLG_ROOT`` keyword in component parameters as a base path. | ||
* Providing comments on where to find referenced data artefacts | ||
* Providing instructions on how to build referenced runtime libraries (in the case of Dynlib drops). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
.. _reproducibility_helloworld: | ||
|
||
Hello World Example | ||
=================== | ||
We present a simple example based on several 'Hello world' workflows. | ||
First, we present the workflows and signatures for all rmodes and discuss how they compare. | ||
|
||
Hello World Bash | ||
---------------- | ||
This workflow is comprised of a bash script, writing text to a file - | ||
Specifically `echo 'Hello World' > %o0` | ||
|
||
.. image:: HelloWorldBash.png | ||
|
||
Hello World Python | ||
------------------ | ||
This workflow is comprised of a single python script and a file. | ||
This function writes 'Hello World' to the linked file. | ||
|
||
.. image:: HelloWorldPython.png | ||
|
||
Hello Everybody Python | ||
---------------------- | ||
This workflow is again comprised of a single python script and file. | ||
This function writes 'Hello Everybody' to the linked file. | ||
|
||
.. image:: HelloEverybodyPython.png | ||
|
||
Signature Comparisons | ||
--------------------- | ||
|
||
By comparing the hashes of each workflow together, we arrive at the following conclusions: | ||
|
||
.. csv-table:: Workflow Hashes | ||
:file: HelloHashes.csv | ||
:widths: 13, 12, 12, 12, 12, 12, 12, 12 | ||
:header-rows: 1 | ||
|
||
* HelloEverybodyPython and HelloWorldPython are Reruns | ||
* No two workflows are repetitions | ||
* No two workflows are recomputations | ||
* HelloWorldBash and HelloWorldPython reproduce the same results | ||
* No two workflows are replicas. | ||
|
||
Testing for repetitions is primarily useful when examining stochastic workflows to take their | ||
results in concert with confidence. | ||
Testing or replicas is useful when moving between deployment environments or verifying the validity | ||
of a workflow. | ||
When debugging a workflow or asserting if the computing environment has changed, recomputations and | ||
computational replicas are of particular use. | ||
|
||
This simple example scratches the surface of what is possible with a robust workflow | ||
signature scheme. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
.. _scientific_reproducibility: | ||
|
||
Scientific Reproducibility | ||
========================== | ||
|
||
*Under construction* | ||
|
||
The scientific reproducibility of computational workflows is a fundamental concern when conducting | ||
scientific investigations. | ||
Here, we outline our approach to increasing scientific confidence in DALiuGE workflows. | ||
Modern methods create a deterministic computing environment through careful software versioning and | ||
containerization. | ||
We suggest testing equivalence between carefully selected provenance information to complement such | ||
approaches. | ||
|
||
Doing so allows any workflow system which generates identical provenance information can claim to | ||
re-create some aspect of the original workflow execution. | ||
Drops provide component-specific provenance information at runtime and throughout graph translation. | ||
|
||
Additionally, a novel hash-graph (BlockDAG) method captures the relationships between components by | ||
linking provenance throughout an entire workflow. | ||
The resulting signature completely characterizes a workflow allowing for constant time provenance | ||
comparison. | ||
|
||
We refer a motivated reader to the | ||
`related thesis <https://research-repository.uwa.edu.au/en/publications/using-blockchain-technology | ||
-to-enable-reproducible-science>`__. | ||
|
||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
rmodes | ||
blockdags | ||
helloWorldExample | ||
graphcertification | ||
adding_drops |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
.. _reproducibility_rmodes: | ||
|
||
R-Mode Standards | ||
================ | ||
|
||
Each drop's provenance information defines what a workflow signature claims. | ||
Inspired and extending current workflow literature, we define seven R-modes. | ||
R-mode selection occurs when submitting a workflow to |daliuge| for initial filling and unrolling; | ||
|daliuge| handles everything else automatically. | ||
Additionally, the ALL mode will generate a signature structure containing separate hash graphs for | ||
all supported modes, which is a good choice when experimenting with new workflow concepts or | ||
certifying a particular workflow version. | ||
|
||
Rerunning | ||
--------- | ||
A workflow reruns another if they execute the same logical workflow; their logical components and | ||
dependencies match. | ||
At this standard, the runtime information is simply an execution status flag; the translate-time | ||
information is logical template data excluding physical drops structurally. | ||
|
||
When scaling up an in-development workflow or deploying to a new facility asserting that executions | ||
rerun the original workflow build confidence in the workflow tools. | ||
Rerunning is also useful where data scale and contents change, like an ingest pipeline. | ||
|
||
Repeating | ||
--------- | ||
A workflow repeats another if they execute the same logical workflow and a principally identical | ||
physical workflow; their logical components, dependencies, and physical tasks match. | ||
At this standard, the runtime information is still only an execution flag, and translate-time | ||
information includes component parameters (in addition to rerunning information) and includes all physical drops structurally. | ||
|
||
Workflows with stochastic results need statistical power to make scientific claims. | ||
Asserting workflow repetitions allows using results in concert. | ||
|
||
Recomputing | ||
----------- | ||
A workflow recomputes another if they execute the same physical workflow; their physical tasks and | ||
dependencies match precisely. | ||
In addition to repetition information, a maximal amount of detail for computing drops is stored | ||
at this standard. | ||
|
||
Recomputation is a meticulous approach that is helpful when debugging workflow deployments. | ||
|
||
Reproducing | ||
----------- | ||
A workflow reproduces another if their scientific information match. In other words, the terminal | ||
data drops of two workflows match in content. | ||
The precise mechanism of establishing comparable data need not be a naive copy but is a | ||
domain-specific decision. | ||
At this standard, runtime and translate-time data only include data-drops structurally. At runtime, | ||
data drops are expected to provide a characteristic summary of their contents | ||
|
||
Reproductions are practical in asserting whether a given result can be independently reported or to | ||
test an alternate methodology. | ||
An alternate methodology could mean an incremental change to a single component | ||
(somewhat akin to regression testing) or testing a vastly different workflow approach. | ||
|
||
Replicating - Scientifically | ||
---------------------------- | ||
A scientific replica reruns and reproduces a workflow execution. | ||
|
||
Scientific replicas establish a workflow design as a gold standard for a given set of results. | ||
|
||
Replicating - Computationally | ||
----------------------------- | ||
A computational replica recomputes and reproduces a workflow execution. | ||
|
||
Computational replicas are useful if performing science on workflows directly | ||
(performance claims etc.) | ||
|
||
Replicating - Totally | ||
--------------------- | ||
A total replica repeats and reproduces a workflow execution. | ||
|
||
Total replicas allow for independent verification of results, adding direct credibility to | ||
results coming from a workflow. | ||
Moreover, if a workflow's original deployment environment is unavailable, a total replica is | ||
the most robust assertion possibly placed on a workflow. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters