Merge branch 'master' into python3-only

ICRAR · Jul 1, 2021 · 1da1022 · 1da1022
2 parents 0e50452 + a317c9c
commit 1da1022
Show file tree

Hide file tree

Showing 12 changed files with 110 additions and 52 deletions.
diff --git a/daliuge-engine/dlg/ddap_protocol.py b/daliuge-engine/dlg/ddap_protocol.py
@@ -29,10 +29,10 @@ class DROPLinkType:
     Although not explicitly stated in this enumeration, each link type has a
     corresponding inverse. This way, if X is a consumer of Y, Y is an input of
     X. The full list is:
-     * CONSUMER           / INPUT
-     * STREAMING_CONSUMER / STREAMING_INPUT
-     * PRODUCER           / OUTPUT
-     * PARENT             / CHILD
+    * CONSUMER           / INPUT
+    * STREAMING_CONSUMER / STREAMING_INPUT
+    * PRODUCER           / OUTPUT
+    * PARENT             / CHILD
     """
     CONSUMER, STREAMING_CONSUMER, PRODUCER, PARENT, CHILD, INPUT, STREAMING_INPUT, OUTPUT = range(8)
 

diff --git a/daliuge-engine/dlg/json_drop.py b/daliuge-engine/dlg/json_drop.py
@@ -31,6 +31,9 @@
 
 
 class JsonDROP(FileDROP):
+    """
+    Very simple implementation derived from the :class:`FileDROP <dlg.drop.FileDROP>` class
+    """
     def __init__(self, oid, uid, **kwargs):
         self._data = None
         super(JsonDROP, self).__init__(oid, uid, **kwargs)

diff --git a/docs/README ray.rst b/docs/README ray.rst
@@ -1,5 +1,5 @@
-Notes of the merge between Data Activated 流 Graph Engine and Ray
-=================================================================
+Notes of the merge between |daliuge| and Ray
+=============================================
 
 The objective of this activity was to investigate a feasible solution for the flexible and simple deployment of DALiuGE on various platforms. In particular the deployment of DAliuGE on AWS in an autoscaling environment is of interest to us.
 
@@ -13,13 +13,13 @@ Ray (https://docs.ray.io/en/master/) is a pretty complete execution engine all b
 
 Internally Ray is using a number of technologies we are also using or evaluating within DALiuGE and/or the SKA. The way Ray is managing and distributing computing is done very well and essentially covers a number of our target platforms including AWS, SLURM, Kubernetes, Azure and GC.
 
-The idea thus was to use Ray to distribute DALiuGE on those platforms and on AWS to start with.
+The idea thus was to use Ray to distribute DALiuGE on those platforms and on AWS to start with, but leave the rest of the two systems essentially independent.
 
 Setup
 -----
 
 Pre-requisites
---------------
+^^^^^^^^^^^^^^
 
 First you need to install Ray into your local python virtualenv::
 
@@ -30,7 +30,7 @@ Ray uses a YAML file to configure a deployment and allows to run additional setu
 The rest is then straight forward and just requires to configure a few AWS autoscale specific settings, which includes AWS region, type of head node and type and (maximum and minimum) number of worker nodes as well as whether this is using the Spot market or not. In addition it is required to specify the virtual machine AMI ID, which is a pain to get and different for the various AWS regions. 
 
 Starting the DALiuGE Ray cluster
---------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 To get DALiuGE up and running in addition to Ray requires just two additional lines for the HEAD and the worker nodes in the YAML file, but there are some caveats as outlined below. With the provided ray configuration YAML file starting a cluster running DALiuGE on AWS is super easy (provided you have your AWS environment set up in place)::
 
@@ -48,7 +48,7 @@ More for convenience both DALiuGE and Ray require a number of ports to be expose
 More specifically the command above actually opens a shell inside to the docker container running on the head node AWS instance. 
 
 Issues
-------
+^^^^^^
 The basic idea is to start up a Data Island Manager (and possibly also a Node Manager) on the Ray Head node and a Node Manager on each of the worker nodes. Starting them is trivial, but the issue is to correctly register the NMs with the DIM. DALiuGE is usually doing this the other way around, by telling the DIM at startup which NMs it is responsible for. This is not possible in a autoscaling setup, since the NMs don't exist yet. 
 As a workaround DALiuGE does provide a REST API call to register a NM with the DIM, but that currently has a bug, registering the node twice.
 
@@ -64,15 +64,15 @@ To stop and start a node manager use the following two commands, replacing the S
 The commands above also show how to connect to a shell inside the docker container on a worker node. Unfortunately this is not exposed as easily as the connection to the head node in Ray.
 
 Submitting and executing a Graph
---------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 This configuration only deploys the DALiuGE engine. EAGLE and a translator need to be deployed somewhere else. When submitting the PG from a translator web interface, the IP address to be entered there is the *public* IP address of the DIM (Ray AWS head instance). After submitting, the DALiuGE monitoring page will pop up and show the progress bar. It is then also possible to click your way through to the sub-graphs running on the worker nodes.
 
 Future thoughts
-===============
+---------------
 This implementation is the start of an integration between Ray and DALiuGE. Ray (like the AWS autoscaling) is a *reactive* execution framework and as such it uses the autoscaling feature just in time, when the load exceeds a certain threshold. DALiuGE on the other hand is a *proactive* execution framework and pre-allocates the resources required to execute a whole workflow. Both approaches have pros and cons. In particular in an environment where resources are charged by the second it is desireable to allocate them as dynamically as possible. On the other hand dynamic allocation comes with the overhead of provisioning additional resources during run-time and is thus non-deterministic in terms of completion time. This is even more obvious when using the spot market on AWS. Fully dynamic allocation also does not fit well with bigger workflows, which require lots of resources already at the beginning. The optimal solution very likely is somewhere in the middle between fully dynamic and fully static resource provisioning. 
 
 Dynamic workflow allocation
----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 The first step in that direction is to connect the DALiuGE translator with the ray deployment. After the translator has performed the workflow partitioning the resource requirements are fixed and could be used in turn to startup the Ray cluster with the required number of worker nodes. Essentially This would also completely isolate one workflow from another. The next step could be to add workflow fragmentation to the DALiuGE translator and scale the Ray cluster according to the requirements of each of the fragments, rather than the whole workflow. It has to be seen how to trigger the scaling of the Ray cluster just enough ahead of time to be available for the previous workflow fragment to continue without delays.
 
 

diff --git a/docs/api/dlg.rst b/docs/api/dlg.rst
@@ -4,43 +4,59 @@ dlg
 .. automodule:: dlg
 .. contents::
 
+dlg.ddap_protocol
+^^^^^^^^^^^^^^^^^
+.. automodule:: dlg.ddap_protocol
+   :members:
+
+dlg.drop
+^^^^^^^^
+.. automodule:: dlg.drop
+   :members:
+
+dlg.droputils
+^^^^^^^^^^^^^
+.. automodule:: dlg.droputils
+   :members:
+
 dlg.event
 ^^^^^^^^^
 .. automodule:: dlg.event
    :members:
 
+dlg.graph_loader
+^^^^^^^^^^^^^^^^
+.. automodule:: dlg.graph_loader
+   :members:
+
 dlg.io
 ^^^^^^
 .. automodule:: dlg.io
    :members:
 
 .. _api.dlg.drop:
 
-dlg.drop
-^^^^^^^^
-.. automodule:: dlg.drop
+dlg.json_drop
+^^^^^^^^^^^^^
+.. automodule:: dlg.json_drop
+   :members:
+
+dlg.rpc
+^^^^^^^
+.. automodule:: dlg.rpc
    :members:
 
+dlg.runtime.delayed
+^^^^^^^^^^^^^^^^^^^
+.. autofunction:: dlg.runtime.delayed
+
 dlg.s3_drop
 ^^^^^^^^^^^
 .. automodule:: dlg.s3_drop
    :members:
 
-dlg.droputils
-^^^^^^^^^^^^^
-.. automodule:: dlg.droputils
-   :members:
-
 dlg.utils
 ^^^^^^^^^
 .. automodule:: dlg.utils
    :members:
 
-dlg.graph_loader
-^^^^^^^^^^^^^^^^
-.. automodule:: dlg.graph_loader
-   :members:
-
-dlg.runtime.delayed
-^^^^^^^^^^^^^^^^^^^
-.. autofunction:: dlg.runtime.delayed
diff --git a/docs/writing_an_application.rst → docs/app_development.rst b/docs/writing_an_application.rst → docs/app_development.rst
@@ -6,7 +6,7 @@ to write a new application component that can be used
 as an Application Drop during the execution of a |daliuge| graph.
 
 Types of Application Components
-===============================
+-------------------------------
 
 |daliuge| supports four main types of Application Components.
 

diff --git a/docs/data_development.rst b/docs/data_development.rst
@@ -0,0 +1,35 @@
+Data Component Development
+=================================
+
+This section describes what developers need to do
+to write a new data component that can be used
+as a Data Drop during the execution of a |daliuge| graph.
+
+Different from most other frameworks |daliuge| makes data components first class entities in the context of a workflow. In fact data components, or rather the instances of data components, the Data Drops, are driving the execution of a workflow. Consequently |daliuge| graphs are showing both application and data components as graph nodes. Edges in |daliuge| graphs are 
+
+Types of Data Components
+------------------------
+
+|daliuge| out of the box supports five main types of Data Components.
+
+* Posix file data components
+* Memory data components
+* S3 data components
+* NGAS data components
+* Apache Plasma data components
+
+This range covers most of the use cases in workflows we have encountered sofar, but adding support for additional data components is possible as well. 
+
+.. default-domain:: py
+
+Python Data Component Class
+---------------------------
+
+Depending on the I/O model of such a new component, the development can be derived from either one of the existing higher level data components, the lower level :class:`DataIO <dlg.io.DataIO>` abstract class, or even lower with the :class:`AbstractDROP <dlg.drop.AbstractDROP>` class. The best way to get started is to checkout the code of existing data components like , e.g. the :class:`JsonDROP <dlg.json_drop.JsonDROP>` or the the :class:`S3DROP <dlg.s3_drop.S3DROP>`
+
+I/O
+---
+
+A data components' input and output methods are defined by the abstract class :class:`DataIO <dlg.io.DataIO>`. The methods in that class are just empty definitions and have to be implemented by the actual data component.
+
+
diff --git a/docs/dataflow.rst b/docs/dataflow.rst
@@ -1,8 +1,8 @@
 Concepts and Background
 -----------------------
 
-This section briefly introduces key concepts and motivations underpinning
-|daliuge|.
+This section introduces key concepts and motivations underpinning
+the |daliuge| system.
 
 Dataflow
 ^^^^^^^^
@@ -85,29 +85,29 @@ Concretely, we have made the following changes to the existing dataflow model:
 
 .. _dlg_functions:
 
-|daliuge| Functions
-^^^^^^^^^^^^^^^^^^^
-|daliuge| provides eight Graph-based functions as shown in
-:numref:`dataflow.fig.funcs`.
+|daliuge| operational concepts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+As mentioned above, |daliuge| has been developed to enable processing of data from the future Square Kilometre Array (SKA) observatory. To support the SKA operational environment |daliuge| provides eight Graph-based functions as shown in
+:numref:`dataflow.fig.funcs`. The implementation of these operational concepts in general does not restrict the usage of |daliuge| for other use cases, but it is still taylored to meet the SKA requirements.
 
 .. _dataflow.fig.funcs:
 
 .. figure:: images/dfms_func_as_graphs.jpg
 
    Graph-based Functions of the |daliuge| Prototype
 
-The :doc:`graphs` section will go through implementation details for each function.
-Here we briefly discuss how they work together in our data-driven framework.
+The :doc:`graphs` section describes the implementation details for each function.
+Here we briefly discuss how they work together to fullfill the SKA requirements.
 
 * First of all, the *Logical Graph Template* (topleft in
   :numref:`dataflow.fig.funcs`) represents high-level
-  data processing capabilities. In the case of SDP, they could be, for example,
+  data processing capabilities. In the case of the SKA Data Processor, they could be, for example,
   "Process Visibility Data" or "Stage Data Products".
 
 * Logical Graph Templates are managed by *LogicalGraph Template
   Repositories* (bottomleft in :numref:`dataflow.fig.funcs`).
   The logical graph template is first selected from this repository for a specific pipeline and
-  is then filled with scheduling block parameters. This generates a *Logical Graph*, expressing a workflow with resource-oblivious dataflow constructs.
+  is then populated with parameters derived from the detailed description of the scheduled science observation. This generates a *Logical Graph*, expressing a workflow with resource-oblivious dataflow constructs.
 
 * Using profiling information of pipeline components executed on specific hardware resources, |daliuge|
   then "translates" a Logical Graph into a *Physical Graph Template*, which prescribes a manifest of all Drops without specifying their physical locations.

diff --git a/docs/dlm.rst b/docs/dlm.rst
@@ -8,13 +8,13 @@ account how and when it is used. This includes, for instance, placing medium-
 and long-term persistent data into the optimal storage media, and to remove
 data that is not used anymore.
 
-The current |daliuge| implementation contains a Data Lifecycle Manager (DLM)
-prototype.  Because of the high coupling that is needed with all the Drops the
+The current |daliuge| implementation contains a Data Lifecycle Manager (DLM).  
+Because of the high coupling that is needed with all the Drops the
 DLM is contained within the :ref:`node_drop_manager` processes, and thus shares
 the same memory space with the Drops it manages. By subscribing to events sent
 by individual Drops it can track their state and react accordingly.
 
-The DLM functionalities currently implemented in the |daliuge| prototype are:
+The DLM functionalities currently implemented in |daliuge| are:
 
 * Automatically expire Drops; i.e., moves them from the **COMPLETED** state
   into the **EXPIRED** state, after which they are not readable anymore.

diff --git a/docs/graphs.rst b/docs/graphs.rst
@@ -87,12 +87,11 @@ the following flow constructs:
 
 Repositories
 """"""""""""
-|daliuge| uses EAGLE as a Web-based |lg| editor as the default user interface
-to underlying *logical graph repositories*. Repositories can reside on a local file system, on GitHub or on GitLab. Each |lg| is physically stored in those repositories as a
-JSON-formatted textual file. For example, the JSON file for the continuous
-imaging pipeline as shown partially in :numref:`graphs.figs.loop` can be accessed 
+|daliuge| uses EAGLE, a Web-based |lg| editor as the default user interface
+to underlying logical graph and component repositories. Repositories can reside on a local file system, on GitHub or on GitLab. Each |lg| is physically stored in those repositories as a
+JSON-formatted text file. The JSON format is based on a JSON schema and validated against that as well. The JSON file contains the description of the application and data components used in the graph as nodes, a description of the connection between the nodes (edges and connection ports) and also the description of some of the representation properties required to draw the graph. 
 
-`through HTTP GET <http://sdp-dfms.ddns.net/jsonbody?lg_name=cont_img.json>`_.
+The repositories also contain so-called *palettes*, which represent a collection of components. Users can pick from those components in EAGLE to draw |lgts|. The differentiation between graphs and palettes is somewhat blurry, since any graph can also be used as a collection of components. However, palettes usually contain a superset of components used in any graph derived from them and thus the distinction is still relevant.
 
 
 Usage of |Lgts| and |Lgs|

diff --git a/docs/index.rst b/docs/index.rst
@@ -11,7 +11,7 @@ Welcome to the Data Activated 流 [#f1]_ Graph Engine (|daliuge|).
 |daliuge|
 is a workflow graph execution framework, specifically designed to support very large scale processing graphs for the reduction of interferometric radio astronomy data sets. DALiuGE has already been used for processing large astronomical datasets in existing radio astronomy projects. It originated from a prototyping activity as part of the SDP Consortium called Data Flow Management System (DFMS). DFMS aimed to prototype the execution framework of the proposed SDP architecture.
 For a complete tour of |daliuge| please read
-our `overview paper <http://dx.doi.org/10.1016/j.ascom.2017.03.007>`_.
+our `overview paper <http://dx.doi.org/10.1016/j.ascom.2017.03.007>`_. DALiuGE has been used in a project running a `full-scale simulation <http://dx.doi.org/10.1109/SC41405.2020.00006>`_ of the Square Kilometre Array dataflow on the ORNL Summit supercomputer.
 
 .. figure:: images/DALiuGE_naming_rationale.png
 
@@ -27,7 +27,8 @@ and is performed by the `DIA team <http://www.icrar.org/our-research/data-intens
  deployment
  overview
  graph_development
- writing_an_application
+ app_development
+ data_development
  examples
  api-index
 

diff --git a/docs/installing.rst b/docs/installing.rst
@@ -49,7 +49,7 @@ Direct Installation
 **NOTE: For most use cases the docker installation described above is recommended.** 
 
 Requirements
-############
+^^^^^^^^^^^^
 
 
 The |daliuge| framework requires no packages apart from those listed in its
@@ -66,7 +66,7 @@ installed on the system:
 * gcc >= 4.8
 
 Installing
-##########
+^^^^^^^^^^
 
 |daliuge| requires python 3.7 or later.
 

diff --git a/docs/reference.rst b/docs/reference.rst
@@ -48,3 +48,7 @@ References
     Parallel and Distributed Systems, IEEE Transactions on, 25(4), pp.993-1002.
 #.  Bokhari, S.H., 2012. Assignment problems in parallel and distributed
     computing (Vol. 32). Springer Science & Business Media
+#.  R. Wang, et al., "Processing Full-Scale Square Kilometre Array Data on the 
+    Summit Supercomputer," in 2020 SC20: International Conference for High Performance 
+    Computing, Networking, Storage and Analysis (SC), Atlanta, GA, US, 2020 pp. 11-22. 
+    doi: 10.1109/SC41405.2020.00006