Skip to content

Commit

Permalink
Merge branch 'master' into python3-only
Browse files Browse the repository at this point in the history
  • Loading branch information
rtobar committed Jul 1, 2021
2 parents 0e50452 + a317c9c commit 1da1022
Show file tree
Hide file tree
Showing 12 changed files with 110 additions and 52 deletions.
8 changes: 4 additions & 4 deletions daliuge-engine/dlg/ddap_protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ class DROPLinkType:
Although not explicitly stated in this enumeration, each link type has a
corresponding inverse. This way, if X is a consumer of Y, Y is an input of
X. The full list is:
* CONSUMER / INPUT
* STREAMING_CONSUMER / STREAMING_INPUT
* PRODUCER / OUTPUT
* PARENT / CHILD
* CONSUMER / INPUT
* STREAMING_CONSUMER / STREAMING_INPUT
* PRODUCER / OUTPUT
* PARENT / CHILD
"""
CONSUMER, STREAMING_CONSUMER, PRODUCER, PARENT, CHILD, INPUT, STREAMING_INPUT, OUTPUT = range(8)

Expand Down
3 changes: 3 additions & 0 deletions daliuge-engine/dlg/json_drop.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@


class JsonDROP(FileDROP):
"""
Very simple implementation derived from the :class:`FileDROP <dlg.drop.FileDROP>` class
"""
def __init__(self, oid, uid, **kwargs):
self._data = None
super(JsonDROP, self).__init__(oid, uid, **kwargs)
Expand Down
18 changes: 9 additions & 9 deletions docs/README ray.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Notes of the merge between Data Activated 流 Graph Engine and Ray
=================================================================
Notes of the merge between |daliuge| and Ray
=============================================

The objective of this activity was to investigate a feasible solution for the flexible and simple deployment of DALiuGE on various platforms. In particular the deployment of DAliuGE on AWS in an autoscaling environment is of interest to us.

Expand All @@ -13,13 +13,13 @@ Ray (https://docs.ray.io/en/master/) is a pretty complete execution engine all b

Internally Ray is using a number of technologies we are also using or evaluating within DALiuGE and/or the SKA. The way Ray is managing and distributing computing is done very well and essentially covers a number of our target platforms including AWS, SLURM, Kubernetes, Azure and GC.

The idea thus was to use Ray to distribute DALiuGE on those platforms and on AWS to start with.
The idea thus was to use Ray to distribute DALiuGE on those platforms and on AWS to start with, but leave the rest of the two systems essentially independent.

Setup
-----

Pre-requisites
--------------
^^^^^^^^^^^^^^

First you need to install Ray into your local python virtualenv::

Expand All @@ -30,7 +30,7 @@ Ray uses a YAML file to configure a deployment and allows to run additional setu
The rest is then straight forward and just requires to configure a few AWS autoscale specific settings, which includes AWS region, type of head node and type and (maximum and minimum) number of worker nodes as well as whether this is using the Spot market or not. In addition it is required to specify the virtual machine AMI ID, which is a pain to get and different for the various AWS regions.

Starting the DALiuGE Ray cluster
--------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To get DALiuGE up and running in addition to Ray requires just two additional lines for the HEAD and the worker nodes in the YAML file, but there are some caveats as outlined below. With the provided ray configuration YAML file starting a cluster running DALiuGE on AWS is super easy (provided you have your AWS environment set up in place)::

Expand All @@ -48,7 +48,7 @@ More for convenience both DALiuGE and Ray require a number of ports to be expose
More specifically the command above actually opens a shell inside to the docker container running on the head node AWS instance.

Issues
------
^^^^^^
The basic idea is to start up a Data Island Manager (and possibly also a Node Manager) on the Ray Head node and a Node Manager on each of the worker nodes. Starting them is trivial, but the issue is to correctly register the NMs with the DIM. DALiuGE is usually doing this the other way around, by telling the DIM at startup which NMs it is responsible for. This is not possible in a autoscaling setup, since the NMs don't exist yet.
As a workaround DALiuGE does provide a REST API call to register a NM with the DIM, but that currently has a bug, registering the node twice.

Expand All @@ -64,15 +64,15 @@ To stop and start a node manager use the following two commands, replacing the S
The commands above also show how to connect to a shell inside the docker container on a worker node. Unfortunately this is not exposed as easily as the connection to the head node in Ray.

Submitting and executing a Graph
--------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This configuration only deploys the DALiuGE engine. EAGLE and a translator need to be deployed somewhere else. When submitting the PG from a translator web interface, the IP address to be entered there is the *public* IP address of the DIM (Ray AWS head instance). After submitting, the DALiuGE monitoring page will pop up and show the progress bar. It is then also possible to click your way through to the sub-graphs running on the worker nodes.

Future thoughts
===============
---------------
This implementation is the start of an integration between Ray and DALiuGE. Ray (like the AWS autoscaling) is a *reactive* execution framework and as such it uses the autoscaling feature just in time, when the load exceeds a certain threshold. DALiuGE on the other hand is a *proactive* execution framework and pre-allocates the resources required to execute a whole workflow. Both approaches have pros and cons. In particular in an environment where resources are charged by the second it is desireable to allocate them as dynamically as possible. On the other hand dynamic allocation comes with the overhead of provisioning additional resources during run-time and is thus non-deterministic in terms of completion time. This is even more obvious when using the spot market on AWS. Fully dynamic allocation also does not fit well with bigger workflows, which require lots of resources already at the beginning. The optimal solution very likely is somewhere in the middle between fully dynamic and fully static resource provisioning.

Dynamic workflow allocation
---------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first step in that direction is to connect the DALiuGE translator with the ray deployment. After the translator has performed the workflow partitioning the resource requirements are fixed and could be used in turn to startup the Ray cluster with the required number of worker nodes. Essentially This would also completely isolate one workflow from another. The next step could be to add workflow fragmentation to the DALiuGE translator and scale the Ray cluster according to the requirements of each of the fragments, rather than the whole workflow. It has to be seen how to trigger the scaling of the Ray cluster just enough ahead of time to be available for the previous workflow fragment to continue without delays.


Expand Down
48 changes: 32 additions & 16 deletions docs/api/dlg.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,43 +4,59 @@ dlg
.. automodule:: dlg
.. contents::

dlg.ddap_protocol
^^^^^^^^^^^^^^^^^
.. automodule:: dlg.ddap_protocol
:members:

dlg.drop
^^^^^^^^
.. automodule:: dlg.drop
:members:

dlg.droputils
^^^^^^^^^^^^^
.. automodule:: dlg.droputils
:members:

dlg.event
^^^^^^^^^
.. automodule:: dlg.event
:members:

dlg.graph_loader
^^^^^^^^^^^^^^^^
.. automodule:: dlg.graph_loader
:members:

dlg.io
^^^^^^
.. automodule:: dlg.io
:members:

.. _api.dlg.drop:

dlg.drop
^^^^^^^^
.. automodule:: dlg.drop
dlg.json_drop
^^^^^^^^^^^^^
.. automodule:: dlg.json_drop
:members:

dlg.rpc
^^^^^^^
.. automodule:: dlg.rpc
:members:

dlg.runtime.delayed
^^^^^^^^^^^^^^^^^^^
.. autofunction:: dlg.runtime.delayed

dlg.s3_drop
^^^^^^^^^^^
.. automodule:: dlg.s3_drop
:members:

dlg.droputils
^^^^^^^^^^^^^
.. automodule:: dlg.droputils
:members:

dlg.utils
^^^^^^^^^
.. automodule:: dlg.utils
:members:

dlg.graph_loader
^^^^^^^^^^^^^^^^
.. automodule:: dlg.graph_loader
:members:

dlg.runtime.delayed
^^^^^^^^^^^^^^^^^^^
.. autofunction:: dlg.runtime.delayed
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ to write a new application component that can be used
as an Application Drop during the execution of a |daliuge| graph.

Types of Application Components
===============================
-------------------------------

|daliuge| supports four main types of Application Components.

Expand Down
35 changes: 35 additions & 0 deletions docs/data_development.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
Data Component Development
=================================

This section describes what developers need to do
to write a new data component that can be used
as a Data Drop during the execution of a |daliuge| graph.

Different from most other frameworks |daliuge| makes data components first class entities in the context of a workflow. In fact data components, or rather the instances of data components, the Data Drops, are driving the execution of a workflow. Consequently |daliuge| graphs are showing both application and data components as graph nodes. Edges in |daliuge| graphs are

Types of Data Components
------------------------

|daliuge| out of the box supports five main types of Data Components.

* Posix file data components
* Memory data components
* S3 data components
* NGAS data components
* Apache Plasma data components

This range covers most of the use cases in workflows we have encountered sofar, but adding support for additional data components is possible as well.

.. default-domain:: py

Python Data Component Class
---------------------------

Depending on the I/O model of such a new component, the development can be derived from either one of the existing higher level data components, the lower level :class:`DataIO <dlg.io.DataIO>` abstract class, or even lower with the :class:`AbstractDROP <dlg.drop.AbstractDROP>` class. The best way to get started is to checkout the code of existing data components like , e.g. the :class:`JsonDROP <dlg.json_drop.JsonDROP>` or the the :class:`S3DROP <dlg.s3_drop.S3DROP>`

I/O
---

A data components' input and output methods are defined by the abstract class :class:`DataIO <dlg.io.DataIO>`. The methods in that class are just empty definitions and have to be implemented by the actual data component.


20 changes: 10 additions & 10 deletions docs/dataflow.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Concepts and Background
-----------------------

This section briefly introduces key concepts and motivations underpinning
|daliuge|.
This section introduces key concepts and motivations underpinning
the |daliuge| system.

Dataflow
^^^^^^^^
Expand Down Expand Up @@ -85,29 +85,29 @@ Concretely, we have made the following changes to the existing dataflow model:

.. _dlg_functions:

|daliuge| Functions
^^^^^^^^^^^^^^^^^^^
|daliuge| provides eight Graph-based functions as shown in
:numref:`dataflow.fig.funcs`.
|daliuge| operational concepts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned above, |daliuge| has been developed to enable processing of data from the future Square Kilometre Array (SKA) observatory. To support the SKA operational environment |daliuge| provides eight Graph-based functions as shown in
:numref:`dataflow.fig.funcs`. The implementation of these operational concepts in general does not restrict the usage of |daliuge| for other use cases, but it is still taylored to meet the SKA requirements.

.. _dataflow.fig.funcs:

.. figure:: images/dfms_func_as_graphs.jpg

Graph-based Functions of the |daliuge| Prototype

The :doc:`graphs` section will go through implementation details for each function.
Here we briefly discuss how they work together in our data-driven framework.
The :doc:`graphs` section describes the implementation details for each function.
Here we briefly discuss how they work together to fullfill the SKA requirements.

* First of all, the *Logical Graph Template* (topleft in
:numref:`dataflow.fig.funcs`) represents high-level
data processing capabilities. In the case of SDP, they could be, for example,
data processing capabilities. In the case of the SKA Data Processor, they could be, for example,
"Process Visibility Data" or "Stage Data Products".

* Logical Graph Templates are managed by *LogicalGraph Template
Repositories* (bottomleft in :numref:`dataflow.fig.funcs`).
The logical graph template is first selected from this repository for a specific pipeline and
is then filled with scheduling block parameters. This generates a *Logical Graph*, expressing a workflow with resource-oblivious dataflow constructs.
is then populated with parameters derived from the detailed description of the scheduled science observation. This generates a *Logical Graph*, expressing a workflow with resource-oblivious dataflow constructs.

* Using profiling information of pipeline components executed on specific hardware resources, |daliuge|
then "translates" a Logical Graph into a *Physical Graph Template*, which prescribes a manifest of all Drops without specifying their physical locations.
Expand Down
6 changes: 3 additions & 3 deletions docs/dlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ account how and when it is used. This includes, for instance, placing medium-
and long-term persistent data into the optimal storage media, and to remove
data that is not used anymore.

The current |daliuge| implementation contains a Data Lifecycle Manager (DLM)
prototype. Because of the high coupling that is needed with all the Drops the
The current |daliuge| implementation contains a Data Lifecycle Manager (DLM).
Because of the high coupling that is needed with all the Drops the
DLM is contained within the :ref:`node_drop_manager` processes, and thus shares
the same memory space with the Drops it manages. By subscribing to events sent
by individual Drops it can track their state and react accordingly.

The DLM functionalities currently implemented in the |daliuge| prototype are:
The DLM functionalities currently implemented in |daliuge| are:

* Automatically expire Drops; i.e., moves them from the **COMPLETED** state
into the **EXPIRED** state, after which they are not readable anymore.
Expand Down
9 changes: 4 additions & 5 deletions docs/graphs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,12 +87,11 @@ the following flow constructs:

Repositories
""""""""""""
|daliuge| uses EAGLE as a Web-based |lg| editor as the default user interface
to underlying *logical graph repositories*. Repositories can reside on a local file system, on GitHub or on GitLab. Each |lg| is physically stored in those repositories as a
JSON-formatted textual file. For example, the JSON file for the continuous
imaging pipeline as shown partially in :numref:`graphs.figs.loop` can be accessed
|daliuge| uses EAGLE, a Web-based |lg| editor as the default user interface
to underlying logical graph and component repositories. Repositories can reside on a local file system, on GitHub or on GitLab. Each |lg| is physically stored in those repositories as a
JSON-formatted text file. The JSON format is based on a JSON schema and validated against that as well. The JSON file contains the description of the application and data components used in the graph as nodes, a description of the connection between the nodes (edges and connection ports) and also the description of some of the representation properties required to draw the graph.

`through HTTP GET <http://sdp-dfms.ddns.net/jsonbody?lg_name=cont_img.json>`_.
The repositories also contain so-called *palettes*, which represent a collection of components. Users can pick from those components in EAGLE to draw |lgts|. The differentiation between graphs and palettes is somewhat blurry, since any graph can also be used as a collection of components. However, palettes usually contain a superset of components used in any graph derived from them and thus the distinction is still relevant.


Usage of |Lgts| and |Lgs|
Expand Down
5 changes: 3 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Welcome to the Data Activated 流 [#f1]_ Graph Engine (|daliuge|).
|daliuge|
is a workflow graph execution framework, specifically designed to support very large scale processing graphs for the reduction of interferometric radio astronomy data sets. DALiuGE has already been used for processing large astronomical datasets in existing radio astronomy projects. It originated from a prototyping activity as part of the SDP Consortium called Data Flow Management System (DFMS). DFMS aimed to prototype the execution framework of the proposed SDP architecture.
For a complete tour of |daliuge| please read
our `overview paper <http://dx.doi.org/10.1016/j.ascom.2017.03.007>`_.
our `overview paper <http://dx.doi.org/10.1016/j.ascom.2017.03.007>`_. DALiuGE has been used in a project running a `full-scale simulation <http://dx.doi.org/10.1109/SC41405.2020.00006>`_ of the Square Kilometre Array dataflow on the ORNL Summit supercomputer.

.. figure:: images/DALiuGE_naming_rationale.png

Expand All @@ -27,7 +27,8 @@ and is performed by the `DIA team <http://www.icrar.org/our-research/data-intens
deployment
overview
graph_development
writing_an_application
app_development
data_development
examples
api-index

Expand Down
4 changes: 2 additions & 2 deletions docs/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Direct Installation
**NOTE: For most use cases the docker installation described above is recommended.**

Requirements
############
^^^^^^^^^^^^


The |daliuge| framework requires no packages apart from those listed in its
Expand All @@ -66,7 +66,7 @@ installed on the system:
* gcc >= 4.8

Installing
##########
^^^^^^^^^^

|daliuge| requires python 3.7 or later.

Expand Down
4 changes: 4 additions & 0 deletions docs/reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,7 @@ References
Parallel and Distributed Systems, IEEE Transactions on, 25(4), pp.993-1002.
#. Bokhari, S.H., 2012. Assignment problems in parallel and distributed
computing (Vol. 32). Springer Science & Business Media
#. R. Wang, et al., "Processing Full-Scale Square Kilometre Array Data on the
Summit Supercomputer," in 2020 SC20: International Conference for High Performance
Computing, Networking, Storage and Analysis (SC), Atlanta, GA, US, 2020 pp. 11-22.
doi: 10.1109/SC41405.2020.00006

0 comments on commit 1da1022

Please sign in to comment.