Skip to content

Commit

Permalink
Small documentation corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
awicenec committed Jul 1, 2021
1 parent ad1b028 commit 797c666
Show file tree
Hide file tree
Showing 5 changed files with 17 additions and 17 deletions.
16 changes: 8 additions & 8 deletions docs/README ray.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Notes of the merge between |daliuge| and Ray
=============================================
--------------------------------------------

The objective of this activity was to investigate a feasible solution for the flexible and simple deployment of DALiuGE on various platforms. In particular the deployment of DAliuGE on AWS in an autoscaling environment is of interest to us.

Expand All @@ -16,10 +16,10 @@ Internally Ray is using a number of technologies we are also using or evaluating
The idea thus was to use Ray to distribute DALiuGE on those platforms and on AWS to start with, but leave the rest of the two systems essentially independent.

Setup
-----
^^^^^

Pre-requisites
^^^^^^^^^^^^^^
""""""""""""""

First you need to install Ray into your local python virtualenv::

Expand All @@ -30,7 +30,7 @@ Ray uses a YAML file to configure a deployment and allows to run additional setu
The rest is then straight forward and just requires to configure a few AWS autoscale specific settings, which includes AWS region, type of head node and type and (maximum and minimum) number of worker nodes as well as whether this is using the Spot market or not. In addition it is required to specify the virtual machine AMI ID, which is a pain to get and different for the various AWS regions.

Starting the DALiuGE Ray cluster
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
""""""""""""""""""""""""""""""""

To get DALiuGE up and running in addition to Ray requires just two additional lines for the HEAD and the worker nodes in the YAML file, but there are some caveats as outlined below. With the provided ray configuration YAML file starting a cluster running DALiuGE on AWS is super easy (provided you have your AWS environment set up in place)::

Expand All @@ -48,7 +48,7 @@ More for convenience both DALiuGE and Ray require a number of ports to be expose
More specifically the command above actually opens a shell inside to the docker container running on the head node AWS instance.

Issues
^^^^^^
""""""
The basic idea is to start up a Data Island Manager (and possibly also a Node Manager) on the Ray Head node and a Node Manager on each of the worker nodes. Starting them is trivial, but the issue is to correctly register the NMs with the DIM. DALiuGE is usually doing this the other way around, by telling the DIM at startup which NMs it is responsible for. This is not possible in a autoscaling setup, since the NMs don't exist yet.
As a workaround DALiuGE does provide a REST API call to register a NM with the DIM, but that currently has a bug, registering the node twice.

Expand All @@ -64,15 +64,15 @@ To stop and start a node manager use the following two commands, replacing the S
The commands above also show how to connect to a shell inside the docker container on a worker node. Unfortunately this is not exposed as easily as the connection to the head node in Ray.

Submitting and executing a Graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
""""""""""""""""""""""""""""""""
This configuration only deploys the DALiuGE engine. EAGLE and a translator need to be deployed somewhere else. When submitting the PG from a translator web interface, the IP address to be entered there is the *public* IP address of the DIM (Ray AWS head instance). After submitting, the DALiuGE monitoring page will pop up and show the progress bar. It is then also possible to click your way through to the sub-graphs running on the worker nodes.

Future thoughts
---------------
"""""""""""""""
This implementation is the start of an integration between Ray and DALiuGE. Ray (like the AWS autoscaling) is a *reactive* execution framework and as such it uses the autoscaling feature just in time, when the load exceeds a certain threshold. DALiuGE on the other hand is a *proactive* execution framework and pre-allocates the resources required to execute a whole workflow. Both approaches have pros and cons. In particular in an environment where resources are charged by the second it is desireable to allocate them as dynamically as possible. On the other hand dynamic allocation comes with the overhead of provisioning additional resources during run-time and is thus non-deterministic in terms of completion time. This is even more obvious when using the spot market on AWS. Fully dynamic allocation also does not fit well with bigger workflows, which require lots of resources already at the beginning. The optimal solution very likely is somewhere in the middle between fully dynamic and fully static resource provisioning.

Dynamic workflow allocation
^^^^^^^^^^^^^^^^^^^^^^^^^^^
"""""""""""""""""""""""""""
The first step in that direction is to connect the DALiuGE translator with the ray deployment. After the translator has performed the workflow partitioning the resource requirements are fixed and could be used in turn to startup the Ray cluster with the required number of worker nodes. Essentially This would also completely isolate one workflow from another. The next step could be to add workflow fragmentation to the DALiuGE translator and scale the Ray cluster according to the requirements of each of the fragments, rather than the whole workflow. It has to be seen how to trigger the scaling of the Ray cluster just enough ahead of time to be available for the previous workflow fragment to continue without delays.


Expand Down
2 changes: 1 addition & 1 deletion docs/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The |daliuge| execution engine *can* be run on a laptop as well, but, other than
EAGLE
#####

EAGLE is a web-application allowing users to develop complex scientific workflows using a visual paradigm similar to Kepler and Taverna or Simulink and Quartz composer. It also allows groups of scientists to work together on such workflows, version them and prepare them for execution. A workflow in |daliuge| terminology is represented as a graph. In fact the |daliuge| system is dealing with a whole set of different graphs representing the same workflow, depending on the lifecycle state of that workflow. EAGLE just deals with the so-called *Logical Graph* state of a given workflow. EAGLE also offers an interface to the *Translator* and, through the *Translator* also to the *Engine*. For detailed information about EAGLE please refer to the EAGLE basic documentation under https://github.com/ICRAR/EAGLE as well as the `detailed usage documentation in EAGLE <https://eagle.icrar.org/static/docs/build/html/index.html>`__.
EAGLE is a web-application allowing users to develop complex scientific workflows using a visual paradigm similar to Kepler and Taverna or Simulink and Quartz composer. It also allows groups of scientists to work together on such workflows, version them and prepare them for execution. A workflow in |daliuge| terminology is represented as a graph. In fact the |daliuge| system is dealing with a whole set of different graphs representing the same workflow, depending on the lifecycle state of that workflow. EAGLE just deals with the so-called *Logical Graph* state of a given workflow. EAGLE also offers an interface to the *Translator* and, through the *Translator* also to the *Engine*. For detailed information about EAGLE please refer to the EAGLE basic documentation under https://github.com/ICRAR/EAGLE as well as the `detailed usage documentation in readthedocs <https://eagle-dlg.readthedocs.io>`__.

Translator service
##################
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Deployment
==========

The three components described in the :ref:`basics` section allow for a very flexible deployment. In a real world deployment there will always be one master manager, zero or a few data island managers, and as many node managers as there are computing nodes available to the |daliuge| execution engine. In very small deployments one node manager can take over the role of the master manager as well.
The three components described in the :ref:`basics` section allow for a very flexible deployment. In a real world deployment there will always be one data island manager, zero or one master managers, and as many node managers as there are computing nodes available to the |daliuge| execution engine. In very small deployments one node manager can take over the role of the master manager as well. For extremely large deployments |daliuge| supports a hierarchy of island managers to be deployed, although even with 10s of millions of tasks we have not seen the actual need to do this. Note that the managers are only deploying the graph, the execution is completely asynchronous and does not require any of the higher level managers to run. Even the *manager functionality* of the node manager is not required at run-time.

The primary usage scenario for the |daliuge| execution engine is to run it on a large cluster of machines with very large workflows of thousands to millions of individual tasks. However, for testing and small scale applications it is also possible to deploy the whole system on a single laptop or on a small cluster. It is also possible to deploy the whole system or parts of it on AWS or a Kubernetes cluster. For instance EAGLE and/or the *translator* could run locally, or on a single AWS instance and submit the physical graph to a master manager on some HPC system. This flexible deployment is also the reason why the individual components are kept well separated.
10 changes: 5 additions & 5 deletions docs/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,21 @@ Docker images

The recommended and easiest way to get started is to use the docker container installation procedures provided to build and run the daliuge-engine and the daliuge-translator. We currently build the system in three images:

* *icrar/daliuge-base* contains all the basic |daliuge| libraries and dependencies.
* *icrar/daliuge-common* contains all the basic |daliuge| libraries and dependencies.
* *icrar/daliuge-engine* is built on top of the :base image and includes the installation
of the DALiuGE execution engine.
* *icrar/daliuge-translator* is also built on top of the :base image and includes the installation
of the DALiuGE translator.


This way we try to separate the pre-requirements of the daliuge engine and translator from the rest of the framework, which is more dynamic. The idea is then to rebuild only the daliuge-engine image as needed when new versions of the framework need to be deployed, and not build it from scratch each time.
This way we try to separate the pre-requirements of the daliuge engine and translator from the rest of the framework, which is has a more dynamic development cycle. The idea is then to rebuild only the daliuge-engine image as needed when new versions of the framework need to be deployed, and not build it from scratch each time.

Most of the dependencies included in :base do not belong to the DALiuGE framework itself, but
rather to its requirements (mainly to the spead2 communication protocol). Once we move out the spead2 application from this repository (and therefore the dependency of dfms on spead2) we'll re-organize these Dockerfiles to have a base installation of the dfms framework, and then build further images on top of that base image containing specific applications with their own system installation requirements.
Most of the dependencies included in daliuge-common do not belong to the DALiuGE framework itself, but
rather to its requirements (mainly to the spead2 communication protocol).

The *daliuge-engine* image by default runs a generic daemon, which allows to then start the Master Manager, Node Manager or DataIsland Manager. This approach allows to change the actual manager deployment configuration in a more dynamic way and adjusted to the actual requirements of the environment.

Building the three images is easy, just start with the daliuge-base image by running::
Building the three images is easy, just start with the daliuge-common image by running::

cd daliuge-common && ./build_common.sh && cd ..

Expand Down
4 changes: 2 additions & 2 deletions docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ sources (mostly *Python*!) such as `Luigi <http://luigi.readthedocs.io/>`_,
Nevertheless, we believe |daliuge| has some unique features well suited
for data-intensive applications:

* Completely data-activated, by promoting data Drops to become graph "nodes" (no longer just edges)
that has persistent states and can consume and raise events
* Completely data-activated, by promoting data :doc:`drops` to become graph "nodes" (no longer just edges)
that have persistent states and can consume and raise events
* Integration of data-lifecycle management within the data processing framework
* Separation of concerns between logical graphs (high level workflows) and physical graphs (execution recipes)
* Flexible pipeline component interface, including Docker containers.
Expand Down

0 comments on commit 797c666

Please sign in to comment.