-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1f7c560
commit 7a576a1
Showing
6 changed files
with
1,848 additions
and
1,835 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
\chapter{Background\label{chap1}} | ||
Workflow management systems define a complex process in terms well-defined | ||
tasks and coordinate process completion \cite{1245778}. Automated | ||
workflow management has been in wide use across various disciplines since | ||
the concept was formalised in 1996\cite{springerlink:10.1007/BF00136712}. | ||
Successful systems have been implemented across various, fields including | ||
banking and pharmaceuticals | ||
\cite{Brahe:2007:SWW:1316624.1316661,5407993}. | ||
|
||
It has been shown to be very successful in the sciences as the same scientific | ||
process can easily be repeated on a different set of data\cite{4721191}. | ||
This not only aids in reproducibility but also saves time. This is done by | ||
efficiently abstracting the operations in the flow, allowing it to be | ||
automatically handled. | ||
|
||
Geomatics is the field that concerns itself with the organisation, | ||
representation and processing of geographic data, for the purpose of | ||
querying it and making decissions off of the data | ||
\cite{DiMartino:2007:TAG:1341012.1341081}. The workflow in Geomatics is | ||
very distributed and the set of data that is operated on is large and | ||
diverse. Workflow management within Geomatics has been considered and | ||
solutions have been proposed, but not implemented or | ||
evaluated\cite{Migliorini:2011:WTG:1999320.1999356}. | ||
|
||
|
||
\section{Overview} | ||
A workflow management system consists of definitions on how a set of tasks | ||
should be executed \cite{springerlink:10.1007/BF00136712,vanderAalst2002125}. | ||
The overall procedure is defined by the following components: | ||
\begin{inparaenum}[(i)] \item actors, \item roles, \item responsibilities and | ||
obligations, \item tasks, \item activities,\item conceptual structures and | ||
\item resources.\end{inparaenum} | ||
|
||
A real life problem or task can then be broken up into these components in | ||
such a way that the tasks represent a flow network. These tasks then connect to | ||
the actors and resources via the other | ||
components\cite[p.~4]{Taylor:2006:WES:1196459}. This allows tasks to be | ||
executed efficiently in a distributed manner. | ||
|
||
The initial implementations of a workflow system, however, almost | ||
immediately failed. The systems were too rigid and was unable to accommodate the | ||
high levels of change that was required by the users | ||
\cite{Suchman:1983:OPP:357442.357445}. | ||
|
||
These changes come from a number of sources, including: ill-specification | ||
of initial problems, change in actors or resources, exceptions that occurred | ||
and new requirements. Adaptive workflow systems were proposed to solve this | ||
problem by providing a mechanism for allowing change in the | ||
system\cite{vanderAalst2002125}. This allows processes to be extended, replaced | ||
or re-ordered. It also adds the ability to change already running tasks by | ||
providing restart, transfer and proceed options. | ||
|
||
Scientific workflow management has also been very successful with how | ||
experiments are defined, and, more importantly, reused. Another benefit that was | ||
quickly discovered was that it also allowed researchers to trade workflows, | ||
making the replication of results much easier than they were | ||
previously\cite{4721191}. Keys to this success were: that the workflow systems | ||
were made to fit the researchers; quick responses to adding required features | ||
when needed; listening to user input and making sharing of workflows as easy as | ||
possible. | ||
|
||
Such a system has also been applied in fields that operate on large data | ||
sets, as would be the case if applied to problems in Geomatics Workflow systems were found to | ||
work well in the management of getting this data processed. Applying the | ||
concept to Observational Astrophysics, it revealed that it could be used to | ||
identify bottlenecks that could be optimised \cite{Aragon:2009:WMH:1529282.1529491}. Further, it was used to | ||
automatically ensure local access of large files that needed to be processed. | ||
|
||
|
||
\section{Geographic Data} | ||
Geomatics concerns itself with the collection, organisation and query of | ||
geographic data \\ \cite{DiMartino:2007:TAG:1341012.1341081}. This data includes | ||
but is not limited to landscapes, coordinate data, building models, | ||
statistics, pictures, textures and routes. This is a very broad set of data, | ||
varying from very large to very small. That variation, however, means that | ||
there exists no uniform method to efficiently deal with the data. | ||
|
||
The processing of this data can vary from human to software processing | ||
\cite{DiMartino:2007:TAG:1341012.1341081}. Various Web applications have been | ||
written to facilitate the tasks that need to be accomplished. This software is | ||
known as WebGIS and is becoming more popular with scientists; it also means | ||
that even within the field there is a strong shift toward Web-based services. | ||
|
||
A key realisation with the usage of this data is that the same data is used | ||
across various applications, to create various amounts of | ||
abstractions\cite{ElAdnani:2001:MLF:512161.512177}. The core data is seldom | ||
changed. Instead a new abstraction layer is added on top of it. The data can be | ||
thought of as a graph, where the nodes represent either a data or abstraction | ||
element, and the edges represent the functions/tasks required to create the | ||
particular abstraction as a set of topological relationships. | ||
|
||
\section{Implementations} | ||
There are various products available that can compose scientific workflows. | ||
\emph{The Trident workbench} \cite{Simmhan:2009:BTS:1673063.1673121} is an open | ||
source workflow management system developed by Microsoft Research that also | ||
adds middleware services and a graphical composition interface. Trident builds | ||
workflows of control and data flows, off of built-in, user defined activities | ||
and nested subflows. | ||
|
||
The flows are represented using XOML, an XML Specification, while the | ||
activities are stored as a set of sub-routines\cite{Simmhan2011790}. Trident | ||
can be used on a local system, remote systems and even clusters. Queries on | ||
the system can be performed using LINQ. | ||
|
||
|
||
\emph{Kepler} is another scientific workflow management system that | ||
provides workflow design and execution. Actors are designed to perform | ||
independent tasks that can either be atomic or composite | ||
\cite{Wang:2009:KHG:1645164.1645176}. Composite actors(subflows) consist of | ||
multiple atomic actors bundled together. Actors can consume data and produce | ||
output, called tokens. Actors communicate tokens with each other via links. The | ||
order of execution and the links are defined by an independent entity called | ||
the director. As a consequence, the workflow can either be executed in a | ||
sequential or parallel manner. Kepler effectively separates the workflow from | ||
its execution, allowing for easy batch execution. Actors can easily be exported | ||
and shared. Kepler is very popular due to its adaptability and easy | ||
integration. | ||
|
||
\emph{Taverna} is a scientific workbench that supports application-level | ||
workflow and does not focus on scheduling as much as others\cite{4721191}. Taverna | ||
has a strong focus on workflow sharing. Taverna is quite popular, since there | ||
exists a social network designed to facilitate workflow sharing among | ||
scientists(\emph{myExperiment}). Services are linked to the model to execute | ||
the various tasks. Taverna can be used in such a way that it can utilize all | ||
the services a client has to facilitate the flow by easily adding services. The | ||
Taverna language is a simple data-flow language called the Simple Conceptual | ||
Unified Flow Language(SCUFL), that can be encoded in XML. | ||
|
||
In order for these workbenches to be successful, there needs to exist a | ||
high level of interoperability between the workflow management and the services | ||
that are required \cite{Shegalov:2001:XWM:767132.767139}. However, due to the | ||
fact that there is a relatively high chance of failure when building this | ||
interoperability into the services as a core component. It is an extremely high | ||
risk and therefore is not typically done. A cheaper way of doing this is | ||
providing middleware that can wrap around the service to provide the required | ||
interfaces. | ||
|
||
This need for interoperability has led to the popularisation of SOA(Service | ||
Orientated Architecture) \cite{Sanders:2008:SSA:1400549.1400595}. It should be | ||
noted that SOA is \emph{not} an implementation, but rather an | ||
\emph{Architectural Model}; SOA refers to a collection of loosely coupled | ||
services, that individually carry out a particular process. Each service should | ||
have a well defined interface with self-contained functionality. It should | ||
allow other applications or services to use this functionality without knowing | ||
the underlying technical details. These services should be hidden from the | ||
end-user and their usage should preferably be platform-independent. | ||
|
||
Although the concept has been around since the 1970s, it has only recently | ||
gained favour due to Web services. Web services are software components that run on the | ||
Internet through XML standards-based | ||
interfaces\cite{Tai:2004:CCW:1045658.1045680}. Each service provides a | ||
functional description using the \emph{Web Services Description Language}(WSDL). | ||
This description provides the supported operations, as well as the definition | ||
of the input and output messages. | ||
|
||
By using these concepts, a workflow system can be built that automatically uses | ||
these Web Services to facilitate both the data and control flow using well-defined | ||
interfaces in standards such as XML/JSON \cite{Shegalov:2001:XWM:767132.767139}. | ||
With the advancement of WebGIS, a lot | ||
of Web Services that facilitate Geomatics processing already exist. | ||
|
||
|
||
\section{Case Studies} | ||
The next section will look at two instances where workflow management systems | ||
were implemented and used. These case studies will look at both a business and | ||
a scientific application. | ||
\subsection*{Danske Bank} | ||
The workflow management system at \emph{Danske bank} was incrementally | ||
implemented as their system moved from a manual | ||
system\cite{Brahe:2007:SWW:1316624.1316661}. | ||
|
||
This system was developed as an in-house solution when the manual system | ||
could not cope any longer. Several lessons were learnt that are applicable | ||
to other workflow systems. When work was divided purely from an | ||
efficiency point of view, the workers became complacent as they felt that | ||
they did not understand the overall mechanism and felt that they were not | ||
involved. They discovered that the system did not handle change very | ||
well. This change was expensive and inevitable. Their system had to be | ||
adapted to handle this change. The success of the system is mainly | ||
attributed to the interoperability and close relationship between the | ||
users and the developers | ||
|
||
\subsection*{OrthoSearch} | ||
\emph{OrophoSearch} is a workflow, built on \emph{Kepler}, that is | ||
designed to work on data in the field of Bioinformatics. | ||
\cite{daCruz:2008:OSW:1363686.1363983} | ||
|
||
A workflow system was implemented in \emph{Kepler} as it addressed the | ||
requirements they had, including: \begin{inparaenum}[(i)] \item workflow | ||
definition and design; \item workflow execution control; \item fault | ||
tolerance; \item intermediate data management; and \item data provenance | ||
support. \end{inparaenum} | ||
|
||
Although the system was not without its hiccups and changes, the | ||
integration with Kepler provided the workflow with increased overall | ||
productivity. | ||
|
||
|
||
\section{Implication} | ||
The field of Geomatics concerns itself with a vast amount of geographic data. | ||
This data comes in various sizes and as such different methods of handling and | ||
would need to be used to facilitate dataflows within the system. | ||
|
||
The work, however, is done in a very distributed manner, which allows for a very | ||
effective mapping onto a grid-based computing solution, provided middleware can | ||
be developed to support the systems that are | ||
used\cite{Montella:2007:UGC:1272980.1272995}. | ||
|
||
Workflow for Geomatics processes, due to its distributed nature, would map well | ||
onto a automated workflow system | ||
|
||
\cite{Withana:2010:VWE:1851476.1851586}. The nature of the science is supported | ||
well. It would allow for effective automation of some of the functions are | ||
available. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
\chapter{Conclusion\label{chap4}} | ||
This research project was concluded with the successful implementation of a | ||
Workflow Management System. This system can successfully manage a complex set | ||
of tasks with arbitrary dependencies. Tasks could be either be fully automated | ||
by the system or could be completed by users. When user tasks are started the | ||
required files are incrementally transferred to the desktop host of the user | ||
using \emph{rsync} to only transfer files that have not been transfered or are | ||
out of date. To enforce quality and accuracy a feature was added that enforces | ||
that user tasks be validated by experienced members of the team before the task | ||
can be labelled as complete. | ||
|
||
Automated tasks are executed on the | ||
server due to the fact that these task operate on very large files. By executing | ||
them locally data does not need to be transferred which would be an expensive | ||
process. Tasks are automatically started when all dependencies are met. | ||
|
||
In the event that a task fails the system also allows the user to inspect the | ||
logging information that is generated during the execution of the task. Once the | ||
problem is identified the tasks can then be manually restarted. The design and | ||
implementation was done in three iterations. This is was explained in depth in | ||
Chapter~\ref{chap2}. | ||
|
||
|
||
The system was then successfully evaluated in Chapter~\ref{chap3} both for it's | ||
usability and it's effectiveness at solving the problem. The following positive | ||
results was obtained during the evaluation of the system: | ||
\begin{enumerate} | ||
\item The system was successfully able to implement and execute a portion of | ||
the workflow in the modelling section of the modelling tasks that are | ||
present in the Zamani-Project. This sample workflow used a mix of system | ||
and user task. | ||
\item The system was positively evaluated using a sample group of 24 users. | ||
This evaluation revealed that users found the system useful, easy to | ||
use and users were satisfied using the system. User responses and the | ||
observations made during the test it was found that the system is | ||
effective and is very easy to learn. | ||
\end{enumerate} | ||
|
||
This system was however not implemented within the Zamani Project. This was | ||
mainly due to time constraints, caused by the scale and time required to | ||
implement it. Functionally the system could be implemented however this process | ||
could be significantly simplified by the addition of some features. These are | ||
mentioned in the future work session. | ||
|
||
|
||
\section{Future Work} | ||
During the implementation of the workflow system various possible extensions | ||
that could be added to the system however due to constraints on time these could | ||
not be implemented. These features would improve the system both in terms of | ||
performance, usability and set up time. | ||
\begin{description} | ||
\item[Hierarchical Workflows]\hfill \\ | ||
To allow better control and re usability over tasks, workflows should be | ||
abstracted to include a hierarchy. Such a hierarchy would allow entire workflows | ||
to be represented as singular nodes. These workflows, could then be repackaged | ||
and reused in different sites, or even the same site. This would also allow the | ||
setup for new sites to be much faster as prepackaged workflows could easily be | ||
used as drop in components. | ||
\item[Parameterized Scripts]\hfill \\ | ||
Oftentimes particular parameters of a script can change from one site to | ||
another. This change does not necessarily affect the \emph{Task type}, however | ||
with the current implementation of the system the change would need to be made | ||
at this point. This can be greatly improved by allowing a \emph{Task} to send | ||
parameters to the job. This would require the Task Subsystem to allow parameters | ||
dynamically be sent to the \emph{Task Type}. | ||
\item[Rule Based File Filters]\hfill \\ | ||
Currently within the system all the files in the output directory of a task is | ||
treated as input to successor tasks. Tasks often times only use a portion of the | ||
files created by the predecessor. In order to currently facilitate this with the | ||
system additional a filtering task would need to be set up that filters out | ||
unused files. By including a rule based filtering system much greater control | ||
can be placed on the output files. Such rule based filters have been | ||
successfully implemented in other systems\cite{conery2005rule}. | ||
\item[Interactive Task Feedback Options]\hfill \\ | ||
In order to avoid one of the problems that were found in Section~\ref{eval:simple} | ||
more interactivity is required for \emph{Tasks}. This primarily includes | ||
real-time updates on the status of tasks. Further developments include the | ||
ability to do more interactive validation such as discussion integration. | ||
The addition of these collaborative could allow for a system that allows issues | ||
with tasks for be resolved in a uniform manner\cite{guimaraes1998integration}. | ||
\item[Transformation-based Task Support]\hfill \\ | ||
Currently the system is built around creating derivative data items. However it | ||
is often common for certain files within a site to change, without creating an | ||
additional copy. Although this behaviour is implicitly allowed it should be | ||
extended to be better defined within the system. | ||
\item[Parallel Task Processing] \hfill \\ | ||
One of the most crucial aspects affecting the long term feasibility of the | ||
system is its ability to scale and handle larger and more complicated workflows. | ||
In this regard the server node would become a significant bottleneck in | ||
processing \emph{Server Tasks}. In order to alleviate this problem the system | ||
would need to become distributed. This would present its own set of problems | ||
as data would need to be efficiently distributed between the computation nodes | ||
to ensure efficiency. | ||
|
||
\end{description} | ||
|
Oops, something went wrong.