Skip to content

Commit

Permalink
I'm tired
Browse files Browse the repository at this point in the history
  • Loading branch information
michielbaird committed Oct 26, 2012
1 parent 1f7c560 commit 7a576a1
Show file tree
Hide file tree
Showing 6 changed files with 1,848 additions and 1,835 deletions.
214 changes: 214 additions & 0 deletions writeup/thesis_michiel/background.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
\chapter{Background\label{chap1}}
Workflow management systems define a complex process in terms well-defined
tasks and coordinate process completion \cite{1245778}. Automated
workflow management has been in wide use across various disciplines since
the concept was formalised in 1996\cite{springerlink:10.1007/BF00136712}.
Successful systems have been implemented across various, fields including
banking and pharmaceuticals
\cite{Brahe:2007:SWW:1316624.1316661,5407993}.

It has been shown to be very successful in the sciences as the same scientific
process can easily be repeated on a different set of data\cite{4721191}.
This not only aids in reproducibility but also saves time. This is done by
efficiently abstracting the operations in the flow, allowing it to be
automatically handled.

Geomatics is the field that concerns itself with the organisation,
representation and processing of geographic data, for the purpose of
querying it and making decissions off of the data
\cite{DiMartino:2007:TAG:1341012.1341081}. The workflow in Geomatics is
very distributed and the set of data that is operated on is large and
diverse. Workflow management within Geomatics has been considered and
solutions have been proposed, but not implemented or
evaluated\cite{Migliorini:2011:WTG:1999320.1999356}.


\section{Overview}
A workflow management system consists of definitions on how a set of tasks
should be executed \cite{springerlink:10.1007/BF00136712,vanderAalst2002125}.
The overall procedure is defined by the following components:
\begin{inparaenum}[(i)] \item actors, \item roles, \item responsibilities and
obligations, \item tasks, \item activities,\item conceptual structures and
\item resources.\end{inparaenum}

A real life problem or task can then be broken up into these components in
such a way that the tasks represent a flow network. These tasks then connect to
the actors and resources via the other
components\cite[p.~4]{Taylor:2006:WES:1196459}. This allows tasks to be
executed efficiently in a distributed manner.

The initial implementations of a workflow system, however, almost
immediately failed. The systems were too rigid and was unable to accommodate the
high levels of change that was required by the users
\cite{Suchman:1983:OPP:357442.357445}.

These changes come from a number of sources, including: ill-specification
of initial problems, change in actors or resources, exceptions that occurred
and new requirements. Adaptive workflow systems were proposed to solve this
problem by providing a mechanism for allowing change in the
system\cite{vanderAalst2002125}. This allows processes to be extended, replaced
or re-ordered. It also adds the ability to change already running tasks by
providing restart, transfer and proceed options.

Scientific workflow management has also been very successful with how
experiments are defined, and, more importantly, reused. Another benefit that was
quickly discovered was that it also allowed researchers to trade workflows,
making the replication of results much easier than they were
previously\cite{4721191}. Keys to this success were: that the workflow systems
were made to fit the researchers; quick responses to adding required features
when needed; listening to user input and making sharing of workflows as easy as
possible.

Such a system has also been applied in fields that operate on large data
sets, as would be the case if applied to problems in Geomatics Workflow systems were found to
work well in the management of getting this data processed. Applying the
concept to Observational Astrophysics, it revealed that it could be used to
identify bottlenecks that could be optimised \cite{Aragon:2009:WMH:1529282.1529491}. Further, it was used to
automatically ensure local access of large files that needed to be processed.


\section{Geographic Data}
Geomatics concerns itself with the collection, organisation and query of
geographic data \\ \cite{DiMartino:2007:TAG:1341012.1341081}. This data includes
but is not limited to landscapes, coordinate data, building models,
statistics, pictures, textures and routes. This is a very broad set of data,
varying from very large to very small. That variation, however, means that
there exists no uniform method to efficiently deal with the data.

The processing of this data can vary from human to software processing
\cite{DiMartino:2007:TAG:1341012.1341081}. Various Web applications have been
written to facilitate the tasks that need to be accomplished. This software is
known as WebGIS and is becoming more popular with scientists; it also means
that even within the field there is a strong shift toward Web-based services.

A key realisation with the usage of this data is that the same data is used
across various applications, to create various amounts of
abstractions\cite{ElAdnani:2001:MLF:512161.512177}. The core data is seldom
changed. Instead a new abstraction layer is added on top of it. The data can be
thought of as a graph, where the nodes represent either a data or abstraction
element, and the edges represent the functions/tasks required to create the
particular abstraction as a set of topological relationships.

\section{Implementations}
There are various products available that can compose scientific workflows.
\emph{The Trident workbench} \cite{Simmhan:2009:BTS:1673063.1673121} is an open
source workflow management system developed by Microsoft Research that also
adds middleware services and a graphical composition interface. Trident builds
workflows of control and data flows, off of built-in, user defined activities
and nested subflows.

The flows are represented using XOML, an XML Specification, while the
activities are stored as a set of sub-routines\cite{Simmhan2011790}. Trident
can be used on a local system, remote systems and even clusters. Queries on
the system can be performed using LINQ.


\emph{Kepler} is another scientific workflow management system that
provides workflow design and execution. Actors are designed to perform
independent tasks that can either be atomic or composite
\cite{Wang:2009:KHG:1645164.1645176}. Composite actors(subflows) consist of
multiple atomic actors bundled together. Actors can consume data and produce
output, called tokens. Actors communicate tokens with each other via links. The
order of execution and the links are defined by an independent entity called
the director. As a consequence, the workflow can either be executed in a
sequential or parallel manner. Kepler effectively separates the workflow from
its execution, allowing for easy batch execution. Actors can easily be exported
and shared. Kepler is very popular due to its adaptability and easy
integration.

\emph{Taverna} is a scientific workbench that supports application-level
workflow and does not focus on scheduling as much as others\cite{4721191}. Taverna
has a strong focus on workflow sharing. Taverna is quite popular, since there
exists a social network designed to facilitate workflow sharing among
scientists(\emph{myExperiment}). Services are linked to the model to execute
the various tasks. Taverna can be used in such a way that it can utilize all
the services a client has to facilitate the flow by easily adding services. The
Taverna language is a simple data-flow language called the Simple Conceptual
Unified Flow Language(SCUFL), that can be encoded in XML.

In order for these workbenches to be successful, there needs to exist a
high level of interoperability between the workflow management and the services
that are required \cite{Shegalov:2001:XWM:767132.767139}. However, due to the
fact that there is a relatively high chance of failure when building this
interoperability into the services as a core component. It is an extremely high
risk and therefore is not typically done. A cheaper way of doing this is
providing middleware that can wrap around the service to provide the required
interfaces.

This need for interoperability has led to the popularisation of SOA(Service
Orientated Architecture) \cite{Sanders:2008:SSA:1400549.1400595}. It should be
noted that SOA is \emph{not} an implementation, but rather an
\emph{Architectural Model}; SOA refers to a collection of loosely coupled
services, that individually carry out a particular process. Each service should
have a well defined interface with self-contained functionality. It should
allow other applications or services to use this functionality without knowing
the underlying technical details. These services should be hidden from the
end-user and their usage should preferably be platform-independent.

Although the concept has been around since the 1970s, it has only recently
gained favour due to Web services. Web services are software components that run on the
Internet through XML standards-based
interfaces\cite{Tai:2004:CCW:1045658.1045680}. Each service provides a
functional description using the \emph{Web Services Description Language}(WSDL).
This description provides the supported operations, as well as the definition
of the input and output messages.

By using these concepts, a workflow system can be built that automatically uses
these Web Services to facilitate both the data and control flow using well-defined
interfaces in standards such as XML/JSON \cite{Shegalov:2001:XWM:767132.767139}.
With the advancement of WebGIS, a lot
of Web Services that facilitate Geomatics processing already exist.


\section{Case Studies}
The next section will look at two instances where workflow management systems
were implemented and used. These case studies will look at both a business and
a scientific application.
\subsection*{Danske Bank}
The workflow management system at \emph{Danske bank} was incrementally
implemented as their system moved from a manual
system\cite{Brahe:2007:SWW:1316624.1316661}.

This system was developed as an in-house solution when the manual system
could not cope any longer. Several lessons were learnt that are applicable
to other workflow systems. When work was divided purely from an
efficiency point of view, the workers became complacent as they felt that
they did not understand the overall mechanism and felt that they were not
involved. They discovered that the system did not handle change very
well. This change was expensive and inevitable. Their system had to be
adapted to handle this change. The success of the system is mainly
attributed to the interoperability and close relationship between the
users and the developers

\subsection*{OrthoSearch}
\emph{OrophoSearch} is a workflow, built on \emph{Kepler}, that is
designed to work on data in the field of Bioinformatics.
\cite{daCruz:2008:OSW:1363686.1363983}

A workflow system was implemented in \emph{Kepler} as it addressed the
requirements they had, including: \begin{inparaenum}[(i)] \item workflow
definition and design; \item workflow execution control; \item fault
tolerance; \item intermediate data management; and \item data provenance
support. \end{inparaenum}

Although the system was not without its hiccups and changes, the
integration with Kepler provided the workflow with increased overall
productivity.


\section{Implication}
The field of Geomatics concerns itself with a vast amount of geographic data.
This data comes in various sizes and as such different methods of handling and
would need to be used to facilitate dataflows within the system.

The work, however, is done in a very distributed manner, which allows for a very
effective mapping onto a grid-based computing solution, provided middleware can
be developed to support the systems that are
used\cite{Montella:2007:UGC:1272980.1272995}.

Workflow for Geomatics processes, due to its distributed nature, would map well
onto a automated workflow system

\cite{Withana:2010:VWE:1851476.1851586}. The nature of the science is supported
well. It would allow for effective automation of some of the functions are
available.
96 changes: 96 additions & 0 deletions writeup/thesis_michiel/conclusion.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
\chapter{Conclusion\label{chap4}}
This research project was concluded with the successful implementation of a
Workflow Management System. This system can successfully manage a complex set
of tasks with arbitrary dependencies. Tasks could be either be fully automated
by the system or could be completed by users. When user tasks are started the
required files are incrementally transferred to the desktop host of the user
using \emph{rsync} to only transfer files that have not been transfered or are
out of date. To enforce quality and accuracy a feature was added that enforces
that user tasks be validated by experienced members of the team before the task
can be labelled as complete.

Automated tasks are executed on the
server due to the fact that these task operate on very large files. By executing
them locally data does not need to be transferred which would be an expensive
process. Tasks are automatically started when all dependencies are met.

In the event that a task fails the system also allows the user to inspect the
logging information that is generated during the execution of the task. Once the
problem is identified the tasks can then be manually restarted. The design and
implementation was done in three iterations. This is was explained in depth in
Chapter~\ref{chap2}.


The system was then successfully evaluated in Chapter~\ref{chap3} both for it's
usability and it's effectiveness at solving the problem. The following positive
results was obtained during the evaluation of the system:
\begin{enumerate}
\item The system was successfully able to implement and execute a portion of
the workflow in the modelling section of the modelling tasks that are
present in the Zamani-Project. This sample workflow used a mix of system
and user task.
\item The system was positively evaluated using a sample group of 24 users.
This evaluation revealed that users found the system useful, easy to
use and users were satisfied using the system. User responses and the
observations made during the test it was found that the system is
effective and is very easy to learn.
\end{enumerate}

This system was however not implemented within the Zamani Project. This was
mainly due to time constraints, caused by the scale and time required to
implement it. Functionally the system could be implemented however this process
could be significantly simplified by the addition of some features. These are
mentioned in the future work session.


\section{Future Work}
During the implementation of the workflow system various possible extensions
that could be added to the system however due to constraints on time these could
not be implemented. These features would improve the system both in terms of
performance, usability and set up time.
\begin{description}
\item[Hierarchical Workflows]\hfill \\
To allow better control and re usability over tasks, workflows should be
abstracted to include a hierarchy. Such a hierarchy would allow entire workflows
to be represented as singular nodes. These workflows, could then be repackaged
and reused in different sites, or even the same site. This would also allow the
setup for new sites to be much faster as prepackaged workflows could easily be
used as drop in components.
\item[Parameterized Scripts]\hfill \\
Oftentimes particular parameters of a script can change from one site to
another. This change does not necessarily affect the \emph{Task type}, however
with the current implementation of the system the change would need to be made
at this point. This can be greatly improved by allowing a \emph{Task} to send
parameters to the job. This would require the Task Subsystem to allow parameters
dynamically be sent to the \emph{Task Type}.
\item[Rule Based File Filters]\hfill \\
Currently within the system all the files in the output directory of a task is
treated as input to successor tasks. Tasks often times only use a portion of the
files created by the predecessor. In order to currently facilitate this with the
system additional a filtering task would need to be set up that filters out
unused files. By including a rule based filtering system much greater control
can be placed on the output files. Such rule based filters have been
successfully implemented in other systems\cite{conery2005rule}.
\item[Interactive Task Feedback Options]\hfill \\
In order to avoid one of the problems that were found in Section~\ref{eval:simple}
more interactivity is required for \emph{Tasks}. This primarily includes
real-time updates on the status of tasks. Further developments include the
ability to do more interactive validation such as discussion integration.
The addition of these collaborative could allow for a system that allows issues
with tasks for be resolved in a uniform manner\cite{guimaraes1998integration}.
\item[Transformation-based Task Support]\hfill \\
Currently the system is built around creating derivative data items. However it
is often common for certain files within a site to change, without creating an
additional copy. Although this behaviour is implicitly allowed it should be
extended to be better defined within the system.
\item[Parallel Task Processing] \hfill \\
One of the most crucial aspects affecting the long term feasibility of the
system is its ability to scale and handle larger and more complicated workflows.
In this regard the server node would become a significant bottleneck in
processing \emph{Server Tasks}. In order to alleviate this problem the system
would need to become distributed. This would present its own set of problems
as data would need to be efficiently distributed between the computation nodes
to ensure efficiency.

\end{description}

Loading

0 comments on commit 7a576a1

Please sign in to comment.