I'm tired

DEADBEEF · Oct 26, 2012 · 7a576a1 · 7a576a1
1 parent 1f7c560
commit 7a576a1
Show file tree

Hide file tree

Showing 6 changed files with 1,848 additions and 1,835 deletions.
diff --git a/writeup/thesis_michiel/background.tex b/writeup/thesis_michiel/background.tex
@@ -0,0 +1,214 @@
+\chapter{Background\label{chap1}}
+    Workflow management systems define a complex process in terms well-defined
+    tasks and coordinate process completion \cite{1245778}.  Automated
+    workflow management has been in wide use across various disciplines since
+    the concept was formalised in 1996\cite{springerlink:10.1007/BF00136712}.
+    Successful systems have been implemented across various, fields including
+    banking and pharmaceuticals
+    \cite{Brahe:2007:SWW:1316624.1316661,5407993}.
+
+    It has been shown to be very successful in the sciences as the same scientific
+    process can easily be repeated on a different set of data\cite{4721191}.
+    This not only aids in reproducibility but also saves time.  This is done by
+    efficiently abstracting the operations in the flow, allowing it to be
+    automatically handled.
+
+    Geomatics is the field that concerns itself with the organisation,
+    representation and processing of geographic data, for the purpose of
+    querying it and making decissions off of the data
+    \cite{DiMartino:2007:TAG:1341012.1341081}. The workflow in Geomatics is
+    very distributed and the set of data that is operated on is large and
+    diverse.  Workflow management within Geomatics has been considered and
+    solutions have been proposed, but not implemented or
+    evaluated\cite{Migliorini:2011:WTG:1999320.1999356}.
+
+
+\section{Overview}
+A workflow management system consists of definitions on how a set of tasks
+should be executed \cite{springerlink:10.1007/BF00136712,vanderAalst2002125}.
+The overall procedure is defined by the following components:
+\begin{inparaenum}[(i)] \item actors, \item roles, \item responsibilities and
+obligations, \item tasks, \item activities,\item conceptual structures and
+\item resources.\end{inparaenum}
+
+A real life problem or task can then be broken up into these components in
+such a way that the tasks represent a flow network. These tasks then connect to
+the actors and resources via the other
+components\cite[p.~4]{Taylor:2006:WES:1196459}.  This allows tasks to be
+executed efficiently in a distributed manner.
+
+The initial implementations of a workflow system, however, almost
+immediately failed. The systems were too rigid and was unable to accommodate the
+high levels of change that was required by the users
+\cite{Suchman:1983:OPP:357442.357445}.
+
+These changes come from a number of sources, including: ill-specification
+of initial problems, change in actors or resources, exceptions that occurred
+and new requirements.  Adaptive workflow systems were proposed to solve this
+problem by providing a mechanism for allowing change in the
+system\cite{vanderAalst2002125}. This allows processes to be extended, replaced
+or re-ordered. It also adds the ability to change already running tasks by
+providing restart, transfer and proceed options.
+
+Scientific workflow management has also been very successful with how
+experiments are defined, and, more importantly, reused. Another benefit that was
+quickly discovered was that it also allowed researchers to trade workflows,
+making the replication of results much easier than they were
+previously\cite{4721191}. Keys to this success were: that the workflow systems
+were made to fit the researchers; quick responses to adding required features
+when needed; listening to user input and making sharing of workflows as easy as
+possible.
+
+Such a system has also been applied in fields that operate on large data
+sets, as would be the case if applied to problems in Geomatics  Workflow systems were found to
+work well in the management of getting this data processed. Applying the
+concept to Observational Astrophysics, it revealed that it could be used to
+identify bottlenecks that could be optimised \cite{Aragon:2009:WMH:1529282.1529491}.  Further, it was used to
+automatically ensure local access of large files that needed to be processed.
+
+
+\section{Geographic Data}
+Geomatics concerns itself with the collection, organisation and query of
+geographic data \\  \cite{DiMartino:2007:TAG:1341012.1341081}.  This data includes
+but  is not limited to landscapes, coordinate data, building models,
+statistics, pictures, textures and routes. This is a very broad set of data,
+varying from very large to very small.  That variation, however, means that
+there exists no uniform method to efficiently deal with the data.
+
+The processing of this data can vary from human to software processing
+\cite{DiMartino:2007:TAG:1341012.1341081}.  Various Web applications have been
+written to facilitate the tasks that need to be accomplished.  This software is
+known as WebGIS and is becoming more popular with scientists; it also means
+that even within the field there is a strong shift toward Web-based services.
+
+A key realisation with the usage of this data is that the same data is used
+across various applications, to create various amounts of
+abstractions\cite{ElAdnani:2001:MLF:512161.512177}.  The core data is seldom
+changed. Instead a new abstraction layer is added on top of it. The data can be
+thought of as a graph, where the nodes represent either a data or abstraction
+element, and the edges represent the functions/tasks required to create the
+particular abstraction as a set of topological relationships. 
+
+\section{Implementations}
+There are various products available that can compose scientific workflows.
+\emph{The Trident workbench} \cite{Simmhan:2009:BTS:1673063.1673121} is an open
+source workflow management system developed by Microsoft Research that also
+adds middleware services and a graphical composition interface. Trident builds
+workflows of control and data flows, off of built-in, user defined activities
+and nested subflows.
+
+The flows are represented using XOML, an XML Specification, while the
+activities are stored as a set of sub-routines\cite{Simmhan2011790}. Trident
+can be used on a local system, remote systems and even clusters.  Queries on
+the system can be performed using LINQ.
+
+
+\emph{Kepler} is another scientific workflow management system that
+provides workflow design and execution.  Actors are designed to perform
+independent tasks that can either be atomic or  composite
+\cite{Wang:2009:KHG:1645164.1645176}.  Composite actors(subflows) consist of
+multiple   atomic actors bundled together. Actors can consume data and produce
+output, called tokens. Actors communicate tokens with each other via links. The
+order of execution and the links are defined by an independent entity called
+the director. As a consequence, the workflow can either be executed in a
+sequential or parallel manner. Kepler effectively separates the workflow from
+its execution, allowing for easy batch execution. Actors can easily be exported
+and shared.  Kepler is very popular due to its adaptability and easy
+integration.
+
+    \emph{Taverna} is a scientific workbench that supports application-level
+workflow and does not focus on scheduling as much as others\cite{4721191}. Taverna
+has a strong focus on workflow sharing. Taverna is quite popular, since there
+exists a social network designed to facilitate workflow sharing among
+scientists(\emph{myExperiment}). Services are linked to the model to execute
+the various tasks. Taverna can be used in such a way that it can utilize all
+the services a client has to facilitate the flow by easily adding services. The
+Taverna language is a simple data-flow language called the Simple Conceptual
+Unified Flow Language(SCUFL), that can be encoded in XML.
+
+In order for these workbenches to be successful, there needs to exist a
+high level of interoperability between the workflow management and the services
+that are required \cite{Shegalov:2001:XWM:767132.767139}.  However, due to the
+fact that there is a relatively high chance of failure when building this
+interoperability into the services as a core component. It is an extremely high
+risk and therefore is not typically done. A cheaper way of doing this is
+providing middleware that can wrap around the service to provide the required
+interfaces.
+
+This need for interoperability has led to the popularisation of SOA(Service
+Orientated Architecture) \cite{Sanders:2008:SSA:1400549.1400595}.  It should be
+noted that SOA is \emph{not} an implementation, but rather an
+\emph{Architectural Model}; SOA refers to a collection of loosely coupled
+services, that individually carry out a particular process. Each service should
+have a well defined interface with self-contained functionality. It should
+allow other applications or services to use this functionality without knowing
+the underlying technical details. These services should be hidden from the
+end-user and their usage should preferably be platform-independent.
+
+Although the concept has been around since the 1970s, it has only recently
+gained favour due to Web services.  Web services are software components that run on the
+Internet through XML standards-based
+interfaces\cite{Tai:2004:CCW:1045658.1045680}.  Each service provides a
+functional description using the \emph{Web Services Description Language}(WSDL).
+This description provides the supported operations, as well as the definition
+of the input and output messages.
+
+By using these concepts, a workflow system can be built that automatically uses
+these Web Services to facilitate both the data and control flow using well-defined
+interfaces in standards such as XML/JSON \cite{Shegalov:2001:XWM:767132.767139}. 
+With the advancement of WebGIS, a lot
+of Web Services that facilitate Geomatics processing already exist.
+
+
+\section{Case Studies}
+The next section will look at two instances where workflow management systems
+were implemented and used.  These case studies will look at both a business and
+a scientific application.
+    \subsection*{Danske Bank}
+      The workflow management system at \emph{Danske bank} was incrementally
+      implemented as their system moved from a manual
+      system\cite{Brahe:2007:SWW:1316624.1316661}.
+
+      This system was developed as an in-house solution when the manual system
+      could not cope any longer.  Several lessons were learnt that are applicable
+      to other workflow systems. When work was divided purely from an
+      efficiency point of view, the workers became complacent as they felt that
+      they did not understand the overall mechanism and felt that they were not
+      involved. They discovered that the system did not handle change very
+      well. This change was expensive and inevitable. Their system had to be
+      adapted to handle this change. The success of the system is mainly
+      attributed to the interoperability and close relationship between the
+      users and the developers
+
+    \subsection*{OrthoSearch}
+      \emph{OrophoSearch} is a workflow, built on \emph{Kepler}, that is
+      designed to work on data in the field of Bioinformatics.
+      \cite{daCruz:2008:OSW:1363686.1363983}
+
+      A workflow system was implemented in \emph{Kepler} as it addressed the
+      requirements they had, including: \begin{inparaenum}[(i)] \item workflow
+      definition and design; \item workflow execution control; \item fault
+      tolerance; \item intermediate data management; and \item data provenance
+      support.  \end{inparaenum}
+
+      Although the system was not without its hiccups and changes, the
+      integration with Kepler provided the workflow with increased overall
+      productivity.
+
+
+\section{Implication}
+The field of Geomatics concerns itself with a vast amount of geographic data.
+This data comes in various sizes and as such different methods of handling and
+would need to be used to facilitate dataflows within the system.
+
+The work, however, is done in a very distributed manner, which allows for a very
+effective mapping onto a grid-based computing solution, provided middleware can
+be developed to support the systems that are
+used\cite{Montella:2007:UGC:1272980.1272995}.
+
+Workflow for Geomatics processes, due to its distributed nature, would map well
+onto a automated workflow system
+
+\cite{Withana:2010:VWE:1851476.1851586}. The nature of the science is supported
+well. It would allow for effective automation of some of the functions are
+available.
diff --git a/writeup/thesis_michiel/conclusion.tex b/writeup/thesis_michiel/conclusion.tex
@@ -0,0 +1,96 @@
+\chapter{Conclusion\label{chap4}}
+This research project was concluded with the successful implementation of a
+Workflow Management System. This system can successfully manage a complex set
+of tasks with arbitrary dependencies. Tasks could be either be fully automated
+by the system or could be completed by users. When user tasks are started the 
+required files are incrementally transferred to the desktop host of the user
+using \emph{rsync} to only transfer files that have not been transfered or are 
+out of date. To enforce quality and accuracy a feature was added that enforces
+that user tasks be validated by experienced members of the team before the task
+can be labelled as complete.
+
+Automated tasks are executed on the
+server due to the fact that these task operate on very large files. By executing
+them locally data does not need to be transferred which would be an expensive
+process. Tasks are automatically started when all dependencies are met. 
+
+In the event that a task fails the system also allows the user to inspect the
+logging information that is generated during the execution of the task. Once the
+problem is identified the tasks can then be manually restarted.  The design and
+implementation was done in three iterations. This is was explained in depth in 
+Chapter~\ref{chap2}.
+
+
+The system was then successfully evaluated in Chapter~\ref{chap3} both for it's
+usability and it's effectiveness at solving the problem. The following positive
+results was obtained during the evaluation of the system:
+\begin{enumerate}
+    \item The system was successfully able to implement and execute a portion of
+        the workflow in the modelling section of the modelling tasks that are
+        present in the Zamani-Project. This sample workflow used a mix of system
+        and user task.
+    \item The system was positively evaluated using a sample group of 24 users.
+        This evaluation revealed that users found the system useful, easy to
+        use and users were satisfied using the system. User responses and the 
+        observations made during the test it was found that the system is
+        effective and is very easy to learn. 
+\end{enumerate}
+
+This system was however not implemented within the Zamani Project. This was
+mainly due to time constraints, caused by the scale and time required to
+implement it. Functionally the system could be implemented however this process
+could be significantly simplified by the addition of some features. These are
+mentioned in the future work session.
+
+
+\section{Future Work}
+During the implementation of the workflow system various possible extensions
+that could be added to the system however due to constraints on time these could
+not be implemented. These features would improve the system both in terms of
+performance, usability and set up time. 
+\begin{description}
+\item[Hierarchical Workflows]\hfill \\
+To allow better control and re usability over tasks, workflows should be
+abstracted to include a hierarchy. Such a hierarchy would allow entire workflows
+to be represented as singular nodes. These workflows, could then be repackaged
+and reused in different sites, or even the same site. This would also allow the
+setup for new sites to be much faster as prepackaged workflows could easily be
+used as drop in components.
+\item[Parameterized Scripts]\hfill \\
+Oftentimes particular parameters of a script can change from one site to
+another. This change does not necessarily affect the \emph{Task type}, however
+with the current implementation of the system the change would need to be made
+at this point. This can be greatly improved by allowing a \emph{Task} to send
+parameters to the job. This would require the Task Subsystem to allow parameters
+dynamically be sent to the \emph{Task Type}.
+\item[Rule Based File Filters]\hfill \\
+Currently within the system all the files in the output directory of a task is
+treated as input to successor tasks. Tasks often times only use a portion of the
+files created by the predecessor. In order to currently facilitate this with the
+system additional a filtering task would need to be set up that filters out
+unused files. By including a rule based filtering system much greater control
+can be placed on the output files. Such rule based filters have been
+successfully implemented in other systems\cite{conery2005rule}.
+\item[Interactive Task Feedback Options]\hfill \\
+In order to avoid one of the problems that were found in Section~\ref{eval:simple}
+more interactivity is required for \emph{Tasks}. This primarily includes
+real-time updates on the status of tasks. Further developments include the
+ability to do more interactive validation such as discussion integration.
+The addition of these collaborative could allow for a system that allows issues
+with tasks for be resolved in a uniform manner\cite{guimaraes1998integration}.
+\item[Transformation-based Task Support]\hfill \\
+Currently the system is built around creating derivative data items. However it
+is often common for certain files within a site to change, without creating an
+additional copy. Although this behaviour is implicitly allowed it should be
+extended to be better defined within the system.
+\item[Parallel Task Processing] \hfill \\
+One of the most crucial aspects affecting the long term feasibility of the
+system is its ability to scale and handle larger and more complicated workflows.
+In this regard the server node would become a significant bottleneck in
+processing \emph{Server Tasks}. In order to alleviate this problem the system
+would need to become distributed. This would present its own set of problems
+as data would need to be efficiently distributed between the computation nodes
+to ensure efficiency.
+
+\end{description}
+