Browse files


  • Loading branch information...
1 parent 7a576a1 commit 33b4f0beb3dcc44462e59334d39a2311646fc068 @michielbaird michielbaird committed Oct 27, 2012
@@ -20,7 +20,20 @@ \chapter{Background\label{chap1}}
very distributed and the set of data that is operated on is large and
diverse. Workflow management within Geomatics has been considered and
solutions have been proposed, but not implemented or
- evaluated\cite{Migliorini:2011:WTG:1999320.1999356}.
+ evaluated\cite{Migliorini:2011:WTG:1999320.1999356}.
+ This chapter presents a discussion on Workflow Systems. Firstly presenting
+ an overview of what these systems are and briefly looking into the histories
+ of these systems. This is followed by a review of the factors that have influenced
+ the success an failure of theses systems.
+ Section~\ref{geo:data} does a review of the data and processing involved within
+ the field of Geomatics.
+ This is then followed in Section~\ref{example:sys} by a review of existing implementations
+ of Workflow Management Systems namely: Kepler, Trident and Taverna. A variety of Case Studies
+ ar presented in Section~\ref{casestudy}.
@@ -67,7 +80,7 @@ \section{Overview}
automatically ensure local access of large files that needed to be processed.
-\section{Geographic Data}
+\section{Geographic Data\label{geo:data}}
Geomatics concerns itself with the collection, organisation and query of
geographic data \\ \cite{DiMartino:2007:TAG:1341012.1341081}. This data includes
but is not limited to landscapes, coordinate data, building models,
@@ -89,14 +102,13 @@ \section{Geographic Data}
element, and the edges represent the functions/tasks required to create the
particular abstraction as a set of topological relationships.
There are various products available that can compose scientific workflows.
\emph{The Trident workbench} \cite{Simmhan:2009:BTS:1673063.1673121} is an open
source workflow management system developed by Microsoft Research that also
adds middleware services and a graphical composition interface. Trident builds
workflows of control and data flows, off of built-in, user defined activities
and nested subflows.
The flows are represented using XOML, an XML Specification, while the
activities are stored as a set of sub-routines\cite{Simmhan2011790}. Trident
can be used on a local system, remote systems and even clusters. Queries on
@@ -160,7 +172,7 @@ \section{Implementations}
of Web Services that facilitate Geomatics processing already exist.
-\section{Case Studies}
+\section{Case Studies\label{casestudy}}
The next section will look at two instances where workflow management systems
were implemented and used. These case studies will look at both a business and
a scientific application.
@@ -195,20 +207,39 @@ \section{Case Studies}
integration with Kepler provided the workflow with increased overall
-The field of Geomatics concerns itself with a vast amount of geographic data.
-This data comes in various sizes and as such different methods of handling and
-would need to be used to facilitate dataflows within the system.
-The work, however, is done in a very distributed manner, which allows for a very
-effective mapping onto a grid-based computing solution, provided middleware can
-be developed to support the systems that are
-Workflow for Geomatics processes, due to its distributed nature, would map well
-onto a automated workflow system
-\cite{Withana:2010:VWE:1851476.1851586}. The nature of the science is supported
-well. It would allow for effective automation of some of the functions are
+ \subsection*{Sunfall}
+ \emph{Sunfall} is a workflow system that was created to assist in locating
+ supernovas from large amount of telescope data\cite{Aragon:2009:WMH:1529282.1529491}.
+ Sunfall consists of four components: \begin{inparaenum}[(i)]\item Search, \item
+ Workflow Status Monitor, \item Data Forklift and \item Supernova Warehouse.\end{inparaenum}
+ The Search component is responsible for coordinating the tasks responsible for
+ coordinating tasks involved in finding supernovas, within the data. The system
+ is also tasked with dealing with an enormous amount of data, up to 100TB. The
+ data movement is carried out using the \emph{Data Forklift} component.
+ This project used a Parallel File system, to aid in data replication within the
+ project and used middle ware to interface with legacy software.
+ Sunfall was deemed a great success as it not only successfully improved the
+ efficiency and identified bottlenecks within the process
+This chapter reviewed the appropriate literature for Workflow Management Systems
+and the data variety and processing within Geomatics. This has provided the necessary
+insight to determine what components would be required in order to build such a Workflow
+System for the Zamani Project.
+The process to create the heritage artifacts from the raw-scans and photographs
+generates a large amount of varied data. A Workflow Management System would need
+to be able to specifically cater for this constraint similarly to the large scaled
+data involved in the implementation of both \emph{OrthoSearch} and \emph{Sunfall}.
+Since the tasks are however a mixture between automated and manual tasks. Such a
+system would be able to map to a more grid-based approach and was shown to be done
+at \emph{Danske Bank}. Middleware would need to be provided in order for the system
+to uniformly integrate with applications required throughout the process\cite{Montella:2007:UGC:1272980.1272995}.
+The model has been shown to be able to be effectively automated using a Workflow Management System\cite{Withana:2010:VWE:1851476.1851586}.
@@ -657,3 +657,95 @@ @techreport{slot2005workflow
institution={Technical report, Division of Mathematics and Computer Science, Vrije Universiteit, The Nerherlands}
+ title={eScience-A Transformed Scientific Method},
+ author={Gray, J. and Szalay, A.},
+ journal={presentation to the Computer Science and Technology Board of the National Research Council, Mountain View, CA},
+ year={2007}
+ title={Novel Data-Mining Methodologies for Adverse Drug Event Discovery and Analysis},
+ author={Harpaz, R. and DuMouchel, W. and Shah, NH and Madigan, D. and Ryan, P. and Friedman, C.},
+ journal={Clinical Pharmacology \& Therapeutics},
+ volume={91},
+ number={6},
+ pages={1010--1021},
+ year={2012},
+ publisher={Nature Publishing Group}
+ title={Integrative systems biology for data-driven knowledge discovery},
+ author={Greene, C.S. and Troyanskaya, O.G.},
+ booktitle={Seminars in nephrology},
+ volume={30},
+ number={5},
+ pages={443--454},
+ year={2010},
+ organization={Elsevier}
+ title={SYNAPPS: Data-Driven Analysis for Supernova Spectroscopy},
+ author={Thomas, RC and Nugent, PE and Meza, JC},
+ journal={Publications of the Astronomical Society of the Pacific},
+ volume={123},
+ number={900},
+ pages={237--248},
+ year={2011},
+ publisher={JSTOR}
+ title={Inventing discovery tools: combining information visualization with data mining1},
+ author={Shneiderman, B.},
+ journal={Information Visualization},
+ volume={1},
+ number={1},
+ pages={5--12},
+ year={2002},
+ publisher={SAGE Publications}
+ title={Scientific data management in the coming decade},
+ author={Gray, J. and Liu, D.T. and Nieto-Santisteban, M. and Szalay, A. and DeWitt, D.J. and Heber, G.},
+ journal={ACM SIGMOD Record},
+ volume={34},
+ number={4},
+ pages={34--41},
+ year={2005},
+ publisher={ACM}
+ title={Provenance in scientific workflow systems},
+ author={Davidson, S. and Boulakia, S.C. and Eyal, A. and Lud{\"a}scher, B. and McPhillips, T.M. and Bowers, S. and Anand, M.K. and Freire, J.},
+ journal={IEEE Data Eng. Bull},
+ volume={30},
+ number={4},
+ pages={44--50},
+ year={2007}
+ title={Scientific process automation and workflow management},
+ author={Lud{\"a}scher, B. and Altintas, I. and Bowers, S. and Cummings, J. and Critchlow, T. and Deelman, E. and Roure, D.D. and Freire, J. and Goble, C. and Jones, M. and others},
+ journal={Scientific Data Management: Challenges, Existing Technology, and Deployment, Computational Science Series},
+ pages={476--508},
+ year={2009},
+ publisher={Citeseer}
+ title={The workflow management system Panta Rhei},
+ author={Eder, J. and Groiss, H. and Liebhart, W.},
+ journal={Workflow Management Systems and Interoperability},
+ pages={129--144},
+ year={1998},
+ publisher={Springer}
@@ -45,51 +45,51 @@ \chapter{Conclusion\label{chap4}}
\section{Future Work}
During the implementation of the workflow system various possible extensions
-that could be added to the system however due to constraints on time these could
+that could be added to the system could
not be implemented. These features would improve the system both in terms of
performance, usability and set up time.
\item[Hierarchical Workflows]\hfill \\
To allow better control and re usability over tasks, workflows should be
abstracted to include a hierarchy. Such a hierarchy would allow entire workflows
-to be represented as singular nodes. These workflows, could then be repackaged
+to be represented as singular nodes. These workflow, could then be repackaged
and reused in different sites, or even the same site. This would also allow the
-setup for new sites to be much faster as prepackaged workflows could easily be
-used as drop in components.
+setup for new sites to be much faster, as prepackaged workflows could easily be
+used as drop-in components.
\item[Parameterized Scripts]\hfill \\
Oftentimes particular parameters of a script can change from one site to
-another. This change does not necessarily affect the \emph{Task type}, however
+another. This change does not necessarily affect the \emph{Task type}; however,
with the current implementation of the system the change would need to be made
at this point. This can be greatly improved by allowing a \emph{Task} to send
-parameters to the job. This would require the Task Subsystem to allow parameters
-dynamically be sent to the \emph{Task Type}.
+parameters to the job. This would require the Task Subsystem to allow parameters to
+be sent dynamically to the \emph{Task Type}.
\item[Rule Based File Filters]\hfill \\
-Currently within the system all the files in the output directory of a task is
-treated as input to successor tasks. Tasks often times only use a portion of the
+Currently within the system all the files in the output directory of are task is
+treated as input to successor tasks. Tasks often only use a portion of the
files created by the predecessor. In order to currently facilitate this with the
-system additional a filtering task would need to be set up that filters out
+system an additional filtering, task would need to be set up that filters out
unused files. By including a rule based filtering system much greater control
can be placed on the output files. Such rule based filters have been
successfully implemented in other systems\cite{conery2005rule}.
\item[Interactive Task Feedback Options]\hfill \\
-In order to avoid one of the problems that were found in Section~\ref{eval:simple}
+In order to avoid one of the problems that were found in Section~\ref{eval:simple},
more interactivity is required for \emph{Tasks}. This primarily includes
real-time updates on the status of tasks. Further developments include the
ability to do more interactive validation such as discussion integration.
-The addition of these collaborative could allow for a system that allows issues
-with tasks for be resolved in a uniform manner\cite{guimaraes1998integration}.
+The addition of these collaborative tasks could resolves issues
+with tasks in a uniform manner\cite{guimaraes1998integration}.
\item[Transformation-based Task Support]\hfill \\
-Currently the system is built around creating derivative data items. However it
+Currently the system is built around creating derivative data items. However, it
is often common for certain files within a site to change, without creating an
-additional copy. Although this behaviour is implicitly allowed it should be
+additional copy. Although this behaviour is implicitly allowed, it should be
extended to be better defined within the system.
\item[Parallel Task Processing] \hfill \\
One of the most crucial aspects affecting the long term feasibility of the
system is its ability to scale and handle larger and more complicated workflows.
In this regard the server node would become a significant bottleneck in
-processing \emph{Server Tasks}. In order to alleviate this problem the system
+processing \emph{Server Tasks}. In order to alleviate this problem, the system
would need to become distributed. This would present its own set of problems
-as data would need to be efficiently distributed between the computation nodes
+as data would need to be efficiently distributed along the computation nodes
to ensure efficiency.
Oops, something went wrong.

0 comments on commit 33b4f0b

Please sign in to comment.