Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
lalala
  • Loading branch information
michielbaird committed Oct 27, 2012
1 parent 7a576a1 commit 33b4f0b
Show file tree
Hide file tree
Showing 10 changed files with 422 additions and 207 deletions.
75 changes: 53 additions & 22 deletions writeup/thesis_michiel/background.tex
Expand Up @@ -20,7 +20,20 @@ \chapter{Background\label{chap1}}
very distributed and the set of data that is operated on is large and
diverse. Workflow management within Geomatics has been considered and
solutions have been proposed, but not implemented or
evaluated\cite{Migliorini:2011:WTG:1999320.1999356}.
evaluated\cite{Migliorini:2011:WTG:1999320.1999356}.

This chapter presents a discussion on Workflow Systems. Firstly presenting
an overview of what these systems are and briefly looking into the histories
of these systems. This is followed by a review of the factors that have influenced
the success an failure of theses systems.

Section~\ref{geo:data} does a review of the data and processing involved within
the field of Geomatics.

This is then followed in Section~\ref{example:sys} by a review of existing implementations
of Workflow Management Systems namely: Kepler, Trident and Taverna. A variety of Case Studies
ar presented in Section~\ref{casestudy}.



\section{Overview}
Expand Down Expand Up @@ -67,7 +80,7 @@ \section{Overview}
automatically ensure local access of large files that needed to be processed.


\section{Geographic Data}
\section{Geographic Data\label{geo:data}}
Geomatics concerns itself with the collection, organisation and query of
geographic data \\ \cite{DiMartino:2007:TAG:1341012.1341081}. This data includes
but is not limited to landscapes, coordinate data, building models,
Expand All @@ -89,14 +102,13 @@ \section{Geographic Data}
element, and the edges represent the functions/tasks required to create the
particular abstraction as a set of topological relationships.

\section{Implementations}
\section{Implementations\label{example:sys}}
There are various products available that can compose scientific workflows.
\emph{The Trident workbench} \cite{Simmhan:2009:BTS:1673063.1673121} is an open
source workflow management system developed by Microsoft Research that also
adds middleware services and a graphical composition interface. Trident builds
workflows of control and data flows, off of built-in, user defined activities
and nested subflows.

The flows are represented using XOML, an XML Specification, while the
activities are stored as a set of sub-routines\cite{Simmhan2011790}. Trident
can be used on a local system, remote systems and even clusters. Queries on
Expand Down Expand Up @@ -160,7 +172,7 @@ \section{Implementations}
of Web Services that facilitate Geomatics processing already exist.


\section{Case Studies}
\section{Case Studies\label{casestudy}}
The next section will look at two instances where workflow management systems
were implemented and used. These case studies will look at both a business and
a scientific application.
Expand Down Expand Up @@ -195,20 +207,39 @@ \section{Case Studies}
integration with Kepler provided the workflow with increased overall
productivity.


\section{Implication}
The field of Geomatics concerns itself with a vast amount of geographic data.
This data comes in various sizes and as such different methods of handling and
would need to be used to facilitate dataflows within the system.

The work, however, is done in a very distributed manner, which allows for a very
effective mapping onto a grid-based computing solution, provided middleware can
be developed to support the systems that are
used\cite{Montella:2007:UGC:1272980.1272995}.

Workflow for Geomatics processes, due to its distributed nature, would map well
onto a automated workflow system

\cite{Withana:2010:VWE:1851476.1851586}. The nature of the science is supported
well. It would allow for effective automation of some of the functions are
available.
\subsection*{Sunfall}
\emph{Sunfall} is a workflow system that was created to assist in locating
supernovas from large amount of telescope data\cite{Aragon:2009:WMH:1529282.1529491}.

Sunfall consists of four components: \begin{inparaenum}[(i)]\item Search, \item
Workflow Status Monitor, \item Data Forklift and \item Supernova Warehouse.\end{inparaenum}

The Search component is responsible for coordinating the tasks responsible for
coordinating tasks involved in finding supernovas, within the data. The system
is also tasked with dealing with an enormous amount of data, up to 100TB. The
data movement is carried out using the \emph{Data Forklift} component.

This project used a Parallel File system, to aid in data replication within the
project and used middle ware to interface with legacy software.

Sunfall was deemed a great success as it not only successfully improved the
efficiency and identified bottlenecks within the process


\section{Summary}
This chapter reviewed the appropriate literature for Workflow Management Systems
and the data variety and processing within Geomatics. This has provided the necessary
insight to determine what components would be required in order to build such a Workflow
System for the Zamani Project.

The process to create the heritage artifacts from the raw-scans and photographs
generates a large amount of varied data. A Workflow Management System would need
to be able to specifically cater for this constraint similarly to the large scaled
data involved in the implementation of both \emph{OrthoSearch} and \emph{Sunfall}.


Since the tasks are however a mixture between automated and manual tasks. Such a
system would be able to map to a more grid-based approach and was shown to be done
at \emph{Danske Bank}. Middleware would need to be provided in order for the system
to uniformly integrate with applications required throughout the process\cite{Montella:2007:UGC:1272980.1272995}.
The model has been shown to be able to be effectively automated using a Workflow Management System\cite{Withana:2010:VWE:1851476.1851586}.
92 changes: 92 additions & 0 deletions writeup/thesis_michiel/bibliography.bib
Expand Up @@ -657,3 +657,95 @@ @techreport{slot2005workflow
institution={Technical report, Division of Mathematics and Computer Science, Vrije Universiteit, The Nerherlands}
}

@article{gray2007escience,
title={eScience-A Transformed Scientific Method},
author={Gray, J. and Szalay, A.},
journal={presentation to the Computer Science and Technology Board of the National Research Council, Mountain View, CA},
year={2007}
}

@article{harpaz2012novel,
title={Novel Data-Mining Methodologies for Adverse Drug Event Discovery and Analysis},
author={Harpaz, R. and DuMouchel, W. and Shah, NH and Madigan, D. and Ryan, P. and Friedman, C.},
journal={Clinical Pharmacology \& Therapeutics},
volume={91},
number={6},
pages={1010--1021},
year={2012},
publisher={Nature Publishing Group}
}

@inproceedings{greene2010integrative,
title={Integrative systems biology for data-driven knowledge discovery},
author={Greene, C.S. and Troyanskaya, O.G.},
booktitle={Seminars in nephrology},
volume={30},
number={5},
pages={443--454},
year={2010},
organization={Elsevier}
}

@article{thomas2011synapps,
title={SYNAPPS: Data-Driven Analysis for Supernova Spectroscopy},
author={Thomas, RC and Nugent, PE and Meza, JC},
journal={Publications of the Astronomical Society of the Pacific},
volume={123},
number={900},
pages={237--248},
year={2011},
publisher={JSTOR}
}

@article{shneiderman2002inventing,
title={Inventing discovery tools: combining information visualization with data mining1},
author={Shneiderman, B.},
journal={Information Visualization},
volume={1},
number={1},
pages={5--12},
year={2002},
publisher={SAGE Publications}
}

@article{gray2005scientific,
title={Scientific data management in the coming decade},
author={Gray, J. and Liu, D.T. and Nieto-Santisteban, M. and Szalay, A. and DeWitt, D.J. and Heber, G.},
journal={ACM SIGMOD Record},
volume={34},
number={4},
pages={34--41},
year={2005},
publisher={ACM}
}

@article{davidson2007provenance,
title={Provenance in scientific workflow systems},
author={Davidson, S. and Boulakia, S.C. and Eyal, A. and Lud{\"a}scher, B. and McPhillips, T.M. and Bowers, S. and Anand, M.K. and Freire, J.},
journal={IEEE Data Eng. Bull},
volume={30},
number={4},
pages={44--50},
year={2007}
}

@article{ludascher2009scientific,
title={Scientific process automation and workflow management},
author={Lud{\"a}scher, B. and Altintas, I. and Bowers, S. and Cummings, J. and Critchlow, T. and Deelman, E. and Roure, D.D. and Freire, J. and Goble, C. and Jones, M. and others},
journal={Scientific Data Management: Challenges, Existing Technology, and Deployment, Computational Science Series},
pages={476--508},
year={2009},
publisher={Citeseer}
}

@article{eder1998workflow,
title={The workflow management system Panta Rhei},
author={Eder, J. and Groiss, H. and Liebhart, W.},
journal={Workflow Management Systems and Interoperability},
pages={129--144},
year={1998},
publisher={Springer}
}



34 changes: 17 additions & 17 deletions writeup/thesis_michiel/conclusion.tex
Expand Up @@ -45,51 +45,51 @@ \chapter{Conclusion\label{chap4}}

\section{Future Work}
During the implementation of the workflow system various possible extensions
that could be added to the system however due to constraints on time these could
that could be added to the system could
not be implemented. These features would improve the system both in terms of
performance, usability and set up time.
\begin{description}
\item[Hierarchical Workflows]\hfill \\
To allow better control and re usability over tasks, workflows should be
abstracted to include a hierarchy. Such a hierarchy would allow entire workflows
to be represented as singular nodes. These workflows, could then be repackaged
to be represented as singular nodes. These workflow, could then be repackaged
and reused in different sites, or even the same site. This would also allow the
setup for new sites to be much faster as prepackaged workflows could easily be
used as drop in components.
setup for new sites to be much faster, as prepackaged workflows could easily be
used as drop-in components.
\item[Parameterized Scripts]\hfill \\
Oftentimes particular parameters of a script can change from one site to
another. This change does not necessarily affect the \emph{Task type}, however
another. This change does not necessarily affect the \emph{Task type}; however,
with the current implementation of the system the change would need to be made
at this point. This can be greatly improved by allowing a \emph{Task} to send
parameters to the job. This would require the Task Subsystem to allow parameters
dynamically be sent to the \emph{Task Type}.
parameters to the job. This would require the Task Subsystem to allow parameters to
be sent dynamically to the \emph{Task Type}.
\item[Rule Based File Filters]\hfill \\
Currently within the system all the files in the output directory of a task is
treated as input to successor tasks. Tasks often times only use a portion of the
Currently within the system all the files in the output directory of are task is
treated as input to successor tasks. Tasks often only use a portion of the
files created by the predecessor. In order to currently facilitate this with the
system additional a filtering task would need to be set up that filters out
system an additional filtering, task would need to be set up that filters out
unused files. By including a rule based filtering system much greater control
can be placed on the output files. Such rule based filters have been
successfully implemented in other systems\cite{conery2005rule}.
\item[Interactive Task Feedback Options]\hfill \\
In order to avoid one of the problems that were found in Section~\ref{eval:simple}
In order to avoid one of the problems that were found in Section~\ref{eval:simple},
more interactivity is required for \emph{Tasks}. This primarily includes
real-time updates on the status of tasks. Further developments include the
ability to do more interactive validation such as discussion integration.
The addition of these collaborative could allow for a system that allows issues
with tasks for be resolved in a uniform manner\cite{guimaraes1998integration}.
The addition of these collaborative tasks could resolves issues
with tasks in a uniform manner\cite{guimaraes1998integration}.
\item[Transformation-based Task Support]\hfill \\
Currently the system is built around creating derivative data items. However it
Currently the system is built around creating derivative data items. However, it
is often common for certain files within a site to change, without creating an
additional copy. Although this behaviour is implicitly allowed it should be
additional copy. Although this behaviour is implicitly allowed, it should be
extended to be better defined within the system.
\item[Parallel Task Processing] \hfill \\
One of the most crucial aspects affecting the long term feasibility of the
system is its ability to scale and handle larger and more complicated workflows.
In this regard the server node would become a significant bottleneck in
processing \emph{Server Tasks}. In order to alleviate this problem the system
processing \emph{Server Tasks}. In order to alleviate this problem, the system
would need to become distributed. This would present its own set of problems
as data would need to be efficiently distributed between the computation nodes
as data would need to be efficiently distributed along the computation nodes
to ensure efficiency.

\end{description}
Expand Down

0 comments on commit 33b4f0b

Please sign in to comment.