STEllAR-GROUP · hkaiser · Aug 9, 2014 · May 20, 2014 · May 20, 2014 · May 26, 2014
@@ -161,6 +161,7 @@
 [def __fibonacci_example__      [link hpx.tutorial.examples.fibonacci Fibonacci Example]]
 [def __hello_world_example__    [link hpx.tutorial.examples.hello_world Hello World Example]]
 [def __accumulator_example__    [link hpx.tutorial.examples.accumulator Accumulator Example]]
+[def __futurization_example__   [link hpx.tutorial.futurization_example Futurization Example]]
 
 [def __people__                 [link hpx.people People]]
 [def __getting_started__        [link hpx.tutorial.getting_started Getting Started]]
@@ -279,7 +280,7 @@ __boost_auto_index__ can be found in the collection of __boost_tools__.
 [include tutorial/gettingstarted.qbk]
 [include tutorial/introduction.qbk]
 [include tutorial/examples.qbk]
-
+[include tutorial/futurization_example.qbk]
 [endsect] [/ Tutorial]
 
 [/////////////////////////////////////////////////////////////////////////////]

@@ -0,0 +1,258 @@
+[/=============================================================================
+   Copyright (c) 2014 Adrian Serio 
+
+   Distributed under the Boost Software License, Version 1.0. (See accompanying
+   file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+=============================================================================/]
+
+[section:futurization_example Futurization Example]
+
+When developers write code they typically begin with a simple serial code
+and build upon it until all of the required functionality is present.
+The following set of examples were developed to demonstrate this iterative
+process of evolving a 
+simple serial program to an efficient, fully distributed HPX application.
+For this demonstration, we implemented a 1D heat distribution problem.
+This calculation simulates the diffusion of heat across a ring from 
+an initialized
+state to some user defined point in the future. It does this by 
+breaking each portion of the ring into discrete segments and using
+the current segment's temperature and the temperature of the surrounding
+segments to calculate the temperature of the current segment in the next
+timestep as shown in [link examples.1d_pgf figure] below. 
+
+[fig 1d_stencil_program_flow.png..Heat Diffusion Example Program Flow..futurization_example.1d_pgf]
+
+We parallelize this code over the following eight examples:
+
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 1]
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 2]
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 3]
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 4]
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 5]
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 6]
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 7]
+*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 8]
+
+The first example is straight serial code.  In this code we instantiate
+ a vector [^U] which contains two vectors of doubles as seen in
+the structure [^stepper]. 
+
+[import ../../examples/1d_stencil/1d_stencil_1.cpp]
+[stepper_1]
+
+Each element in the vector of doubles represents a single grid
+point. To calculate the change in heat distribution, the temperature of 
+each grid point, along with its neighbors, are passed to the function [^heat].
+ In order to improve readability, references named [^current] 
+and [^next] are created which, depending on the time step, point to the first
+and second vector of doubles. The first vector of doubles is initialized 
+with a simple 
+heat ramp. After calling the heat function with the data in the "current"
+vector, the results are placed into the "next" vector. 
+
+In example 2 we employ a technique called futurization. Futurization is
+a method by which we can easily transform a code which is serially
+executed into a code which creates asynchronous threads. In the simplest
+case this involves replacing a variable with a future to a variable,
+a function with a future to a function, and adding a [^.get()] at the
+point where a value is actually needed. The code below shows how this
+technique was applied to the [^struct stepper].
+
+[import ../../examples/1d_stencil/1d_stencil_2.cpp]
+[stepper_2]
+
+In example 2, we re-define 
+our partition type as a [^shared_future] and, in [^main], create 
+the object "result" which is a future to a vector of partitions. We
+use [^result] to represent the last vector in a string of 
+vectors created for each timestep. 
+In order to move to the next timestep, the values 
+of a partition and its neighbors must be passed to [^heat] once the futures that 
+contain them are ready. In HPX, 
+we have an LCO (Local Control Object) named Dataflow which assists the 
+programmer in expressing this dependency.  Dataflow 
+allows us to pass the results of a set of futures to a specified function 
+when the futures are ready.
+Dataflow takes three types of arguments,
+one which instructs the dataflow on how to perform 
+the function call (async or sync), the function to 
+call (in this case [^Op]), and futures to the arguments 
+that will be passed to the function. 
+When called, dataflow immediately returns a future to the result of the 
+specified function.  This allows users to string dataflows together 
+and construct an execution tree. 
+
+After the values of the futures in dataflow are ready, the values must 
+be pulled out of the future container to be passed to the function [^heat]. 
+In order to do this, we 
+use the HPX facility [^unwrapped], which underneath calls [^.get()] on
+each of the futures so that the function [^heat] will be passed doubles
+and not futures to doubles.
+
+By setting up the algorithm this way, the program will be able to execute
+as quickly as the dependencies of each future are met. 
+Unfortunately, this example runs 
+terribly slow. This increase in execution time is caused 
+by the overheads needed to create a future for each
+data point. Because the work done within each call to heat is very
+small, the overhead of creating and scheduling each of the three
+futures is greater than that of the actual useful work! In order
+to amortize the overheads of our synchronization techniques,
+we need to be able to control the amount of work that will be
+done with each future.  We call this amount of work per overhead
+grain size.
+
+In example 3, we return to our serial code to figure out how to 
+control the grain size of our program. The strategy that we 
+employ is to create "partitions" of data points. The user
+can define how many partitions are created and how many 
+data points are contained in each partition. This is accomplished 
+by creating the [^struct partition] which contains a member
+object [^data_], a vector of doubles which holds
+the data points assigned to a particular instance of [^partition].
+
+In example 4, we take advantage of the partition setup by redefining
+[^space] to be a vector of shared_futures with each future
+representing a partition. In this manner, each future represents
+several data points. Because the user can define how many
+data points are contained in each partition (and therefore how many 
+data points that are represented by one future) a user can now 
+control the grainsize of the simulation.
+The rest of the code was then futurized in the same manner that was 
+done in example 2. It should be noted
+how strikingly similar example 4 is to example 2.
+
+Example 4 finally shows good results.  This code 
+scales better than the OpenMP version by fifty percent.
+While these results are impressive, our work on this
+algorithm is not over. This example only runs on one locality.
+To get the full benefit of HPX we need to be able to distribute
+the work to other machines in a cluster. We begin this process in
+example 5.
+
+In order to run on a distributed system, a large amount of boilerplate code
+must be added. Fortunately, HPX provides us with the concept of
+a "component" which saves us from having to write quite as much code.
+ A component is an object which can be 
+remotely accessed using its global address. Components
+are made of two parts: a server and a client class.
+While the client class is not required, abstracting 
+the server behind a client allows us to ensure type safety 
+instead of having to pass around pointers
+to global objects. Example 5 renames example 4's 
+[^struct partition] to [^partition_data] and
+adds serialization support. Next we add the server side
+representation of the data in the structure [^partition_server].
+[^Partition_server] inherits from [^hpx::components::simple_component_base]
+which contains a server side component boilerplate. 
+The boilerplate code allows a component's public members 
+to be accessible anywhere on the
+machine via its Global Identifier (GID). To encapsulate
+the component, we create a client side helper class.  This
+object allows us to create new instances of our component,
+and access its members without having to know its GID. In addition,
+ we are using the client class to assist us with managing 
+our asynchrony. For example, our client class [^partition]'s
+member function [^get_data()] returns a future to 
+[^partition_data get_data()]. This struct inherits its boilerplate
+code from [^hpx::components::client_base].
+
+In the structure [^stepper], we have also had to make some changes
+to accommodate a distributed environment. In order to get the data from a 
+neighboring partition, which could be remote, we must retrieve the 
+data from the neighboring partitions. These retrievals are asynchronous
+and the function [^heat_part_data], which amongst other things calls
+[^heat], should not be called unless the data from the neighboring partitions
+have arrived.  Therefore it should come as no surprise that we synchronize 
+this operation with another instance of dataflow (found in [^heat_part]).
+This dataflow is passed futures to the data in the current and surrounding
+partitions by calling [^get_data()] on each respective partition. When these
+futures are ready dataflow passes then to the [^unwrapped] function, which
+extracts the shared_array of doubles and passes them to the lambda.
+The lambda calls [^heat_part_data] on the locality which the middle 
+partition is on.
+
+Although this example could run in distributed, it only runs on
+one locality as it always uses [^hpx::find_here()] as the target
+for the functions to run on.
+
+In example 6, we begin to distribute the partition data on different nodes.
+This is accomplished in [^stepper::do_work()] by passing the 
+GID of the locality where we wish to create the partition to the 
+the partition constructor. 
+
+[import ../../examples/1d_stencil/1d_stencil_6.cpp]
+[do_work_6]
+
+We distribute the partitions evenly 
+based on the number
+of localities used, which is described in the function 
+[^locidx]. Because some of the data needed to update the
+partition in [^heat_part] could now be on a new locality, we
+must devise a way of moving data to the locality of the middle
+partition. We accomplished this by adding a switch in the
+function [^get_data()] which returns the end element of
+the [^buffer data_] if it is from the left partition or
+the first element of the buffer if the data is from the 
+right partition.  In this way only the necessary elements,
+not the whole buffer, are exchanged between nodes. 
+The reader should be reminded that this exchange of end 
+elements occurs in the function 
+[^get_data()] and therefore is executed asynchronously.
+
+Now that we have the code running in distributed, it 
+is time to make some optimizations. The function [^heat_part] 
+spends most of its time on two tasks: retrieving remote 
+data and working on the data in the middle partition. Because
+we know that the data for the middle partition is local,
+we can overlap the work on the middle partition with that 
+of the possibly remote call of [^get_data()].  This
+algorithmic change which was implemented in example 7
+can be seen below:
+
+[import ../../examples/1d_stencil/1d_stencil_7.cpp]
+[stepper_7]
+
+Example 8 completes the futurization process and utilizes 
+the full potential of HPX by 
+distributing the program flow to multiple localities, 
+usually defined as nodes in a cluster.
+It accomplishes this task by running an instance 
+of HPX main on each locality. In order to coordinate 
+the execution of the program
+the [^struct stepper] is wrapped into a component. In this 
+way, each locality contains an instance of stepper which
+executes its own instance of the function [^do_work()].
+This scheme does create an interesting synchronization 
+problem that must
+be solved. When the program flow was being coordinated on
+the head node the, GID of each component was known. However,
+when we distribute the program flow, each partition has no
+notion of the GID of its neighbor if the next partition 
+is on another locality. In order to make the GIDs of neighboring
+partitions visible to each other, we created two buffers to 
+store the GIDs of the remote neighboring partitions on the 
+left and right respectively. These buffers are filled by 
+sending the GID of a newly created edge partitions to the 
+right and left buffers of the neighboring localities.
+
+In order to finish the simulation the solution vectors 
+named "result" 
+are then gathered 
+together on locality 0 and added into a vector of spaces
+[^overall_result] using the HPX functions [^gather_id] and 
+[^gather_here].
+
+[/Instert performance of stencil_8/]
+
+Example 8 completes this example series which takes the 
+serial code of example 1 and incrementally morphs it into
+a fully distributed parallel code. This evolution
+was guided by the simple principles of futurization,
+the knowledge of grainsize, and utilization of components.
+Applying these techniques easily facilitates the scalable
+parallelization of most applications.
+
+[/////////////////////////////////////////////////////////////////////////////]
+