New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stencil docs #1214
Merged
Merged
Stencil docs #1214
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
c199ad5
Adding stencil example walk through. The goal of this section is to
dab5dad
Airplane work, most of example 2
aserio 2d24b82
Explanation of examples 1 and 2
aserio 16b190b
Adding explanation of example 3 and 4
aserio 81e5a95
Started work on example 5
aserio 0a8b01d
More work on example 5
aserio c7829c8
Adding Example 6
aserio 3e997c5
Added example 7 and beginning of example 8
aserio 6125c3a
corrected some typos
3fd331e
more typos
36ff013
Finishing example 8
aserio 02b2542
Adding the futurization example to the docutmentaion
e76cfc1
Polishing intro...
6a63dd6
Adding code exerpts from ex1 & 2
ee47bd0
Polishing, with a focus on section 1, 2, and 4
b476766
Polishing Revisions
c8c3dc9
Adding code from examples 6 & 7
e074e96
fix some typos and added to example 8 description
bff788f
Adding image and conclusion
d071de6
Spellcheck
dca5b46
minor corrections on last paragraph and typos elsewhere
4236b0d
typo
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,258 @@ | ||
[/============================================================================= | ||
Copyright (c) 2014 Adrian Serio | ||
|
||
Distributed under the Boost Software License, Version 1.0. (See accompanying | ||
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | ||
=============================================================================/] | ||
|
||
[section:futurization_example Futurization Example] | ||
|
||
When developers write code they typically begin with a simple serial code | ||
and build upon it until all of the required functionality is present. | ||
The following set of examples were developed to demonstrate this iterative | ||
process of evolving a | ||
simple serial program to an efficient, fully distributed HPX application. | ||
For this demonstration, we implemented a 1D heat distribution problem. | ||
This calculation simulates the diffusion of heat across a ring from | ||
an initialized | ||
state to some user defined point in the future. It does this by | ||
breaking each portion of the ring into discrete segments and using | ||
the current segment's temperature and the temperature of the surrounding | ||
segments to calculate the temperature of the current segment in the next | ||
timestep as shown in [link examples.1d_pgf figure] below. | ||
|
||
[fig 1d_stencil_program_flow.png..Heat Diffusion Example Program Flow..futurization_example.1d_pgf] | ||
|
||
We parallelize this code over the following eight examples: | ||
|
||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 1] | ||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 2] | ||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 3] | ||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 4] | ||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 5] | ||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 6] | ||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 7] | ||
*[hpx_link examples/1d_stencil/1d_stencil_1.cpp..Example 8] | ||
|
||
The first example is straight serial code. In this code we instantiate | ||
a vector [^U] which contains two vectors of doubles as seen in | ||
the structure [^stepper]. | ||
|
||
[import ../../examples/1d_stencil/1d_stencil_1.cpp] | ||
[stepper_1] | ||
|
||
Each element in the vector of doubles represents a single grid | ||
point. To calculate the change in heat distribution, the temperature of | ||
each grid point, along with its neighbors, are passed to the function [^heat]. | ||
In order to improve readability, references named [^current] | ||
and [^next] are created which, depending on the time step, point to the first | ||
and second vector of doubles. The first vector of doubles is initialized | ||
with a simple | ||
heat ramp. After calling the heat function with the data in the "current" | ||
vector, the results are placed into the "next" vector. | ||
|
||
In example 2 we employ a technique called futurization. Futurization is | ||
a method by which we can easily transform a code which is serially | ||
executed into a code which creates asynchronous threads. In the simplest | ||
case this involves replacing a variable with a future to a variable, | ||
a function with a future to a function, and adding a [^.get()] at the | ||
point where a value is actually needed. The code below shows how this | ||
technique was applied to the [^struct stepper]. | ||
|
||
[import ../../examples/1d_stencil/1d_stencil_2.cpp] | ||
[stepper_2] | ||
|
||
In example 2, we re-define | ||
our partition type as a [^shared_future] and, in [^main], create | ||
the object "result" which is a future to a vector of partitions. We | ||
use [^result] to represent the last vector in a string of | ||
vectors created for each timestep. | ||
In order to move to the next timestep, the values | ||
of a partition and its neighbors must be passed to [^heat] once the futures that | ||
contain them are ready. In HPX, | ||
we have an LCO (Local Control Object) named Dataflow which assists the | ||
programmer in expressing this dependency. Dataflow | ||
allows us to pass the results of a set of futures to a specified function | ||
when the futures are ready. | ||
Dataflow takes three types of arguments, | ||
one which instructs the dataflow on how to perform | ||
the function call (async or sync), the function to | ||
call (in this case [^Op]), and futures to the arguments | ||
that will be passed to the function. | ||
When called, dataflow immediately returns a future to the result of the | ||
specified function. This allows users to string dataflows together | ||
and construct an execution tree. | ||
|
||
After the values of the futures in dataflow are ready, the values must | ||
be pulled out of the future container to be passed to the function [^heat]. | ||
In order to do this, we | ||
use the HPX facility [^unwrapped], which underneath calls [^.get()] on | ||
each of the futures so that the function [^heat] will be passed doubles | ||
and not futures to doubles. | ||
|
||
By setting up the algorithm this way, the program will be able to execute | ||
as quickly as the dependencies of each future are met. | ||
Unfortunately, this example runs | ||
terribly slow. This increase in execution time is caused | ||
by the overheads needed to create a future for each | ||
data point. Because the work done within each call to heat is very | ||
small, the overhead of creating and scheduling each of the three | ||
futures is greater than that of the actual useful work! In order | ||
to amortize the overheads of our synchronization techniques, | ||
we need to be able to control the amount of work that will be | ||
done with each future. We call this amount of work per overhead | ||
grain size. | ||
|
||
In example 3, we return to our serial code to figure out how to | ||
control the grain size of our program. The strategy that we | ||
employ is to create "partitions" of data points. The user | ||
can define how many partitions are created and how many | ||
data points are contained in each partition. This is accomplished | ||
by creating the [^struct partition] which contains a member | ||
object [^data_], a vector of doubles which holds | ||
the data points assigned to a particular instance of [^partition]. | ||
|
||
In example 4, we take advantage of the partition setup by redefining | ||
[^space] to be a vector of shared_futures with each future | ||
representing a partition. In this manner, each future represents | ||
several data points. Because the user can define how many | ||
data points are contained in each partition (and therefore how many | ||
data points that are represented by one future) a user can now | ||
control the grainsize of the simulation. | ||
The rest of the code was then futurized in the same manner that was | ||
done in example 2. It should be noted | ||
how strikingly similar example 4 is to example 2. | ||
|
||
Example 4 finally shows good results. This code | ||
scales better than the OpenMP version by fifty percent. | ||
While these results are impressive, our work on this | ||
algorithm is not over. This example only runs on one locality. | ||
To get the full benefit of HPX we need to be able to distribute | ||
the work to other machines in a cluster. We begin this process in | ||
example 5. | ||
|
||
In order to run on a distributed system, a large amount of boilerplate code | ||
must be added. Fortunately, HPX provides us with the concept of | ||
a "component" which saves us from having to write quite as much code. | ||
A component is an object which can be | ||
remotely accessed using its global address. Components | ||
are made of two parts: a server and a client class. | ||
While the client class is not required, abstracting | ||
the server behind a client allows us to ensure type safety | ||
instead of having to pass around pointers | ||
to global objects. Example 5 renames example 4's | ||
[^struct partition] to [^partition_data] and | ||
adds serialization support. Next we add the server side | ||
representation of the data in the structure [^partition_server]. | ||
[^Partition_server] inherits from [^hpx::components::simple_component_base] | ||
which contains a server side component boilerplate. | ||
The boilerplate code allows a component's public members | ||
to be accessible anywhere on the | ||
machine via its Global Identifier (GID). To encapsulate | ||
the component, we create a client side helper class. This | ||
object allows us to create new instances of our component, | ||
and access its members without having to know its GID. In addition, | ||
we are using the client class to assist us with managing | ||
our asynchrony. For example, our client class [^partition]'s | ||
member function [^get_data()] returns a future to | ||
[^partition_data get_data()]. This struct inherits its boilerplate | ||
code from [^hpx::components::client_base]. | ||
|
||
In the structure [^stepper], we have also had to make some changes | ||
to accommodate a distributed environment. In order to get the data from a | ||
neighboring partition, which could be remote, we must retrieve the | ||
data from the neighboring partitions. These retrievals are asynchronous | ||
and the function [^heat_part_data], which amongst other things calls | ||
[^heat], should not be called unless the data from the neighboring partitions | ||
have arrived. Therefore it should come as no surprise that we synchronize | ||
this operation with another instance of dataflow (found in [^heat_part]). | ||
This dataflow is passed futures to the data in the current and surrounding | ||
partitions by calling [^get_data()] on each respective partition. When these | ||
futures are ready dataflow passes then to the [^unwrapped] function, which | ||
extracts the shared_array of doubles and passes them to the lambda. | ||
The lambda calls [^heat_part_data] on the locality which the middle | ||
partition is on. | ||
|
||
Although this example could run in distributed, it only runs on | ||
one locality as it always uses [^hpx::find_here()] as the target | ||
for the functions to run on. | ||
|
||
In example 6, we begin to distribute the partition data on different nodes. | ||
This is accomplished in [^stepper::do_work()] by passing the | ||
GID of the locality where we wish to create the partition to the | ||
the partition constructor. | ||
|
||
[import ../../examples/1d_stencil/1d_stencil_6.cpp] | ||
[do_work_6] | ||
|
||
We distribute the partitions evenly | ||
based on the number | ||
of localities used, which is described in the function | ||
[^locidx]. Because some of the data needed to update the | ||
partition in [^heat_part] could now be on a new locality, we | ||
must devise a way of moving data to the locality of the middle | ||
partition. We accomplished this by adding a switch in the | ||
function [^get_data()] which returns the end element of | ||
the [^buffer data_] if it is from the left partition or | ||
the first element of the buffer if the data is from the | ||
right partition. In this way only the necessary elements, | ||
not the whole buffer, are exchanged between nodes. | ||
The reader should be reminded that this exchange of end | ||
elements occurs in the function | ||
[^get_data()] and therefore is executed asynchronously. | ||
|
||
Now that we have the code running in distributed, it | ||
is time to make some optimizations. The function [^heat_part] | ||
spends most of its time on two tasks: retrieving remote | ||
data and working on the data in the middle partition. Because | ||
we know that the data for the middle partition is local, | ||
we can overlap the work on the middle partition with that | ||
of the possibly remote call of [^get_data()]. This | ||
algorithmic change which was implemented in example 7 | ||
can be seen below: | ||
|
||
[import ../../examples/1d_stencil/1d_stencil_7.cpp] | ||
[stepper_7] | ||
|
||
Example 8 completes the futurization process and utilizes | ||
the full potential of HPX by | ||
distributing the program flow to multiple localities, | ||
usually defined as nodes in a cluster. | ||
It accomplishes this task by running an instance | ||
of HPX main on each locality. In order to coordinate | ||
the execution of the program | ||
the [^struct stepper] is wrapped into a component. In this | ||
way, each locality contains an instance of stepper which | ||
executes its own instance of the function [^do_work()]. | ||
This scheme does create an interesting synchronization | ||
problem that must | ||
be solved. When the program flow was being coordinated on | ||
the head node the, GID of each component was known. However, | ||
when we distribute the program flow, each partition has no | ||
notion of the GID of its neighbor if the next partition | ||
is on another locality. In order to make the GIDs of neighboring | ||
partitions visible to each other, we created two buffers to | ||
store the GIDs of the remote neighboring partitions on the | ||
left and right respectively. These buffers are filled by | ||
sending the GID of a newly created edge partitions to the | ||
right and left buffers of the neighboring localities. | ||
|
||
In order to finish the simulation the solution vectors | ||
named "result" | ||
are then gathered | ||
together on locality 0 and added into a vector of spaces | ||
[^overall_result] using the HPX functions [^gather_id] and | ||
[^gather_here]. | ||
|
||
[/Instert performance of stencil_8/] | ||
|
||
Example 8 completes this example series which takes the | ||
serial code of example 1 and incrementally morphs it into | ||
a fully distributed parallel code. This evolution | ||
was guided by the simple principles of futurization, | ||
the knowledge of grainsize, and utilization of components. | ||
Applying these techniques easily facilitates the scalable | ||
parallelization of most applications. | ||
|
||
[/////////////////////////////////////////////////////////////////////////////] | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence begs for at least one comma