New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement N4409 on top of HPX #1141

Open
hkaiser opened this Issue May 31, 2014 · 16 comments

Comments

@hkaiser
Member

hkaiser commented May 31, 2014

Implement N3989 (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3989.html) on top of HPX. This finally would be the first step to expose parallel algorithms to application developers.

There is an updated version of the proposal document here (N4409): http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4409.pdf.

As always, this implementation will add HPX specific functionality. Several extensions are obvious right away:

  • Add a new execution policy task_execution_policy on top of what the proposal mandates. The difference to the already described parallel_execution_policy would be that all algorithms would return a future<> representing the original result.
  • Add possibility to specify HPX executors to be used with the par and task execution policies. This could be done by adding par(exec) and task(exec) as valid execution policy arguments.
  • Add functionality enabling the use of all algorithms for distributed use cases (see also #1338).

Another possible extension needs more investigation:

  • Allow for all algorithms to be invoked with sequences of futures themselves, where the predicates/operators are invoked only after the corresponding future became ready.

Here is the list of algorithms mandated by the proposal:

  • adjacent_difference inner_product (see #1659)
  • adjacent_find (see #1223)
  • all_of any_of none_of (see #1181)
  • copy copy_if copy_n (see #1147)
  • move
  • count count_if
  • equal mismatch (see #1193, #1206)
  • exclusive_scan inclusive_scan (see #1341, #1342)
  • reduce transform (see #1183)
  • fill fill_n
  • find find_end find_first_of find_if find_if_not (see #1206, #1223)
  • for_each for_each_n
  • generate generate_n (see #1206)
  • is_heap is_heap_until (see #2654)
  • is_partitioned is_sorted is_sorted_until (see #1382)
  • lexicographical_compare (see #1354)
  • max_element min_element minmax_element (see #1231)
  • partial_sort partial_sort_copy nth_element sort (see #1850) stable_sort
  • partition (see #2778)
  • partition_copy (see #2716)
  • stable_partition (see #2345)
  • remove remove_if (see #3086)
  • remove_copy remove_copy_if (see #1385)
  • replace replace_copy replace_copy_if replace_if (see #1225)
  • reverse reverse_copy (see #1208)
  • rotate rotate_copy (see #1212)
  • search search_n (see #1317)
  • set_difference set_intersection set_symmetric_difference set_union includes (see #1410)
  • inplace_merge (see #2978)
  • merge (see #2833)
  • swap_ranges
  • uninitialized_copy uninitialized_copy_n (see #1284)
  • uninitialized_fill uninitialized_fill_n (see #1284)
  • uninitialized_default_construct uninitialized_default_construct_n (see #2669)
  • uninitialized_value_construct uninitialized_value_construct_n (see #2671)
  • uninitialized_move uninitialized_move_n (see #2664)
  • destroy destroy_n (see #2676)
  • unique (see #2867)
  • unique_copy (see #2754)

These were added by N4310:

  • transform_reduce
  • transform_exclusive_scan transform_inclusive_scan (see #1341, #1342)

@hkaiser hkaiser added this to the 0.9.9 milestone May 31, 2014

@hkaiser hkaiser referenced this issue Jul 13, 2014

Merged

Parallel equal #1193

@Syntaf Syntaf changed the title from Implement N3989 on top of HPX to Implement N4071 on top of HPX Jul 31, 2014

@hkaiser hkaiser modified the milestones: 0.9.10, 0.9.9 Sep 13, 2014

@hkaiser hkaiser changed the title from Implement N4071 on top of HPX to Implement N4310 on top of HPX Dec 28, 2014

@hkaiser hkaiser added in progress and removed in progress labels Jan 7, 2015

@hkaiser hkaiser modified the milestones: 1.0.0, 0.9.10 Feb 25, 2015

@hkaiser hkaiser changed the title from Implement N4310 on top of HPX to Implement N4352 on top of HPX Mar 5, 2015

@hkaiser hkaiser changed the title from Implement N4352 on top of HPX to Implement N4409 on top of HPX May 11, 2015

@hkaiser hkaiser referenced this issue Dec 7, 2015

Closed

version 1.0? #1896

hkaiser added a commit that referenced this issue Apr 22, 2016

@diehlpk

This comment has been minimized.

Show comment
Hide comment
@diehlpk
Member

diehlpk commented Jan 24, 2017

@hkaiser Could you please add a project description here https://github.com/STEllAR-GROUP/hpx/wiki/GSoC-2017-Project-Ideas

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Mar 25, 2017

Member

I'm preparing GSoC. I have a question.
When implementing parallel algorithms, can I allocate additional memory for optimization or parallelization of algorithms?
For some algorithms, there may be differences in implementation depending on whether additional memory allocation is allowed or not.

Member

taeguk commented Mar 25, 2017

I'm preparing GSoC. I have a question.
When implementing parallel algorithms, can I allocate additional memory for optimization or parallelization of algorithms?
For some algorithms, there may be differences in implementation depending on whether additional memory allocation is allowed or not.

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Mar 25, 2017

Member

When implementing parallel algorithms, can I allocate additional memory for optimization or parallelization of algorithms?

That's a very good and controversial question. The maximum memory requirements for the parallel algorithms are not specified, only the computational complexity. While allocating memory itself does not change the computational complexity of an algorithm, often the fact that you allocate some intermediate buffer requires more data copying which in turn may change the complexity.

So the first rule of the game is not to exceed the complexity requirements as specified (e.g. don't make an algorithm O(N) if it's supposed to be O(logN), etc.). If an additional allocation does not (indirectly) change the complexity, then please make sure (that even for large data arrays) this does not blow the memory requirements out of proportion.

Generally, I'd suggest to try to avoid memory allocations as much as possible (in the first step) and use the implementation allowing to do things without. I understand that additional allocations may improve the algorithm performance, or even it's complexity, but I'd like to make an implementation correct first before diving into possible optimizations.

I know I have not given a concrete answer to your question. I guess it's a case by case decision we'll have to make as we go.

Member

hkaiser commented Mar 25, 2017

When implementing parallel algorithms, can I allocate additional memory for optimization or parallelization of algorithms?

That's a very good and controversial question. The maximum memory requirements for the parallel algorithms are not specified, only the computational complexity. While allocating memory itself does not change the computational complexity of an algorithm, often the fact that you allocate some intermediate buffer requires more data copying which in turn may change the complexity.

So the first rule of the game is not to exceed the complexity requirements as specified (e.g. don't make an algorithm O(N) if it's supposed to be O(logN), etc.). If an additional allocation does not (indirectly) change the complexity, then please make sure (that even for large data arrays) this does not blow the memory requirements out of proportion.

Generally, I'd suggest to try to avoid memory allocations as much as possible (in the first step) and use the implementation allowing to do things without. I understand that additional allocations may improve the algorithm performance, or even it's complexity, but I'd like to make an implementation correct first before diving into possible optimizations.

I know I have not given a concrete answer to your question. I guess it's a case by case decision we'll have to make as we go.

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Mar 25, 2017

Member

@taeguk Most of the missing algorithms are usually implemented based on a variation of a parallel scan. We already have a handful algorithms based on a scan_partitioner (e.g. copy_if, remove_copy, inclusive_scan, etc.). I'd expect for the rest of those to be easily implementable using this very same (and already existing) scan_partitioner. I'd suggest for you to familiarize yourself with how we have implemented those existing algorithms as reusing the partitioner would significantly simplify implementing the missing algorithms.

Member

hkaiser commented Mar 25, 2017

@taeguk Most of the missing algorithms are usually implemented based on a variation of a parallel scan. We already have a handful algorithms based on a scan_partitioner (e.g. copy_if, remove_copy, inclusive_scan, etc.). I'd expect for the rest of those to be easily implementable using this very same (and already existing) scan_partitioner. I'd suggest for you to familiarize yourself with how we have implemented those existing algorithms as reusing the partitioner would significantly simplify implementing the missing algorithms.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Mar 26, 2017

Member

@hkaiser Thanks for your answer and advice.
After reading your answer, I analyzed the code for copy_if, inclusive_scan, and scan_partitioner.
scan_partitioner seems to be helpful for the rest of algorithms.
For example, partition_copy can be easily implemented using scan_partitioner with referring to the implementation of copy_if. And unique_copy also seems to be able to be implemented using scan_partitioner through a little application in using it.
But still, further thought and discussion seem necessary for implementations of some algorithms.
For example, partition is not easy to implement because it is an in-place algorithm unlike copy_if and scan_partitioner.
And, merge also seems to be not easy even though it is an out-of-place algorithm.

Member

taeguk commented Mar 26, 2017

@hkaiser Thanks for your answer and advice.
After reading your answer, I analyzed the code for copy_if, inclusive_scan, and scan_partitioner.
scan_partitioner seems to be helpful for the rest of algorithms.
For example, partition_copy can be easily implemented using scan_partitioner with referring to the implementation of copy_if. And unique_copy also seems to be able to be implemented using scan_partitioner through a little application in using it.
But still, further thought and discussion seem necessary for implementations of some algorithms.
For example, partition is not easy to implement because it is an in-place algorithm unlike copy_if and scan_partitioner.
And, merge also seems to be not easy even though it is an out-of-place algorithm.

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Mar 26, 2017

Member

@taeguk Sure. Please ask more questions if needed. You may want to focus on the algorithms which are easier to implement, first.

Member

hkaiser commented Mar 26, 2017

@taeguk Sure. Please ask more questions if needed. You may want to focus on the algorithms which are easier to implement, first.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Mar 26, 2017

Member

@hkaiser Thank you. And I have a one more question.
Currently, scan_partitioner uses a sequential approach as a way to propagate partition results in step 2.
(

workitems.push_back(dataflow(p, f2, prev, curr));
)
I think it will be more better by using an approach like:
https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Prefix_sum_16.svg/300px-Prefix_sum_16.svg.png
Of course, this is only meaningful if there are a large number of partitions and f2 is heavy function.
How do you think about it?

Member

taeguk commented Mar 26, 2017

@hkaiser Thank you. And I have a one more question.
Currently, scan_partitioner uses a sequential approach as a way to propagate partition results in step 2.
(

workitems.push_back(dataflow(p, f2, prev, curr));
)
I think it will be more better by using an approach like:
https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Prefix_sum_16.svg/300px-Prefix_sum_16.svg.png
Of course, this is only meaningful if there are a large number of partitions and f2 is heavy function.
How do you think about it?

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Mar 26, 2017

Member

@taeguk I agree, this might be viable. Having an independent implementation for a scan-partitioner for performance comparisons is a good thing. Also, @biddisco was looking into improving the scan-based algorithms at some point, he might be able to shed some additional light onto all of this.

As an unrelated note: would you mind continuing this discussion on our mailing list (hpx-users@stellar.cct.lsu.edu) or on the IRC channel (#ste||ar at freenode)?

Member

hkaiser commented Mar 26, 2017

@taeguk I agree, this might be viable. Having an independent implementation for a scan-partitioner for performance comparisons is a good thing. Also, @biddisco was looking into improving the scan-based algorithms at some point, he might be able to shed some additional light onto all of this.

As an unrelated note: would you mind continuing this discussion on our mailing list (hpx-users@stellar.cct.lsu.edu) or on the IRC channel (#ste||ar at freenode)?

@mcopik

This comment has been minimized.

Show comment
Hide comment
@mcopik

mcopik Mar 30, 2017

Contributor

@hkaiser @taeguk @biddisco
I missed the IRC discussion on that topic but I want to mention that I am already working on a very similar problem. My Master thesis is built on top of parallel prefix sum, I started working on implementing existing solution in HPX and for that part, I have to change the existing implementation of scan_partitioner.

In general, taeguk made a very good observation about scan_partitioner. However, the existing approach is absolutely fine for most applications because they tend to perform operations where computation cost of the operand << communication time. For example, all papers on MPI_Scan are focused on optimizing the communication in operations such as integer addition. Unfortunately, it is not easy to find examples in literature which consider more complex operations in parallel prefix sum. After all, it might be a rare case. A parallel scan with a serialized middle phase is a very poor choice there due to its lack of scalability. The theoretical maximum of scalability is reached for a number of workers equal to square root of the number of data items.

My case is completely different and I have a very "heavy" operation with an unpredictable runtime (multigrid for nonlinear optimization by gradient descent) which operates on a small amount of data.
The most important problem for me is to decrease the span of the whole algorithm. It does not have to be the work-efficient but the total number of operations by the slowest worker has to be as small as possible. Unfortunately, for that case the Blleloch scan, which is presented on the graph linked by taeguk, is not really the best solution. The idea behind this scan is to perform two tree sweeps, one upward and one downward, which in total gives you a theoretical span of a twice the logarithm of the number of workers. We can do better.

I have achieved the best results with parallel prefix sums which are not work-efficient i.e. the total number of operations performed by all workers is much higher than in Blelloch scan. It payoffs to do more work in parallel. Examples here are Kogge-Stone and Sklansky parallel prefix adders and both of them achieve the total span of just a logarithm of the number of workers; if prefer the former solution. The downside here is that those scans are inclusive and the third phase requires results of an exclusive scan. It has to be fixed by another round of communication where each partition shifts its result to a neighbor on its right. It is not a problem as long as the operand is "heavy" enough because it is much better to perform more communication instead of doing more work.

Then there are smaller optimizations and modifications to further decrease the span. Kogge-Stone scan leaves some room for improvement because the distribution of work is not equal among workers.

If I succeed, then definitely something similar would be available before GSoC starts. It may be a starting point or a competitive idea, we may end up with different solutions. However, it is very hard to invent anything new or different for prefix sums, the field has been widely studied since 60's.

Contributor

mcopik commented Mar 30, 2017

@hkaiser @taeguk @biddisco
I missed the IRC discussion on that topic but I want to mention that I am already working on a very similar problem. My Master thesis is built on top of parallel prefix sum, I started working on implementing existing solution in HPX and for that part, I have to change the existing implementation of scan_partitioner.

In general, taeguk made a very good observation about scan_partitioner. However, the existing approach is absolutely fine for most applications because they tend to perform operations where computation cost of the operand << communication time. For example, all papers on MPI_Scan are focused on optimizing the communication in operations such as integer addition. Unfortunately, it is not easy to find examples in literature which consider more complex operations in parallel prefix sum. After all, it might be a rare case. A parallel scan with a serialized middle phase is a very poor choice there due to its lack of scalability. The theoretical maximum of scalability is reached for a number of workers equal to square root of the number of data items.

My case is completely different and I have a very "heavy" operation with an unpredictable runtime (multigrid for nonlinear optimization by gradient descent) which operates on a small amount of data.
The most important problem for me is to decrease the span of the whole algorithm. It does not have to be the work-efficient but the total number of operations by the slowest worker has to be as small as possible. Unfortunately, for that case the Blleloch scan, which is presented on the graph linked by taeguk, is not really the best solution. The idea behind this scan is to perform two tree sweeps, one upward and one downward, which in total gives you a theoretical span of a twice the logarithm of the number of workers. We can do better.

I have achieved the best results with parallel prefix sums which are not work-efficient i.e. the total number of operations performed by all workers is much higher than in Blelloch scan. It payoffs to do more work in parallel. Examples here are Kogge-Stone and Sklansky parallel prefix adders and both of them achieve the total span of just a logarithm of the number of workers; if prefer the former solution. The downside here is that those scans are inclusive and the third phase requires results of an exclusive scan. It has to be fixed by another round of communication where each partition shifts its result to a neighbor on its right. It is not a problem as long as the operand is "heavy" enough because it is much better to perform more communication instead of doing more work.

Then there are smaller optimizations and modifications to further decrease the span. Kogge-Stone scan leaves some room for improvement because the distribution of work is not equal among workers.

If I succeed, then definitely something similar would be available before GSoC starts. It may be a starting point or a competitive idea, we may end up with different solutions. However, it is very hard to invent anything new or different for prefix sums, the field has been widely studied since 60's.

@mcopik

This comment has been minimized.

Show comment
Hide comment
@mcopik

mcopik Mar 30, 2017

Contributor

@taeguk Just my curiosity: have you noticed the performance problem in scan_partitioner because of prior experience or knowledge about problems where parallel prefix sum is performed with computationally expensive operands? If yes, then would you mind sharing some details? I'm interested in approaches developed for this problem and, as I said, it's not a very common thing.

Contributor

mcopik commented Mar 30, 2017

@taeguk Just my curiosity: have you noticed the performance problem in scan_partitioner because of prior experience or knowledge about problems where parallel prefix sum is performed with computationally expensive operands? If yes, then would you mind sharing some details? I'm interested in approaches developed for this problem and, as I said, it's not a very common thing.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Mar 31, 2017

Member

@mcopik I didn't claim to use Blleloch scan. Just, I wanted to mention that there may be a problem in implementation of scan_partitioner. Kogge-Stone and Sklansky parallel prefix adders seem to be good.

Let me tell you a little bit about the last discussion in IRC. @biddisco said that he implemented scan_partitioner using sequential approach for propagation because of extra cost outweighed the gains from parallel approach.
But, I thought the sequential approach can occur a problem if there are a large number of partitions and f2 is heavy function. So, we determined that an user selects what approach to use through algorithmic policy, or somesuch.
In parallel approach, determining what method we use is a issue which needs additional discussions.

Member

taeguk commented Mar 31, 2017

@mcopik I didn't claim to use Blleloch scan. Just, I wanted to mention that there may be a problem in implementation of scan_partitioner. Kogge-Stone and Sklansky parallel prefix adders seem to be good.

Let me tell you a little bit about the last discussion in IRC. @biddisco said that he implemented scan_partitioner using sequential approach for propagation because of extra cost outweighed the gains from parallel approach.
But, I thought the sequential approach can occur a problem if there are a large number of partitions and f2 is heavy function. So, we determined that an user selects what approach to use through algorithmic policy, or somesuch.
In parallel approach, determining what method we use is a issue which needs additional discussions.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Mar 31, 2017

Member

My case is completely different and I have a very "heavy" operation with an unpredictable runtime (multigrid for nonlinear optimization by gradient descent) which operates on a small amount of data.
The most important problem for me is to decrease the span of the whole algorithm. It does not have to be the work-efficient but the total number of operations by the slowest worker has to be as small as possible.

@mcopik In this case, there is basical problem in scan_partitioner aside from the problem I presented. Currently, scan_partitioner first performs a sequential scan on each partition and then propagate partition results to other partitions from left to right. (the problem I presented is related in the latter)
So, the sequential scan on each partition in first step occurs the problem in situation which you suggested because times spent in each partition are so different. Because of the slowest partition, everything is delayed. So, I think that first step of scan_partitioner also need to be changed if we consider that situation.

Member

taeguk commented Mar 31, 2017

My case is completely different and I have a very "heavy" operation with an unpredictable runtime (multigrid for nonlinear optimization by gradient descent) which operates on a small amount of data.
The most important problem for me is to decrease the span of the whole algorithm. It does not have to be the work-efficient but the total number of operations by the slowest worker has to be as small as possible.

@mcopik In this case, there is basical problem in scan_partitioner aside from the problem I presented. Currently, scan_partitioner first performs a sequential scan on each partition and then propagate partition results to other partitions from left to right. (the problem I presented is related in the latter)
So, the sequential scan on each partition in first step occurs the problem in situation which you suggested because times spent in each partition are so different. Because of the slowest partition, everything is delayed. So, I think that first step of scan_partitioner also need to be changed if we consider that situation.

@msimberg

This comment has been minimized.

Show comment
Hide comment
@msimberg

msimberg Nov 21, 2017

Contributor

Marked inplace_merge as done because #2978 was merged.

Contributor

msimberg commented Nov 21, 2017

Marked inplace_merge as done because #2978 was merged.

@msimberg msimberg removed this from the 1.1.0 milestone Nov 21, 2017

@victor-ludorum

This comment has been minimized.

Show comment
Hide comment
@victor-ludorum

victor-ludorum Feb 18, 2018

Contributor

Hello @hkaiser !! As Many algorithms are already implemented . But I have made one list of the unimplemented algorithms .
copy_backward
equal_range
is_permutation
lower_bound
upper_bound
move_backward
prev_permutation
next_permutation
nth_element
partition_point
pop_heap
push_heap
sort_heap
stable_sort
partial_sort

and as we have one numeric adjacent_difference (numeric algorithm) ,
These three numeric algorithms can also be implemented
accumulate
inner_product
partial_sum

As I have started learning about HPX , I have checked that these algorithms haven't been implemented yet. Is the implementation of these algorithms are not important for parallelism and concurrency ?

Contributor

victor-ludorum commented Feb 18, 2018

Hello @hkaiser !! As Many algorithms are already implemented . But I have made one list of the unimplemented algorithms .
copy_backward
equal_range
is_permutation
lower_bound
upper_bound
move_backward
prev_permutation
next_permutation
nth_element
partition_point
pop_heap
push_heap
sort_heap
stable_sort
partial_sort

and as we have one numeric adjacent_difference (numeric algorithm) ,
These three numeric algorithms can also be implemented
accumulate
inner_product
partial_sum

As I have started learning about HPX , I have checked that these algorithms haven't been implemented yet. Is the implementation of these algorithms are not important for parallelism and concurrency ?

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Feb 18, 2018

Member

@victor-ludorum the algorithms listed here above are the ones specified for C++17.

accumulate, inner_product, and partial_sum are listed under a different name (reduce, transform_reduce, and inclusive_scan).

Somebody already tried to implement the heap algorithms, but that was abandoned (see #1914), feel free to revive that effort.

The algorithms related to sort are listed as not-implemented in our list above (nth_element, partial_sort, etc.), feel free to take those on.

I'm not sure if it's possible to parallelize the permutation algorithms.

copy_backwards and move_backwards can easily be implemented on top of the existing copy and move algorithms (it requires at least or bi-directional iterators, so you could wrap the given iterators into reverse_iterator), alternatively we'd need a separate (but similar to copy/move) implementation.

I don't know what is the difference between partition and partition_point.

Member

hkaiser commented Feb 18, 2018

@victor-ludorum the algorithms listed here above are the ones specified for C++17.

accumulate, inner_product, and partial_sum are listed under a different name (reduce, transform_reduce, and inclusive_scan).

Somebody already tried to implement the heap algorithms, but that was abandoned (see #1914), feel free to revive that effort.

The algorithms related to sort are listed as not-implemented in our list above (nth_element, partial_sort, etc.), feel free to take those on.

I'm not sure if it's possible to parallelize the permutation algorithms.

copy_backwards and move_backwards can easily be implemented on top of the existing copy and move algorithms (it requires at least or bi-directional iterators, so you could wrap the given iterators into reverse_iterator), alternatively we'd need a separate (but similar to copy/move) implementation.

I don't know what is the difference between partition and partition_point.

@victor-ludorum

This comment has been minimized.

Show comment
Hide comment
@victor-ludorum

victor-ludorum Feb 18, 2018

Contributor

Thanks @hkaiser sir !! I will definitely work on these algorithms. So numeric algorithms are already implemented . Remaining algorithms which is important can be implemented , I hope.

Contributor

victor-ludorum commented Feb 18, 2018

Thanks @hkaiser sir !! I will definitely work on these algorithms. So numeric algorithms are already implemented . Remaining algorithms which is important can be implemented , I hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment