New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The in-place parallel algorithms are hard to implement efficiently. #2804

Open
taeguk opened this Issue Aug 7, 2017 · 7 comments

Comments

4 participants
@taeguk
Member

taeguk commented Aug 7, 2017

I'm working in parallel algorithms of HPX. Through this issue, I want to get some advices and form a consensus about implementations of in-place parallel algorithms.
There are some in-place parallel algorithms that are not implemented, still. (I exclude sorting algorithms from the list because they seem to be able to implement efficiently using parallel partition.)

  • unique
  • remove, remove_if
  • inplace_merge

Above parallel algorithms are hard to implement efficiently.
For good performance, parallel algorithms should be running as parallely as possible. And they should use less additional memory as possible.
But these things are not easy in implementations of above algorithms.

For getting ideas, I search many other implementations of those.
There are some implementations of those parallel algorithms.
But, most implementations just allocate temporary memory and utilize out-place version of the algorithm.

The problem of additional memory is related to not only performance but also nonconformity to C++ standard.
The type of dereferenced iterator must meet the requirements of MoveConstructable. If not, using additional memory is violation to C++ standard.

Only the 'parallelstl' of MS doesn't use temporary memory.
Instead, in case of unique, remove, and remove_if, 'parallelstl' seems to use parallel scanning and sequential copying. (https://github.com/Chilledheart/ParallelSTL/blob/master/include/experimental/impl/unique.h#L108-L109)
And in case of inplace_merge, 'parallelstl' utilizes rotate. So, it doesn't use additional memory. But, time complexity increases from O(N) to O(N*logN).
(https://github.com/Chilledheart/ParallelSTL/blob/master/include/experimental/impl/merge.h#L107)

And 'gcc' don't support unique, remove, remove_if, and inplace_merge. (https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html)

The issue of inplace_merge is what way we should choose between two. One is allocating additional memory and using out-place version of the algorithm. And the another is utilizing rotate.

And the issue of unique, remove, and remove_if is what way we should choose between two. One is allocating additional memory and using out-place version of the algorithm. And the another is scanning parallely but copying sequentially.

I want to know another people's thoughts. And very very welcome for suggesting another different way to implement those algorithms. Please give me your thoughts and ideas.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Aug 7, 2017

Member

@biddisco @hkaiser Excuse me, could you give me your thoughts about this issue?

Member

taeguk commented Aug 7, 2017

@biddisco @hkaiser Excuse me, could you give me your thoughts about this issue?

@biddisco

This comment has been minimized.

Show comment
Hide comment
@biddisco

biddisco Aug 7, 2017

Contributor

I've only thought about remove and remove_if and my understanding might be flawed ... correct me if I'm wrong

you can perform a parallel scan with the predicate - this gives you the number of elements in each partition that are going to remain - the problem is that if the Nth partition starts copying data into a slot <N and would overwrite data that needs to be moved lower - before it has been moved - then a race condition occurs between the moves of data elements in different partitions

option 1 - use temp memory. for individual partition sections that meet the predicate and then a final copy into the original memory. I can see why this is distasteful - we pay for malloc in each partition except the first - and have to move the data twice.
option 2 - do a sequential second step to ensure that the race is avoided - slows the algorithm down but avoids mallocs and reduces memory bandwidth.
option 3 - everything else I can think of is just a variation of option 2 - to avoid the race.

I would implement option 2 as a first attempt and time it. unless the user is passing absolutely huge arrays - then the serial copy will not be so expensive compared to a parallel one.

I will think about this more and reply again if I have a better ieda. Sadly, some algorithms just aren't that easy to make work in parallel ...

Contributor

biddisco commented Aug 7, 2017

I've only thought about remove and remove_if and my understanding might be flawed ... correct me if I'm wrong

you can perform a parallel scan with the predicate - this gives you the number of elements in each partition that are going to remain - the problem is that if the Nth partition starts copying data into a slot <N and would overwrite data that needs to be moved lower - before it has been moved - then a race condition occurs between the moves of data elements in different partitions

option 1 - use temp memory. for individual partition sections that meet the predicate and then a final copy into the original memory. I can see why this is distasteful - we pay for malloc in each partition except the first - and have to move the data twice.
option 2 - do a sequential second step to ensure that the race is avoided - slows the algorithm down but avoids mallocs and reduces memory bandwidth.
option 3 - everything else I can think of is just a variation of option 2 - to avoid the race.

I would implement option 2 as a first attempt and time it. unless the user is passing absolutely huge arrays - then the serial copy will not be so expensive compared to a parallel one.

I will think about this more and reply again if I have a better ieda. Sadly, some algorithms just aren't that easy to make work in parallel ...

@hkaiser hkaiser added this to the 1.1.0 milestone Aug 8, 2017

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Aug 8, 2017

Member

It's always a compromise between memory consumption and performance. Would it be sensible to have both versions (using external memory but faster and not using external memory but slower) and let the user decide (through an execution parameter)?

There is probably also a sweet-spot hidden in there which we could find by doing some measurements.

Member

hkaiser commented Aug 8, 2017

It's always a compromise between memory consumption and performance. Would it be sensible to have both versions (using external memory but faster and not using external memory but slower) and let the user decide (through an execution parameter)?

There is probably also a sweet-spot hidden in there which we could find by doing some measurements.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Aug 20, 2017

Member

@biddisco @hkaiser I implemented and benchmarked both two ways to implement unique, remove, and remove_if. The two ways are 1) scanning parallely & copying sequentially, and 2) allocating temporary memory & utilizing unique_copy.
In conclusion, both have bad performance in my benchmark... :(
Both have same or a little bit faster than sequantial version. (The 'a little bit' means really 'a little bit'. They cannot achieve even near 20% improvement with 8 cores (16 threads) in Rostam...)
How should I do forward?
For now, select one way between the two, and submit PR which implement and optimize that way, and then find the better way in future?
Or don't submit the PR of unique, remove, and remove_if and postpone the implementations of those to the future?

Member

taeguk commented Aug 20, 2017

@biddisco @hkaiser I implemented and benchmarked both two ways to implement unique, remove, and remove_if. The two ways are 1) scanning parallely & copying sequentially, and 2) allocating temporary memory & utilizing unique_copy.
In conclusion, both have bad performance in my benchmark... :(
Both have same or a little bit faster than sequantial version. (The 'a little bit' means really 'a little bit'. They cannot achieve even near 20% improvement with 8 cores (16 threads) in Rostam...)
How should I do forward?
For now, select one way between the two, and submit PR which implement and optimize that way, and then find the better way in future?
Or don't submit the PR of unique, remove, and remove_if and postpone the implementations of those to the future?

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Aug 20, 2017

Member

@taeguk my suggestion would be to add the better of the two versions now and keep in mind (create a ticket) that we should get back and improve, if possible. Please also add the benchmarks you developed.

Member

hkaiser commented Aug 20, 2017

@taeguk my suggestion would be to add the better of the two versions now and keep in mind (create a ticket) that we should get back and improve, if possible. Please also add the benchmarks you developed.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Aug 20, 2017

Member

@hkaiser Okay, I'll follow what you said.
I have one more question.
In case of the version which uses temporary memory, MoveConstructible is needed as requirements. But standard only states MoveAssignable. Surely, if MoveAssignable is conformed, maybe MoveConstructible is also conformed. But, not 100% guaranteed. Can this tiny nonconformity be accepted? Or should I avoid temporary memory for following standard strongly?

And does the ticket means github issue??

Member

taeguk commented Aug 20, 2017

@hkaiser Okay, I'll follow what you said.
I have one more question.
In case of the version which uses temporary memory, MoveConstructible is needed as requirements. But standard only states MoveAssignable. Surely, if MoveAssignable is conformed, maybe MoveConstructible is also conformed. But, not 100% guaranteed. Can this tiny nonconformity be accepted? Or should I avoid temporary memory for following standard strongly?

And does the ticket means github issue??

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Aug 21, 2017

Member

@taeguk: newer versions of the ranges TS refer to MoveConstructible only (see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4685.pdf).

Member

hkaiser commented Aug 21, 2017

@taeguk: newer versions of the ranges TS refer to MoveConstructible only (see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4685.pdf).

taeguk added a commit to taeguk/hpx that referenced this issue Aug 28, 2017

Re-implement parallel::unique using sequentially performed f3.
Original way: allocating temporary storage and utilizing out-place version.
New way: perform f3 sequentially.

See also: #2804

@taeguk taeguk referenced this issue Aug 28, 2017

Merged

Implement parallel::unique. #2867

6 of 6 tasks complete

@msimberg msimberg removed this from the 1.1.0 milestone Dec 5, 2017

@hkaiser hkaiser added this to Open Tickets in Standard Algorithms Jan 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment