Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
The in-place parallel algorithms are hard to implement efficiently. #2804
I'm working in parallel algorithms of HPX. Through this issue, I want to get some advices and form a consensus about implementations of in-place parallel algorithms.
Above parallel algorithms are hard to implement efficiently.
For getting ideas, I search many other implementations of those.
The problem of additional memory is related to not only performance but also nonconformity to C++ standard.
Only the 'parallelstl' of MS doesn't use temporary memory.
And 'gcc' don't support
The issue of
And the issue of
I want to know another people's thoughts. And very very welcome for suggesting another different way to implement those algorithms. Please give me your thoughts and ideas.
I've only thought about remove and remove_if and my understanding might be flawed ... correct me if I'm wrong
you can perform a parallel scan with the predicate - this gives you the number of elements in each partition that are going to remain - the problem is that if the Nth partition starts copying data into a slot <N and would overwrite data that needs to be moved lower - before it has been moved - then a race condition occurs between the moves of data elements in different partitions
option 1 - use temp memory. for individual partition sections that meet the predicate and then a final copy into the original memory. I can see why this is distasteful - we pay for malloc in each partition except the first - and have to move the data twice.
I would implement option 2 as a first attempt and time it. unless the user is passing absolutely huge arrays - then the serial copy will not be so expensive compared to a parallel one.
I will think about this more and reply again if I have a better ieda. Sadly, some algorithms just aren't that easy to make work in parallel ...
It's always a compromise between memory consumption and performance. Would it be sensible to have both versions (using external memory but faster and not using external memory but slower) and let the user decide (through an execution parameter)?
There is probably also a sweet-spot hidden in there which we could find by doing some measurements.
@biddisco @hkaiser I implemented and benchmarked both two ways to implement
@hkaiser Okay, I'll follow what you said.
And does the ticket means github issue??