New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The worst-case time complexity of parallel::sort seems to be O(N^2). #2836

Open
taeguk opened this Issue Aug 16, 2017 · 11 comments

Comments

Projects
None yet
6 participants
@taeguk
Member

taeguk commented Aug 16, 2017

Currently, the worst-case time complexity of parallel::sort seems to be O(N^2).

if (std::size_t(N) <= sort_limit_per_task)
{
return execution::async_execute(
policy.executor(),
[first, last, comp]() -> RandomIt
{
std::sort(first, last, comp);
return last;
});
}

//------------------- check if sorted ----------------------------
if (detail::is_sorted_sequential(first, last, comp))
return hpx::make_ready_future(last);

Surely, due to the two code blocks above, the possibility is very rare.
But, theoretically it still seems to be able to be O(N^2) in the worst-case.
Normally, for ensuring O(NlogN), switch algorithm to heap sort when recursion depth is higher than specific threshold.
If we want to ensure O(N
logN), we should modify Line 80 to

if (std::size_t(N) <= sort_limit_per_task || depth > recursion_depth_limit) 

But, I'm afraid of performance change from those modifications.

I want to hear your thoughts about this.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Aug 16, 2017

Member

@biddisco @hkaiser Because at first this algorithm was developed by you, I want to hear your thoughts.

Member

taeguk commented Aug 16, 2017

@biddisco @hkaiser Because at first this algorithm was developed by you, I want to hear your thoughts.

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Aug 16, 2017

Member

@taeguk the algorithm was developed by Francisco (@fjtapia). John and I have just adapted it to HPX.

Member

hkaiser commented Aug 16, 2017

@taeguk the algorithm was developed by Francisco (@fjtapia). John and I have just adapted it to HPX.

@biddisco

This comment has been minimized.

Show comment
Hide comment
@biddisco

biddisco Aug 16, 2017

Contributor

@taeguk - Tell me what are the rare circumstances where
std::size_t(N) <= sort_limit_per_task
could be false, but
depth > recursion_depth_limit
could be true?

Since the list is subdivided on each recursion - surely the list will always get shortened (by roughly 1/2) - are you implying that there are cases where the list might be split on recursion, but by some factor that is way off 1/2 - and (say) only one element goes to one half of the tree and N-1 go to the other?

If that is the case, then yes, changing the switchover point by adding the recursion depth would help.

Contributor

biddisco commented Aug 16, 2017

@taeguk - Tell me what are the rare circumstances where
std::size_t(N) <= sort_limit_per_task
could be false, but
depth > recursion_depth_limit
could be true?

Since the list is subdivided on each recursion - surely the list will always get shortened (by roughly 1/2) - are you implying that there are cases where the list might be split on recursion, but by some factor that is way off 1/2 - and (say) only one element goes to one half of the tree and N-1 go to the other?

If that is the case, then yes, changing the switchover point by adding the recursion depth would help.

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Aug 16, 2017

Member

and (say) only one element goes to one half of the tree and N-1 go to the other?
If that is the case, then yes, changing the switchover point by adding the recursion depth would help.

@biddisco Yes, I implied that case. For ensuring time complexity to O(N*logN) , we should limit recursion depth. But, what I care is performance change from those limitation. As you think, is there possibility that performance will be decreased because of those limitation?
What should we do? Leave current parallel::sort ? Or Limit recursion depth for ensuring time complexity?

Member

taeguk commented Aug 16, 2017

and (say) only one element goes to one half of the tree and N-1 go to the other?
If that is the case, then yes, changing the switchover point by adding the recursion depth would help.

@biddisco Yes, I implied that case. For ensuring time complexity to O(N*logN) , we should limit recursion depth. But, what I care is performance change from those limitation. As you think, is there possibility that performance will be decreased because of those limitation?
What should we do? Leave current parallel::sort ? Or Limit recursion depth for ensuring time complexity?

@biddisco

This comment has been minimized.

Show comment
Hide comment
@biddisco

biddisco Aug 17, 2017

Contributor

My suggestion would be to run the algorithm over a bunch of tests datasets and see if the recursion depth ever significantly exceeds the nominal threshold required for NlogN behaviour. If it happens frequently - then yes, add a limit. In my own tests I have not yet encountered problems of that kind (that doesn't mean it can't happen though).

Contributor

biddisco commented Aug 17, 2017

My suggestion would be to run the algorithm over a bunch of tests datasets and see if the recursion depth ever significantly exceeds the nominal threshold required for NlogN behaviour. If it happens frequently - then yes, add a limit. In my own tests I have not yet encountered problems of that kind (that doesn't mean it can't happen though).

@biddisco

This comment has been minimized.

Show comment
Hide comment
@biddisco

biddisco Aug 17, 2017

Contributor

Alternatively - add the extra check and verify that normal tests don't run any more slowly.

Contributor

biddisco commented Aug 17, 2017

Alternatively - add the extra check and verify that normal tests don't run any more slowly.

@fjtapia

This comment has been minimized.

Show comment
Hide comment
@fjtapia

fjtapia Aug 18, 2017

fjtapia commented Aug 18, 2017

@hkaiser hkaiser added this to the 1.1.0 milestone Aug 18, 2017

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Sep 10, 2017

Member

@fjtapia Thank you for your reply and sorry to my late response.
I read https://github.com/fjtapia/HPX_sort/blob/master/block_indirect_sort_en.pdf. Your idea looks so awesome!
I want to consider your better parallel sorting algorithms for resolving this issue and getting better performance. But, I should re-implement your implementation with HPX-tic way.
But, before I consider your idea, I have to ask you. Can I use your idea for HPX?

Member

taeguk commented Sep 10, 2017

@fjtapia Thank you for your reply and sorry to my late response.
I read https://github.com/fjtapia/HPX_sort/blob/master/block_indirect_sort_en.pdf. Your idea looks so awesome!
I want to consider your better parallel sorting algorithms for resolving this issue and getting better performance. But, I should re-implement your implementation with HPX-tic way.
But, before I consider your idea, I have to ask you. Can I use your idea for HPX?

@fjtapia

This comment has been minimized.

Show comment
Hide comment
@fjtapia

fjtapia Sep 16, 2017

fjtapia commented Sep 16, 2017

@taeguk

This comment has been minimized.

Show comment
Hide comment
@taeguk

taeguk Sep 17, 2017

Member

@fjtapia

This, running in a single machine, had a penalty around 10% of the previous
version.

1 . Does this penalty means the overheads occurred by porting your original block_indirect_sort to HPX? That is, is there no penalty in original block_indirect_sort?

https://github.com/fjtapia/HPX_sort/blob/8c00d770f3e831f4914196c9386ff930e9c0b1de/include/hpx/parallel/sort/detail/bis/backbone.hpp#L167-L170
2 . And I think that above codes cause frequent thread switching through std::this_thread::yield(). And it is very harmful for performance. How do you think about it?

3 . HPX have individual implementations of parallel algorithms for on a single machine and on a network. (/hpx/parallel/algorithms/ for single machine, and /hpx/parallel/segmented_algorithms for network.) So I think it is better that we first implement the algorithm running on only a single machine. And then consider running over a network.

4 . How can I look attached file? I'm reading your reply on github. I can't find attached file on github.

Member

taeguk commented Sep 17, 2017

@fjtapia

This, running in a single machine, had a penalty around 10% of the previous
version.

1 . Does this penalty means the overheads occurred by porting your original block_indirect_sort to HPX? That is, is there no penalty in original block_indirect_sort?

https://github.com/fjtapia/HPX_sort/blob/8c00d770f3e831f4914196c9386ff930e9c0b1de/include/hpx/parallel/sort/detail/bis/backbone.hpp#L167-L170
2 . And I think that above codes cause frequent thread switching through std::this_thread::yield(). And it is very harmful for performance. How do you think about it?

3 . HPX have individual implementations of parallel algorithms for on a single machine and on a network. (/hpx/parallel/algorithms/ for single machine, and /hpx/parallel/segmented_algorithms for network.) So I think it is better that we first implement the algorithm running on only a single machine. And then consider running over a network.

4 . How can I look attached file? I'm reading your reply on github. I can't find attached file on github.

@msimberg msimberg removed this from the 1.1.0 milestone Dec 5, 2017

@shantanudwvd

This comment has been minimized.

Show comment
Hide comment
@shantanudwvd

shantanudwvd Mar 25, 2018

i want to work on this. Can anyone guide me as i'm new to this?

shantanudwvd commented Mar 25, 2018

i want to work on this. Can anyone guide me as i'm new to this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment