New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Floyd-Rivest selection algorithm instead of std::partial_sort #16825
Conversation
@@ -3,7 +3,7 @@ | |||
5 | |||
1 1 | |||
2 1 | |||
3 4 | |||
3 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because of unspecified "ties" handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, likely some other tests will also have this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can we construct any performance tests to validate this method? |
I am happy to add more, however I am pretty sure there are enough queries with ORDER BY LIMIT N in the performance tests. I want to see them first |
@alexey-milovidov any chances ci can go through due to Yandex checks? UPD: Solved by force tests marker |
Perf tests are here. 2.5% win overall for all tests, for desc data the speedup in 1400%, some more representative benchmarks as string_sort showed 15% boost
|
One query became 5-10% slower, 15-20 became significantly faster, up to 20x
It is expected as Floyd-Rivest is not performing way faster when there are many equal elements in the array, it might be several percent worse. Other than that everything is really good. I believe somewhen in the future I can fix this issue and we will have even better partial sorting. |
@alexey-milovidov you can merge, the tests from 435f410 are good and after that I fixed the performance test which I checked locally, it should be good |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Use Floyd-Rivest algorithm, it should be the best for the ClickHouse use case of partial sorting. Bechmarks are in https://github.com/danlark1/miniselect and here