Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Floyd-Rivest selection algorithm instead of std::partial_sort #16825

Merged
merged 16 commits into from Nov 10, 2020
Merged

Conversation

danlark1
Copy link
Contributor

@danlark1 danlark1 commented Nov 9, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Use Floyd-Rivest algorithm, it should be the best for the ClickHouse use case of partial sorting. Bechmarks are in https://github.com/danlark1/miniselect and here

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Nov 9, 2020
@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Nov 9, 2020
@@ -3,7 +3,7 @@
5
1 1
2 1
3 4
3 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because of unspecified "ties" handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, likely some other tests will also have this issue

Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexey-milovidov alexey-milovidov self-assigned this Nov 9, 2020
@amosbird
Copy link
Collaborator

Can we construct any performance tests to validate this method?

@danlark1
Copy link
Contributor Author

danlark1 commented Nov 10, 2020

Can we construct any performance tests to validate this method?

I am happy to add more, however I am pretty sure there are enough queries with ORDER BY LIMIT N in the performance tests. I want to see them first

@danlark1
Copy link
Contributor Author

danlark1 commented Nov 10, 2020

@alexey-milovidov any chances ci can go through due to Yandex checks? UPD: Solved by force tests marker

@akuzm akuzm added the force tests Force test ignoring fast test output. label Nov 10, 2020
@danlark1
Copy link
Contributor Author

danlark1 commented Nov 10, 2020

Perf tests are here. 2.5% win overall for all tests, for desc data the speedup in 1400%, some more representative benchmarks as string_sort showed 15% boost

0.350 | 0.301 | -1.163x | -0.141 | 0.140 | string_sort | 7 | SELECT Referer FROM hits_100m_single ORDER BY Referer LIMIT 300 format Null
0.435 | 0.376 | -1.154x | -0.134 | 0.133 | string_sort | 8 | SELECT Title FROM hits_100m_single ORDER BY Title LIMIT 300 format Null
0.488 | 0.423 | -1.154x | -0.134 | 0.133 | string_sort | 50 | SELECT Title FROM hits_100m_single ORDER BY Title, CounterID LIMIT 300 format Null

@danlark1
Copy link
Contributor Author

danlark1 commented Nov 10, 2020

One query became 5-10% slower, 15-20 became significantly faster, up to 20x

SELECT MobilePhoneModel, MobilePhoneModel FROM hits_100m_single ORDER BY MobilePhoneModel, MobilePhoneModel LIMIT 10 FORMAT Null

It is expected as Floyd-Rivest is not performing way faster when there are many equal elements in the array, it might be several percent worse. Other than that everything is really good. I believe somewhen in the future I can fix this issue and we will have even better partial sorting.

@danlark1
Copy link
Contributor Author

@alexey-milovidov you can merge, the tests from 435f410 are good and after that I fixed the performance test which I checked locally, it should be good

@alexey-milovidov alexey-milovidov merged commit 41dc55c into ClickHouse:master Nov 10, 2020
3 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
force tests Force test ignoring fast test output. pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants