New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: best effort sorting to optimize compression #4413
Comments
Maybe you've reinvented primary key
|
@den-crane Sorry but I don't see how the primary key helps with the usecase of dropping columns that are not the last one in the table. Your example just modifies the last column which is fine with ORDER BY restrictions. Try dropping the country column in your example.
When trying to remove it from the ORDER BY (assuming PRIMARY KEY didn't include country):
The problem is that both PRIMARY KEY and ORDER BY give hard guarantees on the order of data which prevents one from changing columns unless its the last column. This guarantee is especially important for table engines that actually merge rows like the Summing or Collapsing ones. They wouldn't merge rows that they should if you dropped a column from the middle of a ORDER BY expression because columns after the one that was dropped don't adhere to the new sorting order. Another merge/sort pass would be required for all data parts. In theory this shouldn't be a big issue because of the splitting into and merging of parts is a background process anyways and one can't expect all rows in the table to be properly sorted during SELECT time. Maybe another solution would be to allow dropping of columns from the middle of the ORDER BY expression and mark parts as "dirty" (needing a merge) even if it's just 1? |
@alexey-milovidov hi, could you please post a comment why you decided to mark this as wontfix? Is there some other way to achieve the posted goal? Thanks. |
If I understand correctly, the proposed idea is to reorder data (keeping the ORDER BY key and reordering within each group of records for the same key) in every inserted block so that it will be compressed better. This is a good idea. Let's think about how we can implement it. Also, there was another proposal - to allow dropping columns from ORDER BY if these columns are not in the PRIMARY key. |
I'm trying to come up with some algorithm for best-effort reordering. The current idea is to extract all 4-byte sequences (four-grams) from every record in the inserted block and put them into a hash table: |
This generally makes sense. Thoughts:
|
related paper : https://arxiv.org/abs/1207.2189 @Article{DBLP:journals/corr/abs-1207-2189, |
This paper is a good reference... Lemire, D., Kaser, O., & Gutarra, E. (2012). Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems (TODS), 37(3), 1-29. https://arxiv.org/abs/1207.2189 but this one might be easier: Lemire, D., & Kaser, O. (2011). Reordering columns for smaller indexes. Information Sciences, 181(12), 2550-2570. https://arxiv.org/abs/0909.1346 The result is super simple...
This was a known result and it is intuitive, but we formalized it. It suggests that, often, you can get good compression by sorting on the columns that have few distinct values. It is the opposite of sorting on the primary key... |
A similar technique is mentioned here https://www.influxdata.com/blog/influxdb-3-0-system-architecture/
|
It is definitively an established technique. |
Use case:
A MergeTree table with free modifications of colums (add/drop) can't make good use of a ORDER BY sorting expression in order to sort rows so that compression can work well. For example dropping column2 from a table which has an ORDER BY expression of (column1, column2, column3) does not seem possible currently. One can only drop the last column of the table. This makes sense as dropping a column from the middle of the table would result in data in the columns after the dropped one being not properly sorted.
If the ORDER BY expression does not contain the columns, then dropping them works.
This would be an optimization for tables which collect metrics by dimensions which might change every now and then.
Proposed solution:
A setting that would allow the MergeTree table engine to sort rows by an implicit expression that contains all columns which are not already in the sorting expression but only on a best effort basis. The sorting would be performed during insertion of data and merging. E.g. if a table has columns column1 ... column4 with explicit strong sorting expression (column1, column2), then this setting would add an implicit soft sorting expression (column1, column2, column3, column4). Ordering of column3 and column4 are not guaranteed.
For tables with many columns, sorting them by all columns might not be efficient so explicitly specifying columns for the soft sort expression might be ideal. ALTER queries would allow modification of the columns that are in the best effort sorting expression freely while still having the same restrictions for the current ORDER BY expression.
Example:
Dropping the column domain would be allowed.
Workaround considered:
One can of course create a new table with only column1 and column2 and then insert the data from the old table into it but that gets painful for large tables.
Comment:
I am not sure if my proposed solution is ideal. Maybe there are easier ways to improve the compression of data without requiring strict sorting with its limitations.
The text was updated successfully, but these errors were encountered: