Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: best effort sorting to optimize compression #4413

Open
arctica opened this issue Feb 15, 2019 · 10 comments
Open

Feature: best effort sorting to optimize compression #4413

arctica opened this issue Feb 15, 2019 · 10 comments
Labels

Comments

@arctica
Copy link

arctica commented Feb 15, 2019

Use case:

A MergeTree table with free modifications of colums (add/drop) can't make good use of a ORDER BY sorting expression in order to sort rows so that compression can work well. For example dropping column2 from a table which has an ORDER BY expression of (column1, column2, column3) does not seem possible currently. One can only drop the last column of the table. This makes sense as dropping a column from the middle of the table would result in data in the columns after the dropped one being not properly sorted.
If the ORDER BY expression does not contain the columns, then dropping them works.

This would be an optimization for tables which collect metrics by dimensions which might change every now and then.

Proposed solution:

A setting that would allow the MergeTree table engine to sort rows by an implicit expression that contains all columns which are not already in the sorting expression but only on a best effort basis. The sorting would be performed during insertion of data and merging. E.g. if a table has columns column1 ... column4 with explicit strong sorting expression (column1, column2), then this setting would add an implicit soft sorting expression (column1, column2, column3, column4). Ordering of column3 and column4 are not guaranteed.

For tables with many columns, sorting them by all columns might not be efficient so explicitly specifying columns for the soft sort expression might be ideal. ALTER queries would allow modification of the columns that are in the best effort sorting expression freely while still having the same restrictions for the current ORDER BY expression.

Example:

CREATE TABLE metrics (date Date, country String, domain String, browser String, hits UInt64, bandwidth Uint64)
ORDER BY (date, country)
SOFTORDER BY (date, country, domain, path)

Dropping the column domain would be allowed.

Workaround considered:

One can of course create a new table with only column1 and column2 and then insert the data from the old table into it but that gets painful for large tables.

Comment:

I am not sure if my proposed solution is ideal. Maybe there are easier ways to improve the compression of data without requiring strict sorting with its limitations.

@den-crane
Copy link
Contributor

den-crane commented Feb 15, 2019

Maybe you've reinvented primary key
https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/#selecting-the-primary-key

CREATE TABLE metrics
(
    date Date,
    country String,
    domain String,
    browser String,
    hits UInt64,
    bandwidth UInt64
)
ENGINE = MergeTree
PRIMARY KEY (date, country)
ORDER BY (date, country, domain);

ALTER TABLE metrics
    MODIFY ORDER BY (date, country);

ALTER TABLE metrics
    ADD COLUMN
    test String,
    MODIFY ORDER BY (date, country, test)

@arctica
Copy link
Author

arctica commented Feb 16, 2019

@den-crane Sorry but I don't see how the primary key helps with the usecase of dropping columns that are not the last one in the table. Your example just modifies the last column which is fine with ORDER BY restrictions. Try dropping the country column in your example.

Code: 44. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: ALTER of key column country must be metadata-only.

When trying to remove it from the ORDER BY (assuming PRIMARY KEY didn't include country):

Code: 36. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: Existing column domain is used in the expression that was added to the sorting key. You can add expressions that use only the newly added columns.

The problem is that both PRIMARY KEY and ORDER BY give hard guarantees on the order of data which prevents one from changing columns unless its the last column. This guarantee is especially important for table engines that actually merge rows like the Summing or Collapsing ones. They wouldn't merge rows that they should if you dropped a column from the middle of a ORDER BY expression because columns after the one that was dropped don't adhere to the new sorting order. Another merge/sort pass would be required for all data parts.

In theory this shouldn't be a big issue because of the splitting into and merging of parts is a background process anyways and one can't expect all rows in the table to be properly sorted during SELECT time. Maybe another solution would be to allow dropping of columns from the middle of the ORDER BY expression and mark parts as "dirty" (needing a merge) even if it's just 1?

@alexey-milovidov alexey-milovidov added the st-wontfix Known issue, no plans to fix it currenlty label Jun 24, 2023
@arctica
Copy link
Author

arctica commented Jun 27, 2023

@alexey-milovidov hi, could you please post a comment why you decided to mark this as wontfix? Is there some other way to achieve the posted goal? Thanks.

@alexey-milovidov alexey-milovidov removed the st-wontfix Known issue, no plans to fix it currenlty label Jun 28, 2023
@alexey-milovidov
Copy link
Member

If I understand correctly, the proposed idea is to reorder data (keeping the ORDER BY key and reordering within each group of records for the same key) in every inserted block so that it will be compressed better.

This is a good idea.

Let's think about how we can implement it.

Also, there was another proposal - to allow dropping columns from ORDER BY if these columns are not in the PRIMARY key.

@alexey-milovidov
Copy link
Member

I'm trying to come up with some algorithm for best-effort reordering.

The current idea is to extract all 4-byte sequences (four-grams) from every record in the inserted block and put them into a hash table: fourgram -> [rows...]. Then take the first record, and find a subsequent record with the maximum number of common four-grams. Then repeat, but keep the limited window of four-grams from the previous rows to search for the best subsequent record.

@rschu1ze
Copy link
Member

This generally makes sense. Thoughts:

  • Reordering would proceed iteratively from column to column. Outside the primary key and ORDER BY, we can choose the next best column freely. We are also free to select the first column to start reordering. (I use terms "column" and "block" interchangeably)
  • The columns cannot be resorted independently, this creates a combinatorial explosion of the search space that can only be addressed with a heuristics.
  • A "good" column to choose is one which has only few or even only a single row inside the section "fixed" by the previously chosen column. In contrast, a high cardinality reduces opportunity for subsequent columns.
  • By the same arguments, sorting-for-compression will usually be at odds with the primary key or ORDER BY as they - by definition - are normally created on highly distinct columns.
  • Similarly, an extreme use case would be no primary key or no ORDER BY - the whole table could then be sorted-for-compression.
  • This raises the (orthogonal) question how the re-ordering should be in order to maximize compression. Hard to say for generic compression methods but long-runs of same values (with the special case of a long prefix) should do the trick (RLE is a special case of LZ4). (For light-weight compression, the answer is potentially different.)
  • Once the column order is found, re-sorting can be done using Alexeys idea or something even more stupid like moving the most frequent value as a prefix to the beginning, splitting the current column into a "prefix" section and an "other" section, then go on to the next column and re-resort the rows in the previous "prefix" section using the same heuristic, creating a new and smaller "prefix" section and so on.

@aadant
Copy link

aadant commented Feb 13, 2024

related paper : https://arxiv.org/abs/1207.2189

@Article{DBLP:journals/corr/abs-1207-2189,
author = {Daniel Lemire and
Owen Kaser and
Eduardo Gutarra},
title = {Reordering Rows for Better Compression: Beyond the Lexicographic Order},
journal = {CoRR},
volume = {abs/1207.2189},
year = {2012},
url = {http://arxiv.org/abs/1207.2189},
eprinttype = {arXiv},
eprint = {1207.2189},
timestamp = {Mon, 13 Aug 2018 16:46:28 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1207-2189.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@lemire
Copy link

lemire commented Feb 19, 2024

@alexey-milovidov @arctica

This paper is a good reference...

Lemire, D., Kaser, O., & Gutarra, E. (2012). Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems (TODS), 37(3), 1-29. https://arxiv.org/abs/1207.2189

but this one might be easier:

Lemire, D., & Kaser, O. (2011). Reordering columns for smaller indexes. Information Sciences, 181(12), 2550-2570. https://arxiv.org/abs/0909.1346

The result is super simple...

For many cases, we prove that the number of runs in table columns is minimized if we sort columns by increasing cardinality

This was a known result and it is intuitive, but we formalized it. It suggests that, often, you can get good compression by sorting on the columns that have few distinct values. It is the opposite of sorting on the primary key...

@aadant
Copy link

aadant commented Feb 21, 2024

A similar technique is mentioned here

https://www.influxdata.com/blog/influxdb-3-0-system-architecture/

the ingester finds and picks the least cardinality columns for the sort order of the sort mentioned above. As a result, the size of the file is often 10-100x smaller than its raw form.

@lemire
Copy link

lemire commented Feb 21, 2024

It is definitively an established technique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants