New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column-level compression block sizes #55201
Column-level compression block sizes #55201
Conversation
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
@alexey-milovidov @nikitamikhaylov can I get some reviews for this feature :D. We have tested on our prod table.
-- Compression ratio
┌─table─────────────┬─count()─┬─compressed_sz─┬─uncompressed_sz─┬──────────────ratio─┐
│ xxxx_html_local │ 14 │ 228.20 GiB │ 3.43 TiB │ 15.385512604598656 │
│ xxxx_html_local2 │ 12 │ 226.07 GiB │ 3.42 TiB │ 15.504667251480628 │
└───────────────────┴─────────┴───────────────┴─────────────────┴────────────────────┘
-- SELECT * on origin table with min_compression_block_size = 64MB and max_compress_block_size = 64M
-- on table level
SELECT * EXCEPT xxxx_html
FROM xxxx_html_local
WHERE _partition_id = '9-4-0'
SETTINGS max_threads = 16
FORMAT `Null`
Query id: 4304bfcd-a3e4-4d95-b5fa-96becee33ad0
Ok.
0 rows in set. Elapsed: 1.105 sec. Processed 5.53 million rows, 725.27 MB (5.00 million rows/s., 656.11 MB/s.)
Peak memory usage: 7.68 GiB.
-- SELECT * on new table with min_compression_block_size = 64MB and max_compress_block_size = 64M
-- on column `xxxx_html ` level
SELECT * EXCEPT xxxx_html
FROM xxxx_html_local2
WHERE _partition_id = '9-4-0'
SETTINGS max_threads = 16
FORMAT `Null`
Query id: 55e7290d-a6ef-4a96-badd-7569f30fb409
Ok.
0 rows in set. Elapsed: 0.172 sec. Processed 5.53 million rows, 719.40 MB (32.19 million rows/s., 4.19 GB/s.)
Peak memory usage: 33.01 MiB. If we don't tune the table compress block size, the compression ratio (with default settings) is only ~5.x. |
@canhld94 The default compress block size is from 64 KB to 1 MB, and it is strange to see it could lead to the difference in 7 GB of memory usage. Is it possible that you have also changed the defaults? I don't think it's ever needed to increase the compress block size. It is one of the "factory" settings that are not expected to be changed. |
@alexey-milovidov May be the example is not clear. With default compress block, memory consumption is normal, but the compression ratio is not good. Previously, we need to increase table level
In our use case, the table has a big string column (e.g. the whole html source of a website). If we use default compression block size, the compression ratio is from 5-6, which is too low for our demand. |
@alexey-milovidov I've revised the example in my previous comment as well. Hope it is more clear to you. |
This is an automated comment for commit 7246655 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
@alexey-milovidov we have tables where one column is a big string (100KB on average) and other columns don't have that much data. Setting min_compress_block_size to 64MB for string column helps to increase compression ratio almost twice. But if min_compress_block_size applied to the whole table all other columns use 64MB compress block and it slows down select queries and these queries require more memory. The solution is to apply min_compress_block_size to one column only and it works well in our fork (high compression ratio AND fast queries AND lower memory usage). |
Resolve conflicts Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Upgrade check #57893 |
@alexey-milovidov we change the syntax to declare compress block size as parameters of CREATE TABLE t
(
...
big_column String CODEC(16384, 16384)(ZSTD(9, 24)),
)
ENGINE = MergeTree ORDER BY tuple(); This change is backward incompatible, so I also add a query setting |
Honestly, i preferred old syntax, it's more self describing, and also allow to support more per column settings. (low_cardinality or potential dictionary support for ZSTD, i'm looking for you.)
BTW, there is another potential syntax option (inspired by YDB):
|
@UnamedRus yes, the old syntax is more declarative and more generic, but its scope is beyond of the main purpose of this PR (to have explicit compress block size for each column). For now we want to push this PR to upstream first. Re. column level settings, it's definitely a needed feature, but different people will have different preferred syntax and we may need lots of discussion. I still advocate my previous proposed syntax and will try to push it to upstream. COLUMN TYPE ATRRIBUTES SETTINGS (<list of settings>), But it'll be in another issue and PR. |
@canhld94 @UnamedRus After reading this PR, #54821 and #36428, I think there is some value in per-column min/max block sizes when the columns have very different average byte sizes per value ("big string column" use case) and I like to help get this merged. Settings in ClickHouse come in global form (configured via cfg file), session/query form ( Re syntax: We should strive for maximum consistency.
I like that. Is there perhaps a PR already? I am afraid that if we implement |
I like the syntax:
Let's finish this PR and merge... |
Resolve conflicts Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
3e368bf
to
f755e77
Compare
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
105aa33
to
b16a4cf
Compare
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
eeda0e5
to
ed031f3
Compare
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Closes #54821
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Certain settings (currently
min_compress_block_size
andmax_compress_block_size
) can now be specified at column-level where they take precedence over the corresponding table-level setting. Example:CREATE TABLE tab (col String SETTINGS (min_compress_block_size = 81920, max_compress_block_size = 163840)) ENGINE = MergeTree ORDER BY tuple();
Documentation entry for user-facing changes