You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, there is optimize_aggregation_in_order setttings. This settings is great, but it requires the following :
it requires reading the primary keys fields, which can be big and slow to read and compare, and aggregate. I understand it is required to merge sort input block, but as far as i understand, it is not strictly required to the optimization, for instance you could only sort block using primary key index, and finalize aggregations at block boundaries to save on memory.
Locally, we have forked the hashing function to have some injective version to be able to
GROUP BY hash(primary_key1), hash(primary_key2)
And use optimisation, which saves on memory and slow aggregation with large string, but it does not save on the reading/merge sort part.
it does not support holes and hidden relation in schema fields. On the following table :
CREATE TABLE test
(
prefix_hash UInt64,
key String,
prefix_hash2 UInt64,
key2 String,
metric UInt64
)
ORDER BY prefix_hash, key, prefix_hash2, key2
where prefix_hash is the hash of the first logical part of key if you try
SELECT sum(metric) FROM test GROUP BY prefix_hash, prefix_hash2
in theory optimize_aggregation_in_order can work, because if prefix differ then it means aggregation also must differ, but the optimizer cannot know that. The primary key could be reordered but then ordering for others queries is lost.
Ideally i want to use
CREATE TABLE test
(
prefix_hash UInt64,
key String,
prefix_hash2 UInt64,
key2 String,
full_hash UInt64,
metric UInt64
)
ORDER BY prefix_hash, key, prefix_hash2, key2
with full_hash = hash(key, key2) and GROUP BY full_hash, but this requires full aggregation and is very memory hungry (and slow due to finalizing aggregation on tens of gigabytes of keys)
As far as i understand, it is not possible to read in "semi_order" currently using only index to order blocks instead of full mergeSort ? and there is no way to explicitly ask to flush aggregations when some key change by some clever query rewriting ?
The text was updated successfully, but these errors were encountered:
I'm thinking about some misc function which can tell ClickHouse about possible replacements:
ie:
assumeGroupBy(prefix_hash, (prefix_hash, key))prefix_hash is equivalent to prefix_hash, key in GROUP BY assumeOrderBy(prefix_hash, key)prefix_hash is equivalent to prefix_hash, key in ORDER BY
Currently, there is optimize_aggregation_in_order setttings. This settings is great, but it requires the following :
Locally, we have forked the hashing function to have some injective version to be able to
And use optimisation, which saves on memory and slow aggregation with large string, but it does not save on the reading/merge sort part.
where prefix_hash is the hash of the first logical part of key if you try
in theory optimize_aggregation_in_order can work, because if prefix differ then it means aggregation also must differ, but the optimizer cannot know that. The primary key could be reordered but then ordering for others queries is lost.
with
full_hash = hash(key, key2)
andGROUP BY full_hash
, but this requires full aggregation and is very memory hungry (and slow due to finalizing aggregation on tens of gigabytes of keys)As far as i understand, it is not possible to read in "semi_order" currently using only index to order blocks instead of full mergeSort ? and there is no way to explicitly ask to flush aggregations when some key change by some clever query rewriting ?
The text was updated successfully, but these errors were encountered: