Finalize aggregations at field boundaries by taking advantages of ordering #22072

jorisgio · 2021-03-24T09:57:46Z

Currently, there is optimize_aggregation_in_order setttings. This settings is great, but it requires the following :

it requires reading the primary keys fields, which can be big and slow to read and compare, and aggregate. I understand it is required to merge sort input block, but as far as i understand, it is not strictly required to the optimization, for instance you could only sort block using primary key index, and finalize aggregations at block boundaries to save on memory.
Locally, we have forked the hashing function to have some injective version to be able to

GROUP BY hash(primary_key1), hash(primary_key2)

And use optimisation, which saves on memory and slow aggregation with large string, but it does not save on the reading/merge sort part.

it does not support holes and hidden relation in schema fields. On the following table :

  CREATE TABLE test
   (
      prefix_hash UInt64,
      key String,
      prefix_hash2 UInt64,
      key2 String,
      metric UInt64
   )
   ORDER BY prefix_hash, key, prefix_hash2, key2

where prefix_hash is the hash of the first logical part of key if you try

SELECT sum(metric) FROM test GROUP BY prefix_hash, prefix_hash2

in theory optimize_aggregation_in_order can work, because if prefix differ then it means aggregation also must differ, but the optimizer cannot know that. The primary key could be reordered but then ordering for others queries is lost.

Ideally i want to use

 CREATE TABLE test
  (
     prefix_hash UInt64,
     key String,
     prefix_hash2 UInt64,
     key2 String,
     full_hash UInt64,
     metric UInt64
  )
  ORDER BY prefix_hash, key, prefix_hash2, key2

with full_hash = hash(key, key2) and GROUP BY full_hash, but this requires full aggregation and is very memory hungry (and slow due to finalizing aggregation on tens of gigabytes of keys)

As far as i understand, it is not possible to read in "semi_order" currently using only index to order blocks instead of full mergeSort ? and there is no way to explicitly ask to flush aggregations when some key change by some clever query rewriting ?

The text was updated successfully, but these errors were encountered:

UnamedRus · 2022-06-11T23:58:35Z

I'm thinking about some misc function which can tell ClickHouse about possible replacements:

ie:

assumeGroupBy(prefix_hash, (prefix_hash, key)) prefix_hash is equivalent to prefix_hash, key in GROUP BY
assumeOrderBy(prefix_hash, key) prefix_hash is equivalent to prefix_hash, key in ORDER BY

jorisgio added the question Question? label Mar 24, 2021

melvynator self-assigned this Mar 7, 2022

melvynator added the support-services label Mar 7, 2022

alexey-milovidov unassigned melvynator Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize aggregations at field boundaries by taking advantages of ordering #22072

Finalize aggregations at field boundaries by taking advantages of ordering #22072

jorisgio commented Mar 24, 2021

UnamedRus commented Jun 11, 2022

Finalize aggregations at field boundaries by taking advantages of ordering #22072

Finalize aggregations at field boundaries by taking advantages of ordering #22072

Comments

jorisgio commented Mar 24, 2021

UnamedRus commented Jun 11, 2022