Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize aggregations at field boundaries by taking advantages of ordering #22072

Open
jorisgio opened this issue Mar 24, 2021 · 1 comment
Open

Comments

@jorisgio
Copy link
Contributor

Currently, there is optimize_aggregation_in_order setttings. This settings is great, but it requires the following :

  • it requires reading the primary keys fields, which can be big and slow to read and compare, and aggregate. I understand it is required to merge sort input block, but as far as i understand, it is not strictly required to the optimization, for instance you could only sort block using primary key index, and finalize aggregations at block boundaries to save on memory.
    Locally, we have forked the hashing function to have some injective version to be able to
GROUP BY hash(primary_key1), hash(primary_key2)

And use optimisation, which saves on memory and slow aggregation with large string, but it does not save on the reading/merge sort part.

  • it does not support holes and hidden relation in schema fields. On the following table :
  CREATE TABLE test
   (
      prefix_hash UInt64,
      key String,
      prefix_hash2 UInt64,
      key2 String,
      metric UInt64
   )
   ORDER BY prefix_hash, key, prefix_hash2, key2

where prefix_hash is the hash of the first logical part of key if you try

SELECT sum(metric) FROM test GROUP BY prefix_hash, prefix_hash2

in theory optimize_aggregation_in_order can work, because if prefix differ then it means aggregation also must differ, but the optimizer cannot know that. The primary key could be reordered but then ordering for others queries is lost.

  • Ideally i want to use
 CREATE TABLE test
  (
     prefix_hash UInt64,
     key String,
     prefix_hash2 UInt64,
     key2 String,
     full_hash UInt64,
     metric UInt64
  )
  ORDER BY prefix_hash, key, prefix_hash2, key2

with full_hash = hash(key, key2) and GROUP BY full_hash, but this requires full aggregation and is very memory hungry (and slow due to finalizing aggregation on tens of gigabytes of keys)

As far as i understand, it is not possible to read in "semi_order" currently using only index to order blocks instead of full mergeSort ? and there is no way to explicitly ask to flush aggregations when some key change by some clever query rewriting ?

@jorisgio jorisgio added the question Question? label Mar 24, 2021
@melvynator melvynator self-assigned this Mar 7, 2022
@UnamedRus
Copy link
Contributor

I'm thinking about some misc function which can tell ClickHouse about possible replacements:

ie:

assumeGroupBy(prefix_hash, (prefix_hash, key)) prefix_hash is equivalent to prefix_hash, key in GROUP BY
assumeOrderBy(prefix_hash, key) prefix_hash is equivalent to prefix_hash, key in ORDER BY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants