Skip to content

Ideas for performance optimizations of aggregation in order #42696

@novikd

Description

@novikd

After #35111 we can apply aggregation in order optimization for queries where aggregation keys are a superset of storage ORDER BY.

If data is sorted by (a, b) and we want to perform GROUP BY a, b, c the current implementation will do the following:

  1. Split each block into segments with the same (a, b) velues.
  2. Perform aggregation over each segment using the key set (a, b, c).
  3. Sort each block by (a, b, c).

I think there are opportunities for optimization:

  • In step 2 we know that values of (a, b) are equal, so we need to aggregate data only over (c). It means we can reduce the hash table key to (c) and as a result, it may reduce hash table lookup latency.
  • In step 3 we already have blocks sorted by (a, b) and we don't need to sort blocks by the values of these columns. It's possible to pass into the sort information about equality ranges and just sort by (c).

cc @azat

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions