Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group-by aggregation based optimization for UNBOUNDED collect_set window function #10248

Merged
merged 59 commits into from
Feb 12, 2024

Conversation

mythrocks
Copy link
Collaborator

@mythrocks mythrocks commented Jan 23, 2024

Fixes #10114.

This change introduces an optimization for the collect_set() window function, when run as an fully UNBOUNDED window function, as follows:

SELECT COLLECT_SET(foo) OVER (PARTITION BY grp 
                              ORDER BY ord 
                              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ... 

It might be possible to solve most UNBOUNDED window aggregations via group-wise aggregations. This change addresses collect_set first.

This change currently builds on the changes @revans2 made over at #10158, and includes those changes here.

A new window-exec is introduced specifically to address fully UNBOUNDED window functions. Each input batch is processed via group-wise aggregations. For groups that are known to be complete, the output is returned. The results for incomplete groups are cached, and finalized only when the end of the group is encountered.

revans2 and others added 28 commits January 5, 2024 10:08
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: MithunR <mythrocks@gmail.com>
No need to partition any more than the last group.
All the other groups shake out during the merge.
Signed-off-by: MithunR <mythrocks@gmail.com>
Also, clarified group-by column ordering.
@mythrocks mythrocks added performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Jan 23, 2024
@mythrocks
Copy link
Collaborator Author

Sorry, I thought I'd upmerged.

Yes, I should resolve these before we merge.

@revans2
Copy link
Collaborator

revans2 commented Feb 2, 2024

Still getting errors on compile. Looks like some style issues with the imports.

@revans2
Copy link
Collaborator

revans2 commented Feb 2, 2024

I did some quick scale tests and it looks really good. I'll try to look through all of the code on Monday.

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

CI still seems to be failing because of #10382. I'll retry the build after #10382 is resolved.

@mythrocks
Copy link
Collaborator Author

Build

revans2
revans2 previously approved these changes Feb 7, 2024
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

@revans2: I've rebased this change since #10158 was merged. The tests are passing, thankfully.

@revans2 revans2 merged commit a76f5b6 into NVIDIA:branch-24.04 Feb 12, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Memory optimizations for collect_set with unbounded proceeding to unbounded following
2 participants