Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] [Scan Operator] Refactor planning and execution code to use shared Pushdowns struct. #1595

Merged
merged 3 commits into from
Nov 13, 2023

Conversation

clarkzinzow
Copy link
Contributor

This PR refactors the parallel pushdown optimization + execution paths to use a shared Pushdowns struct, simplifying a good bit of code.

@github-actions github-actions bot added the enhancement New feature or request label Nov 10, 2023
Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Merging #1595 (cc044c5) into main (ef4d2fd) will decrease coverage by 0.02%.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1595      +/-   ##
==========================================
- Coverage   85.21%   85.19%   -0.02%     
==========================================
  Files          54       54              
  Lines        5180     5180              
==========================================
- Hits         4414     4413       -1     
- Misses        766      767       +1     

see 1 file with indirect coverage changes

@@ -84,28 +72,33 @@ impl Source {
res.push(format!("Scan op = {}", scan_op));
res.push(format!("File schema = {}", source_schema.short_string()));
res.push(format!("Partitioning keys = {:?}", partitioning_keys));
res.push(format!("Scan pushdowns = {:?}", pushdowns));
if let Some(columns) = &pushdowns.columns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement Display on Pushdowns instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few reasons I didn't do this as a Display implementation:

  1. A Display implementation returns a single string, rather than a vec of strings. For the tree display mode of the logical plan, I think that we'd want each field to be displayed on its own line rather than be on a single line (many columns in the projection, a long filter expression, etc.) The nice thing about the multiline_display() API is that consumers can join them on any separtor they like, be that "\n" or ", ".
  2. I was thinking that we'll only want to include the non-null pushdowns, otherwise in the common case we would have a Pushdowns (columns = None, filters = None, limit = None) which would clutter the Source op section in the repr. We could implement Display such that it only includes a field if it's non-null, and returns an empty string if all fields are null, but that seems a bit messy.

We could implement fn multiline_display() on Pushdowns that returns a Vec<String> that both of these match arms could use. How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine -- should be fine to leave it as-is,

I wish there were simple utilities for displaying hierarchical structs though. Looks like some folks recommend something like treeline: https://www.reddit.com/r/rust/comments/od0esb/representing_and_printing_hierarchies/

.into()
.into();
let new_plan = plan.with_new_children(&[new_source.into()]);
// Retry optimization now that the upstream node is different.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary/what are the failure modes here?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursive call is used to prune now-redundant projections. This (status quo) behavior currently assumes that if a projection is pushed into a scan, then the downstream projection op can be removed from the logical plan.

I thought about this a bit earlier today, and I do think that we'll want to eventually integrate this with a query to the ScanOperator to see if fully absorbing the projection is supported, but I don't think that this is as needed as it is for limit and filter pushdowns, right? I think that all of our scan operators will trivially support fully absorbing projections as part of their scan implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support fully absorbing projections

ScanOperators will only be able to fully support projects that don't have any transformations I guess (pure column pruning). If the projection contains any transformation at all, we can't absorb...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, our projection pushdown is already very conservative, where we only drop a projection if it's a raw column selection projection! Expanding these semantics when we add more nontrivial projection pushdowns into scans seems doable.

scan
};
Ok(plan)
Ok(scan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? Don't we still need some mechanism to "prune" the columns in the output table?

Our InMemoryInfo doesn't contain any Pushdowns since it doesn't have a ScanTask, so how is it able to provide that pruning capabilities without an explicit Projection?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're no longer pruning projections that are downstream from in-memory scans, so we no longer need to manually insert back a column-selecting projection here! 😄

#[cfg(feature = "python")]
SourceInfo::InMemoryInfo(_) => Ok(Transformed::No(plan)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very meta... makes sense.

@clarkzinzow clarkzinzow merged commit 655bcb0 into main Nov 13, 2023
37 checks passed
@clarkzinzow clarkzinzow deleted the clark/pushdown-refactor branch November 13, 2023 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants