-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] [Scan Operator] Refactor planning and execution code to use shared Pushdowns
struct.
#1595
Conversation
45d98ed
to
0c450a5
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1595 +/- ##
==========================================
- Coverage 85.21% 85.19% -0.02%
==========================================
Files 54 54
Lines 5180 5180
==========================================
- Hits 4414 4413 -1
- Misses 766 767 +1 |
@@ -84,28 +72,33 @@ impl Source { | |||
res.push(format!("Scan op = {}", scan_op)); | |||
res.push(format!("File schema = {}", source_schema.short_string())); | |||
res.push(format!("Partitioning keys = {:?}", partitioning_keys)); | |||
res.push(format!("Scan pushdowns = {:?}", pushdowns)); | |||
if let Some(columns) = &pushdowns.columns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement Display
on Pushdowns
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few reasons I didn't do this as a Display
implementation:
- A
Display
implementation returns a single string, rather than a vec of strings. For the tree display mode of the logical plan, I think that we'd want each field to be displayed on its own line rather than be on a single line (many columns in the projection, a long filter expression, etc.) The nice thing about themultiline_display()
API is that consumers can join them on any separtor they like, be that"\n"
or", "
. - I was thinking that we'll only want to include the non-null pushdowns, otherwise in the common case we would have a
Pushdowns (columns = None, filters = None, limit = None)
which would clutter theSource
op section in the repr. We could implementDisplay
such that it only includes a field if it's non-null, and returns an empty string if all fields are null, but that seems a bit messy.
We could implement fn multiline_display()
on Pushdowns
that returns a Vec<String>
that both of these match arms could use. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine -- should be fine to leave it as-is,
I wish there were simple utilities for displaying hierarchical structs though. Looks like some folks recommend something like treeline
: https://www.reddit.com/r/rust/comments/od0esb/representing_and_printing_hierarchies/
.into() | ||
.into(); | ||
let new_plan = plan.with_new_children(&[new_source.into()]); | ||
// Retry optimization now that the upstream node is different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary/what are the failure modes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recursive call is used to prune now-redundant projections. This (status quo) behavior currently assumes that if a projection is pushed into a scan, then the downstream projection op can be removed from the logical plan.
I thought about this a bit earlier today, and I do think that we'll want to eventually integrate this with a query to the ScanOperator
to see if fully absorbing the projection is supported, but I don't think that this is as needed as it is for limit and filter pushdowns, right? I think that all of our scan operators will trivially support fully absorbing projections as part of their scan implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support fully absorbing projections
ScanOperators will only be able to fully support projects that don't have any transformations I guess (pure column pruning). If the projection contains any transformation at all, we can't absorb...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, our projection pushdown is already very conservative, where we only drop a projection if it's a raw column selection projection! Expanding these semantics when we add more nontrivial projection pushdowns into scans seems doable.
scan | ||
}; | ||
Ok(plan) | ||
Ok(scan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? Don't we still need some mechanism to "prune" the columns in the output table?
Our InMemoryInfo doesn't contain any Pushdowns
since it doesn't have a ScanTask
, so how is it able to provide that pruning capabilities without an explicit Projection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're no longer pruning projections that are downstream from in-memory scans, so we no longer need to manually insert back a column-selecting projection here! 😄
Daft/src/daft-plan/src/optimization/rules/push_down_projection.rs
Lines 178 to 179 in 0c450a5
#[cfg(feature = "python")] | |
SourceInfo::InMemoryInfo(_) => Ok(Transformed::No(plan)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah very meta... makes sense.
0c450a5
to
cc044c5
Compare
This PR refactors the parallel pushdown optimization + execution paths to use a shared
Pushdowns
struct, simplifying a good bit of code.