-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2 #51046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
280a301
to
3cbac75
Compare
|
||
private def createScanBuilder(plan: LogicalPlan) = plan.transform { | ||
case r @ DataSourceV2ScanRelation(relation, _, _, _, _) | ||
if relation.getTagValue(V2_SCAN_BUILDER_HOLDER).nonEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will the content of tags be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap
@@ -658,16 +658,55 @@ abstract class SchemaPruningSuite | |||
|where not exists (select null from employees e where e.name.first = c.name.first | |||
| and e.employer.name = c.employer.company.name) | |||
|""".stripMargin) | |||
// TODO: enable this check once we fix the schema pruning for V1 nested columns | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If V1 hasn't figured out how to fix it yet, perhaps we could temporarily check for different outcomes based on whether dataSourceName
is included in the result of SQLConf.USE_V1_SOURCE_LIST
?
This is still better than commenting out the assertion
@@ -658,16 +658,55 @@ abstract class SchemaPruningSuite | |||
|where not exists (select null from employees e where e.name.first = c.name.first | |||
| and e.employer.name = c.employer.company.name) | |||
|""".stripMargin) | |||
// TODO: enable this check once we fix the schema pruning for V1 nested columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO should be used with a JIRA ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
a2c0714
to
0121f77
Compare
friendly ping @cloud-fan |
Why are the changes needed?
Recently, I have been testing TPC-DS queries based on DataSource V2, and noticed that column pruning does not occur in scenarios involving EXISTS (SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all columns instead of just the required ones. This issue is reproducible in queries like Q10, Q16, Q35, Q69, and Q94.
This PR introduces
PostV2ScanRelationPushdown
to address the column pruning issues that may arise after optimizer rules are applied.Below is the plan changes for the newly added test case.
Before this PR
After this PR
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Newly added UT.
Was this patch authored or co-authored using generative AI tooling?
No.