Skip to content

[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2 #51046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jackylee-ch
Copy link
Contributor

Why are the changes needed?

Recently, I have been testing TPC-DS queries based on DataSource V2, and noticed that column pruning does not occur in scenarios involving EXISTS (SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all columns instead of just the required ones. This issue is reproducible in queries like Q10, Q16, Q35, Q69, and Q94.

This PR introduces PostV2ScanRelationPushdown to address the column pruning issues that may arise after optimizer rules are applied.

Below is the plan changes for the newly added test case.
Before this PR

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76b1f4fc-2e84-485c-aade-a62168987baf/t1[id#32L, col1#33L, col2#34L, col3#35L, col4#36L, col5#37L, col6#38L, col7#39L, col8#40L, col9#41L] ParquetScan DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: []

After this PR

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd4b50d9-1643-40e6-a8e1-1429d3213411/t1[id#133L, col1#134L] ParquetScan DataFilters: [isnotnull(col1#134L), (col1#134L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: []

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Newly added UT.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 29, 2025
@jackylee-ch jackylee-ch marked this pull request as draft June 3, 2025 01:48
@jackylee-ch jackylee-ch force-pushed the SPARK-51831 branch 6 times, most recently from 280a301 to 3cbac75 Compare June 3, 2025 09:40

private def createScanBuilder(plan: LogicalPlan) = plan.transform {
case r @ DataSourceV2ScanRelation(relation, _, _, _, _)
if relation.getTagValue(V2_SCAN_BUILDER_HOLDER).nonEmpty =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the content of tags be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap

@@ -658,16 +658,55 @@ abstract class SchemaPruningSuite
|where not exists (select null from employees e where e.name.first = c.name.first
| and e.employer.name = c.employer.company.name)
|""".stripMargin)
// TODO: enable this check once we fix the schema pruning for V1 nested columns
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If V1 hasn't figured out how to fix it yet, perhaps we could temporarily check for different outcomes based on whether dataSourceName is included in the result of SQLConf.USE_V1_SOURCE_LIST?

This is still better than commenting out the assertion

@@ -658,16 +658,55 @@ abstract class SchemaPruningSuite
|where not exists (select null from employees e where e.name.first = c.name.first
| and e.employer.name = c.employer.company.name)
|""".stripMargin)
// TODO: enable this check once we fix the schema pruning for V1 nested columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO should be used with a JIRA ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jackylee-ch jackylee-ch force-pushed the SPARK-51831 branch 2 times, most recently from a2c0714 to 0121f77 Compare June 4, 2025 03:26
@LuciferYang
Copy link
Contributor

friendly ping @cloud-fan

@jackylee-ch jackylee-ch marked this pull request as ready for review June 10, 2025 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants