[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2 #51046

jackylee-ch · 2025-05-29T08:27:09Z

Why are the changes needed?

Recently, I have been testing TPC-DS queries based on DataSource V2, and noticed that column pruning does not occur in scenarios involving EXISTS (SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all columns instead of just the required ones. This issue is reproducible in queries like Q10, Q16, Q35, Q69, and Q94.

This PR introduces PostV2ScanRelationPushdown to address the column pruning issues that may arise after optimizer rules are applied.

Below is the plan changes for the newly added test case.
Before this PR

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76b1f4fc-2e84-485c-aade-a62168987baf/t1[id#32L, col1#33L, col2#34L, col3#35L, col4#36L, col5#37L, col6#38L, col7#39L, col8#40L, col9#41L] ParquetScan DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: []

After this PR

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd4b50d9-1643-40e6-a8e1-1429d3213411/t1[id#133L, col1#134L] ParquetScan DataFilters: [isnotnull(col1#134L), (col1#134L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: []

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Newly added UT.

Was this patch authored or co-authored using generative AI tooling?

No.

LuciferYang · 2025-06-03T15:47:31Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+
+  private def createScanBuilder(plan: LogicalPlan) = plan.transform {
+    case r @ DataSourceV2ScanRelation(relation, _, _, _, _)
+        if relation.getTagValue(V2_SCAN_BUILDER_HOLDER).nonEmpty =>


When will the content of tags be removed?

LuciferYang · 2025-06-03T15:53:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

@@ -658,16 +658,55 @@ abstract class SchemaPruningSuite
            |where not exists (select null from employees e where e.name.first = c.name.first
            |  and e.employer.name = c.employer.company.name)
            |""".stripMargin)
+        // TODO: enable this check once we fix the schema pruning for V1 nested columns
+        /**


If V1 hasn't figured out how to fix it yet, perhaps we could temporarily check for different outcomes based on whether dataSourceName is included in the result of SQLConf.USE_V1_SOURCE_LIST?

This is still better than commenting out the assertion

LuciferYang · 2025-06-03T15:54:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

@@ -658,16 +658,55 @@ abstract class SchemaPruningSuite
            |where not exists (select null from employees e where e.name.first = c.name.first
            |  and e.employer.name = c.employer.company.name)
            |""".stripMargin)
+        // TODO: enable this check once we fix the schema pruning for V1 nested columns


TODO should be used with a JIRA ticket.

LuciferYang · 2025-06-05T04:47:12Z

friendly ping @cloud-fan

[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2

38c6fc9

github-actions bot added the SQL label May 29, 2025

jackylee-ch force-pushed the SPARK-51831 branch from 6db5e10 to af18d46 Compare May 29, 2025 15:53

jackylee-ch marked this pull request as draft June 3, 2025 01:48

jackylee-ch force-pushed the SPARK-51831 branch 6 times, most recently from 280a301 to 3cbac75 Compare June 3, 2025 09:40

use tags to pass scanBuildHolder

0d1f77a

jackylee-ch force-pushed the SPARK-51831 branch from 3cbac75 to 0d1f77a Compare June 3, 2025 15:29

LuciferYang reviewed Jun 3, 2025

View reviewed changes

jackylee-ch force-pushed the SPARK-51831 branch 2 times, most recently from a2c0714 to 0121f77 Compare June 4, 2025 03:26

fix

5004720

jackylee-ch force-pushed the SPARK-51831 branch from 0121f77 to 5004720 Compare June 4, 2025 09:14

Merge remote-tracking branch 'upstream/master' into SPARK-51831

2e11ea0

skip relation change if no need

8c8c121

jackylee-ch force-pushed the SPARK-51831 branch from 1506dd4 to 8c8c121 Compare June 10, 2025 01:52

jackylee-ch marked this pull request as ready for review June 10, 2025 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2 #51046

[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2 #51046

jackylee-ch commented May 29, 2025

Uh oh!

LuciferYang Jun 3, 2025

Uh oh!

jackylee-ch Jun 4, 2025

Uh oh!

LuciferYang Jun 3, 2025

Uh oh!

LuciferYang Jun 3, 2025

Uh oh!

jackylee-ch Jun 4, 2025

Uh oh!

LuciferYang commented Jun 5, 2025

Uh oh!

Uh oh!

[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2 #51046

Are you sure you want to change the base?

[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2 #51046

Conversation

jackylee-ch commented May 29, 2025

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

jackylee-ch Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

jackylee-ch Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Jun 5, 2025

Uh oh!

Uh oh!