[SPARK-38768][SQL] Remove Limit from plan if complete push down limit to data source #474

chenzhx · 2022-06-01T13:04:47Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…it to data source ### What changes were proposed in this pull request? Currently, Spark supports push down limit to data source. If limit could pushed down and Data source only have one partition, DS V2 still do limit again. This PR want remove `Limit` from plan if complete push down limit to data source. ### Why are the changes needed? Improve performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? Tests updated. Closes apache#36043 from beliefer/SPARK-38768. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…` and `pushTopN` of `PushDownUtils` ### What changes were proposed in this pull request? `pushLimit` and `pushTopN` of `PushDownUtils` returns tuple of boolean. It will be good to explain what the boolean value represents. ### Why are the changes needed? Make DS V2 API more friendly to developers. ### Does this PR introduce _any_ user-facing change? 'No'. Just update comments. ### How was this patch tested? N/A Closes apache#36092 from beliefer/SPARK-38391_SPARK-38768_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…asonable ### What changes were proposed in this pull request? Some testing CASE WHEN queries are not carefully written and do not make sense. In the future, the optimizer may get smarter and get rid of the CASE WHEN completely, and then we loose test coverage. This PR updates some CASE WHEN queries to make them more reasonable. ### Why are the changes needed? future-proof test coverage. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A Closes apache#36125 from beliefer/SPARK-37960_followup3. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… to data source (#474) * [SPARK-38768][SQL] Remove `Limit` from plan if complete push down limit to data source Currently, Spark supports push down limit to data source. If limit could pushed down and Data source only have one partition, DS V2 still do limit again. This PR want remove `Limit` from plan if complete push down limit to data source. Improve performance. 'No'. New feature. Tests updated. Closes apache#36043 from beliefer/SPARK-38768. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38391][SPARK-38768][SQL][FOLLOWUP] Add comments for `pushLimit` and `pushTopN` of `PushDownUtils` `pushLimit` and `pushTopN` of `PushDownUtils` returns tuple of boolean. It will be good to explain what the boolean value represents. Make DS V2 API more friendly to developers. 'No'. Just update comments. N/A Closes apache#36092 from beliefer/SPARK-38391_SPARK-38768_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> * [SPARK-37960][SQL][FOLLOWUP] Make the testing CASE WHEN query more reasonable Some testing CASE WHEN queries are not carefully written and do not make sense. In the future, the optimizer may get smarter and get rid of the CASE WHEN completely, and then we loose test coverage. This PR updates some CASE WHEN queries to make them more reasonable. future-proof test coverage. 'No'. N/A Closes apache#36125 from beliefer/SPARK-37960_followup3. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * update spark version Co-authored-by: Jiaan Geng <beliefer@163.com>

### What changes were proposed in this pull request? apache#41458 updated `numeric.sql.out` but not update `numeric.sql.out.java21`, this pr updated `numeric.sql.out.java21` for Java 21. ### Why are the changes needed? Fix golden file for Java 21. https://github.com/apache/spark/actions/runs/5362442727/jobs/9729315685 ``` [info] - postgreSQL/numeric.sql *** FAILED *** (1 minute, 4 seconds) [info] postgreSQL/numeric.sql [info] Expected "...OLUMN_ARITY_MISMATCH[", [info] "sqlState" : "21S01", [info] "messageParameters" : { [info] "dataColumns" : "'id', 'id', 'val', 'val', '(val * val)'", [info] "reason" : "too many data columns", [info] "tableColumns" : "'id1', 'id2', 'result']", [info] "tableName" :...", but got "...OLUMN_ARITY_MISMATCH[.TOO_MANY_DATA_COLUMNS", [info] "sqlState" : "21S01", [info] "messageParameters" : { [info] "dataColumns" : "`id`, `id`, `val`, `val`, `(val * val)`", [info] "tableColumns" : "`id1`, `id2`, `result`]", [info] "tableName" :..." Result did not match for query #474 [info] INSERT INTO num_result SELECT t1.id, t2.id, t1.val, t2.val, t1.val * t2.val [info] FROM num_data t1, num_data t2 (SQLQueryTestSuite.scala:848) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564) [info] at org.scalatest.Assertions.assertResult(Assertions.scala:847) [info] at org.scalatest.Assertions.assertResult$(Assertions.scala:842) [info] at org.scalatest.funsuite.AnyFunSuite.assertResult(AnyFunSuite.scala:1564) [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$readGoldenFileAndCompareResults$3(SQLQueryTestSuite.scala:848) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual checked using Java 21 Closes apache#41720 from LuciferYang/SPARK-43969-FOLLOWUP-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

beliefer and others added 4 commits June 1, 2022 21:00

update spark version

e78b63b

github-actions bot added the SQL label Jun 1, 2022

chenzhx force-pushed the kyspark-3.2.x-4.x-agg-pushdown branch from 4944e42 to e78b63b Compare June 1, 2022 13:26

github-actions bot added AVRO BUILD CORE DSTREAM EXAMPLES GRAPHX INFRA KUBERNETES MESOS MLLIB SPARK SHELL STRUCTURED STREAMING YARN labels Jun 1, 2022

zheniantoushipashi approved these changes Jun 2, 2022

View reviewed changes

chenzhx merged commit 0553594 into Kyligence:kyspark-3.2.x-4.x Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38768][SQL] Remove Limit from plan if complete push down limit to data source #474

[SPARK-38768][SQL] Remove Limit from plan if complete push down limit to data source #474

chenzhx commented Jun 1, 2022

[SPARK-38768][SQL] Remove Limit from plan if complete push down limit to data source #474

[SPARK-38768][SQL] Remove Limit from plan if complete push down limit to data source #474

Conversation

chenzhx commented Jun 1, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?