[BUG] DPP is not working in Databricks env #3143

viadea · 2021-08-04T23:07:15Z

Describe the bug
A clear and concise description of what the bug is.

DPP(dynamic partition pruning) is not working in Databricks env.
Found this issue when analyzing NDS query performance on Databricks.

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Below is the minimum reproduce:

import org.apache.spark.sql.functions.col
spark.range(1000).select(col("id"), col("id").as("k")).write.partitionBy("k").format("parquet").mode("overwrite").save("/tmp/hao/myfact")
spark.range(100).select(col("id"), col("id").as("k")).write.format("parquet").mode("overwrite").save("/tmp/hao/mydim")
spark.read.parquet("/tmp/hao/myfact").createOrReplaceTempView("fact")
spark.read.parquet("/tmp/hao/mydim").createOrReplaceTempView("dim")
spark.sql("SELECT fact.id, fact.k FROM fact JOIN dim ON fact.k = dim.k AND dim.id < 2").collect

GPU's physical plan:

Location: InMemoryFileIndex [dbfs:/tmp/hao/myfact]
PartitionFilters: [isnotnull(k#1265), dynamicpruningexpression(true)]

CPU's physical plan:

Location: InMemoryFileIndex [dbfs:/tmp/hao/myfact]
PartitionFilters: [isnotnull(k#2201), dynamicpruningexpression(cast(k#2201 as bigint) IN dynamicpruning#2211)]

As you can see, even though the dynamicpruningexpression keyword is there , however the filter is always true.
In Spark standalone cluster, there is no such issue for GPU run.

Expected behavior
A clear and concise description of what you expected to happen.

DPP should happen in Databricks env.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Databricks runtime 8.2ML GPU
RAPIDS 21.10snapshot / 21.06GA

Additional context

WorkAround:
Disabling DPP by setting:
spark.sql.optimizer.dynamicPartitionPruning.enabled=false

The text was updated successfully, but these errors were encountered:

tgravescs · 2021-08-05T13:07:09Z

is the CPU run here with AQE off or on?

viadea · 2021-08-05T14:19:17Z

CPU with AQE on while GPU run can only use AQE off. This is because if I enable AQE for GPU run, another bug will be triggered and will crash the cluster so that i have to restart the cluster

viadea · 2021-08-05T14:31:31Z

@tgravescs I just quickly tested CPU run on Databricks. Basically my test shows, even if AQE is off, DPP is happening.
DPP is controlled by spark.sql.optimizer.dynamicPartitionPruning.enabled and setting this parameter to false will only make DPP gone.
So I do not think AQE is the trigger for this issue.

tgravescs · 2021-08-05T19:34:10Z

for some reason the GPU plan on Databricks is missing the SubqueryBroadcast which is used with DPP:


 :- GpuFileGpuScan parquet [id#60L,k#61] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/tgraves/myfact], PartitionFilters: [isnotnull(k#61), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<id:bigint>
      +- GpuBroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#191]
   +- GpuProject [k#65L]
            +- GpuCoalesceBatches TargetSize(2147483647)
               +- GpuFilter ((gpuisnotnull(id#64L) AND (id#64L < 2)) AND gpuisnotnull(k#65L))
                  +- GpuFileGpuScan parquet [id#64L,k#65L] Batched: true, DataFilters: [isnotnull(id#64L), (id#64L < 2), isnotnull(k#65L)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/tgraves/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint>

normally this woudl look like:

 +- GpuBroadcastHashJoin [cast(k#39 as bigint)], [k#43L], Inner, GpuBuildRight
      :- GpuFileGpuScan parquet [id#38L,k#39] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/tgraves/myfact], PartitionFilters: [isnotnull(k#39), dynamicpruningexpression(cast(k#39 as bigint) IN dynamicpruning#48)], PushedFilters: [], ReadSchema: struct<id:bigint>
      :     +- SubqueryBroadcast dynamicpruning#48, 0, [k#43L], [id=#235]
      :        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#234]
      :           +- GpuColumnarToRow false
      :              +- GpuProject [k#43L]
      :                 +- GpuCoalesceBatches targetsize(2147483647)
      :                    +- GpuFilter ((gpuisnotnull(id#42L) AND (id#42L < 2)) AND gpuisnotnull(k#43L))
      :                       +- GpuFileGpuScan parquet [id#42L,k#43L] Batched: true, DataFilters: [isnotnull(id#42L), (id#42L < 2), isnotnull(k#43L)], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/tgraves/mydim], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint>

Somehow the subqueryBroadcast isn't there

tgravescs · 2021-08-05T20:21:26Z

Yeah so the plan that we get from Databricks doesn't even have the SubqueryBroadcast in it like Apache Spark does, so they must be inserting this at some other point because the CPU side eventually gets it inserted, but its sometime after we see the plan.

tgravescs · 2021-08-09T13:54:20Z

note, turning off our gpuBroadcastExchange makes dpp work on databricks:

spark.conf.set("spark.rapids.sql.exec.BroadcastExchangeExec", "false")

== Physical Plan ==
GpuColumnarToRow (7)
+- GpuProject (6)
   +- GpuRowToColumnar (5)
      +- * BroadcastHashJoin Inner BuildRight (4)
         :- GpuColumnarToRow (2)
         :  +- GpuScan parquet  (1)
         +- ReusedExchange (3)

===== Subqueries =====

Subquery:1 Hosting operator id = 1 Hosting Expression = cast(k#18 as bigint) IN dynamicpruning#28
BroadcastExchange (13)
+- GpuColumnarToRow (12)
   +- GpuProject (11)
      +- GpuCoalesceBatches (10)
         +- GpuFilter (9)
            +- GpuScan parquet  (8)

tgravescs · 2021-12-07T19:40:58Z

Please note you can work around this issue by disabling DPP with spark.sql.optimizer.dynamicPartitionPruning.enabled=false

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 4, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Aug 17, 2021

tgravescs mentioned this issue Dec 7, 2021

[DOC] Document DPP not working on Databricks #4321

Closed

NVnavkumar self-assigned this Sep 20, 2022

sameerz added this to To do in Release 22.12 via automation Sep 20, 2022

NVnavkumar mentioned this issue Oct 25, 2022

Enable DPP and DPP+AQE on [databricks] #6919

Merged

NVnavkumar closed this as completed in #6919 Oct 27, 2022

Release 22.12 automation moved this from To do to Done Oct 27, 2022

NVnavkumar mentioned this issue Nov 2, 2022

[DOC] Update documentation to indicate support for DPP on Databricks #6982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DPP is not working in Databricks env #3143

[BUG] DPP is not working in Databricks env #3143

viadea commented Aug 4, 2021 •

edited by tgravescs

tgravescs commented Aug 5, 2021

viadea commented Aug 5, 2021

viadea commented Aug 5, 2021

tgravescs commented Aug 5, 2021

tgravescs commented Aug 5, 2021 •

edited

tgravescs commented Aug 9, 2021

tgravescs commented Dec 7, 2021

[BUG] DPP is not working in Databricks env #3143

[BUG] DPP is not working in Databricks env #3143

Comments

viadea commented Aug 4, 2021 • edited by tgravescs

tgravescs commented Aug 5, 2021

viadea commented Aug 5, 2021

viadea commented Aug 5, 2021

tgravescs commented Aug 5, 2021

tgravescs commented Aug 5, 2021 • edited

tgravescs commented Aug 9, 2021

tgravescs commented Dec 7, 2021

viadea commented Aug 4, 2021 •

edited by tgravescs

tgravescs commented Aug 5, 2021 •

edited