Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Shims to replace Hive execs #7486

Merged
merged 7 commits into from
Jan 10, 2023

Conversation

mythrocks
Copy link
Collaborator

@mythrocks mythrocks commented Jan 10, 2023

Distro-specific shims are currently allowed to customize operator/exec/expr replacements in most cases. However, replacements made specifically for exprs/execs from org.apache.spark.sql.hive (i.e. the ones handled by HiveProvider) do not succeed.

For instance, the HiveTableScanExec is currently provided by HiveProviderImpl. When a Spark distribution's shim attempts to block/override this exec's replacement, it does not succeed, because the HiveProviderImpl changes are applied last in GpuOverrides.

This change puts the Spark shim replacement rules at the end, so that the shim has the final say.

Distro-specific shims are currently allowed to customize operator/exec/expr
replacements in most cases. However, there is currently no good way to override
any operator/exec/expr replacements done in `HiveProviderImpl`.

For instance, the `HiveTableScanExec` is currently provided unilaterally by
`HiveProviderImpl`. There is no way for a Spark distribution's shim to
block/override this exec's replacement; the exec will be replaced in all
distros.

This change provides shims a way to specify a custom `HiveProvider` if required,
where operator replacements may be customized.

Signed-off-by: MithunR <mythrocks@gmail.com>
@mythrocks mythrocks self-assigned this Jan 10, 2023
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks marked this pull request as draft January 10, 2023 00:22
@mythrocks
Copy link
Collaborator Author

mythrocks commented Jan 10, 2023

The build failure on 3.4 seems to be unrelated to this current change:

[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOptimizedCreateHiveTableAsSelectCommand.scala:31: object OptimizedCreateHiveTableAsSelectCommand is not a member of package org.apache.spark.sql.hive.execution
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOptimizedCreateHiveTableAsSelectCommand.scala:118: not found: type OptimizedCreateHiveTableAsSelectCommand
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOptimizedCreateHiveTableAsSelectCommand.scala:152: not found: type OptimizedCreateHiveTableAsSelectCommand
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOptimizedCreateHiveTableAsSelectCommand.scala:148: not found: type OptimizedCreateHiveTableAsSelectCommand
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOptimizedCreateHiveTableAsSelectCommand.scala:19: Unused import
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOptimizedCreateHiveTableAsSelectCommand.scala:31: Unused import
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala:3625: type arguments [org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand] do not conform to method dataWriteCmd's type parameter bounds [INPUT <: org.apache.spark.sql.execution.command.DataWritingCommand]
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala:32: object OptimizedCreateHiveTableAsSelectCommand is not a member of package org.apache.spark.sql.hive.execution
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala:43: not found: type OptimizedCreateHiveTableAsSelectCommand
[ERROR] [Error] /home/mithunr/workspace/dev/spark-plugin/2/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala:32: Unused import

Apache Spark has removed OptimizedCreateHiveTableAsSelect on master:
apache/spark#39263

I should check if this is being addressed elsewhere.

@mythrocks
Copy link
Collaborator Author

@jlowe: Thank you for the suggestion earlier, to reorder the expressions/execs in GpuOverrides.scala. That would have worked, were it not for the package structure.

Unfortunately, HiveTableScanExec cannot be replaced directly from the Shims (or GpuOverrides), because it is package private to org.apache.spark.sql.hive. It needs to be replaced from the HiveProvider. It is easier to pipe the choice of HiveProvider through the shim, and replace it there.

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks marked this pull request as ready for review January 10, 2023 08:29
revans2
revans2 previously approved these changes Jan 10, 2023
@jlowe
Copy link
Member

jlowe commented Jan 10, 2023

Unfortunately, HiveTableScanExec cannot be replaced directly from the Shims (or GpuOverrides), because it is package private to org.apache.spark.sql.hive

That does not prevent shims from overriding HiveTableScanExec. Shims can, and do, have sources that are in different packages. See the stuff in org.apache.spark instead of com.nvidia.spark under the varous shims, as one example. Since this override meta would be essentially private to the shim, it can be located in any package it wants to be.

If a distro shim attempts to provide a replacement for Hive execs, e.g.
`HiveTableScanExec`, it is likely to fail, because `GpuOverrides` currently
applies execs derived from the HiveProvider last.

This commit changes this order, so that the active shim has the last say
in exec replacements. This should allow for shims to selectively
replace or disable replacements for Hive operators.
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks changed the title Allow Shims to specify HiveProvider implementation. Allow Shims to replace Hive execs Jan 10, 2023
@mythrocks
Copy link
Collaborator Author

I have modified this patch based on @jlowe's suggestion. I agree that this works a little better:

  1. The change is now minuscule.
  2. The Shim developer is not exposed to HiveProvider. They may replace Hive execs as they see fit. Less coupling all around.

I have changed the brief and description to match the change.

An example of how a shim (say for CDH) may modify the replacement rules for a Hive exec (say HiveTableScanExec) is provided here.

@mythrocks mythrocks merged commit eaab737 into NVIDIA:branch-23.02 Jan 10, 2023
@mythrocks
Copy link
Collaborator Author

Thank you for the review, @revans2, @jlowe. I have merged this change.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants