Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle minimum GPU architecture supported [databricks] #10540

Merged
merged 14 commits into from
Mar 15, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Mar 4, 2024

Fixes #10430. This PR ensures that Spark RAPIDS jobs are executed on supported GPU architectures without relying on manual configuration.

Changes:

  1. Processes gpu_architectures property from the *version-info.properties file generated by the native builds.
  2. Verifies if the user is running the job on an architecture supported by the cuDF and JNI libraries and throws an exception if the architecture is unsupported.

Testing

Tested on a Dataproc VM running on Nvidia P4 (GPU Architecture 6.1)

24/03/06 17:44:58 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
24/03/06 17:45:10 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down!
java.lang.RuntimeException: Device architecture 61 is unsupported. Minimum supported architecture: 75.
        at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitectureInternal(Plugin.scala:366)
        at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitecture(Plugin.scala:375)
        at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:461)

Related PR

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added the feature request New feature or request label Mar 4, 2024
@parthosa parthosa self-assigned this Mar 4, 2024
kuhushukla
kuhushukla previously approved these changes Mar 4, 2024
jlowe
jlowe previously approved these changes Mar 4, 2024
…hitectures

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@jlowe jlowe marked this pull request as draft March 7, 2024 14:46
@jlowe jlowe marked this pull request as ready for review March 7, 2024 14:48
jlowe
jlowe previously approved these changes Mar 7, 2024
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to draft since this

gerashegalov
gerashegalov previously approved these changes Mar 7, 2024
@parthosa
Copy link
Collaborator Author

parthosa commented Mar 7, 2024

Need to wait for a new artefact for spark-rapids-jni

kuhushukla
kuhushukla previously approved these changes Mar 7, 2024
@gerashegalov gerashegalov changed the title Handle minimum CUDA architecture supported Handle minimum CUDA architecture supported [databricks] Mar 8, 2024
@gerashegalov gerashegalov self-requested a review March 12, 2024 12:29
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@parthosa
Copy link
Collaborator Author

build

@parthosa parthosa requested a review from jlowe March 12, 2024 15:43
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
gerashegalov
gerashegalov previously approved these changes Mar 12, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, but agree with @gerashegalov that this should have at least some test. Refactoring checkGpuArchitecture to take the property set and the GPU major/minor architectures makes it easier to mock and test various scenarios of the core logic for this.

gerashegalov
gerashegalov previously approved these changes Mar 12, 2024
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
jlowe
jlowe previously approved these changes Mar 14, 2024
gerashegalov
gerashegalov previously approved these changes Mar 14, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @parthosa for working through all the issues.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa dismissed stale reviews from gerashegalov and jlowe via 8673f2a March 14, 2024 22:31
@gerashegalov
Copy link
Collaborator

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@parthosa
Copy link
Collaborator Author

Thank you @gerashegalov and @jlowe

@gerashegalov gerashegalov merged commit 79c2a3b into NVIDIA:branch-24.04 Mar 15, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Error out when running on an unsupported GPU architecture
5 participants