Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] the driver is trying to load CUDA with latest 22.02 #4421

Closed
abellina opened this issue Dec 22, 2021 · 3 comments
Closed

[BUG] the driver is trying to load CUDA with latest 22.02 #4421

abellina opened this issue Dec 22, 2021 · 3 comments
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release

Comments

@abellina
Copy link
Collaborator

I tried running NDS with the latest HEAD (812e463) and I am seeing a dynamic pruning related bug where it is trying to load the CUDA libraries from the driver.

NDS Q9 triggers this:

dynamicpruning-0 21/12/22 15:17:46:392 ERROR NativeDepsLoader: Could not load cudf jni library...
java.io.IOException: Error loading dependencies
        at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:166)
        at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:74)
        at ai.rapids.cudf.ColumnView.<clinit>(ColumnView.java:34)
        at ai.rapids.cudf.HostColumnVectorCore$OffHeapState.cleanImpl(HostColumnVectorCore.java:597)
        at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:109)
        at ai.rapids.cudf.HostColumnVector.close(HostColumnVector.java:114)
        at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.close(RapidsHostColumnVectorCore.java:74)
        at org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:48)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableSeq.$anonfun$safeClose$1(implicits.scala:74)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableSeq.$anonfun$safeClose$1$adapted(implicits.scala:71)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableSeq.safeClose(implicits.scala:71)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableArray.safeClose(implicits.scala:92)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:57)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:53)
        at org.apache.spark.sql.rapids.execution.GpuSubqueryBroadcastExec.withResource(GpuSubqueryBroadcastExec.scala:35)
        at org.apache.spark.sql.rapids.execution.GpuSubqueryBroadcastExec.$anonfun$relationFuture$2(GpuSubqueryBroadcastExec.scala:89)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:139)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:137)
        at org.apache.spark.sql.rapids.execution.GpuSubqueryBroadcastExec.$anonfun$relationFuture$1(GpuSubqueryBroadcastExec.scala:75)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: java.lang.UnsatisfiedLinkError: /tmp/cudf2883818619457660902.so: libcuda.so.1: cannot open shared object file: No such file or directory
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:164)
        ... 32 more
Caused by: java.lang.UnsatisfiedLinkError: /tmp/cudf2883818619457660902.so: libcuda.so.1: cannot open shared object file: No such file or directory
        at java.lang.ClassLoader$NativeLibrary.load(Native Method)
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934)
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817)
        at java.lang.Runtime.load0(Runtime.java:810)
        at java.lang.System.load(System.java:1088)
        at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:181)
        at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:195)
        at ai.rapids.cudf.NativeDepsLoader.lambda$loadNativeDeps$1(NativeDepsLoader.java:158)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 22, 2021
@jlowe jlowe self-assigned this Dec 22, 2021
@jlowe jlowe added the P0 Must have for release label Dec 22, 2021
@jlowe
Copy link
Member

jlowe commented Dec 22, 2021

Looks like rapidsai/cudf#9332 regressed in rapidsai/cudf#9485. I'll post a cudf PR to fix.

@jlowe
Copy link
Member

jlowe commented Dec 22, 2021

rapidsai/cudf#9948 restores the cudf fix to prevent HostColumnVectorCore from trying to load ColumnVector when closing host columns.

@jlowe jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Dec 22, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 4, 2022
@jlowe
Copy link
Member

jlowe commented Jan 5, 2022

The cudf fix has been merged.

@jlowe jlowe closed this as completed Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release
Projects
No open projects
Release 22.02
Awaiting triage
Development

No branches or pull requests

3 participants