Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unit tests core dump as below #10499

Closed
NvTimLiu opened this issue Feb 25, 2024 · 5 comments · Fixed by #10501
Closed

[BUG] Unit tests core dump as below #10499

NvTimLiu opened this issue Feb 25, 2024 · 5 comments · Fixed by #10501
Assignees
Labels
bug Something isn't working

Comments

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Feb 25, 2024

Describe the bug

Unit tests core dump on branch-24.04 as below, JDK11-nightly/545
hs_err_pid2664.log

 - row based group by window handles GpuRetryOOM
 - row-based group by running window handles GpuSplitAndRetryOOM
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x0000000000000000, pid=2664, tid=2755
 #
 # JRE version: OpenJDK Runtime Environment (11.0.21+9) (build 11.0.21+9-post-Ubuntu-0ubuntu120.04)
 # Java VM: OpenJDK 64-Bit Server VM (11.0.21+9-post-Ubuntu-0ubuntu120.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
 # Problematic frame:
 # C  0x0000000000000000
 #
 # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/core.2664)
 #
 # An error report file with more information is saved as:
 # /home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/hs_err_pid2664.log
 Compiled method (c1) 1191062 28589  s!   3       ai.rapids.cudf.Scalar$OffHeapState::cleanImpl (133 bytes)
  total in heap  [0x00007f78c1f44910,0x00007f78c1f46678] = 7528
  relocation     [0x00007f78c1f44a88,0x00007f78c1f44c60] = 472
  main code      [0x00007f78c1f44c60,0x00007f78c1f45d60] = 4352
  stub code      [0x00007f78c1f45d60,0x00007f78c1f45e80] = 288
  oops           [0x00007f78c1f45e80,0x00007f78c1f45ea8] = 40
  metadata       [0x00007f78c1f45ea8,0x00007f78c1f45f28] = 128
  scopes data    [0x00007f78c1f45f28,0x00007f78c1f461c0] = 664
  scopes pcs     [0x00007f78c1f461c0,0x00007f78c1f46560] = 928
  dependencies   [0x00007f78c1f46560,0x00007f78c1f46580] = 32
  handler table  [0x00007f78c1f46580,0x00007f78c1f46610] = 144
  nul chk table  [0x00007f78c1f46610,0x00007f78c1f46678] = 104
 Compiled method (c1) 1191062 28589  s!   3       ai.rapids.cudf.Scalar$OffHeapState::cleanImpl (133 bytes)
  total in heap  [0x00007f78c1f44910,0x00007f78c1f46678] = 7528
  relocation     [0x00007f78c1f44a88,0x00007f78c1f44c60] = 472
  main code      [0x00007f78c1f44c60,0x00007f78c1f45d60] = 4352
  stub code      [0x00007f78c1f45d60,0x00007f78c1f45e80] = 288
  oops           [0x00007f78c1f45e80,0x00007f78c1f45ea8] = 40
  metadata       [0x00007f78c1f45ea8,0x00007f78c1f45f28] = 128
  scopes data    [0x00007f78c1f45f28,0x00007f78c1f461c0] = 664
  scopes pcs     [0x00007f78c1f461c0,0x00007f78c1f46560] = 928
  dependencies   [0x00007f78c1f46560,0x00007f78c1f46580] = 32
  handler table  [0x00007f78c1f46580,0x00007f78c1f46610] = 144
  nul chk table  [0x00007f78c1f46610,0x00007f78c1f46678] = 104
 #
 # If you would like to submit a bug report, please visit:
 #   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
 # The crash happened outside the Java Virtual Machine in native code.
 # See problematic frame for where to report the bug.
 #
 [INFO] ------------------------------------------------------------------------
 [INFO] Reactor Summary for RAPIDS Accelerator for Apache Spark Root Project 24.04.0-SNAPSHOT:
 [INFO] 
 [INFO] RAPIDS Accelerator for Apache Spark Root Project ... SUCCESS [ 15.341 s]
 [INFO] rapids-4-spark-jdk-profiles_2.12 ................... SUCCESS [  0.311 s]
 [INFO] rapids-4-spark-shim-deps-parent_2.12 ............... SUCCESS [ 11.861 s]
 [INFO] rapids-4-spark-sql-plugin-api_2.12 ................. SUCCESS [ 24.701 s]
 [INFO] RAPIDS Accelerator for Apache Spark SQL Plugin ..... SUCCESS [01:23 min]
 [INFO] RAPIDS Accelerator for Apache Spark Shuffle Plugin . SUCCESS [  7.602 s]
 [INFO] RAPIDS Accelerator for Apache Spark Scala UDF Plugin SUCCESS [ 39.216 s]
 [INFO] RAPIDS Accelerator for Apache Spark Delta Lake 2.0.x Support SUCCESS [ 13.406 s]
 [INFO] RAPIDS Accelerator for Apache Spark Aggregator ..... SUCCESS [  9.863 s]
 [INFO] Data Generator ..................................... SUCCESS [ 10.006 s]
 [INFO] RAPIDS Accelerator for Apache Spark Distribution ... SUCCESS [01:24 min]
 [INFO] rapids-4-spark-integration-tests_2.12 .............. SUCCESS [ 36.575 s]
 [INFO] RAPIDS Accelerator for Apache Spark Tests .......... FAILURE [20:26 min]
 [INFO] RAPIDS Accelerator for Apache Spark Tools Support .. SKIPPED
 [INFO] ------------------------------------------------------------------------
 [INFO] BUILD FAILURE
 [INFO] ------------------------------------------------------------------------
 [INFO] Total time:  26:03 min
 [INFO] Finished at: 2024-02-25T02:27:11Z
 [INFO] ------------------------------------------------------------------------
 [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project rapids-4-spark-tests_2.12: There are test failures -> 



-------------------------------------------------------------------------------------



--------------  S U M M A R Y ------------

Command Line: -Dai.rapids.refcount.debug=true -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/target/spark320/tmp -Drapids.shuffle.manager.override=true -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dspark.unsafe.exceptionOnMemoryLeak=true -Dbasedir=/home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests -javaagent:/root/.m2/repository/org/jacoco/org.jacoco.agent/0.8.8/org.jacoco.agent-0.8.8-runtime.jar=destfile=/home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/target/spark320/jacoco.exec,append=true,includes=ai.rapids.cudf.*:com.nvidia.spark.*:org.apache.spark.sql.rapids.*,excludes=spark320.com.nvidia.shaded.spark.* -ea -Xmx4g -Xss4m -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.scalatest.tools.Runner -R /home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/target/spark320/classes /home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/target/spark320/test-classes -o -f /home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/target/spark320/surefire-reports/scala-test-output.txt -u /home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/target/spark320/surefire-reports/.

Host: AMD EPYC 7313P 16-Core Processor, 32 cores, 60G, Ubuntu 20.04.6 LTS
Time: Sun Feb 25 02:27:10 2024 UTC elapsed time: 1191.061316 seconds (0d 0h 19m 51s)

---------------  T H R E A D  ---------------

Current thread (0x00007f77e8352000):  JavaThread "Cleaner Thread" daemon [_thread_in_native, id=2755, stack(0x00007f780c3f7000,0x00007f780c7f8000)]

Stack: [0x00007f780c3f7000,0x00007f780c7f8000],  sp=0x00007f780c7f6718,  free space=4093k
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 28591  ai.rapids.cudf.Scalar.closeScalar(J)V (0 bytes) @ 0x00007f78c8c06a11 [0x00007f78c8c069c0+0x0000000000000051]
J 28589 c1 ai.rapids.cudf.Scalar$OffHeapState.cleanImpl(Z)Z (133 bytes) @ 0x00007f78c1f457d4 [0x00007f78c1f44c60+0x0000000000000b74]
J 61839 c2 ai.rapids.cudf.MemoryCleaner$CleanerWeakReference.clean()V (42 bytes) @ 0x00007f78cb1d7aa8 [0x00007f78cb1d79e0+0x00000000000000c8]
j  ai.rapids.cudf.MemoryCleaner.lambda$static$0()V+126
j  ai.rapids.cudf.MemoryCleaner$$Lambda$2057.run()V+0
j  java.lang.Thread.run()V+11 java.base@11.0.21
v  ~StubRoutines::call_stub

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

Register to memory mapping:

RAX=0x0 is NULL
RBX={method} {0x00007f7898b3b238} 'closeScalar' '(J)V' in 'ai/rapids/cudf/Scalar'
RCX=0x0000000000000002 is an unknown value
RDX=0x0000000000000010 is an unknown value
RSP=0x00007f780c7f6718 is pointing into the stack for thread: 0x00007f77e8352000
RBP=0x00007f778e8c3b10 points into unknown readable memory: 0x00007f77e4daeae8 | e8 ea da e4 77 7f 00 00
RSI=0x00007f740e201400 is an unknown value
RDI=0x00007f73fa3c6d60 points into unknown readable memory: 0x00007f73fad5d170 | 70 d1 d5 fa 73 7f 00 00
R8 =0x0000000000000002 is an unknown value
R9 =0x00007f77e83529f0 points into unknown readable memory: 0x0000000000000006 | 06 00 00 00 00 00 00 00
R10=0x00007f77c04768e0: Java_ai_rapids_cudf_Scalar_closeScalar+0x0000000000000000 in /home/jenkins/agent/workspace/jenkins-JDK11-nightly-545/tests/target/spark320/tmp/cudf5094709651714106938.so at 0x00007f77bf6bc000
R11=0x00007f78db090a70 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00
R12=0x0 is NULL
R13=0x00007f780c7f672c is pointing into the stack for thread: 0x00007f77e8352000
R14=0x0 is NULL
R15=0x00007f77e8352000 is a thread

@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 25, 2024
@NvTimLiu
Copy link
Collaborator Author

Another similar but not the same core dump logs: rapids_nightly-dev-github/1064
hs_err_pid281499.log

 - row-based group by running window handles GpuSplitAndRetryOOM
 24/02/25 15:34:40.251 Cleaner Thread ERROR Scalar: A SCALAR WAS LEAKED(ID: 1516040 7f02e48fb5b0)
 24/02/25 15:34:40.253 Cleaner Thread ERROR MemoryCleaner: Leaked scalar (ID: 1516040): 2024-02-25 15:34:38.0666 UTC: INC
 java.lang.Thread.getStackTrace(Thread.java:1564)
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x00007f0700000003, pid=185769, tid=0x00007f07b0d7e700
 #
 # JRE version: OpenJDK Runtime Environment (8.0_392-b08) (build 1.8.0_392-8u392-ga-1~20.04-b08)
 # Java VM: OpenJDK 64-Bit Server VM (25.392-b08 mixed mode linux-amd64 compressed oops)
 # Problematic frame:
 ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:341)
 ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
 ai.rapids.cudf.Scalar.incRefCount(Scalar.java:540)
 ai.rapids.cudf.Scalar.<init>(Scalar.java:528)
 ai.rapids.cudf.ColumnView.getScalarElement(ColumnView.java:4002)
 com.nvidia.spark.rapids.window.BatchedRunningWindowBinaryFixer.updateState(GpuWindowExpression.scala:1120)
 com.nvidia.spark.rapids.window.BatchedRunningWindowBinaryFixer.fixUp(GpuWindowExpression.scala:1142)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$fixUpAll$2(GpuRunningWindowExec.scala:113)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$fixUpAll$2$adapted(GpuRunningWindowExec.scala:109)
 scala.collection.immutable.Range.foreach(Range.scala:158)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$fixUpAll$1(GpuRunningWindowExec.scala:109)
 com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:126)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.fixUpAll(GpuRunningWindowExec.scala:108)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$6(GpuRunningWindowExec.scala:154)
 com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:84)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$5(GpuRunningWindowExec.scala:137)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$4(GpuRunningWindowExec.scala:135)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(RmmRapidsRetryIterator.scala:272)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$3(GpuRunningWindowExec.scala:135)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:57)
 # C  [cudf2235933746603852361.so+0x2182d003]
 #
 # Core dump written. Default location: /home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-1064-w2-1064/tests/core or core.185769
 #
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$2(GpuRunningWindowExec.scala:129)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$1(GpuRunningWindowExec.scala:128)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
 scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$next$1(GpuRunningWindowExec.scala:214)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.next(GpuRunningWindowExec.scala:213)
 com.nvidia.spark.rapids.WindowRetrySuite.$anonfun$new$19(WindowRetrySuite.scala:208)
 # An error report file with more information is saved as:
 # /home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-1064-w2-1064/tests/hs_err_pid185769.log
 org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
 org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
 org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 org.scalatest.Transformer.apply(Transformer.scala:22)
 org.scalatest.Transformer.apply(Transformer.scala:20)
 org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
 org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
 org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
 org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
 org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
 org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
 org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
 org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
 com.nvidia.spark.rapids.WindowRetrySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(WindowRetrySuite.scala:30)
 org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
 org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
 com.nvidia.spark.rapids.WindowRetrySuite.runTest(WindowRetrySuite.scala:30)
 org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
 org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
 scala.collection.immutable.List.foreach(List.scala:431)
 org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
 org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
 org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
 org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
 org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
 org.scalatest.Suite.run(Suite.scala:1114)
 org.scalatest.Suite.run$(Suite.scala:1096)
 org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
 org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
 org.scalatest.SuperEngine.runImpl(Engine.scala:535)
 org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
 org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
 org.scalatest.funsuite.AnyFunSuite.run(AnyFunSuite.scala:1564)
 org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
 org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
 scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
 scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
 org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
 org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
 org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:30)
 org.scalatest.Suite.run(Suite.scala:1111)
 org.scalatest.Suite.run$(Suite.scala:1096)
 org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:30)
 org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
 org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
 org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
 scala.collection.immutable.List.foreach(List.scala:431)
 org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
 org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
 org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
 org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
 org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
 org.scalatest.tools.Runner$.main(Runner.scala:775)
 org.scalatest.tools.Runner.main(Runner.scala)
 

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Feb 26, 2024

Another similar but not the same core dump logs: rapids_scala213_nightly-dev-github/100

hs_err_pid2232.log

 - row-based group by running window handles GpuSplitAndRetryOOM
 24/02/25 14:01:45.284 Cleaner Thread ERROR Scalar: A SCALAR WAS LEAKED(ID: 1539416 7f09f162eee0)
 24/02/25 14:01:45.291 Cleaner Thread ERROR MemoryCleaner: Leaked scalar (ID: 1539416): 2024-02-25 14:01:43.0531 UTC: INC
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x00007f089748e323, pid=51289, tid=0x00007f08dc526700
 #
 # JRE version: OpenJDK Runtime Environment (8.0_392-b08) (build 1.8.0_392-8u392-ga-1~20.04-b08)
 # Java VM: OpenJDK 64-Bit Server VM (25.392-b08 mixed mode linux-amd64 compressed oops)
 # Problematic frame:
 java.lang.Thread.getStackTrace(Thread.java:1564)
 ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:341)
 ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
 ai.rapids.cudf.Scalar.incRefCount(Scalar.java:540)
 ai.rapids.cudf.Scalar.<init>(Scalar.java:528)
 ai.rapids.cudf.ColumnView.getScalarElement(ColumnView.java:4002)
 com.nvidia.spark.rapids.window.BatchedRunningWindowBinaryFixer.updateState(GpuWindowExpression.scala:1120)
 com.nvidia.spark.rapids.window.BatchedRunningWindowBinaryFixer.fixUp(GpuWindowExpression.scala:1142)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$fixUpAll$2(GpuRunningWindowExec.scala:113)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$fixUpAll$2$adapted(GpuRunningWindowExec.scala:109)
 scala.collection.immutable.Range.foreach(Range.scala:190)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$fixUpAll$1(GpuRunningWindowExec.scala:109)
 com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:126)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.fixUpAll(GpuRunningWindowExec.scala:108)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$6(GpuRunningWindowExec.scala:154)
 com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:84)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$5(GpuRunningWindowExec.scala:137)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$4(GpuRunningWindowExec.scala:135)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(RmmRapidsRetryIterator.scala:272)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$3(GpuRunningWindowExec.scala:135)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:57)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$2(GpuRunningWindowExec.scala:129)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$computeRunningAndClose$1(GpuRunningWindowExec.scala:128)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
 com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
 scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.$anonfun$next$1(GpuRunningWindowExec.scala:214)
 com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
 com.nvidia.spark.rapids.window.GpuRunningWindowIterator.next(GpuRunningWindowExec.scala:213)
 com.nvidia.spark.rapids.WindowRetrySuite.$anonfun$new$19(WindowRetrySuite.scala:208)
 org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
 org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
 org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 org.scalatest.Transformer.apply(Transformer.scala:22)
 org.scalatest.Transformer.apply(Transformer.scala:20)
 org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
 org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
 org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
 org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
 org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
 org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
 org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
 org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
 com.nvidia.spark.rapids.WindowRetrySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(WindowRetrySuite.scala:30)
 org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
 org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
 com.nvidia.spark.rapids.WindowRetrySuite.runTest(WindowRetrySuite.scala:30)
 org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
 org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
 scala.collection.immutable.List.foreach(List.scala:333)
 org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
 org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
 org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
 org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
 org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
 org.scalatest.Suite.run(Suite.scala:1114)
 org.scalatest.Suite.run$(Suite.scala:1096)
 org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
 org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
 org.scalatest.SuperEngine.runImpl(Engine.scala:535)
 org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
 org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
 org.scalatest.funsuite.AnyFunSuite.run(AnyFunSuite.scala:1564)
 org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
 org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
 scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
 org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
 org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
 org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:30)
 org.scalatest.Suite.run(Suite.scala:1111)
 org.scalatest.Suite.run$(Suite.scala:1096)
 org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:30)
 org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
 org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
 # C  [cudf7762830788803101876.so+0xcbb323]  void ce_vtable_builder::_Dealloc_async<rmm::mr::device_memory_resource>(void*, void*, unsigned long, unsigned long, +0x3
 #
 # Core dump written. Default location: /home/jenkins/agent/workspace/jenkins-rapids_scala213_nightly-dev-github-100-w1-100/r core.51289
 #
 org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
 scala.collection.immutable.List.foreach(List.scala:333)
 org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
 org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
 org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
 org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
 org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
 org.scalatest.tools.Runner$.main(Runner.scala:775)
 org.scalatest.tools.Runner.main(Runner.scala)
 
 # An error report file with more information is saved as:
 # /home/jenkins/agent/workspace/jenkins-rapids_scala213_nightly-dev-github-100-w1-100/scala2.13/tests/hs_err_pid51289.log

@jlowe jlowe self-assigned this Feb 26, 2024
@jlowe
Copy link
Member

jlowe commented Feb 26, 2024

I can reproduce this by just running the unit tests on Spark 3.3.3.

mvn package -Dbuildver=333

@gerashegalov
Copy link
Collaborator

it is affecting pre-merge #10497

@jlowe
Copy link
Member

jlowe commented Feb 26, 2024

This crash occurs because WindowRetrySuite leaks a Scalar instance, and by the time the leak is detected, the RMM instance has been shutdown. That causes us to try to deallocate memory on a memory manager that no longer exists.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants