Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Dealing with executors that are stuck in the barrier #453

Open
an-ys opened this issue Sep 27, 2023 · 11 comments
Open

[QST] Dealing with executors that are stuck in the barrier #453

an-ys opened this issue Sep 27, 2023 · 11 comments

Comments

@an-ys
Copy link

an-ys commented Sep 27, 2023

I am trying to run the Linear Regression, KMeans, and PCA examples on a cluster of 2 nodes, each with 4 GPUs, but some of the executors in the examples always get stuck in the barrier when the cuML function is called (i.e., I get 6+2/8, 4+4/8, and 5+3/8, where 2, 4, and 3 executors are stuck in LinReg, KMeans, and PCA respectively). I also tried runing a KMeans application that deals with a large amount of data, so I do not think the problem is related to the small dataset.

I checked the logs for the executor that successfully ran the task and the executor that got stuck. The executor that got stuck initialized cuML These logs are from running the LinReg example in the Python directory of this repo. The executors that are stuck have RUNNING | NODE_LOCAL as the status while the successful executors have SUCCESS PROCESS_LOCAL.

I am using Spark RAPIDS ML branch-23.10 (daedfe56edae33c565af5e06179e992cf8fec93e and f651978), Spark 3.5.0 on standalone mode, and Hadoop 3.3.6 on a cluster of 2 nodes, each with 4 Titan-V GPUs.

Successful Executor
23/09/27 19:42:59 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB)
23/09/27 19:42:59 INFO TorrentBroadcast: Reading broadcast variable 3 took 13 ms
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 192, boot = -749, init = 941, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 203, boot = -723, init = 926, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 4.0 in stage 4.0 (TID 389). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 36.0 in stage 4.0 (TID 421). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 440
23/09/27 19:43:00 INFO Executor: Running task 55.0 in stage 4.0 (TID 440)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 220, boot = -692, init = 912, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 443
23/09/27 19:43:00 INFO Executor: Running task 58.0 in stage 4.0 (TID 443)
23/09/27 19:43:00 INFO Executor: Finished task 44.0 in stage 4.0 (TID 429). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 446
23/09/27 19:43:00 INFO Executor: Running task 61.0 in stage 4.0 (TID 446)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 238, boot = -679, init = 917, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = -767, init = 1006, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 12.0 in stage 4.0 (TID 397). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 20.0 in stage 4.0 (TID 405). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 453
23/09/27 19:43:00 INFO Executor: Running task 68.0 in stage 4.0 (TID 453)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 454
23/09/27 19:43:00 INFO Executor: Running task 69.0 in stage 4.0 (TID 454)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 280, boot = -698, init = 978, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 28.0 in stage 4.0 (TID 413). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 466
23/09/27 19:43:00 INFO Executor: Running task 81.0 in stage 4.0 (TID 466)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 159, boot = -7, init = 166, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 164, boot = -14, init = 178, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 55.0 in stage 4.0 (TID 440). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 58.0 in stage 4.0 (TID 443). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 473
23/09/27 19:43:00 INFO Executor: Running task 88.0 in stage 4.0 (TID 473)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 474
23/09/27 19:43:00 INFO Executor: Running task 89.0 in stage 4.0 (TID 474)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 173, boot = -3, init = 176, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 68.0 in stage 4.0 (TID 453). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 479
23/09/27 19:43:00 INFO Executor: Running task 94.0 in stage 4.0 (TID 479)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 244, boot = -4, init = 248, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 61.0 in stage 4.0 (TID 446). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 194, boot = 8, init = 186, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 489
23/09/27 19:43:00 INFO Executor: Finished task 81.0 in stage 4.0 (TID 466). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Running task 104.0 in stage 4.0 (TID 489)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 249, boot = -5, init = 254, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 494
23/09/27 19:43:00 INFO Executor: Running task 109.0 in stage 4.0 (TID 494)
23/09/27 19:43:00 INFO Executor: Finished task 69.0 in stage 4.0 (TID 454). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 499
23/09/27 19:43:00 INFO Executor: Running task 114.0 in stage 4.0 (TID 499)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 215, boot = 1, init = 214, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 89.0 in stage 4.0 (TID 474). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 507
23/09/27 19:43:00 INFO Executor: Running task 122.0 in stage 4.0 (TID 507)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 272, boot = 15, init = 256, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 88.0 in stage 4.0 (TID 473). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 6, init = 233, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 515
23/09/27 19:43:00 INFO Executor: Running task 130.0 in stage 4.0 (TID 515)
23/09/27 19:43:00 INFO Executor: Finished task 94.0 in stage 4.0 (TID 479). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 519
23/09/27 19:43:00 INFO Executor: Running task 134.0 in stage 4.0 (TID 519)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 240, boot = -7, init = 247, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 114.0 in stage 4.0 (TID 499). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 274, boot = 0, init = 274, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 259, boot = -7, init = 266, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 536
23/09/27 19:43:00 INFO Executor: Running task 151.0 in stage 4.0 (TID 536)
23/09/27 19:43:00 INFO Executor: Finished task 104.0 in stage 4.0 (TID 489). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 109.0 in stage 4.0 (TID 494). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 537
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 538
23/09/27 19:43:00 INFO Executor: Running task 152.0 in stage 4.0 (TID 537)
23/09/27 19:43:00 INFO Executor: Running task 153.0 in stage 4.0 (TID 538)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 269, boot = 9, init = 260, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 122.0 in stage 4.0 (TID 507). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 547
23/09/27 19:43:00 INFO Executor: Running task 162.0 in stage 4.0 (TID 547)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 246, boot = -10, init = 256, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 134.0 in stage 4.0 (TID 519). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 560
23/09/27 19:43:00 INFO Executor: Running task 175.0 in stage 4.0 (TID 560)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 297, boot = 6, init = 290, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 130.0 in stage 4.0 (TID 515). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 568
23/09/27 19:43:00 INFO Executor: Running task 183.0 in stage 4.0 (TID 568)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 241, boot = 3, init = 238, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 151.0 in stage 4.0 (TID 536). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 570
23/09/27 19:43:00 INFO Executor: Running task 185.0 in stage 4.0 (TID 570)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 7, init = 232, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 152.0 in stage 4.0 (TID 537). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 571
23/09/27 19:43:00 INFO Executor: Running task 186.0 in stage 4.0 (TID 571)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 258, boot = 14, init = 244, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 153.0 in stage 4.0 (TID 538). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 574
23/09/27 19:43:01 INFO Executor: Running task 189.0 in stage 4.0 (TID 574)
23/09/27 19:43:01 INFO PythonRunner: Times: total = 215, boot = 15, init = 200, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 162.0 in stage 4.0 (TID 547). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 162, boot = -6, init = 168, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 185.0 in stage 4.0 (TID 570). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 230, boot = -5, init = 235, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 175.0 in stage 4.0 (TID 560). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 154, boot = 0, init = 154, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 189.0 in stage 4.0 (TID 574). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 244, boot = 15, init = 229, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 183.0 in stage 4.0 (TID 568). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 219, boot = 7, init = 212, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 186.0 in stage 4.0 (TID 571). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO UCX: UCX context created
23/09/27 19:43:01 INFO UCX: UCX Worker created
23/09/27 19:43:02 INFO UCX: Started UcpListener on /<master_ip>:57306
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(1, <master_ip>, 32805, Some(rapids=57306)) with driver
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, <master_ip>, 41505, Some(rapids=62205))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=0) started
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124) for /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 2: UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 0: UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124)
23/09/27 19:43:02 INFO CoarseGrainedExecutorBackend: Got assigned task 577
23/09/27 19:43:02 INFO Executor: Running task 0.0 in stage 6.0 (TID 577)
23/09/27 19:43:02 INFO MapOutputTrackerWorker: Updating epoch to 4 and clearing cache
23/09/27 19:43:02 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 13.2 KiB, free 47.8 GiB)
23/09/27 19:43:02 INFO TorrentBroadcast: Reading broadcast variable 4 took 7 ms
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 29.0 KiB, free 47.8 GiB)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 2, fetching them
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 0 (0.0 B) local and 1 (97.0 B) host-local and 0 (0.0 B) push-merged-local and 1 (97.0 B) remote blocks
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<worker_ip_from_non_master_node>:38739 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 21 ms
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<master_ip>:36401 after 1 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 48.927407 ms
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 19.687474 ms
23/09/27 19:43:04 INFO Executor: Finished task 0.0 in stage 6.0 (TID 577). 4021 bytes result sent to driver
23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 584
23/09/27 19:43:05 INFO Executor: Running task 6.0 in stage 9.0 (TID 584)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 17 ms
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
23/09/27 19:43:05 INFO CodeGenerator: Code generated in 14.846185 ms
23/09/27 19:43:05 INFO PythonRunner: Times: total = 139, boot = -4325, init = 4464, finish = 0
23/09/27 19:43:05 INFO Executor: Finished task 6.0 in stage 9.0 (TID 584). 6740 bytes result sent to driver
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, <master_ip>, 36401, Some(rapids=29068))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 3
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=3) started
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(2, <master_ip>, 36445, Some(rapids=50893))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 2
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=2) started
23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 2: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for 2
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 2
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
	at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 3: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for 3
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 3
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
	at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
	at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown

Killed Executor
23/09/27 19:42:59 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache
23/09/27 19:43:00 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:00 INFO TransportClientFactory: Successfully created connection to /<master_ip>:32805 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB)
23/09/27 19:43:00 INFO TorrentBroadcast: Reading broadcast variable 3 took 163 ms
23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB)
23/09/27 19:43:01 INFO UCX: UCX context created
23/09/27 19:43:01 INFO UCX: UCX Worker created
23/09/27 19:43:02 INFO UCX: Started UcpListener on /<master_ip>:50893
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(2, <master_ip>, 36445, Some(rapids=50893)) with driver
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(1, <master_ip>, 32805, Some(rapids=57306))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 1
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=1) started
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, <master_ip>, 41505, Some(rapids=62205))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=0) started
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, <master_ip>, 36401, Some(rapids=29068))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 3
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=3) started
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:57878
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:57878) for /<master_ip>:57878
23/09/27 19:43:02 INFO CodeGenerator: Code generated in 370.253869 ms
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 834, init = 307, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1057, boot = 845, init = 212, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1121, boot = 840, init = 281, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1152, boot = 852, init = 300, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1129, boot = 824, init = 305, finish = 0
23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 829, init = 312, finish = 0
23/09/27 19:43:02 INFO Executor: Finished task 29.0 in stage 4.0 (TID 414). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 13.0 in stage 4.0 (TID 398). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 5.0 in stage 4.0 (TID 390). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 37.0 in stage 4.0 (TID 422). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 45.0 in stage 4.0 (TID 430). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO Executor: Finished task 21.0 in stage 4.0 (TID 406). 2090 bytes result sent to driver
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:57878)
23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 578
23/09/27 19:43:05 INFO Executor: Running task 2.0 in stage 9.0 (TID 578)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:05 INFO TransportClientFactory: Successfully created connection to master/<master_ip>:33961 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 20 ms
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB)
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:07 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Getting 1 (72.0 B) non-empty blocks including 0 (0.0 B) local and 1 (72.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms
23/09/27 19:43:07 INFO CodeGenerator: Code generated in 19.422264 ms
2023-09-27 19:43:08,459 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context
23/09/27 19:43:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0.
23/09/27 19:44:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 60 seconds, current barrier epoch is 0.
23/09/27 19:45:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 120 seconds, current barrier epoch is 0.
23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/09/27 19:46:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 180 seconds, current barrier epoch is 0.
23/09/27 19:46:31 INFO Executor: Executor is trying to kill task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs
23/09/27 19:46:32 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[accept-connections,5,main]
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:267)
	at org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$3(BarrierTaskContext.scala:94)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.BarrierTaskContext.runBarrier(BarrierTaskContext.scala:94)
	at org.apache.spark.BarrierTaskContext.allGather(BarrierTaskContext.scala:179)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.barrierAndServe(PythonRunner.scala:490)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anon$1.run(PythonRunner.scala:321)
23/09/27 19:46:32 INFO Executor: Executor killed task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs
23/09/27 19:46:32 INFO RapidsBufferCatalog: Closing storage
23/09/27 19:46:32 INFO UCXShuffleTransport: UCX transport closing
23/09/27 19:46:32 WARN UCX: UCX is shutting down
23/09/27 19:46:32 INFO UCX: De-registering UCX 3 memory buffers.
23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown

Here is the spark.conf containing the related options. I tried to disable the options related to UDFs (Scala UDF, UDF compiler, etc.), but it did not do much.

`spark.conf`
spark.master	spark://master:7077

# Resource-related configs
spark.executor.instances	8
spark.executor.cores	6
spark.executor.memory	80G
spark.driver.memory	80G
spark.executor.memoryOverhead	1G

# Task-related
spark.default.parallelism	192
spark.sql.shuffle.partitions	192
spark.driver.maxResultSize	30G
spark.sql.files.maxPartitionBytes	4096m
# spark.sql.files.maxPartitionBytes	8192m
spark.sql.execution.sortBeforeRepartition false
spark.sql.adaptive.enabled	true

# GPU-related Configs
spark.executor.resource.gpu.amount	1
spark.executor.resource.gpu.discoveryScript	/usr/lib/spark/scripts/gpu/getGpusResources.sh
spark.executor.resources.discoveryPlugin	com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin

spark.plugins	com.nvidia.spark.SQLPlugin
spark.rapids.memory.gpu.debug	STDOUT
spark.rapids.memory.gpu.pool	NONE
spark.rapids.memory.pinnedPool.size	20G
spark.rapids.shuffle.multiThreaded.reader.threads	24
spark.rapids.shuffle.multiThreaded.writer.threads	24
spark.rapids.sql.concurrentGpuTasks	2
spark.rapids.sql.enabled	true
spark.rapids.sql.exec.CollectLimitExec	true
spark.rapids.sql.explain	all
spark.rapids.sql.expression.ScalaUDF	true
spark.rapids.sql.metrics.level	DEBUG
spark.rapids.sql.rowBasedUDF.enabled true
spark.rapids.sql.udfCompiler.enabled	true
spark.shuffle.manager	com.nvidia.spark.rapids.spark350.RapidsShuffleManager
spark.task.resource.gpu.amount	0.166
spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer
spark.rapids.shuffle.mode UCX
spark.shuffle.service.enabled false
spark.dynamicAllocation.enabled false
spark.executorEnv.UCX_ERROR_SIGNALS
spark.executorEnv.UCX_MEMTYPE_CACHE n
spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024
spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,rc,tcp
spark.executorEnv.UCX_RNDV_SCHEME put_zcopy
spark.executorEnv.UCX_MAX_RNDV_RAILS 1
spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA n
@abellina
Copy link

@an-ys

I am sorry your application isn't making progress. I'll chime in from the spark-rapids side, specifically around the shuffle (for now).

I see that you have configured UCX shuffle. Do you have RDMA-capable networking? Or, do you have several GPUs in one box? If you don't have these things, then we recommend the MULTITHREADED shuffle mode (default). Note also that these configs are only for the MULTITHREADED shuffle:

spark.rapids.shuffle.multiThreaded.reader.threads	24
spark.rapids.shuffle.multiThreaded.writer.threads	24

I would start by removing the RapidsShuffleManager to see if that unblocks your job. If it works there, then it's likely either something to do with the UCX shuffle trying to work with your hardware/OS, or a bug in it.

Also, if you do get another hang, getting a jstack of the executors is useful. You can access a stack trace under the Spark UI in the Executors tab (click on "Thread Dump").

@eordentlich
Copy link
Collaborator

Also, looks like you may have pasted the successful executor log twice as the logs look identical. Would be great to see the bad one when you get a chance.

@an-ys
Copy link
Author

an-ys commented Oct 2, 2023

Thanks for your reply. I have fixed the executor logs.

I don't remember setting up RDMA, but I have the RDMA packages are installed and I can run ucx_perftest with the example on the docs: CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda (I don't think it's related, but I tried disabling UCX_IB_GPU_DIRECT_RDMA for ucx_perftest, and I got unused environment variable: UCX_IB_GPU_DIRECT_RDMA (maybe: UCX_IB_GPU_DIRECT_RDMA?)). The nodes are connected via a switch and regular Ethernet (no Infiniband).

Removing the multithreaded configs gave me "java.lang.IllegalStateException: The ShuffleBufferCatalog is not initialized but the RapidsShuffleManager is configured". When I switched from UCX mode to MULTITHREADED, I stopped getting that error, it still hangs like before. When I changed back to UCX again later on, UCX works without the multithreaded configs.

I tried running the LinReg and KMeans example again with an updated configuration where I commented most of the shuffle-related configuration. Here is the new configuration without the configs for the history server, Python/Java, and timeout values.

New Config
spark.master  spark://master:7077

# Resource-related configs
spark.executor.instances  8
spark.executor.cores  6
spark.executor.memory 80G
spark.driver.memory 80G
spark.executor.memoryOverhead 1G

# Task-related
spark.default.parallelism 192
spark.sql.shuffle.partitions  192
spark.driver.maxResultSize  30G
spark.sql.files.maxPartitionBytes 4096m
spark.sql.execution.sortBeforeRepartition false
spark.sql.adaptive.enabled  true

# GPU-related Configs
spark.executor.resource.gpu.amount  1
spark.executor.resource.gpu.discoveryScript /usr/lib/spark/scripts/gpu/getGpusResources.sh
spark.executor.resources.discoveryPlugin  com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin

spark.plugins com.nvidia.spark.SQLPlugin
spark.rapids.memory.gpu.debug STDOUT
spark.rapids.memory.gpu.pool  NONE
spark.rapids.memory.pinnedPool.size 20G
spark.rapids.sql.concurrentGpuTasks 2
spark.rapids.sql.enabled  true
spark.rapids.sql.explain  all
spark.rapids.sql.expression.ScalaUDF  true
spark.rapids.sql.metrics.level  DEBUG
spark.rapids.sql.rowBasedUDF.enabled  true
spark.rapids.sql.udfCompiler.enabled  true
spark.task.resource.gpu.amount  0.166

The KMeans example does not have any executors that were killed, but 4 executors, which are on the same node, hang. Also, there are two pending stages: stage 7 with 192 tasks and stage 8 with 1 task with the same description (javaToPython at NativeMethodAccessorImpl.java:0).

org.apache.spark.sql.Dataset.javaToPython(Dataset.scala:4151)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.base/java.lang.reflect.Method.invoke(Method.java:568)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
py4j.ClientServerConnection.run(ClientServerConnection.java:106)
java.base/java.lang.Thread.run(Thread.java:833)

The status for Stage 9, the barrier on the CLI is (4 + 4) / 8, but it's "3/8 (5 running)" on the Spark UI. It's the same thing for linear regression with stage 7 and stage 8 as pending stages and stage 9 as an active stage with "(6 + 2) / 8". The executors were not killed, so there is no thread dump. Interestingly enough, the executors that hang always have the same index on Spark UI (indices 2 and 3).

Executor with SUCCESS state
23/10/03 04:52:20 INFO Executor: Running task 6.0 in stage 9.0 (TID 584)
23/10/03 04:52:20 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/10/03 04:52:21 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.0 KiB, free 47.8 GiB)
23/10/03 04:52:21 INFO TorrentBroadcast: Reading broadcast variable 5 took 11 ms
23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 57.3 KiB, free 47.8 GiB)
23/10/03 04:52:23 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/10/03 04:52:23 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:42159)
23/10/03 04:52:23 INFO MapOutputTrackerWorker: Got the map output locations
23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms
23/10/03 04:52:23 INFO CodeGenerator: Code generated in 20.623421 ms
23/10/03 04:52:23 INFO PythonRunner: Times: total = 163, boot = -4851, init = 5014, finish = 0
23/10/03 04:52:23 INFO Executor: Finished task 6.0 in stage 9.0 (TID 584). 6826 bytes result sent to driver
Executor with RUNNING state
23/10/03 04:52:20 INFO Executor: Running task 2.0 in stage 9.0 (TID 578)
23/10/03 04:52:21 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/10/03 04:52:21 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/10/03 04:52:21 INFO TransportClientFactory: Successfully created connection to /<master_ip>:35029 after 3 ms (0 ms spent in bootstraps)
23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.0 KiB, free 47.8 GiB)
23/10/03 04:52:21 INFO TorrentBroadcast: Reading broadcast variable 5 took 39 ms
23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 57.3 KiB, free 47.8 GiB)
23/10/03 04:52:23 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/10/03 04:52:23 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:42159)
23/10/03 04:52:23 INFO MapOutputTrackerWorker: Got the map output locations
23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Getting 1 (80.0 B) non-empty blocks including 0 (0.0 B) local and 1 (80.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 12 ms
23/10/03 04:52:23 INFO TransportClientFactory: Successfully created connection to /<master_ip>:34293 after 2 ms (0 ms spent in bootstraps)
23/10/03 04:52:23 INFO CodeGenerator: Code generated in 19.602622 ms
2023-10-03 04:52:24,307 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context
23/10/03 04:52:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0.
23/10/03 04:53:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 60 seconds, current barrier epoch is 0.
23/10/03 04:54:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 120 seconds, current barrier epoch is 0.
23/10/03 04:55:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 180 seconds, current barrier epoch is 0.
23/10/03 04:56:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 240 seconds, current barrier epoch is 0.
23/10/03 04:57:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 300 seconds, current barrier epoch is 0.
23/10/03 04:58:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 360 seconds, current barrier epoch is 0.

I thought the executors were not killed because it was not using UCX, but when using UCX again, the 2 executors were not killed. I am not sure if it's because I updated the RAPIDS packages on both servers before attempting to run the servers again. The two executors failed on the other node for linear regression, which did not happen before, and this node has log4j level set to TRACE, so there is more information.

Executor with SUCCESS state
23/10/03 05:30:04 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ: 182B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 b6 03 56 7f 26 69 be cb 8a |.........V.&i...|
|00000010| a1 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB|
|00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro|
|00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov|
|000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...|
|000000b0| 00 00 00 00 02 01                               |......          |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=6232742644918487713,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 4096)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 2
23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 2
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2_piece0
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2_piece0
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2_piece0
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2_piece0 of size 9556 dropped from memory (free 51350724480)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2_piece0
23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 5985864061522782634 to master/<master_ip>:33881 took 2 ms
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ: 68B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 44 04 53 12 0f a6 44 d7 a9 |.......D.S...D..|
|00000010| aa 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |..../....sr..jav|
|00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. |
|00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu|
|00000040| 65 78 70 01                                     |exp.            |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=5985864061522782634,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 2048)]]
23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_2_piece0
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_2_piece0
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2 of size 19984 dropped from memory (free 51350744464)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 2, response is 0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=6232742644918487713,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ: 154B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 9a 03 74 22 73 59 a3 e9 19 |.........t"sY...|
|00000010| b8 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T|
|00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff|
|00000090| 6c 65 49 64 78 70 00 00 00 01                   |leIdxp....      |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=8368377886447049144,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 2048)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ: 182B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 b6 03 40 01 9c 7e f4 09 bc |.........@..~...|
|00000010| b3 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB|
|00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro|
|00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov|
|000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...|
|000000b0| 00 00 00 00 04 01                               |......          |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=4612139562478189747,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 1024)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 1
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 4
23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 4
23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 4, response is 0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=4612139562478189747,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/<master_ip>:33881
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 1, response is true
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=8368377886447049144,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ: 154B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 9a 03 68 c8 77 89 5a fa 62 |.........h.w.Z.b|
|00000010| 24 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |$.......master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T|
|00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff|
|00000090| 6c 65 49 64 78 70 00 00 00 00                   |leIdxp....      |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=7550416207107220004,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 1024)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ: 182B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 b6 03 4f 7d 27 9d e0 33 d3 |.........O}'..3.|
|00000010| 28 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |(.......master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB|
|00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro|
|00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov|
|000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...|
|000000b0| 00 00 00 00 03 01                               |......          |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 0
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=5727777860109980456,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 512)]]
23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 0 from shuffle buffer catalog
23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 0
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 3
23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 3
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3 of size 19984 dropped from memory (free 51350764448)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3_piece0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 0, response is true
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3_piece0
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3_piece0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3_piece0 of size 9556 dropped from memory (free 51350774004)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3_piece0
23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=7550416207107220004,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 8609448405931561762 to master/<master_ip>:33881 took 1 ms
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ: 68B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 44 04 77 7a e7 b5 db 9a 93 |.......D.wz.....|
|00000010| 22 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |".../....sr..jav|
|00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. |
|00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu|
|00000040| 65 78 70 01                                     |exp.            |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=8609448405931561762,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 512)]]
23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0
23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_3_piece0
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 3, response is 0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/<worker_ip_from_other_node>:35312 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=5727777860109980456,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/<master_ip>:33881
 
Executor with RUNNING state
2023-10-03 05:30:05,390 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context
23/10/03 05:30:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0.
23/10/03 05:30:08 TRACE BarrierTaskContext: Current callSite: CallSite($anonfun$runBarrier$2 at Logging.scala:68,org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$2(BarrierTaskContext.scala:61)
org.apache.spark.internal.Logging.logTrace(Logging.scala:68)
org.apache.spark.internal.Logging.logTrace$(Logging.scala:67)
org.apache.spark.BarrierTaskContext.logTrace(BarrierTaskContext.scala:43)
org.apache.spark.BarrierTaskContext.runBarrier(BarrierTaskContext.scala:61)
org.apache.spark.BarrierTaskContext.allGather(BarrierTaskContext.scala:179)
org.apache.spark.api.python.BasePythonRunner$WriterThread.barrierAndServe(PythonRunner.scala:490)
org.apache.spark.api.python.BasePythonRunner$WriterThread$$anon$1.run(PythonRunner.scala:321))
23/10/03 05:30:08 TRACE TransportClient: Sending RPC to master/<master_ip>:33881
23/10/03 05:30:08 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 1062]
23/10/03 05:30:08 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:30:08 TRACE TransportClient: Sending request RPC 8875612564211632373 to master/<master_ip>:33881 took 2 ms
23/10/03 05:31:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 60 seconds, current barrier epoch is 0.
23/10/03 05:32:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 120 seconds, current barrier epoch is 0.
23/10/03 05:33:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 180 seconds, current barrier epoch is 0.
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ: 182B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 b6 03 7f 04 b1 1a 18 87 58 |...............X|
|00000010| 0b 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB|
|00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro|
|00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov|
|000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...|
|000000b0| 00 00 00 00 02 01                               |......          |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9152635068362479627,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 16384)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 2
23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 2
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2_piece0
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2_piece0
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2_piece0
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2_piece0 of size 9556 dropped from memory (free 51350724480)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2_piece0
23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 7156121186595844841 to master/<master_ip>:33881 took 1 ms
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ: 68B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 44 04 63 4f a6 4d f0 b1 32 |.......D.cO.M..2|
|00000010| e9 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |..../....sr..jav|
|00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. |
|00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu|
|00000040| 65 78 70 01                                     |exp.            |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=7156121186595844841,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 8192)]]
23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_2_piece0
23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_2_piece0
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2 of size 19984 dropped from memory (free 51350744464)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 2, response is 0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9152635068362479627,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ: 154B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 9a 03 47 c0 63 47 8c a2 90 |.........G.cG...|
|00000010| ee 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T|
|00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff|
|00000090| 6c 65 49 64 78 70 00 00 00 01                   |leIdxp....      |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=5170241531174621422,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 8192)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ: 182B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 b6 03 7e 26 0d 61 e3 f2 d3 |.........~&.a...|
|00000010| 0b 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB|
|00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro|
|00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov|
|000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...|
|000000b0| 00 00 00 00 04 01                               |......          |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9089967611981255435,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 4096)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 1
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 4
23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 4
23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 4, response is 0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9089967611981255435,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/<master_ip>:33881
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 1, response is true
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=5170241531174621422,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ: 154B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 9a 03 7e 30 ec 87 fe e7 e9 |.........~0.....|
|00000010| b1 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T|
|00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff|
|00000090| 6c 65 49 64 78 70 00 00 00 00                   |leIdxp....      |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9093027716502383025,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 4096)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 0
23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 0 from shuffle buffer catalog
23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 0
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ: 182B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 b6 03 7e bc 9a a7 9c 90 c2 |.........~......|
|00000010| ea 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..|
|00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage|
|00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr|
|00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa|
|00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block|
|00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$|
|00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB|
|00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro|
|00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov|
|000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...|
|000000b0| 00 00 00 00 03 01                               |......          |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9132344189077471978,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 2048)]]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 3
23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 3
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3 of size 19984 dropped from memory (free 51350764448)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3
23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3_piece0
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3_piece0
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3_piece0
23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3_piece0 of size 9556 dropped from memory (free 51350774004)
23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3_piece0
23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 9140343881297110183 to master/<master_ip>:33881 took 1 ms
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ: 68B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 44 04 7e d9 06 55 19 a1 24 |.......D.~..U..$|
|00000010| a7 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |..../....sr..jav|
|00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. |
|00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu|
|00000040| 65 78 70 01                                     |exp.            |
+--------+-------------------------------------------------+----------------+
23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=9140343881297110183,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 2048)]]
23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0
23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_3_piece0
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 0, response is true
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] READ COMPLETE
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 3, response is 0
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47]
23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9093027716502383025,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/<master_ip>:33881
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81]
23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/<worker_ip_from_other_node>:35324 - R:master/<master_ip>:33881] FLUSH
23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9132344189077471978,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/<master_ip>:33881
23/10/03 05:34:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 240 seconds, current barrier epoch is 0.
23/10/03 05:35:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 300 seconds, current barrier epoch is 0.
23/10/03 05:36:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 360 seconds, current barrier epoch is 0.
23/10/03 05:37:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 420 seconds, current barrier epoch is 0.
23/10/03 05:38:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 480 seconds, current barrier epoch is 0.
23/10/03 05:39:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 540 seconds, current barrier epoch is 0. 

For some reason, cuML context is only initialized on the running/killed executors.

@an-ys
Copy link
Author

an-ys commented Oct 2, 2023

Hmm, so the examples worked after increasing the size of the dataset, even for UCX. I'm not sure why my application, which uses a large dataset did not work then. It might be a different issue.
The problem with my application is that the status is "0/8" during the barrier stage, and I keep getting messages that there are zero slots available. I tried repartitioning the number of tasks to a smaller number, but it didn't work. I will send another comment later once I get the application fixed because my application stopped working altogether.

@eordentlich
Copy link
Collaborator

Thanks for the additional updates and glad there are some signs of it working. In looking back at your previous executor logs, it is actually the executors without the Initializing cuml context log statements that are problematic. Somehow they are completely bypassing execution of core spark-rapids-ml code and completing their respective barrier tasks. I don't see how that could happen and would be very interesting for us to reproduce if possible.

When you get back to this, please also share the worker(s) and master startup configs. Looks like spark standalone mode.

@an-ys
Copy link
Author

an-ys commented Oct 13, 2023

Sorry for the late reply. Yes, I am using Spark Standalone mode.

I have the following "spark-env.sh" on each node:

spark-env.sh
JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64/ # Just updated this today
HADOOP_CONF_DIR=/home//<username>/fr/hadoop-3.3/etc/hadoop                                                                                                                                                                      
SPARK_LOCAL_IP=master                                                                                                                                                                                                    
unset SPARK_REMOTE                                                                                                                                                                                                       
SPARK_MASTER_HOST=master                                                                                                                                                                                                 
SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=4 -Dspark.worker.resource.gpu.discoveryScript=/usr/lib/spark/scripts/gpu/getGpusResources.sh"                                                                      
export PYSPARK_PYTHON=/home/<username>/mambaforge/envs/$CONDA_DEFAULT_ENV/bin/python3                                                                                                                                          
export PYSPARK_DRIVER_PYTHON=/home/<username>/mambaforge/envs/$CONDA_DEFAULT_ENV/bin/python3 

The initial configuration for each node is almost identical to my app's spark.conf:

spark-defaults.sh
spark.master spark://master:7077
spark.sql.adaptive.enabled true 
spark.executor.instances      8
spark.executor.cores          6 
spark.default.parallelism     24
spark.sql.shuffle.partitions  24 

spark.executor.memory 80G
spark.driver.memory   80G 
spark.executor.memoryOverhead 1G
spark.sql.files.maxPartitionBytes 512m
spark.driver.maxResultSize 30G 

spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory file:///tmp/spark-events
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.logConf true 

spark.executor.heartbeatInterval 10000s
spark.network.timeout 10001s
spark.sql.broadcastTimeout 10000

I am not sure if it is intended behaviour to hang when the dataset is too small to be in every executor. I tried updating Spark RAPIDS ML to the newest version containing the #464 commit, but it still does not work. As mentioned before, if I used the LinReg example as is on the Python README.md with df as df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))], ["label", "weight", "features"]), it hangs, but when I did df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))]*4, ["label", "weight", "features"]), the application runs.

For my applications, I noticed that I can get it to work now when using "multi-threaded mode" for the shuffle manager instead of UCX. Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877 if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.

Anyway, I noticed that when I run my PCA application with a small dataset (8.1KB on hdfs dfs -du -s -h but input size is much smaller in Spark UI), the application hangs at the barrier stage. When I used two datasets that are larger than the first dataset (97.9KB and 7.7MB), it gets past the barrier. However, I noticed that the barrier stage gets stuck at "(0 + 8) / 8" for about 25s for the 97.9KB dataset and about 15s for the 7.7MB dataset, and then the stage suddenly ends without any errors.

@eordentlich
Copy link
Collaborator

I am not sure if it is intended behaviour to hang when the dataset is too small to be in every executor. I tried updating Spark RAPIDS ML to the newest version containing the #464 commit, but it still does not work. As mentioned before, if I used the LinReg example as is on the Python README.md with df as df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))], ["label", "weight", "features"]), it hangs, but when I did df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))]*4, ["label", "weight", "features"]), the application runs.

Currently the expected/intended behavior with empty partitions during training is for the tasks receiving no data to raise exceptions which should fail the whole barrier stage doing the training. It is strange, and would be a bug, that this doesn't seem to be happening in your case. I'm not able to reproduce this for some reason. I'm also not able to reproduce barrier tasks not logging anything spark-rapids-ml related before exiting. This would mean that spark-rapids-ml udf is not being invoked at all for the partition for that task. I've attempted to test if Spark might have an optimization that avoids invoking mapInPandas on partitions with no data, but so far I'm not able to trigger this on some toy examples, if it is even the case.

Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877 if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.

The spark-rapids plugin recommends JDK8. See https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#apache-spark-setup-for-gpu. @abellina is that still the case? In any case, looks like you might be running into different JDK versions being invoked in different places within the application run.

However, I noticed that the barrier stage gets stuck at "(0 + 8) / 8" for about 25s for the 97.9KB dataset and about 15s for the 7.7MB dataset, and then the stage suddenly ends without any errors.

This is actually the normal behavior. The notation means that 8 of the barrier tasks are running and it is during this time that the GPUs are carrying out the distributed computation and communicating directly with each other. If it is stuck with say "(6 + 2) / 8" then 6 barrier tasks exited for some reason, without syncing and two are hanging. That is problematic and would be a bug, even with empty data partitions, as mentioned above.

@abellina
Copy link

Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877 if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.

The spark-rapids plugin recommends JDK8. See https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#apache-spark-setup-for-gpu. @abellina is that still the case? In any case, looks like you might be running into different JDK versions being invoked in different places within the application run.

The spark-rapids plugin is being tested with JDK 8, 11, and 17.

I think an issue here is likely that the java that the executor is seeing is different than the one that the driver is seeing. Make sure that the spark-rapids jar is the same exactly from all places.

@an-ys
Copy link
Author

an-ys commented Oct 20, 2023

Thanks for your replies. I downgraded my JDK version to 17 since I compiled Spark RAPIDS with JDK 17, as it uses Security Manager, and I was also facing an issue where I am getting "Found an unexpected context classloader" when using Scala Spark.

After downgrading to JDK 17 and updating RAPIDS to 23.12.00a, I am not facing "java.io.InvalidClassException" anymore on PySpark.

For the original issue, my guess is that the problem is from building Spark RAPIDS incorrectly? I tried using the JARs from mvn verify and mvn install, but I am using mvn install as of now since I cannot get mvn verify to work for now (as Maven is trying to look for the snapshot version of spark-rapids-jni on a repository and I have not built it yet.).
Anyway, I noticed that I can run the PySpark applications, but I cannot run spark-shell because it cannot find the OptimizerPlugin. I am not sure why it works for PySpark and why it does not on work on Scala Spark anymore, and whether it's related to this issue on PySpark since I do not face any other issues on PySpark aside from the problem with the empty partitions.

Full Error Log from launching spark-shell
Spark Executor Command: "/usr/lib/jvm/java-17-openjdk-amd64//bin/java" "-cp" "/home/<username>/.m2/repository/com/nvidia/rapids-4-spark_2.12/23.12.0-SNAPSHOT/rapids-4-spark_2.12-23.12.0-SNAPSHOT-cuda12.jar:/home/<username>/.m2/repository/org/apache/mahout/mahout-core/0.9/mahout-core-0.9.jar:/home/<username>/.m2/repository/org/apache/mahout/mahout-math/0.9/mahout-math-0.9.jar:<CUSTOM_JAR>:/home/<username>/fr/spark-3.5.0/conf/:/home/<username>/fr/spark-3.5/assembly/target/scala-2.12/jars/*:/home/<username>/fr/hadoop-3.3/etc/hadoop/" "-Xmx81920M" "-Dspark.network.timeout=10001s" "-Dspark.history.ui.port=18080" "-Dspark.driver.port=40191" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "-Duser.timezone=UTC -Dai.rapids.cudf.nvtx.enabled=true -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:-UseGCOverheadLimit" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@master:40191" "--executor-id" "0" "--hostname" "<master_ip>" "--cores" "6" "--app-id" "app-20231020180216-0010" "--worker-url" "spark://Worker@<master_ip>:39051" "--resourceProfileId" "0" "--resourcesFile" "/home/<username>/fr/spark-3.5/work/app-20231020180216-0010/0/resource-executor-5513666133497122425.json"
========================================

23/10/20 18:02:18 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 61111@titanv3
23/10/20 18:02:18 INFO SignalUtils: Registering signal handler for TERM
23/10/20 18:02:18 INFO SignalUtils: Registering signal handler for HUP
23/10/20 18:02:18 INFO SignalUtils: Registering signal handler for INT
23/10/20 18:02:18 INFO SecurityManager: Changing view acls to: <username>
23/10/20 18:02:18 INFO SecurityManager: Changing modify acls to: <username>
23/10/20 18:02:18 INFO SecurityManager: Changing view acls groups to: 
23/10/20 18:02:18 INFO SecurityManager: Changing modify acls groups to: 
23/10/20 18:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: <username>; groups with view permissions: EMPTY; users with modify permissions: <username>; groups with modify permissions: EMPTY
23/10/20 18:02:19 INFO TransportClientFactory: Successfully created connection to master/<master_ip>:40191 after 93 ms (0 ms spent in bootstraps)
23/10/20 18:02:19 INFO SecurityManager: Changing view acls to: <username>
23/10/20 18:02:19 INFO SecurityManager: Changing modify acls to: <username>
23/10/20 18:02:19 INFO SecurityManager: Changing view acls groups to: 
23/10/20 18:02:19 INFO SecurityManager: Changing modify acls groups to: 
23/10/20 18:02:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: <username>; groups with view permissions: EMPTY; users with modify permissions: <username>; groups with modify permissions: EMPTY
23/10/20 18:02:19 INFO TransportClientFactory: Successfully created connection to master/<master_ip>:40191 after 2 ms (0 ms spent in bootstraps)
23/10/20 18:02:19 INFO DiskBlockManager: Created local directory at /tmp/spark-15c1c3d7-e779-414a-a3db-f397a5550c73/executor-ce33b3e1-b1ee-4ec2-b7e5-d2069a9e9731/blockmgr-4502e0c2-bf08-48d8-994d-e639c75137f8
23/10/20 18:02:19 INFO MemoryStore: MemoryStore started with capacity 47.8 GiB
23/10/20 18:02:20 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@master:40191
23/10/20 18:02:20 INFO WorkerWatcher: Connecting to worker spark://Worker@<master_ip>:39051
23/10/20 18:02:20 INFO TransportClientFactory: Successfully created connection to /<master_ip>:39051 after 3 ms (0 ms spent in bootstraps)
23/10/20 18:02:20 INFO WorkerWatcher: Successfully connected to spark://Worker@<master_ip>:39051
23/10/20 18:02:20 INFO ResourceUtils: ==============================================================
23/10/20 18:02:20 INFO ResourceUtils: Custom resources for spark.executor:
gpu -> [name: gpu, addresses: 0]
23/10/20 18:02:20 INFO ResourceUtils: ==============================================================
23/10/20 18:02:20 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
23/10/20 18:02:20 INFO Executor: Starting executor ID 0 on host <master_ip>
23/10/20 18:02:20 INFO Executor: OS info Linux, 6.5.0-9-generic, amd64
23/10/20 18:02:20 INFO Executor: Java version 17.0.9-ea
23/10/20 18:02:20 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42907.
23/10/20 18:02:20 INFO NettyBlockTransferService: Server created on <master_ip>:42907
23/10/20 18:02:20 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/10/20 18:02:20 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, <master_ip>, 42907, None)
23/10/20 18:02:20 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, <master_ip>, 42907, None)
23/10/20 18:02:20 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, <master_ip>, 42907, None)
23/10/20 18:02:20 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
23/10/20 18:02:20 INFO Executor: Using REPL class URI: spark://master:40191/classes
23/10/20 18:02:20 INFO Executor: Created or updated repl class loader org.apache.spark.executor.ExecutorClassLoader@61296497 for default.
23/10/20 18:02:20 INFO ShimLoader: Loading shim for Spark version: 3.5.0
23/10/20 18:02:20 INFO ShimLoader: Complete Spark build info: 3.5.0, , , , 2023-10-19T10:35:55Z
23/10/20 18:02:20 INFO ShimLoader: findURLClassLoader hit the Boostrap classloader org.apache.spark.executor.ExecutorClassLoader@61296497, failed to find a mutable classloader!
23/10/20 18:02:20 WARN ShimLoader: Found an unexpected context classloader org.apache.spark.executor.ExecutorClassLoader@61296497. We will try to recover from this, but it may cause class loading problems.
23/10/20 18:02:37 INFO RapidsPluginUtils: RAPIDS Accelerator build: {date=2023-10-20T08:25:10Z, cudf_version=23.12.0-SNAPSHOT, version=23.12.0-SNAPSHOT, user=<username>, branch=branch-23.12, url=https://github.com/NVIDIA/spark-rapids.git, revision=fba1a4b199230fd56a62cc436e08cf3af8415c20}
23/10/20 18:02:37 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: {date=2023-10-19T06:58:21Z, version=23.12.0-SNAPSHOT, user=, branch=HEAD, url=https://github.com/NVIDIA/spark-rapids-jni.git, revision=b511a5900604ec000b5cdfb00a0a7516847f79a7}
23/10/20 18:02:37 INFO RapidsPluginUtils: cudf build: {date=2023-10-19T06:58:21Z, version=23.12.0-SNAPSHOT, user=, branch=HEAD, url=https://github.com/rapidsai/cudf.git, revision=7aa757959b5b597a0258c955fbeadb92cfb2e762}
23/10/20 18:02:37 WARN RapidsPluginUtils: RAPIDS Accelerator 23.12.0-SNAPSHOT using cudf 23.12.0-SNAPSHOT.
23/10/20 18:02:37 INFO TransportClientFactory: Successfully created connection to master/<master_ip>:40191 after 2 ms (0 ms spent in bootstraps)
23/10/20 18:02:37 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down!
java.lang.ExceptionInInitializerError
	at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:363)
	at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:125)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:113)
	at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:211)
	at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:199)
	at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:337)
	at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:178)
	at org.apache.spark.executor.Executor.<init>(Executor.scala:337)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:174)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.OptimizerPlugin
	at org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:124)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:467)
	at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
	at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:94)
	at org.apache.spark.sql.rapids.execution.TrampolineUtil$.classForName(TrampolineUtil.scala:171)
	at com.nvidia.spark.rapids.RapidsPluginUtils$.$anonfun$loadExtensions$1(Plugin.scala:235)
	at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
	at com.nvidia.spark.rapids.RapidsPluginUtils$.loadExtensions(Plugin.scala:233)
	at com.nvidia.spark.rapids.RapidsPluginUtils$.getExtraPlugins(Plugin.scala:272)
	at com.nvidia.spark.rapids.RapidsPluginUtils$.<init>(Plugin.scala:85)
	at com.nvidia.spark.rapids.RapidsPluginUtils$.<clinit>(Plugin.scala)
	... 24 more
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.OptimizerPlugin
	at java.base/java.lang.ClassLoader.findClass(ClassLoader.java:723)
	at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
	at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
	at org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109)
	... 38 more
23/10/20 18:02:37 INFO MemoryStore: MemoryStore cleared
23/10/20 18:02:37 INFO BlockManager: BlockManager stopped
23/10/20 18:03:07 WARN ShutdownHookManager: ShutdownHook '' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)
	at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
23/10/20 18:03:07 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679)
	at java.base/java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1464)
	at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)
	at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:199)
	at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:196)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:196)
	at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)
	at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)
	at org.apache.spark.SparkEnv.stop(SparkEnv.scala:99)
	at org.apache.spark.executor.Executor.stop(Executor.scala:443)
	at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:90)
	at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
	at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
	at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

@abellina
Copy link

@an-ys thanks for the report, we are looking into it (NVIDIA/spark-rapids#9498). It seems to be an issue with our Spark 3.5.0 shim, specific to spark-shell (pyspark shell, spark-submit don't exhibit this behavior)

Please note we don't recommend using 23.12 unless you are testing some cutting edge feature, as it's not released yet. 23.10 isn't entirely released yet either.

An option if you want to try to build on your own is to set -DallowConventionalDistJar=true. This will sidestep the issue while we get it fixed. I confirmed that adding this to my mvn command built a jar that was loaded successfully by spark 3.5.0 (mvn package -Dbuildver=350 -DskipTests -DallowConventionalDistJar=true).

@eordentlich
Copy link
Collaborator

@an-ys The original hang issue is due to a combination of empty partitions and an optimization in how spark-rapids etl plugin handles mapInPandas vs baseline Spark (which is what we had used to test empty partitions). See NVIDIA/spark-rapids#9480 . The spark-rapids version of mapInPandas does not execute the udf on empty partitions and hence the spark-rapids-ml barrier is never entered for those tasks, leaving the other tasks (with data) hanging. Note that even after that issue is resolved to have spark-rapids mapInPandas match baseline spark mapInPandas behavior on empty partitions, and thereby avoid hanging, an exception would still be raised currently by spark-rapids-ml in the case of empty partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants