Need memory overhead to run Spark RDMA shuffler #30

tobegit3hub · 2019-04-24T12:40:01Z

We use SparkRDMA by adding libdisni.so and spark-rdma-3.1-for-spark-2.3.0-jar-with-dependencies.jar in command of spark-submit. But some tasks always fail because of the issue of java.io.IOException: getCmEvent() failed. Here is the complete log.

java.io.IOException: getCmEvent() failed
	at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:348)
	at org.apache.spark.shuffle.rdma.RdmaChannel.connect(RdmaChannel.java:309)
	at org.apache.spark.shuffle.rdma.RdmaNode.getRdmaChannel(RdmaNode.java:308)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager.org$apache$spark$shuffle$rdma$RdmaShuffleManager$$getRdmaChannel(RdmaShuffleManager.scala:314)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getRdmaChannelToDriver(RdmaShuffleManager.scala:322)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager$$anon$7.apply(RdmaShuffleManager.scala:349)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager$$anon$7.apply(RdmaShuffleManager.scala:343)
	at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getMapTaskOutputTable(RdmaShuffleManager.scala:342)
	at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.startAsyncRemoteFetches(RdmaShuffleFetcherIterator.scala:185)
	at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.initialize(RdmaShuffleFetcherIterator.scala:325)
	at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.<init>(RdmaShuffleFetcherIterator.scala:85)
	at org.apache.spark.shuffle.rdma.RdmaShuffleReader.read(RdmaShuffleReader.scala:44)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:98)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

It seems to require more memory to run SparkRDMA and it may work by adding --conf spark.executor.memoryOverhead=10g when using this. We are confused about this and how much memory should we add for this task?

The text was updated successfully, but these errors were encountered:

petro-rudenko · 2019-04-25T13:04:53Z

SparkRDMA uses native off-heap memory. So yes, need to add memoryOverhead parameter. The size of memoryOverhead depends on the total dataset size, the number of executors, cores per executor.

tobegit3hub · 2019-04-25T14:51:38Z

Thanks @petro-rudenko . Do you have any guide to set the memoryOverhead? For example, if the shuffle read or shuffle write is 10G for each executor, do we need about 10G memoryOverhead for SparkRDMA?

petro-rudenko · 2019-04-27T08:45:51Z

@tobegit3hub you would need 10Gb + # cores / executor * spark.shuffle.rdma.MaxBytesInFlight
E.g. 10Gb + 20 cores * 500Mb = ~20 Gb.

tobegit3hub · 2019-04-28T06:29:32Z

Thanks for your detailed exploration. That helps a lot @petro-rudenko !

tobegit3hub closed this as completed Apr 28, 2019

SmirAlex mentioned this issue Nov 24, 2020

Errors when using 2 or more nodes #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need memory overhead to run Spark RDMA shuffler #30

Need memory overhead to run Spark RDMA shuffler #30

tobegit3hub commented Apr 24, 2019

petro-rudenko commented Apr 25, 2019

tobegit3hub commented Apr 25, 2019

petro-rudenko commented Apr 27, 2019

tobegit3hub commented Apr 28, 2019

Need memory overhead to run Spark RDMA shuffler #30

Need memory overhead to run Spark RDMA shuffler #30

Comments

tobegit3hub commented Apr 24, 2019

petro-rudenko commented Apr 25, 2019

tobegit3hub commented Apr 25, 2019

petro-rudenko commented Apr 27, 2019

tobegit3hub commented Apr 28, 2019