Skip to content
This repository has been archived by the owner on Dec 20, 2022. It is now read-only.

Need memory overhead to run Spark RDMA shuffler #30

Closed
tobegit3hub opened this issue Apr 24, 2019 · 4 comments
Closed

Need memory overhead to run Spark RDMA shuffler #30

tobegit3hub opened this issue Apr 24, 2019 · 4 comments

Comments

@tobegit3hub
Copy link

We use SparkRDMA by adding libdisni.so and spark-rdma-3.1-for-spark-2.3.0-jar-with-dependencies.jar in command of spark-submit. But some tasks always fail because of the issue of java.io.IOException: getCmEvent() failed. Here is the complete log.

java.io.IOException: getCmEvent() failed
	at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:348)
	at org.apache.spark.shuffle.rdma.RdmaChannel.connect(RdmaChannel.java:309)
	at org.apache.spark.shuffle.rdma.RdmaNode.getRdmaChannel(RdmaNode.java:308)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager.org$apache$spark$shuffle$rdma$RdmaShuffleManager$$getRdmaChannel(RdmaShuffleManager.scala:314)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getRdmaChannelToDriver(RdmaShuffleManager.scala:322)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager$$anon$7.apply(RdmaShuffleManager.scala:349)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager$$anon$7.apply(RdmaShuffleManager.scala:343)
	at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
	at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getMapTaskOutputTable(RdmaShuffleManager.scala:342)
	at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.startAsyncRemoteFetches(RdmaShuffleFetcherIterator.scala:185)
	at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.initialize(RdmaShuffleFetcherIterator.scala:325)
	at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.<init>(RdmaShuffleFetcherIterator.scala:85)
	at org.apache.spark.shuffle.rdma.RdmaShuffleReader.read(RdmaShuffleReader.scala:44)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:98)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

It seems to require more memory to run SparkRDMA and it may work by adding --conf spark.executor.memoryOverhead=10g when using this. We are confused about this and how much memory should we add for this task?

@petro-rudenko
Copy link
Member

SparkRDMA uses native off-heap memory. So yes, need to add memoryOverhead parameter. The size of memoryOverhead depends on the total dataset size, the number of executors, cores per executor.

@tobegit3hub
Copy link
Author

Thanks @petro-rudenko . Do you have any guide to set the memoryOverhead? For example, if the shuffle read or shuffle write is 10G for each executor, do we need about 10G memoryOverhead for SparkRDMA?

@petro-rudenko
Copy link
Member

@tobegit3hub you would need 10Gb + # cores / executor * spark.shuffle.rdma.MaxBytesInFlight
E.g. 10Gb + 20 cores * 500Mb = ~20 Gb.

@tobegit3hub
Copy link
Author

Thanks for your detailed exploration. That helps a lot @petro-rudenko !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants