LCS (short for Least Cost Strategy) is an efficient data eviction strategy for Spark.
As an in-memory distributed computing system, Spark is often used to speed up iterative applications. It caches intermediate data generated by previous iterations into memory, so there is no need to repeat the generation when reusing these data later. This sharing mechanism of caching data in memory makes Spark much faster than other systems. When memory used for caching data reaches the capacity limits, data eviction will be performed to supply space for new data, and the evicted data need to be recovered when they are used again. However, classical strategies do not aware of recovery cost, which could cause system performance degradation. This paper shows that the recovery costs have significant difference in Spark, thus a cost aware eviction strategy can obviously reduces the total recovery cost. To this end, a strategy named LCS is proposed, which gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future. Experimental results show that this approach can achieve better performance when memory space is not sufficient, and reduce 30% to 50% of the total execution time.
Same as Spark, LCS is also built using Apache Maven. To build LCS, run:
build/mvn -DskipTests clean package
More detailed documentation about building Spark is available from the project site, at "Building Spark".
Refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.
Running sample programs in the examples
directory for experiment. For example:
./bin/run-example SparkPageRank
will run the PageRank example locally.
You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn-cluster" or "yarn-client" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:
MASTER=spark://host:7077 ./bin/run-example SparkPageRank
Many of the example programs print usage help if no params are given.
[1] Yuanzhen Geng, Xuanhua Shi, Cheng Pei, Hai Jin, and Wenbin Jiang, "LCS: an efficient data eviction strategy for spark", International Journal of Parallel Programming, 45(6), 1285-1297, 2017.