Skip to content
/ LCS Public

LCS is an efficient data eviction strategy for Spark, a popular distributed in-memory data processing system. LCS gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running

License

Notifications You must be signed in to change notification settings

CGCL-codes/LCS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LCS

LCS (short for Least Cost Strategy) is an efficient data eviction strategy for Spark.

Introduction

As an in-memory distributed computing system, Spark is often used to speed up iterative applications. It caches intermediate data generated by previous iterations into memory, so there is no need to repeat the generation when reusing these data later. This sharing mechanism of caching data in memory makes Spark much faster than other systems. When memory used for caching data reaches the capacity limits, data eviction will be performed to supply space for new data, and the evicted data need to be recovered when they are used again. However, classical strategies do not aware of recovery cost, which could cause system performance degradation. This paper shows that the recovery costs have significant difference in Spark, thus a cost aware eviction strategy can obviously reduces the total recovery cost. To this end, a strategy named LCS is proposed, which gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future. Experimental results show that this approach can achieve better performance when memory space is not sufficient, and reduce 30% to 50% of the total execution time.

Building LCS

Same as Spark, LCS is also built using Apache Maven. To build LCS, run:

build/mvn -DskipTests clean package

More detailed documentation about building Spark is available from the project site, at "Building Spark".

Refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Usage

Running sample programs in the examples directory for experiment. For example:

./bin/run-example SparkPageRank

will run the PageRank example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn-cluster" or "yarn-client" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPageRank

Many of the example programs print usage help if no params are given.

Issus about LCS in the forum of Spark

"SPARK-14289".

Publications about LCS

[1] Yuanzhen Geng, Xuanhua Shi, Cheng Pei, Hai Jin, and Wenbin Jiang, "LCS: an efficient data eviction strategy for spark", International Journal of Parallel Programming, 45(6), 1285-1297, 2017.

About

LCS is an efficient data eviction strategy for Spark, a popular distributed in-memory data processing system. LCS gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published