CacheCheck

CacheCheck can effectively detect cache-related bugs in Spark applications. See paper CacheCheck to learn more details.

1. Run CacheCheck

1.1 Build CacheCheck

Enter the main directory, and build it by

mvn package -DskipTests

A runnable jar file "core-1.0-SNAPSHOT.jar" is generated under core/target/.

1.2 Instrument Spark

The trace collection code (i.e., the modification to Spark) is located in instrument/. Specially, the trace collection code starts with the comment "// Start trace collection in CacheCheck", and ends with the comment "// End trace collection in CacheCheck".

First, you need to replace $SPARK_HOME/core/src/main/scala/org/apache/spark/rdd/RDD.scala with cachecheck/core/instrument/RDD.scala and replace $SPARK_HOME/core/src/main/scala/org/apache/spark/SparkContext.scala with cachecheck/core/instrument/SparkContext.scala. We use Spark-2.4.3 in our experiment. If you use other Spark versions, you may need to manually add our trace collection code to the proper place, since directly replacing files may be incompatiable.

Then, you can build the instrumented Spark by running the command in $SPARK_HOME/:

mvn package -DskipTests

1.3 Collect Traces

While the application runs on Spark, the instrumented code can collect traces and store them in $SPARK_HOME/trace/. In our experiment, we use Spark's build-in examples and six word count examples as test cases. Taking SparkPi as the example, we can run it by the command

$SPARK_HOME/bin/run-example SparkPi

We provide six word count examples in directory wordcount. You can add this directory to SPARK_HOME/examples/src/main/scala/org/apache/spark/examples, and then compile the example module, and run these examples by similar commands, such as

$SPARK_HOME/bin/run-example wordcount.MissingPersist

In our paper, we mainly ran examples in GraphX, MLLib, and Spark SQL. They can also be run by similar commands, such as

$SPARK_HOME/bin/run-example graphx.ConnectedComponentsExample

Considering there are too many examples to run, we provide some one-click tools for easy configuration and execution. See details in Code Structure.

1.4 Perform Detection

The detection is performed by

java -jar cachecheck/core/target/core-1.0-SNAPSHOT.jar $TraceDir $AppName [-d]

$TraceDir is the directory that stores traces, i.e., $SPARK_HOME/trace/. $AppName is the name of the application, which is usually set in the application code. For SparkPi, its application name is Spark Pi. -d is an option to enable debug mode. In default, CacheCheck deletes all trace files after detection. If you want keep them, add -d please.
After the detection, a bug report, named $AppName.report, is generated in $SPARK_HOME/trace/.

2. Code Structure

CacheCheck mainly has two modules, i.e., core and tools. core module implement the approach introduced in our paper. tools module provides three tools, i.e., ExampleRunner, CachecheckRunner, and Deduplicator, for easy and automatic bug detection. After Build Cachecheck, three runnable jars are generated under cachecheck/tools/traget/. They are tools-examplerunner.jar, tools-cachecheckrunner.jar, and tools-deduplicator.jar.

2.1 ExampleRunner

ExampleRunner can automatically run Spark's build-in examples. It requires a configuration file, which is an xml file just like example-list-all.xml in cachecheck/tools/resource. In this file, you can specify which examples to run.
The execution command is

java -jar cachecheck/tools/target/tools-examplerunner.jar $ExampleList $SparkDir

$ExampleList is the path of the configuration file. $SparkDir is the base directory of Spark, e.g., $SPARK_HOME.

2.2 CacheCheckRunner

CacheCheckRunner can automatically analyze all the traces under the same directory and get the bug reports.
The execution command is

java -jar cachecheck/tools/target/tools-cachecheckrunner.jar  $TraceDir

$TraceDir is the diretocry where traces files are located, e.g., $SPARK_HOME/trace.

2.3 Deduplicator

Deduplicator can collect all the bug reports under the same directory, remove duplicated bug report and generate a summary bug report. The command is

java -jar cachecheck/tools/target/tools-deduplicator.jar  $ReportDir

$ReportDir is the directory that contains the bug reports. After the execution of Deduplicator, it will generate a summary.report file under the same directory.

3. Detected Unknown Bugs

ID	Project	Issue ID	Bug type	Location	Related RDD variable	Status	Fixed
1	MLLib	SPARK-29809	Missing persist	Word2Vec.fit()	dataset	Confirmed	Yes
2	MLLib	SPARK-29810	Missing persist	RandomForest.run()	retaggedInput	Confirmed	Yes
3	MLLib	SPARK-29811	Missing persist	RandomForestRegressor.train()	oldDataset	Confirmed	Yes
4	MLLib	SPARK-29812	Missing persist	MulticlassificationEvaluator.evaluate()	predictionAndLabels	Confirmed	Yes
5	MLLib	SPARK-29813	Missing persist	PrefixSpan.findFrequentItems()	data	Confirmed	Yes
6	MLLib	SPARK-29814	Missing persist	PCA.fit()	sources	Confirmed	No. Limited affect
7	MLLib	SPARK-29815	Missing persist	CrossValidator.fit()	dataset.toDF.rdd	Confirmed	No. Limited affect
8	MLLib	SPARK-29817	Missing persist	LDAOptimizer.initialize()	docs	Confirmed	Yes
9	MLLib	SPARK-29824	Missing persist	GBTClassifier.train()	trainDataset	Confirmed	Yes
10	MLLib	SPARK-29826	Missing persist	ChiSqSelector.fit()	data	Confirmed	Yes
11	MLLib	SPARK-29828	Missing persist	ALS.train()	ratings	Confirmed	Yes
12	MLLib	SPARK-29816	Missing persist	BinaryClassificationMetrics.recallByThreshold()	scoreAndLabels.combineByKey	Confirmed	No. Limited affect
13	MLLib	SPARK-29827	Missing persist	BisectingKMeans.run()	input	Confirmed	Yes
14	MLLib	SPARK-29827	Unnecessary persist	BisectingKMeans.run()	norms	Confirmed	Yes
15	MLLib	SPARK-29827	Missing persist	BisectingKMeans.run()	assignments	Confirmed	Yes
16	MLLib	SPARK-29856	Unnecessary persist	RandomForest.run()	baggedInput	Confirmed	No. Limited affect
17	MLLib	SPARK-29856	Unnecessary persist	KMeans.fit()	instances	Confirmed	No. Limited affect
18	MLLib	SPARK-29856	Unnecessary persist	BisectingKMeans.run()	indices	Confirmed	No. Limited affect
19	MLLib	SPARK-29844	Missing persist	ASL.train()	userIdAndFactors	Confirmed	No. Difficulty
20	MLLib	SPARK-29844	Missing unpersist	ASL.train()	itemIdAndFactors	Confirmed	No. Difficulty
21	MLLib	SPARK-29844	Premature unpersist	ASL.train()	itemFactors	Confirmed	Yes
22	MLLib	SPARK-29844	Lagging unpersist	ASL.train()	userInBlocks	Confirmed	Yes
23	MLLib	SPARK-29844	Lagging unpersist	ASL.train()	userOutBlocks	Confirmed	Yes
24	MLLib	SPARK-29844	Lagging unpersist	ASL.train()	itemOutBlocks	Confirmed	Yes
25	MLLib	SPARK-29844	Lagging unpersist	ASL.train()	BlockRatings	Confirmed	Yes
26	MLLib	SPARK-29781	Unnecessary persist	PeriodicCheckpointer.update()	newData	Confirmed	No. Limited affect
27	MLLib	SPARK-29781	Lagging unpersist	PeriodicCheckpointer.update()	newData	Confirmed	No. Limited affect
28	MLLib	SPARK-29823	Unnecessary persist	KMeans.run()	norms	Confirmed	Yes
29	MLLib	SPARK-29823	Missing persist	KMeans.run()	zippedData	Confirmed	Yes
30	MLLib	SPARK-29832	Unnecessary persist	IsotonicRegression.fit()	instances	Confirmed	No. Limited affect
31	MLLib	SPARK-29872	Missing persist	SparkTC.main()	edges	Confirmed	No. Limited affect
32	MLLib	SPARK-29873	Unnecessary persist	SparkTC.main()	tc	Confirmed	No. Limited affect
33	MLLib	SPARK-29874	Missing persist	LogisticRegressionSummary	fMeasure	Confirmed	No. Limited affect
34	SQL	SPARK-29875	Missing persist	SparkSQLExample.main()	peopleDF	Confirmed	No. Difficulty
35	SQL	SPARK-29876	Missing persist	RDDRelation.main()	createDataFrame	Confirmed	No. Difficulty
36	GraphX	SPARK-29878	Unnecessary persist	GraphImpl.mapVertices()	vertices	Pending
37	GraphX	SPARK-29878	Unnecessary persist	GraphImpl.mapVertices()	newEdges	Pending
38	GraphX	SPARK-29878	Missing persist	ReplicatedVertexView	zipPartitions	Pending
39	GraphX	SPARK-29878	Unnecessary persist	GraphImpl.partitionBy()	newEdges	Pending
40	GraphX	SPARK-29878	Unnecessary persist	GraphImpl.subgraph()	vertices	Pending
41	GraphX	SPARK-29878	Unnecessary persist	GraphImpl.aggregateMessagesWithActiveSet()	vertices	Pending
42	MLLib	SPARK-29878	Missing unpersist	GraphLoader.edgeListFile()	edges	Pending
43	MLLib	SPARK-29878	Premature unpersist	FPGrowth.genericFit()	items	Confirmed	No. Limited affect
44	SQL	A custom case	Missing persist	CacheTableTest.main()	data	Confirmed	Yes
45	SQL	SPARK-30444	Missing persist	SparkSQLExample	df	Pending
46	MLLib	SPARK-31216	Lagging unpersist	GradientBoostedTrees.boost()	predErrorCheckpointer	Pending
47	MLLib	SPARK-31216	Lagging unpersist	GradientBoostedTrees.boost()	input	Pending
48	MLLib	SPARK-29872	Lagging unpersist	LogisticRegression.train()	instances	Confirmed	No. Limited affect
49	MLLib	SPARK-29872	Lagging unpersist	CrossValidator.fit()	validationDataset	Confirmed	No. Limited affect
50	MLLib	SPARK-29872	Lagging unpersist	TrainValidationSplit.fit()	trainingDataset	Confirmed	No. Limited affect
51	MLLib	SPARK-29872	Lagging unpersist	OneVsRest.fit()	trainingDataset	Confirmed	No. Limited affect
52	MLLib	SPARK-31217	Unnecessary persist	BinaryClassificationMetrics	cumulativeCounts	Pending
53	MLLib	SPARK-29813	Unnecessary persist	PrefixSpan.run()	dataInternalRepr	Confirmed	No. Limited affect
54	MLLib	Fixed in 2.4.4 before submitted	Lagging persist	GradientBoostedTrees.boost()	input	Not submit	Yes
55	MLLib	Fixed in 2.4.4 before submitted	Missing persist	LDA.fit()	oldData	Not submit	Yes
56	MLLib	Fixed in 2.4.4 before submitted	Missing persist	LinearSVC.train()	instances	Not submit	Yes
57	MLLib	SPARK-31218	Missing persist	BinaryClassificationMetrics.recallByThreshold()	counts	Pending
58	MLLib	SPARK-31217	Missing persist	RegressionMetrics	summary	Pending
59	MLLib	SPARK-31217	Missing persist	MulticlassMetrics	predictionAndLabels	Pending
60	MLLib	SPARK-31217	Missing persist	MultilabelMetrics	predictionAndLabels	Pending
61	MLLib	SPARK-31217	Missing persist	RankingMetrics	predictionAndLabels	Pending
62	MLLib	SPARK-29813	Missing persist	PrefixSpan.run()	data	Confirmed	No. Limited affect
63	MLLib	SPARK-29813	Missing persist	PrefixSpan.genFreqPatterns()	data	Confirmed	No. Limited affect
64	MCL	MCL_ISSUE-20	Missing persist	MCL.run()	graph.vertices.sortBy	Pending
65	MCL	MCL_ISSUE-20	Missing persist	MCL.run()	M1	Pending
66	MCL	MCL_ISSUE-20	Missing persist	MCL.run()	lookupTable	Pending
67	kBetwee- nness	KBETWEENNESS_ISSUE-6	Missing persist	kBetweenness.aggregateGraphlets BetweennessScores()	vertexKBcgraph	Pending
68	t-SNE	TSNE_ISSUE-14	Missing persist	MNIST.main()	data	Pending
69	t-SNE	TSNE_ISSUE-14	Missing persist	X2P.apply()	p_betas	Pending
70	t-SNE	TSNE_ISSUE-14	Unnecessary persist	X2P.apply()	norm	Pending
71	t-SNE	TSNE_ISSUE-14	Missing persist	SimpleTSNE.tsne()	P	Pending
72	t-SNE	TSNE_ISSUE-14	Missing persist	SimpleTSNE.tsne()	dataset	Pending

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
core		core
instrument		instrument
tools		tools
wordcount		wordcount
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core

core

instrument

instrument

tools

tools

wordcount

wordcount

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

CacheCheck

1. Run CacheCheck

1.1 Build CacheCheck

1.2 Instrument Spark

1.3 Collect Traces

1.4 Perform Detection

2. Code Structure

2.1 ExampleRunner

2.2 CacheCheckRunner

2.3 Deduplicator

3. Detected Unknown Bugs

About

Releases

Packages

Contributors 2

Languages

License

Icysandwich/cachecheck

Folders and files

Latest commit

History

Repository files navigation

CacheCheck

1. Run CacheCheck

1.1 Build CacheCheck

1.2 Instrument Spark

1.3 Collect Traces

1.4 Perform Detection

2. Code Structure

2.1 ExampleRunner

2.2 CacheCheckRunner

2.3 Deduplicator

3. Detected Unknown Bugs

About

Resources

License

Stars

Watchers

Forks

Languages