CacheCheck can effectively detect cache-related bugs in Spark applications. See paper CacheCheck to learn more details.
Enter the main directory, and build it by
mvn package -DskipTests
A runnable jar file "core-1.0-SNAPSHOT.jar" is generated under core/target/
.
The trace collection code (i.e., the modification to Spark) is located in instrument/
. Specially, the trace collection code starts with the comment "// Start trace collection in CacheCheck
", and ends with the comment "// End trace collection in CacheCheck
".
First, you need to replace $SPARK_HOME/core/src/main/scala/org/apache/spark/rdd/RDD.scala
with cachecheck/core/instrument/RDD.scala
and replace $SPARK_HOME/core/src/main/scala/org/apache/spark/SparkContext.scala
with cachecheck/core/instrument/SparkContext.scala
.
We use Spark-2.4.3 in our experiment. If you use other Spark versions, you may need to manually add our trace collection code to the proper place, since directly replacing files may be incompatiable.
Then, you can build the instrumented Spark by running the command in $SPARK_HOME/
:
mvn package -DskipTests
While the application runs on Spark, the instrumented code can collect traces and store them in $SPARK_HOME/trace/
.
In our experiment, we use Spark's build-in examples and six word count examples as test cases.
Taking SparkPi as the example, we can run it by the command
$SPARK_HOME/bin/run-example SparkPi
We provide six word count examples in directory wordcount
. You can add this directory to SPARK_HOME/examples/src/main/scala/org/apache/spark/examples
, and then compile the example module, and run these examples by similar commands, such as
$SPARK_HOME/bin/run-example wordcount.MissingPersist
In our paper, we mainly ran examples in GraphX, MLLib, and Spark SQL. They can also be run by similar commands, such as
$SPARK_HOME/bin/run-example graphx.ConnectedComponentsExample
Considering there are too many examples to run, we provide some one-click tools for easy configuration and execution. See details in Code Structure.
The detection is performed by
java -jar cachecheck/core/target/core-1.0-SNAPSHOT.jar $TraceDir $AppName [-d]
$TraceDir
is the directory that stores traces, i.e., $SPARK_HOME/trace/
. $AppName
is the name of the application, which is usually set in the application code. For SparkPi, its application name is Spark Pi
. -d
is an option to enable debug mode. In default, CacheCheck deletes all trace files after detection. If you want keep them, add -d
please.
After the detection, a bug report, named $AppName.report
, is generated in $SPARK_HOME/trace/
.
CacheCheck mainly has two modules, i.e., core
and tools
. core
module implement the approach introduced in our paper. tools
module provides three tools, i.e., ExampleRunner
, CachecheckRunner
, and Deduplicator
, for easy and automatic bug detection. After Build Cachecheck, three runnable jars are generated under cachecheck/tools/traget/
. They are tools-examplerunner.jar
, tools-cachecheckrunner.jar
, and tools-deduplicator.jar
.
ExampleRunner can automatically run Spark's build-in examples. It requires a configuration file, which is an xml file just like example-list-all.xml
in cachecheck/tools/resource
. In this file, you can specify which examples to run.
The execution command is
java -jar cachecheck/tools/target/tools-examplerunner.jar $ExampleList $SparkDir
$ExampleList
is the path of the configuration file. $SparkDir
is the base directory of Spark, e.g., $SPARK_HOME
.
CacheCheckRunner can automatically analyze all the traces under the same directory and get the bug reports.
The execution command is
java -jar cachecheck/tools/target/tools-cachecheckrunner.jar $TraceDir
$TraceDir
is the diretocry where traces files are located, e.g., $SPARK_HOME/trace
.
Deduplicator can collect all the bug reports under the same directory, remove duplicated bug report and generate a summary bug report. The command is
java -jar cachecheck/tools/target/tools-deduplicator.jar $ReportDir
$ReportDir
is the directory that contains the bug reports. After the execution of Deduplicator, it will generate a summary.report
file under the same directory.
ID | Project | Issue ID | Bug type | Location | Related RDD variable | Status | Fixed |
---|---|---|---|---|---|---|---|
1 | MLLib | SPARK-29809 | Missing persist | Word2Vec.fit() | dataset | Confirmed | Yes |
2 | MLLib | SPARK-29810 | Missing persist | RandomForest.run() | retaggedInput | Confirmed | Yes |
3 | MLLib | SPARK-29811 | Missing persist | RandomForestRegressor.train() | oldDataset | Confirmed | Yes |
4 | MLLib | SPARK-29812 | Missing persist | MulticlassificationEvaluator.evaluate() | predictionAndLabels | Confirmed | Yes |
5 | MLLib | SPARK-29813 | Missing persist | PrefixSpan.findFrequentItems() | data | Confirmed | Yes |
6 | MLLib | SPARK-29814 | Missing persist | PCA.fit() | sources | Confirmed | No. Limited affect |
7 | MLLib | SPARK-29815 | Missing persist | CrossValidator.fit() | dataset.toDF.rdd | Confirmed | No. Limited affect |
8 | MLLib | SPARK-29817 | Missing persist | LDAOptimizer.initialize() | docs | Confirmed | Yes |
9 | MLLib | SPARK-29824 | Missing persist | GBTClassifier.train() | trainDataset | Confirmed | Yes |
10 | MLLib | SPARK-29826 | Missing persist | ChiSqSelector.fit() | data | Confirmed | Yes |
11 | MLLib | SPARK-29828 | Missing persist | ALS.train() | ratings | Confirmed | Yes |
12 | MLLib | SPARK-29816 | Missing persist | BinaryClassificationMetrics.recallByThreshold() | scoreAndLabels.combineByKey | Confirmed | No. Limited affect |
13 | MLLib | SPARK-29827 | Missing persist | BisectingKMeans.run() | input | Confirmed | Yes |
14 | MLLib | SPARK-29827 | Unnecessary persist | BisectingKMeans.run() | norms | Confirmed | Yes |
15 | MLLib | SPARK-29827 | Missing persist | BisectingKMeans.run() | assignments | Confirmed | Yes |
16 | MLLib | SPARK-29856 | Unnecessary persist | RandomForest.run() | baggedInput | Confirmed | No. Limited affect |
17 | MLLib | SPARK-29856 | Unnecessary persist | KMeans.fit() | instances | Confirmed | No. Limited affect |
18 | MLLib | SPARK-29856 | Unnecessary persist | BisectingKMeans.run() | indices | Confirmed | No. Limited affect |
19 | MLLib | SPARK-29844 | Missing persist | ASL.train() | userIdAndFactors | Confirmed | No. Difficulty |
20 | MLLib | SPARK-29844 | Missing unpersist | ASL.train() | itemIdAndFactors | Confirmed | No. Difficulty |
21 | MLLib | SPARK-29844 | Premature unpersist | ASL.train() | itemFactors | Confirmed | Yes |
22 | MLLib | SPARK-29844 | Lagging unpersist | ASL.train() | userInBlocks | Confirmed | Yes |
23 | MLLib | SPARK-29844 | Lagging unpersist | ASL.train() | userOutBlocks | Confirmed | Yes |
24 | MLLib | SPARK-29844 | Lagging unpersist | ASL.train() | itemOutBlocks | Confirmed | Yes |
25 | MLLib | SPARK-29844 | Lagging unpersist | ASL.train() | BlockRatings | Confirmed | Yes |
26 | MLLib | SPARK-29781 | Unnecessary persist | PeriodicCheckpointer.update() | newData | Confirmed | No. Limited affect |
27 | MLLib | SPARK-29781 | Lagging unpersist | PeriodicCheckpointer.update() | newData | Confirmed | No. Limited affect |
28 | MLLib | SPARK-29823 | Unnecessary persist | KMeans.run() | norms | Confirmed | Yes |
29 | MLLib | SPARK-29823 | Missing persist | KMeans.run() | zippedData | Confirmed | Yes |
30 | MLLib | SPARK-29832 | Unnecessary persist | IsotonicRegression.fit() | instances | Confirmed | No. Limited affect |
31 | MLLib | SPARK-29872 | Missing persist | SparkTC.main() | edges | Confirmed | No. Limited affect |
32 | MLLib | SPARK-29873 | Unnecessary persist | SparkTC.main() | tc | Confirmed | No. Limited affect |
33 | MLLib | SPARK-29874 | Missing persist | LogisticRegressionSummary | fMeasure | Confirmed | No. Limited affect |
34 | SQL | SPARK-29875 | Missing persist | SparkSQLExample.main() | peopleDF | Confirmed | No. Difficulty |
35 | SQL | SPARK-29876 | Missing persist | RDDRelation.main() | createDataFrame | Confirmed | No. Difficulty |
36 | GraphX | SPARK-29878 | Unnecessary persist | GraphImpl.mapVertices() | vertices | Pending | |
37 | GraphX | SPARK-29878 | Unnecessary persist | GraphImpl.mapVertices() | newEdges | Pending | |
38 | GraphX | SPARK-29878 | Missing persist | ReplicatedVertexView | zipPartitions | Pending | |
39 | GraphX | SPARK-29878 | Unnecessary persist | GraphImpl.partitionBy() | newEdges | Pending | |
40 | GraphX | SPARK-29878 | Unnecessary persist | GraphImpl.subgraph() | vertices | Pending | |
41 | GraphX | SPARK-29878 | Unnecessary persist | GraphImpl.aggregateMessagesWithActiveSet() | vertices | Pending | |
42 | MLLib | SPARK-29878 | Missing unpersist | GraphLoader.edgeListFile() | edges | Pending | |
43 | MLLib | SPARK-29878 | Premature unpersist | FPGrowth.genericFit() | items | Confirmed | No. Limited affect |
44 | SQL | A custom case | Missing persist | CacheTableTest.main() | data | Confirmed | Yes |
45 | SQL | SPARK-30444 | Missing persist | SparkSQLExample | df | Pending | |
46 | MLLib | SPARK-31216 | Lagging unpersist | GradientBoostedTrees.boost() | predErrorCheckpointer | Pending | |
47 | MLLib | SPARK-31216 | Lagging unpersist | GradientBoostedTrees.boost() | input | Pending | |
48 | MLLib | SPARK-29872 | Lagging unpersist | LogisticRegression.train() | instances | Confirmed | No. Limited affect |
49 | MLLib | SPARK-29872 | Lagging unpersist | CrossValidator.fit() | validationDataset | Confirmed | No. Limited affect |
50 | MLLib | SPARK-29872 | Lagging unpersist | TrainValidationSplit.fit() | trainingDataset | Confirmed | No. Limited affect |
51 | MLLib | SPARK-29872 | Lagging unpersist | OneVsRest.fit() | trainingDataset | Confirmed | No. Limited affect |
52 | MLLib | SPARK-31217 | Unnecessary persist | BinaryClassificationMetrics | cumulativeCounts | Pending | |
53 | MLLib | SPARK-29813 | Unnecessary persist | PrefixSpan.run() | dataInternalRepr | Confirmed | No. Limited affect |
54 | MLLib | Fixed in 2.4.4 before submitted | Lagging persist | GradientBoostedTrees.boost() | input | Not submit | Yes |
55 | MLLib | Fixed in 2.4.4 before submitted | Missing persist | LDA.fit() | oldData | Not submit | Yes |
56 | MLLib | Fixed in 2.4.4 before submitted | Missing persist | LinearSVC.train() | instances | Not submit | Yes |
57 | MLLib | SPARK-31218 | Missing persist | BinaryClassificationMetrics.recallByThreshold() | counts | Pending | |
58 | MLLib | SPARK-31217 | Missing persist | RegressionMetrics | summary | Pending | |
59 | MLLib | SPARK-31217 | Missing persist | MulticlassMetrics | predictionAndLabels | Pending | |
60 | MLLib | SPARK-31217 | Missing persist | MultilabelMetrics | predictionAndLabels | Pending | |
61 | MLLib | SPARK-31217 | Missing persist | RankingMetrics | predictionAndLabels | Pending | |
62 | MLLib | SPARK-29813 | Missing persist | PrefixSpan.run() | data | Confirmed | No. Limited affect |
63 | MLLib | SPARK-29813 | Missing persist | PrefixSpan.genFreqPatterns() | data | Confirmed | No. Limited affect |
64 | MCL | MCL_ISSUE-20 | Missing persist | MCL.run() | graph.vertices.sortBy | Pending | |
65 | MCL | MCL_ISSUE-20 | Missing persist | MCL.run() | M1 | Pending | |
66 | MCL | MCL_ISSUE-20 | Missing persist | MCL.run() | lookupTable | Pending | |
67 | kBetwee- nness | KBETWEENNESS_ISSUE-6 | Missing persist | kBetweenness.aggregateGraphlets BetweennessScores() | vertexKBcgraph | Pending | |
68 | t-SNE | TSNE_ISSUE-14 | Missing persist | MNIST.main() | data | Pending | |
69 | t-SNE | TSNE_ISSUE-14 | Missing persist | X2P.apply() | p_betas | Pending | |
70 | t-SNE | TSNE_ISSUE-14 | Unnecessary persist | X2P.apply() | norm | Pending | |
71 | t-SNE | TSNE_ISSUE-14 | Missing persist | SimpleTSNE.tsne() | P | Pending | |
72 | t-SNE | TSNE_ISSUE-14 | Missing persist | SimpleTSNE.tsne() | dataset | Pending | |