Skip to content
This repository has been archived by the owner on Aug 9, 2021. It is now read-only.

Icysandwich/cachecheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CacheCheck

CacheCheck can effectively detect cache-related bugs in Spark applications. See paper CacheCheck to learn more details.

1. Run CacheCheck

1.1 Build CacheCheck

Enter the main directory, and build it by

mvn package -DskipTests

A runnable jar file "core-1.0-SNAPSHOT.jar" is generated under core/target/.

1.2 Instrument Spark

The trace collection code (i.e., the modification to Spark) is located in instrument/. Specially, the trace collection code starts with the comment "// Start trace collection in CacheCheck", and ends with the comment "// End trace collection in CacheCheck".

First, you need to replace $SPARK_HOME/core/src/main/scala/org/apache/spark/rdd/RDD.scala with cachecheck/core/instrument/RDD.scala and replace $SPARK_HOME/core/src/main/scala/org/apache/spark/SparkContext.scala with cachecheck/core/instrument/SparkContext.scala. We use Spark-2.4.3 in our experiment. If you use other Spark versions, you may need to manually add our trace collection code to the proper place, since directly replacing files may be incompatiable.

Then, you can build the instrumented Spark by running the command in $SPARK_HOME/:

mvn package -DskipTests

1.3 Collect Traces

While the application runs on Spark, the instrumented code can collect traces and store them in $SPARK_HOME/trace/. In our experiment, we use Spark's build-in examples and six word count examples as test cases. Taking SparkPi as the example, we can run it by the command

$SPARK_HOME/bin/run-example SparkPi

We provide six word count examples in directory wordcount. You can add this directory to SPARK_HOME/examples/src/main/scala/org/apache/spark/examples, and then compile the example module, and run these examples by similar commands, such as

$SPARK_HOME/bin/run-example wordcount.MissingPersist

In our paper, we mainly ran examples in GraphX, MLLib, and Spark SQL. They can also be run by similar commands, such as

$SPARK_HOME/bin/run-example graphx.ConnectedComponentsExample

Considering there are too many examples to run, we provide some one-click tools for easy configuration and execution. See details in Code Structure.

1.4 Perform Detection

The detection is performed by

java -jar cachecheck/core/target/core-1.0-SNAPSHOT.jar $TraceDir $AppName [-d]

$TraceDir is the directory that stores traces, i.e., $SPARK_HOME/trace/. $AppName is the name of the application, which is usually set in the application code. For SparkPi, its application name is Spark Pi. -d is an option to enable debug mode. In default, CacheCheck deletes all trace files after detection. If you want keep them, add -d please.
After the detection, a bug report, named $AppName.report, is generated in $SPARK_HOME/trace/.

2. Code Structure

CacheCheck mainly has two modules, i.e., core and tools. core module implement the approach introduced in our paper. tools module provides three tools, i.e., ExampleRunner, CachecheckRunner, and Deduplicator, for easy and automatic bug detection. After Build Cachecheck, three runnable jars are generated under cachecheck/tools/traget/. They are tools-examplerunner.jar, tools-cachecheckrunner.jar, and tools-deduplicator.jar.

2.1 ExampleRunner

ExampleRunner can automatically run Spark's build-in examples. It requires a configuration file, which is an xml file just like example-list-all.xml in cachecheck/tools/resource. In this file, you can specify which examples to run.
The execution command is

java -jar cachecheck/tools/target/tools-examplerunner.jar $ExampleList $SparkDir

$ExampleList is the path of the configuration file. $SparkDir is the base directory of Spark, e.g., $SPARK_HOME.

2.2 CacheCheckRunner

CacheCheckRunner can automatically analyze all the traces under the same directory and get the bug reports.
The execution command is

java -jar cachecheck/tools/target/tools-cachecheckrunner.jar  $TraceDir

$TraceDir is the diretocry where traces files are located, e.g., $SPARK_HOME/trace.

2.3 Deduplicator

Deduplicator can collect all the bug reports under the same directory, remove duplicated bug report and generate a summary bug report. The command is

java -jar cachecheck/tools/target/tools-deduplicator.jar  $ReportDir

$ReportDir is the directory that contains the bug reports. After the execution of Deduplicator, it will generate a summary.report file under the same directory.


3. Detected Unknown Bugs

ID Project Issue ID Bug type Location Related RDD variable Status Fixed
1MLLibSPARK-29809Missing persistWord2Vec.fit()datasetConfirmedYes
2MLLibSPARK-29810Missing persistRandomForest.run()retaggedInputConfirmedYes
3MLLibSPARK-29811Missing persistRandomForestRegressor.train()oldDatasetConfirmedYes
4MLLibSPARK-29812Missing persistMulticlassificationEvaluator.evaluate()predictionAndLabelsConfirmedYes
5MLLibSPARK-29813Missing persistPrefixSpan.findFrequentItems()dataConfirmedYes
6MLLibSPARK-29814Missing persistPCA.fit()sourcesConfirmedNo. Limited affect
7MLLibSPARK-29815Missing persistCrossValidator.fit()dataset.toDF.rddConfirmedNo. Limited affect
8MLLibSPARK-29817Missing persistLDAOptimizer.initialize()docsConfirmedYes
9MLLibSPARK-29824Missing persistGBTClassifier.train()trainDatasetConfirmedYes
10MLLibSPARK-29826Missing persistChiSqSelector.fit()dataConfirmedYes
11MLLibSPARK-29828Missing persistALS.train()ratingsConfirmedYes
12MLLibSPARK-29816Missing persistBinaryClassificationMetrics.recallByThreshold()scoreAndLabels.combineByKeyConfirmedNo. Limited affect
13MLLibSPARK-29827Missing persistBisectingKMeans.run()inputConfirmedYes
14MLLibSPARK-29827Unnecessary persistBisectingKMeans.run()normsConfirmedYes
15MLLibSPARK-29827Missing persistBisectingKMeans.run()assignmentsConfirmedYes
16MLLibSPARK-29856Unnecessary persistRandomForest.run()baggedInputConfirmedNo. Limited affect
17MLLibSPARK-29856Unnecessary persistKMeans.fit()instancesConfirmedNo. Limited affect
18MLLibSPARK-29856Unnecessary persistBisectingKMeans.run()indicesConfirmedNo. Limited affect
19MLLibSPARK-29844Missing persistASL.train()userIdAndFactorsConfirmedNo. Difficulty
20MLLibSPARK-29844Missing unpersistASL.train()itemIdAndFactorsConfirmedNo. Difficulty
21MLLibSPARK-29844Premature unpersistASL.train()itemFactorsConfirmedYes
22MLLibSPARK-29844Lagging unpersistASL.train()userInBlocksConfirmedYes
23MLLibSPARK-29844Lagging unpersistASL.train()userOutBlocksConfirmedYes
24MLLibSPARK-29844Lagging unpersistASL.train()itemOutBlocksConfirmedYes
25MLLibSPARK-29844Lagging unpersistASL.train()BlockRatingsConfirmedYes
26MLLibSPARK-29781Unnecessary persistPeriodicCheckpointer.update()newDataConfirmedNo. Limited affect
27MLLibSPARK-29781Lagging unpersistPeriodicCheckpointer.update()newDataConfirmedNo. Limited affect
28MLLibSPARK-29823Unnecessary persistKMeans.run()normsConfirmedYes
29MLLibSPARK-29823Missing persistKMeans.run()zippedDataConfirmedYes
30MLLibSPARK-29832Unnecessary persistIsotonicRegression.fit()instancesConfirmedNo. Limited affect
31MLLibSPARK-29872Missing persistSparkTC.main()edgesConfirmedNo. Limited affect
32MLLibSPARK-29873Unnecessary persistSparkTC.main()tcConfirmedNo. Limited affect
33MLLibSPARK-29874Missing persistLogisticRegressionSummaryfMeasureConfirmedNo. Limited affect
34SQLSPARK-29875Missing persistSparkSQLExample.main()peopleDFConfirmedNo. Difficulty
35SQLSPARK-29876Missing persistRDDRelation.main()createDataFrameConfirmedNo. Difficulty
36GraphXSPARK-29878Unnecessary persistGraphImpl.mapVertices()verticesPending
37GraphXSPARK-29878Unnecessary persistGraphImpl.mapVertices()newEdgesPending
38GraphXSPARK-29878Missing persistReplicatedVertexViewzipPartitionsPending
39GraphXSPARK-29878Unnecessary persistGraphImpl.partitionBy()newEdgesPending
40GraphXSPARK-29878Unnecessary persistGraphImpl.subgraph()verticesPending
41GraphXSPARK-29878Unnecessary persistGraphImpl.aggregateMessagesWithActiveSet()verticesPending
42MLLibSPARK-29878Missing unpersistGraphLoader.edgeListFile()edgesPending
43MLLibSPARK-29878Premature unpersistFPGrowth.genericFit()itemsConfirmedNo. Limited affect
44SQLA custom caseMissing persistCacheTableTest.main()dataConfirmedYes
45SQLSPARK-30444Missing persistSparkSQLExampledfPending
46MLLibSPARK-31216Lagging unpersistGradientBoostedTrees.boost()predErrorCheckpointerPending
47MLLibSPARK-31216Lagging unpersistGradientBoostedTrees.boost()inputPending
48MLLibSPARK-29872Lagging unpersistLogisticRegression.train()instancesConfirmedNo. Limited affect
49MLLibSPARK-29872Lagging unpersistCrossValidator.fit()validationDatasetConfirmedNo. Limited affect
50MLLibSPARK-29872Lagging unpersistTrainValidationSplit.fit()trainingDatasetConfirmedNo. Limited affect
51MLLibSPARK-29872Lagging unpersistOneVsRest.fit()trainingDatasetConfirmedNo. Limited affect
52MLLibSPARK-31217Unnecessary persistBinaryClassificationMetricscumulativeCountsPending
53MLLibSPARK-29813Unnecessary persistPrefixSpan.run()dataInternalReprConfirmedNo. Limited affect
54MLLibFixed in 2.4.4 before submittedLagging persistGradientBoostedTrees.boost()inputNot submitYes
55MLLibFixed in 2.4.4 before submittedMissing persistLDA.fit()oldDataNot submitYes
56MLLibFixed in 2.4.4 before submittedMissing persistLinearSVC.train()instancesNot submitYes
57MLLibSPARK-31218Missing persistBinaryClassificationMetrics.recallByThreshold()countsPending
58MLLibSPARK-31217Missing persistRegressionMetricssummaryPending
59MLLibSPARK-31217Missing persistMulticlassMetricspredictionAndLabelsPending
60MLLibSPARK-31217Missing persistMultilabelMetricspredictionAndLabelsPending
61MLLibSPARK-31217Missing persistRankingMetricspredictionAndLabelsPending
62MLLibSPARK-29813Missing persistPrefixSpan.run()dataConfirmedNo. Limited affect
63MLLibSPARK-29813Missing persistPrefixSpan.genFreqPatterns()dataConfirmedNo. Limited affect
64MCLMCL_ISSUE-20Missing persistMCL.run()graph.vertices.sortByPending
65MCLMCL_ISSUE-20Missing persistMCL.run()M1Pending
66MCLMCL_ISSUE-20Missing persistMCL.run()lookupTablePending
67kBetwee- nnessKBETWEENNESS_ISSUE-6Missing persistkBetweenness.aggregateGraphlets BetweennessScores()vertexKBcgraphPending
68t-SNETSNE_ISSUE-14Missing persistMNIST.main()dataPending
69t-SNETSNE_ISSUE-14Missing persistX2P.apply()p_betasPending
70t-SNETSNE_ISSUE-14Unnecessary persistX2P.apply()normPending
71t-SNETSNE_ISSUE-14Missing persistSimpleTSNE.tsne()PPending
72t-SNETSNE_ISSUE-14Missing persistSimpleTSNE.tsne()datasetPending

About

Detect cache-related bugs in Spark applications.

Resources

License

Stars

Watchers

Forks

Packages

No packages published