Deckard is a static code clone detection system (2008 paper here) that finds semantically similar code segments, "clones", within a codebase. This kit provides a convenient set of scripts for finding code clones across two different codebases by essentially combining them into one, and then filtering out same-origin clones. It is also equipped for automation, allowing for multiple different codebases to be compared in all pair-wise combinations, and allowing analysis to be run at different settings (see 'Kit Configuration' below).
Setup is fairly simple, since most of the scripts are ready-to-use immediately after being downloaded.
This kit is intended to be used form a *nix terminal with at least 256-color support. It uses two bash scripts and one PHP script. Accordingly, bash > 4.0 and PHP > 4.3 are required.
As of the time of this README's writing, the version of Deckard mentioned in the 2008 ICSE paper is not available for download or use. However, an older version is available here, complete with installation instructions.
Once you install Deckard, you simply need to set an environment variable telling this kit where Deckard was installed. You can do that with the following command:
export DECKARD_PATH="/path/to/Deckard"
To run this kit, there are three simple steps.
First, extract all the codebases you wish to compare to individually named
directories within libraries
. For example, if you wish to find clones across
projects foo
, bar
, and baz
, start by extracting them to libraries/foo/
,
libraries/bar/
, and libraries/baz
, respectively.
Second, configure the settings with which Deckard should be run. This is done by
writing them into the arrays on lines 8 and 9 of runner.sh
. Here are the
default values:
declare -a toks=("50" "100" "500")
declare -a sims=("0.85" "0.95" "0.98")
The first array, toks
, controls the approximate number of tokens in each
clone. Keep in mind there are about 7 tokens per line of Java code. The second
array, sims
, controls how similar code segments need to be in order to be
considered "clones". The number is fairly mysterious, but for reference,
Deckard's default is 0.95. With the arrays set as they are above, each pair of
codebases will be compared nine times, once for each combination of token and
similarity settings.
Finally, invoke the runner script with:
./runner.sh
The runner script may take some time to complete, depending on your settings and the number and sizes of your codebases, however it prints diagnostic information the whole while, and issues a terminal bell when processing has completed, so it is pretty convenient for use in batch processing jobs.
After running the kit, results are saved to the results
directory. Pair-wise
clone lists are stored in subdirectories named according to the settings with
which they were run. Aggregate data is written to results/all_counts.csv
.
Please report any issues you may encounter. Thanks!