Conveniently run Deckard pair-wise on multiple codebases with multiple settings, and aggregate the results.
Shell PHP
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
libraries/temp
results
test
README.md
analyze.sh
collect_method_clusters.php
config
runner.sh

README.md

Deckard Analysis Kit

Premise

Deckard is a static code clone detection system (2008 paper here) that finds semantically similar code segments, "clones", within a codebase. This kit provides a convenient set of scripts for finding code clones across two different codebases by essentially combining them into one, and then filtering out same-origin clones. It is also equipped for automation, allowing for multiple different codebases to be compared in all pair-wise combinations, and allowing analysis to be run at different settings (see 'Kit Configuration' below).

Setup

Setup is fairly simple, since most of the scripts are ready-to-use immediately after being downloaded.

Requirements

This kit is intended to be used form a *nix terminal with at least 256-color support. It uses two bash scripts and one PHP script. Accordingly, bash > 4.0 and PHP > 4.3 are required.

Installing Deckard

As of the time of this README's writing, the version of Deckard mentioned in the 2008 ICSE paper is not available for download or use. However, an older version is available here, complete with installation instructions.

Kit Configuration

Once you install Deckard, you simply need to set an environment variable telling this kit where Deckard was installed. You can do that with the following command:

export DECKARD_PATH="/path/to/Deckard"

Usage

To run this kit, there are three simple steps.

First, extract all the codebases you wish to compare to individually named directories within libraries. For example, if you wish to find clones across projects foo, bar, and baz, start by extracting them to libraries/foo/, libraries/bar/, and libraries/baz, respectively.

Second, configure the settings with which Deckard should be run. This is done by writing them into the arrays on lines 8 and 9 of runner.sh. Here are the default values:

declare -a toks=("50" "100" "500")
declare -a sims=("0.85" "0.95" "0.98")

The first array, toks, controls the approximate number of tokens in each clone. Keep in mind there are about 7 tokens per line of Java code. The second array, sims, controls how similar code segments need to be in order to be considered "clones". The number is fairly mysterious, but for reference, Deckard's default is 0.95. With the arrays set as they are above, each pair of codebases will be compared nine times, once for each combination of token and similarity settings.

Finally, invoke the runner script with:

./runner.sh

The runner script may take some time to complete, depending on your settings and the number and sizes of your codebases, however it prints diagnostic information the whole while, and issues a terminal bell when processing has completed, so it is pretty convenient for use in batch processing jobs.

Results

After running the kit, results are saved to the results directory. Pair-wise clone lists are stored in subdirectories named according to the settings with which they were run. Aggregate data is written to results/all_counts.csv.

Notes

Please report any issues you may encounter. Thanks!