SourcererCC is Sourcerer's token-based code clone detector for very large code bases and Internet-scale project repositories. SourcererCC works at many levels of granularity such as detecting clones between files, methods, statements or blocks, in any language. This tutorial is for file-level clone detection on Java.
- For more information about SourcererCC please see the ICSE'16 paper.
- SourcererCC supports DéjàVu, a large scale study of cloning on GitHub. It has a homepage, and was published at OOPSLA'17
- DéjàVu is a supporting web-tool to allow quick and simple clone analysis, can be found here.
Before going through:
We have created an artifact in the form of a virtual machine (VM) that contains the pre-programmed set of instructions that take the user from raw source code to a database with a clone mapping, including all the intermediate steps and explanation of intermediate data types. It can be downloaded from the 'Source Materials' section of the paper ACM website or from the DéjàVu homepage (only the latter is kept updated). This VM is the easiest way to get started with SourcererCC to perform your own clone analysis. It has most of the information here. Please try this VM before contacting us.
Let's get started.
Table of Contents
- Tokenize source code
- Run SourcererCC
- I want to know more!
Tokenize source code:
SourcererCC is a token-based clone detector. This means that source code must go through an initial step of processing. Luckily, we have a tool do to so, which we will explain in this section.
The program needed to tokenize souce code can be found here. Start by looking at config.ini which sets the configuration for the tokenizer. You need to edit a few parameters (the parameters not covered here can be dismissed for now):
N_PROCESSES = 1 ; How many projects does each process process at a time? PROJECTS_BATCH = 2
N_PROCESSES are active at any given time, and each one processes a batch of
To set the input you can do:
FILE_projects_list = this/is/a/path/paths.txt
paths.txt is a list of project paths, the projects we want to find clones on. A sample of a
paths.txt file would look like this:
path/for/projects/aesthetic-master.zip path/for/projects/aesthetic-master.zipOffsetAnimator-master.zip path/for/projects/aesthetic-master.zipResourceInspector-master.zip path/for/projects/aesthetic-master.zipzachtaylor-JPokemon.zip
Language configurations. Since comments are removed you need to set the language primitives for
comment_close_tag comments. Finally, describe the
File_extensions being analyzed (supports a list of extensions):
[Language] comment_inline = // comment_open_tag = /* comment_close_tag = */ File_extensions = .py
And then run with:
pythoon tokenizer.py zip
zip is the extension of the individual projects in
FILE_projects_list = this/is/a/path/paths.txt.
The resulting output is composed of three folders, in the same location:
bookkeeping_projs/- contains a list of processed projects. Has the following format:
project id, project path, project url
files_stats/- contains lists of files together with various statistics. Has the following format:
file id,project id,project path,project url,file hash,size bytes,lines,LOC,SLOC
files_tokens/- contains lists of files together with various statistics and the tokenized forms. Has the following format:
file id,project id,total tokens,unique tokens,token hash@#@token1@@::@@frequency,token2@@::@@frequency,...
file id and
project id always point to the same source code file or project, respectively (they work as a primary key). So a line in
files_stats/* that start with
1,1 represents the same file as the line in
files_tokens/* that starts with
1,1, and these came from the project in
bookkeeping_projs/* whose line starts with
The number of lines in
bookkeeping_projs/* corresponds to the total number of projects analyzed, the number of lines in
files_stats/* is the same as
files_tokens/* and is the same as the total number of files obtained from the projects.
For this step we will run SourcererCC, which can be found here.
files_tokens/ from the previous step:
cat files_tokens/* > blocks.file cp blocks.file SourcererCC/clone-detector/input/dataset/
# Ignore all files outside these bounds MIN_TOKENS=65 MAX_TOKENS=500000
where you can set an upper and lower bound for file clone detection. You can dismiss the other parameters for now.
To change the percentage of clone similarity, look at runnodes.sh, line 9:
8 means clones will be flagged at 80% similarity (current setup),
7 at 70%, and so on.
The JVM parameters can be configured in the same file, at line 20.
This tool splits the task by multiple nodes, which must be aggregated in the end:
cat clone-detector/NODE_*/output8.0/query_* > results.pairs
The resulting information is a list of file id pairs which are clones. These ids correspond to the ids generated in the tokenization phase. An example output is:
In this case we have the clone pairs
(2,3). To know which file corresponds to
1, we can look at the folder
files_stats/* and look for the line with the unique id
I want to know more!
That is great