A program with web UI to mass compare files in specified folders, with main purpose to compare exams for similarity. It calculates edit distance between a pair of every two files in each folder, ignoring duplicate whitespace, comments and preprocessing directives. The similarity is estimated based on relative edit distance: that is the edit distance, divided by the length of the longer of both files.
If you have g++
and python3
installed, there is nothing for you to do here.
Otherwise you need to compile edit_distance.cpp
into an executable however you like
and you're all set!
Just run the Python script and specify a list of folders. Some test folders are provided. All files directly within each of those folders will be compared. The output will be printed to the standard output.
Really basic usage:
$ python goljuf.py test
This compares all files directly in test
directory and prints HTML to the standard output.
A more useful example:
$ python goljuf.py -r -f out.html test test2
This compares all files (including subfolders) within test
to each other, and all files within
test2
to each other, saving output to out.html
. Naturally, one could also redirect the output to
file using >
instead of a -f
flag.
If the c++ executable is not found, the script will try to compile the cpp source file. Once that is done, it will use the executable to compare the files and print the HTML page to the standard output. See below for more details.
When the output file is produced you can open it in your favourite browser.
General form:
python goljuf.py [-h] [-e EXT [EXT ...]] [-r] [-t TRESHOLD] [-f OUTPUT_FILE]
[-x EXECUTABLE]
DIR [DIR ...]
-e, --extensions
Specify list of allowed exceptions. Only files with this extension will be compared. Default:['c', 'cpp']
.-r, --recursive
If this option is present, each directory is searched exhaustively. Default: off.-t, --treshold
Specify a threshold, all files with relative difference less of equal than this are treated as suspicious. Default:0.1
.-f, --output_file
The output is printed to this file instead. Default:stdout
.-x, --executable
Specify a path to your compiled executable. Default:edit_distance
.
For a complete option set, run
$ python goljuf.py -h
The web UI has some fancy features:
- hide all-green rows (non suspicious people)
- hide the tables (they can be quite big)
- sortable tables (click on headers)
- direct diff view (click on a cell, or
diff
link), close with ESC - useful tooltips and highlighting
There is room for improvement in faster edit distance algorithms using Levenshtein automata. The UI could see some design improvements as well.
Jure Slak