A simple text reuse detection CLI tool.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples add readme for jupyter notebook example Jul 27, 2017
text_matcher whitespace cleanup Jul 28, 2017
.gitignore ignore dist Oct 16, 2016
LICENSE add GPL Oct 16, 2016
MANIFEST add manifest Oct 16, 2016
README.md increment version Oct 16, 2016
setup.cfg restructure with setuptools Oct 16, 2016
setup.py increment version Jul 28, 2017

README.md

text-matcher

PyPI version

A simple text reuse detection CLI tool. Given a pair of texts or directories of texts, it will find similar text between them. This is good for detection of text reuses such as citation, quotation, intertextuality, and plagiarism.

The pilot experiment that uses this tool is allusion-detection. A new project that uses this tool is middlemarch-critical-histories.

Demo

Does Milton quote from the Bible in his Areopagitica? Let’s find out.

$ text-matcher kjv.txt areopagitica.txt 

1 total matches found.

match 1:
kjv.txt: (4135539, 4135561) Spirit. 5:20 Despise not prophesyings Prove all things; hold fast that which is good. 5:22 Abstain
areopagitica.txt: (25861, 25883) answerable to that of the Apostle to the Thessalonians PROVE ALL THINGS, HOLD FAST THAT WHICH IS GOOD. And he might

Usage

Just run text-matcher and provide the names of the text files you want to compare. You can also provide a directory of files instead of a single file, so if you want to compare textA.txt with every text file in textdir/, run text-matcher textA.txt textdir/.

You can also tweak the matching by providing the ngrams value to match against, and the threshold. From the help:

$ text-matcher --help
Usage: text-matcher [OPTIONS] TEXT1 TEXT2

  This program finds similar text in two text files.

Options:
  -t, --threshold INTEGER  The shortest length of match to include.
  -n, --ngrams INTEGER     The ngram n-value to match against.
  -l, --logfile TEXT       The name of the log file to write to.
  --verbose                Whether to enable verbose mode, giving more
                           information.
  --help                   Show this message and exit.

Installation

You can install text-matcher using pip:

On Arch or a modern Linux distribution that uses python3, run: pip install text-matcher.

On Ubuntu or a similar distribution that uses an old version of Python, run: sudo pip3 install text-matcher.

Alternatively, clone this repo and install using pip:

git clone https://github.com/JonathanReeve/text-matcher
pip install .