Skip to content

Tool to identify similar documents within a corpus and decide whether to keep or remove them

License

Notifications You must be signed in to change notification settings

Australian-Text-Analytics-Platform/document-similarity

Repository files navigation

Document Similarity

Abstract: in this notebook, you can use the DocumentSimilarity tool to identify similar documents within your corpus. Once you identify those similar documents, you can decide whether to 'keep' or 'remove' them from the corpus. This tool is very useful for cleaning up your corpus and removing any duplicates or similar documents before you continue with your research..

Setup

This tool has been designed for use with minimal setup from users. You are able to run it in the cloud and any dependencies with other packages will be installed for you automatically. In order to launch and use the tool, you just need to click the below icon.

Binder

Note: CILogon authentication is required. You can use your institutional, Google or Microsoft account to login. If you have trouble authenticating, please refer to the CILogon troubleshooting guide.

If you do not have access to any of the above accounts, you can use the below link to access the tool (this is a free Binder version, limited to 2GB memory only).

Binder

It may take a few minutes for Binder to launch the notebook and install the dependencies for the tool. Please be patient.

User Guide

For instructions on how to use the Document Similarity tool, please refer to the Document Similarity User Guide.

Load the data

This tool will allow you upload text data in a text file (or a number of text files). Alternatively, you can also upload text inside a text column inside your excel spreadsheet

Note: If you have a large number of text files (more than 10MB in total), we suggest you compress (zip) them and upload the zip file instead. If you need assistance on how to compress your file, please check the user guide.

Calculate Document Similarity

Once your texts have been uploaded, you can begin to calculate the similarity between documents in the corpus. You can then visualise the count of similar documents found by the tool on an histogram (as shown below).

Alternatively, you can visualise the pair of similar documents and their Jaccard similarity on a heatmap (as shown below).

You can also show pair of identified similar documents side-by-side, decide whether to 'keep' or 'remove' them and finally, download the non-duplicated documents to your local computer.

Reference

This tool uses MinHash to estimate the Jaccard similarity between sets of documents. MinHash is introduced by Andrei Z. Broder in this paper.

Citation

If you find the Document Similarity useful in your research, please cite the following:

Jufri, Sony & Sun, Chao (2022). Document Similarity. v1.0. Australian Text Analytics Platform. Software. https://github.com/Australian-Text-Analytics-Platform/document-similarity

About

Tool to identify similar documents within a corpus and decide whether to keep or remove them

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published