Document Similarity

Abstract: in this notebook, you can use the DocumentSimilarity tool to identify similar documents within your corpus. Once you identify those similar documents, you can decide whether to 'keep' or 'remove' them from the corpus. This tool is very useful for cleaning up your corpus and removing any duplicates or similar documents before you continue with your research..

Setup

This tool has been designed for use with minimal setup from users. You are able to run it in the cloud and any dependencies with other packages will be installed for you automatically. In order to launch and use the tool, you just need to click the below icon.

Note: CILogon authentication is required. You can use your institutional, Google or Microsoft account to login. If you have trouble authenticating, please refer to the CILogon troubleshooting guide.

If you do not have access to any of the above accounts, you can use the below link to access the tool (this is a free Binder version, limited to 2GB memory only).

It may take a few minutes for Binder to launch the notebook and install the dependencies for the tool. Please be patient.

User Guide

For instructions on how to use the Document Similarity tool, please refer to the Document Similarity User Guide.

Load the data

This tool will allow you upload text data in a text file (or a number of text files). Alternatively, you can also upload text inside a text column inside your excel spreadsheet

Note: If you have a large number of text files (more than 10MB in total), we suggest you compress (zip) them and upload the zip file instead. If you need assistance on how to compress your file, please check the user guide.

Calculate Document Similarity

Once your texts have been uploaded, you can begin to calculate the similarity between documents in the corpus. You can then visualise the count of similar documents found by the tool on an histogram (as shown below).

Alternatively, you can visualise the pair of similar documents and their Jaccard similarity on a heatmap (as shown below).

You can also show pair of identified similar documents side-by-side, decide whether to 'keep' or 'remove' them and finally, download the non-duplicated documents to your local computer.

Reference

This tool uses MinHash to estimate the Jaccard similarity between sets of documents. MinHash is introduced by Andrei Z. Broder in this paper.

Citation

If you find the Document Similarity useful in your research, please cite the following:

Jufri, Sony & Sun, Chao (2022). Document Similarity. v1.0. Australian Text Analytics Platform. Software. https://github.com/Australian-Text-Analytics-Platform/document-similarity

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.ipynb_checkpoints		.ipynb_checkpoints
documents		documents
img		img
output		output
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
diffviz.py		diffviz.py
document_similarity.ipynb		document_similarity.ipynb
document_similarity.py		document_similarity.py
environment.yml		environment.yml
jupyter_notebook_config.py		jupyter_notebook_config.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Similarity

Setup

User Guide

Load the data

Calculate Document Similarity

Reference

Citation

About

Releases

Packages

Contributors 2

Languages

License

Australian-Text-Analytics-Platform/document-similarity

Folders and files

Latest commit

History

Repository files navigation

Document Similarity

Setup

User Guide

Load the data

Calculate Document Similarity

Reference

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages