Abstract: in this notebook, you can use the DocumentSimilarity tool to identify similar documents within your corpus. Once you identify those similar documents, you can decide whether to 'keep' or 'remove' them from the corpus. This tool is very useful for cleaning up your corpus and removing any duplicates or similar documents before you continue with your research..
This tool has been designed for use with minimal setup from users. You are able to run it in the cloud and any dependencies with other packages will be installed for you automatically. In order to launch and use the tool, you just need to click the below icon.
Note: CILogon authentication is required. You can use your institutional, Google or Microsoft account to login. If you have trouble authenticating, please refer to the CILogon troubleshooting guide.
If you do not have access to any of the above accounts, you can use the below link to access the tool (this is a free Binder version, limited to 2GB memory only).
It may take a few minutes for Binder to launch the notebook and install the dependencies for the tool. Please be patient.
For instructions on how to use the Document Similarity tool, please refer to the Document Similarity User Guide.
![]() |
![]() |
![]() |
![]() |
This tool will allow you upload text data in a text file (or a number of text files). Alternatively, you can also upload text inside a text column inside your excel spreadsheet
Note: If you have a large number of text files (more than 10MB in total), we suggest you compress (zip) them and upload the zip file instead. If you need assistance on how to compress your file, please check the user guide.
Once your texts have been uploaded, you can begin to calculate the similarity between documents in the corpus. You can then visualise the count of similar documents found by the tool on an histogram (as shown below).
Alternatively, you can visualise the pair of similar documents and their Jaccard similarity on a heatmap (as shown below).
You can also show pair of identified similar documents side-by-side, decide whether to 'keep' or 'remove' them and finally, download the non-duplicated documents to your local computer.
This tool uses MinHash to estimate the Jaccard similarity between sets of documents. MinHash is introduced by Andrei Z. Broder in this paper.
If you find the Document Similarity useful in your research, please cite the following:
Jufri, Sony & Sun, Chao (2022). Document Similarity. v1.0. Australian Text Analytics Platform. Software. https://github.com/Australian-Text-Analytics-Platform/document-similarity