Skip to content

University project, program that shows you the percentage similarity between the documents you import

License

Notifications You must be signed in to change notification settings

NikosBakalis/Document_Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document Similarity Analyzer

Overview

This Python program analyzes the textual content of documents to compute similarities between each pair of documents. It utilizes basic text processing techniques and mathematical computations to provide a measure of similarity, outputting results as a percentage similarity between documents.

Features

  • Input multiple documents interactively through the console.
  • Normalize and clean the text data to remove non-alphanumeric characters and handle different case inputs.
  • Calculate term frequencies for each document.
  • Determine cosine similarity scores between each pair of documents.
  • List the top similarities based on user input.

Requirements

  • Python 3.x
  • No external libraries are required, uses only the standard library modules collections, itertools, math, and re.

Installation

No installation is required, just ensure you have Python 3.x installed on your system.

Usage

  1. Run the script in a Python environment.
  2. Enter the number of documents you want to analyze when prompted.
  3. Input each document one by one when prompted.
  4. After all documents are entered, the program will automatically calculate and display the similarity scores.
  5. Enter the number of top similarities you want to view to get the most similar document pairs.

How It Works

  1. Input Handling: The program starts by asking for the number of documents. Each document is input one by one. Text cleaning is performed to standardize the input.
  2. Text Preprocessing: Each document's text is converted to lowercase, and all non-alphanumeric characters are removed. Text is then split into words to create a list of terms.
  3. Similarity Calculation: The program calculates term frequencies and uses these to compute cosine similarities between each pair of documents.
  4. Output: Similarity scores are presented as percentages, and the program can also list the top document pairs with the highest similarities based on user input.

Example

$ python document_similarity.py
Amount of documents: 3
Document No. 1
Enter your document here:
Hello world
Document No. 1 added successfully

Document No. 2
Enter your document here:
Hello there
Document No. 2 added successfully

Document No. 3
Enter your document here:
Another document
Document No. 3 added successfully

The similarity between Document No: 1 and Document No: 2 is: 50.0 %
The similarity between Document No: 1 and Document No: 3 is: 0.0 %
The similarity between Document No: 2 and Document No: 3 is: 0.0 %

Enter a Number between 1 and 3
Find the top similar documents: 2

1 The 50.0% similarity, come from document No: 1 and Document No: 2

Contributions

Contributions to this project are welcome. If you have ideas for improvements or notice any issues, please feel free to fork the repository and submit a pull request with your changes.

License

This project is licensed under the MIT License. You are permitted to use, modify, and distribute the software as needed, provided that this license is included with any substantial usage of the work.

About

University project, program that shows you the percentage similarity between the documents you import

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages