GitHub - Bhargavasomu/Document_Clustering: Given a Dataset of Research Papers, this automatically clusters them based on their Field or Domain of Research

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Dataset		Dataset
Results		Results
src		src
README		README
Report.pdf		Report.pdf

Repository files navigation

This is a Document Clustering Project
In an Abstract sense the Input For this project is a set of Research Papers and the Number of clusters.
The output is the Clusters containing the Research Papers Names

Explanation About the Different Codes Used :
    1) pdf2text.py code converts the Research Paper which is in PDF format to txt format
    2) convertAllPdf2Text.sh takes as Input a directory containing all the Research Papers in PDF format
       and converts them to txt format and stores all those txt files in a folder called 'TextFiles'
       in the same directory
    3) tidf.py takes as input the directory name containing the Text Files and the number of clusters.
       The output which is the name of the Research Papers is printed to the output

Sequence of Running the Code :
    1) Run convertAllPdf2Text.sh
    2) Run tidf.py with the respective arguments

The Results for different clusters can be found in the Results Directory