Skip to content

Bhargavasomu/Document_Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a Document Clustering Project
In an Abstract sense the Input For this project is a set of Research Papers and the Number of clusters.
The output is the Clusters containing the Research Papers Names

Explanation About the Different Codes Used :
    1) pdf2text.py code converts the Research Paper which is in PDF format to txt format
    2) convertAllPdf2Text.sh takes as Input a directory containing all the Research Papers in PDF format
       and converts them to txt format and stores all those txt files in a folder called 'TextFiles'
       in the same directory
    3) tidf.py takes as input the directory name containing the Text Files and the number of clusters.
       The output which is the name of the Research Papers is printed to the output

Sequence of Running the Code :
    1) Run convertAllPdf2Text.sh
    2) Run tidf.py with the respective arguments

The Results for different clusters can be found in the Results Directory

About

Given a Dataset of Research Papers, this automatically clusters them based on their Field or Domain of Research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published