Skip to content

Perform text modelling and clustering on text from the Proceedings of the Neural Information Processing Systems (NeurIPS).

License

Notifications You must be signed in to change notification settings

RohanKarthikeyan/NeurIPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Text analysis on NeurIPS papers

Aim: Perform text modelling and clustering on text from the Proceedings of the Neural Information Processing Systems (NeurIPS).

☑️ What did I do?

  • First, I processed the full-text using Regex to:
    • Extract the abstract from the full-text by analyzing the various patterns the authors use in numbering and/or naming the sections following the abstract.
    • Remove the information after the acknowledgments and/or reference sections. Some reasons for doing so:
      • they are not actually a part of the full text; and
      • they must not have an undue influence on our text analysis.
    • Perform 3 rounds of text cleaning, the last one being lemmatization. For this, I compared the results from NLTK and Spacy, and I found that NLTK gave better results for the given data set.
  • Secondly, I performed Exploratory Data analysis on the number of words and characters in the title and abstract of the papers. I also gained insight into the words commonly used in them.
  • Lastly, I trained a Gensim LDA model for topic modelling and used K-Means to perform clustering on the documents.

🔖 What I hope to do next

  • I want to use pywsd instead of NLTK for lemmatization and compare the results.
  • I want to try other topic modelling techniques like LSA, NMF and the like. See here.

S.D.G.

About

Perform text modelling and clustering on text from the Proceedings of the Neural Information Processing Systems (NeurIPS).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published