Skip to content

Data-Sciecne-Applications/clustering-assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

clustering-assignment

contributor: Zhengliang Wang, Chenyang Zhao, Jianyuan Chen, Weichen Liu, Yunzhou Wang

Project Background

We used Bag-of-Word, TF-IDF, LDA and Doc2Vec to perform feature engineering, and used K Means, Gaussian Mixture and Agglomerative clustering algorithms to classify following books: 'THE COMMON LAW.txt', 'THE CONSTITUTION OF THE UNITED STATES OF AMERICA.txt', 'THE-ENGLISH-CONSTITUTION.txt', 'THE-LIFE-OF-THE-BEE.txt', 'THE STANDARD ELECTRICAL DICTIONARY.txt’, ‘THE-PHILOSOPHY-OF-MATHEMATICS.txt’, ‘WHITE-HOUSE-COOK-BOOK.txt’ , which can be found in https://www.gutenberg.org/

How to run the code:

The code can be run in Colab or local Jupyter Notebook.

The data modelling and result outputs are in main.ipynb. The main.ipynb consists of feature engineering, data modelling, score displays for Kappa, Silhouette and Rand Index. Sicne we only have LDA model that can produce topics, we only provide the best coherence score for the LDA model. Further details of the coherence pattern can be found in LDA/LDA_ipynb.ipynb.

Error analysis can be found at the end of main.ipynb and LDA/lda_erroranalysis.ipynb

For tidiness of our project, we did not provide clustering graph, but in function compare_predict(...), we provide options to display clustering results in TSNE-2D, SVD TSNE and dendrogram manner.

Libraries:

  • scikit-learn
  • matplot
  • plotly
  • wordcloud
  • nltk
  • pandas
  • numpy
  • seaborn
  • gensim
  • kneed
  • pyLDAvis
  • torchvision
  • spacy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •