clustering-assignment

contributor: Zhengliang Wang, Chenyang Zhao, Jianyuan Chen, Weichen Liu, Yunzhou Wang

Project Background

We used Bag-of-Word, TF-IDF, LDA and Doc2Vec to perform feature engineering, and used K Means, Gaussian Mixture and Agglomerative clustering algorithms to classify following books: 'THE COMMON LAW.txt', 'THE CONSTITUTION OF THE UNITED STATES OF AMERICA.txt', 'THE-ENGLISH-CONSTITUTION.txt', 'THE-LIFE-OF-THE-BEE.txt', 'THE STANDARD ELECTRICAL DICTIONARY.txt’, ‘THE-PHILOSOPHY-OF-MATHEMATICS.txt’, ‘WHITE-HOUSE-COOK-BOOK.txt’ , which can be found in https://www.gutenberg.org/

How to run the code:

The code can be run in Colab or local Jupyter Notebook.

The data modelling and result outputs are in main.ipynb. The main.ipynb consists of feature engineering, data modelling, score displays for Kappa, Silhouette and Rand Index. Sicne we only have LDA model that can produce topics, we only provide the best coherence score for the LDA model. Further details of the coherence pattern can be found in LDA/LDA_ipynb.ipynb.

Error analysis can be found at the end of main.ipynb and LDA/lda_erroranalysis.ipynb

For tidiness of our project, we did not provide clustering graph, but in function compare_predict(...), we provide options to display clustering results in TSNE-2D, SVD TSNE and dendrogram manner.

Libraries:

scikit-learn
matplot
plotly
wordcloud
nltk
pandas
numpy
seaborn
gensim
kneed
pyLDAvis
torchvision
spacy

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Doc2Vec_model		Doc2Vec_model
LDA		LDA
TF-IDF		TF-IDF
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb
make_tsne_2d-Copy1.ipynb		make_tsne_2d-Copy1.ipynb
measures.py		measures.py
out.csv		out.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2Vec_model

Doc2Vec_model

LDA

LDA

TF-IDF

TF-IDF

.gitignore

.gitignore

README.md

README.md

main.ipynb

main.ipynb

make_tsne_2d-Copy1.ipynb

make_tsne_2d-Copy1.ipynb

measures.py

measures.py

out.csv

out.csv

Repository files navigation

clustering-assignment

contributor: Zhengliang Wang, Chenyang Zhao, Jianyuan Chen, Weichen Liu, Yunzhou Wang

Project Background

How to run the code:

Libraries:

About

Releases

Packages

Contributors 4

Languages

Data-Sciecne-Applications/clustering-assignment

Folders and files

Latest commit

History

Repository files navigation

clustering-assignment

contributor: Zhengliang Wang, Chenyang Zhao, Jianyuan Chen, Weichen Liu, Yunzhou Wang

Project Background

How to run the code:

Libraries:

About

Resources

Stars

Watchers

Forks

Languages