This repository contains project on clustering of news articles and headlines that are being shared on Facebook.
Dataset- https://drive.google.com/file/d/1NbB053Q4MulTlzxINrlLe9VyhD9GzF0J/view?usp=sharing
Youtube Wakthrough- link to be updated
Dataset snapshot
Data distribution among topics-
Text cleaning procedure
dropna remove stopwords and words with length less than 2 removed numerical text lemmatized words
Dataset snapshot after cleaning
Clustered news articles based on three vectorisation techniques for 2 clustering algorithms
To find the optimum number of clusters, Elbow curve method has been employed.
For Dimensionality reduction we used T-SNE
For converting word2vec to document vector a new method MIN-MAX word vector has been employed.
Vectorisation
- TF-IDF - some parameters 1,3 ngrams,min_df-0.15, max_features-10000
- WORD2VEC - Gensim google word2vec
- DOC2VEC
Clustering Algorithm 1.K-means 2. Agglomerative
Below are the results of clusters
Extra Clustering with kmeans cluster tfidf technique and MDS dimensionality reduction.
Below are the stats of clusters
Observations:
With 6 type of combinations- using TF-IDF,Word2vec and Doc2vec, there results are quite different.
Doc2vVec with neither K-Means nor Agglomerative clustering algorithms performed well. They both failed to cluster topics.
TF-IDF comparitevely performed well than Doc2vec but failed to cluster topics in 1-2 categories with both K-means and agglomerative.
Word2Vec performed the best compared to both TF-IDF and Doc2Vec, Compared to K-means Agglomerative performed well and clustered topics appropriately with the sample.
Word2Vec with K-means performed very Well.