Skip to content

Topic modeling on tweets. Using doc2vec word embedding and k-means clustering to categorize tweets.

Notifications You must be signed in to change notification settings

Hamoon1987/TwitterTopicModeling

Repository files navigation

TwitterTopicModeling

Topic modeling on tweets. Using doc2vec word embedding and k-means clustering to categorize tweets.

The article is availabe here The goal of this code is to categorize tweets into main themes. Figure below shows the main steps of the process:


1- The dataset is a collection of tweets related to a specific subject. In my case it was tweets related to COVID-19 pandemic. Database should be extracted to MySQL folder of XAMPP software. load_data.py loads the tweets.
2- The preprocessing includes converting letters to lower case, removing URL, mentions, stopwords and emojis, correcting repeated characters, tokenizing and replacing negations with NOT. preprocessing.py preprocesses the tweets.
3- Document embedding is done using doc2vec algorithm in doc2vec.py
4- Clustering is performed using k-means algorithm in clustering.py
5- Theme extraction is done manually based on most frequent words used in each cluster which is generated in evaluate.py
All the following steps are performed by running the main.py file.

About

Topic modeling on tweets. Using doc2vec word embedding and k-means clustering to categorize tweets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages