Unsupervised-Topic-Modeling-Spark

Unsupervised Topic Modeling of tweets using Spark

LDA and LSA Comparative analysis

Step 1: Data collection using Apache flume with the following keyword: #suicidal Step 2: Data Cleaning on spark dataframes: For the data cleaning, the steps we carried out are removind the following tokens: stop words, numerical values, Symbols, unique words (df >=2), and words with length < 4.

Step 3: we performed LDA (Latent Dirichlet allocation) by using ml library in spark. and also LSA (Latent Semantic Analysis) by using Gensim library and found 5 topics with each algorithm over all cleaned and tokenized tweets.

Step 4: We devised a way to compare their result and find the similarity between these topics of each algorithm. We created similarity score between each two topic based on simple weight multiplication of identical words.

Step 5: We created a 5*5 matrix that showed the similarity score between each LDA and LSA topic, and then decide which two topics are equal and which are different.

Step 6: We created a table that shows all found topics and their top 10 significant words. the topics that are similar are shown in a single row.

Results found: LDA and LSA topics may overlap, but each of them can find new unique topics, by applying both and finding similarity between their topics, we can extract more useful data from texts.

Twitter Keywors used: #suicidal

How To Run Reza_Program_1.py: export PYSPARK_PYTHON=python3 spark-submit Reza_Program_1.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
Reza_Program_1.py		Reza_Program_1.py
Similarity.JPG		Similarity.JPG
Topics-words.JPG		Topics-words.JPG
configuration.conf		configuration.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised-Topic-Modeling-Spark

About

Releases

Packages

Languages

Reza-Marzban/Unsupervised-Topic-Modeling-Spark

Folders and files

Latest commit

History

Repository files navigation

Unsupervised-Topic-Modeling-Spark

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages