In this project, I have used Quora questions dataset to implement Latent Dirichlet Allocation (LDA) on and assign a topic to out of 10 unsupervised modelled topics
The data consists of 404,289 Quora questions.Following is a snapshot of the data:
- First, we create a TF-IDF matrix from the given questions
- Second, we use the LatentDirichletAllocation from sklearn.decomposition to create a LDA model that will assign highest probability words from our vocabulary (which consists of unique words taken from all the questions in our dataset) 3)Eventually we select the topic which shows highest probability for each question for that particular question
- We perform a similar approach for Non-Negative Matrix Factorization -->
We divide the TF-IDF matrix into two matrices :
1st) Topics Vs words and 2nd) words Vs question
#0 --> Technical/Books/Movies related questions
#1 --> Looks related questions
#2 --> QnA related questions
#3 --> Social Media related questions
#4 --> Life related questions
#5 --> People/Nationality related questions
#6 --> Language/Programming related questions
#7 --> Politics related questions
#8 --> Finance related questions
#9 --> Daily time related questions