# Topic Modeling 400,000 quora questions

Non-negative Matrix Factorization is an unsupervised algorithm. I use it in conjunction with TF-IDF to model topics across 400,000 quora questions. 

Once the model is complete I try to guess the topics given the 20 most popular words for each topic. I invite the reader to try and guess the topics!

In [34]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [24]:
quora = pd.read_csv("quora_questions.csv")

In [27]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [29]:
# Sample data:
quora["Question"][0]

'What is the step by step guide to invest in share market in india?'

In [31]:
# Removing the 5% most frequent terms as they are usually not conducive to any particular topic and any term must appear in at least two documents. You are welcome to choose your own values here.

tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words="english")

In [32]:
dtm = tfidf.fit_transform(quora["Question"])

In [33]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

In [37]:
# We are looking for 20 topics and 42 is the answer to life the universe and everything

nmf_model = NMF(n_components=20, random_state=42)

In [38]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

Let's see the 20 most common terms in each of our 20 topics:

In [39]:
for index,topic in enumerate(nmf_model.components_):
    print(f"The top 20 words for topic # {index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-20:]])
    print("\n")

The top 20 words for topic # 0
['app', 'engineering', 'friend', 'website', 'site', 'thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


The top 20 words for topic # 1
['come', 'relationship', 'says', 'universities', 'grads', 'majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


The top 20 words for topic # 2
['users', 'writer', 'marked', 'search', 'use', 'add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


The top 20 words for topic # 3
['com', 'facebook', 'job', 'easiest', 'making', 'using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


The top 20 words for topic # 4
['embarrassing', 'decision', 'biggest', 'work', 'did', 'balance', 'ea

In [40]:
topic_results = nmf_model.transform(dtm)

In [41]:
quora["Topic"] = topic_results.argmax(axis=1)

In [42]:
quora.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14


In [48]:
# It is up to you to interpret these terms and classify them. As usual, if the interpreter has good domain experience chances are that the topic labels will accurately reflect the results of the model. I invite you to try your own based on the words you see (It's tricky!).

my_topic_dict = {0:"Entertainment", 1:"Student Life", 2:"Research", 3:"Remote Work", 4:"Philosophy", 5:"Economics", 6:"Programming", 7:"Elections", 8:"Diplomacy", 9:"Relationships", 10:"Renewable Energy", 11:"Money", 12:"Social Media", 13:"Language and Communication", 14:"Health", 15:"Time", 16:"Love", 17:"Life Hacks", 18:"Software Development", 19:"People"}

quora["Topic Label"] = quora["Topic"].map(my_topic_dict)

In [60]:
quora["Question"][40]

'Why do Slavs squat?'

Question 40, according to my prediction, is a question concerning Health:
"Why do Slavs squat?"