### Topic Modeling using LDA and NMF in Python


**Dataset description**: 404,289 Quora questions without any labelled category.
    
**Objective**: Find 20 categories to assign to the unlabelled questions using unsupervised learning techniques such as Latent   Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

In [15]:
##basic imports
import numpy as np
import pandas as pd
import sklearn

In [16]:
##load dataset
df = pd.read_csv('quora_questions.csv')
df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [17]:
##null value check
total_null = df['Question'].isnull().sum()
print(total_null)

0


In [18]:
##empty string check
total_blank = df['Question'].apply(lambda string: string.isspace()).sum()
print(total_blank)

0


In [19]:
##checking shape
df.shape

(404289, 1)

### Part 1: Latent Dirichlet Allocation (LDA) based topic modeling

In [21]:
##vectorization

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df=0.7, min_df=2, stop_words='english')

cv_term_matrix = cv.fit_transform(df['Question'])

**Comments**:

1) max_df=0.7 => only consider words that appear in less than 70% of the questions
2) min_df=2 => only consider words that appear in atleast 2 questions
3) stop_words='english' => don't consider way too common english words for vectorization

**Important**: Count vectorization is preferred over TF-IDF here because LDA is based on the Dirichlet Probability Distribution that puts importance on the occurence count of words.

In [22]:
##Latent Dirichlet Allocation

from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=20, random_state=27) ##n_components => no. of topics to identify

LDA.fit(cv_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=20, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=27, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [23]:
##now, the interesting part

In [24]:
##check length of count vectorizer vocabulary
len(cv.get_feature_names())

38669

In [25]:
LDA.components_.shape

(20, 38669)

**Important**: LDA.components_ essentially outputs the normalized distribution for the 20 desired topics over the 38669 words in the vocabulary of the count vectorizer.

In [35]:
##finding the 10 highest probabilty words for each identified topic

lda_dct = dict()

for index, topic in enumerate(LDA.components_):
    indices = topic.argsort()[-10:] ##grab index positions for 10 highest normalized probability values
    words = [cv.get_feature_names()[i] for i in indices] ##use indices to get words from count vectorizer vocabulary
    lda_dct[index] = words
    
##print topics against keywords
for topic, keywords in lda_dct.items():
    print(f"Topic {topic}: {', '.join(keywords)}.\n")

Topic 0: service, police, india, bad, place, safe, laptop, good, buy, best.

Topic 1: air, house, beautiful, body, cause, hair, hard, water, long, does.

Topic 2: way, earn, free, lose, ways, weight, online, best, money, make.

Topic 3: stock, english, make, market, friends, college, stop, girl, best, way.

Topic 4: grow, mean, guy, sleep, car, look, feel, like, work, does.

Topic 5: visit, friend, places, making, process, age, tell, years, year, old.

Topic 6: battle, small, usa, sex, did, history, social, think, start, business.

Topic 7: pakistan, black, did, war, india, difference, love, mean, world, does.

Topic 8: relationship, important, school, new, going, don, day, things, know, like.

Topic 9: current, sentence, machine, human, india, word, real, read, used, life.

Topic 10: child, pregnant, asked, sex, period, makes, average, men, women, good.

Topic 11: website, movies, increase, hillary, clinton, study, donald, improve, english, trump.

Topic 12: india, book, programming, 

In [37]:
##assigning topics to each question alongside representative keywords

lda_topic_distribution = LDA.transform(cv_term_matrix)

df['lda_topic'] = lda_topic_distribution.argmax(axis=1) ##grab the topic with highest normalized probability
df['lda_keywords'] = df['lda_topic'].apply(lambda topic: ', '.join(lda_dct[topic])) ##grab keywords from the lda_dct

df.head()

Unnamed: 0,Question,lda_topic,lda_keywords
0,What is the step by step guide to invest in sh...,3,"stock, english, make, market, friends, college..."
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,14,"difference, data, science, engineering, app, c..."
2,How can I increase the speed of my internet co...,14,"difference, data, science, engineering, app, c..."
3,Why am I mentally very lonely? How can I solve...,7,"pakistan, black, did, war, india, difference, ..."
4,"Which one dissolve in water quikly sugar, salt...",1,"air, house, beautiful, body, cause, hair, hard..."


**Important**: LDA.transform(cv_term_matrix) outputs normalized Dirichlet Probability Distribution for all documents in the corpus gainst the 20 desired topics.

In [48]:
##random lda check function
def lda_check(index=np.random.randint(0,df.shape[0])):
    print(f"Topic {df['lda_topic'][index]}: {df['lda_keywords'][index]}.")
    print("\n")
    print(df['Question'][index])

In [49]:
lda_check(1234)

Topic 14: difference, data, science, engineering, app, computer, iphone, android, examples, use.


Why do some old computer games run very fast on new, powerful computers?


In [50]:
lda_check()

Topic 3: stock, english, make, market, friends, college, stop, girl, best, way.


How can one impress girls on Quora?


### Part 2: Non-negative Matrix Factorization (NMF) based topic modeling

In [44]:
##vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_df=0.7, min_df=2, stop_words='english')

tfidf_term_matrix = tfidf.fit_transform(df['Question'])

**Important**: TF-IDF vectorization can be used with NMF because it works with coefficients for comparison, unlike LDA that uses normalized Dirichlet probability distribution.

In [45]:
##Non-negative Matrix Factorization (NMF)

from sklearn.decomposition import NMF

NMF = NMF(n_components = 20, random_state=27)

NMF.fit(tfidf_term_matrix)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=27, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

In [46]:
##finding the 10 highest coefficient words for each identified topic

nmf_dct = dict()

for index, topic in enumerate(NMF.components_):
    indices = topic.argsort()[-10:] ##grab indices for the 10 highest coefficients
    words = [tfidf.get_feature_names()[i] for i in indices]
    nmf_dct[index] = words
    
##printing topics alongside keywords
for topic, keywords in nmf_dct.items():
    print(f"Topic {topic}: {', '.join(keywords)}.\n")

Topic 0: phone, buy, laptop, movie, ways, 2016, books, book, movies, best.

Topic 1: use, exist, really, compare, cost, long, feel, work, mean, does.

Topic 2: improvement, delete, asked, google, answers, answer, ask, question, questions, quora.

Topic 3: internet, free, home, easy, youtube, ways, earn, online, make, money.

Topic 4: live, want, change, moment, real, important, thing, meaning, purpose, life.

Topic 5: china, business, country, olympics, available, job, spotify, war, pakistan, india.

Topic 6: hacking, want, python, languages, java, learning, start, language, programming, learn.

Topic 7: vote, better, election, did, win, hillary, president, clinton, donald, trump.

Topic 8: place, pakistan, happen, end, country, iii, start, did, war, world.

Topic 9: culture, women, work, girls, live, girl, look, sex, feel, like.

Topic 10: business, read, start, job, work, engineering, ways, bad, books, good.

Topic 11: government, ban, banning, black, indian, rupee, rs, 1000, notes, 

In [47]:
##assigning topics to each question alongside keywords

nmf_topic_distribution = NMF.transform(tfidf_term_matrix)

df['nmf_topic'] = nmf_topic_distribution.argmax(axis=1) ##grab the topic with highest coefficient value
df['nmf_keywords'] = df['nmf_topic'].apply(lambda topic: ', '.join(nmf_dct[topic]))

df.head()

Unnamed: 0,Question,lda_topic,lda_keywords,nmf_topic,nmf_keywords
0,What is the step by step guide to invest in sh...,3,"stock, english, make, market, friends, college...",5,"china, business, country, olympics, available,..."
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,14,"difference, data, science, engineering, app, c...",16,"tell, forget, really, friend, true, know, pers..."
2,How can I increase the speed of my internet co...,14,"difference, data, science, engineering, app, c...",17,"increase, painless, instagram, account, best, ..."
3,Why am I mentally very lonely? How can I solve...,7,"pakistan, black, did, war, india, difference, ...",11,"government, ban, banning, black, indian, rupee..."
4,"Which one dissolve in water quikly sugar, salt...",1,"air, house, beautiful, body, cause, hair, hard...",14,"pounds, reduce, quickly, loss, fast, fat, ways..."


In [51]:
##random check function for NMF
def nmf_check(index=np.random.randint(0,df.shape[0])):
    print(f"Topic {df['nmf_topic'][index]}: {df['nmf_keywords'][index]}.")
    print("\n")
    print(df['Question'][index])

In [57]:
nmf_check(123879)

Topic 12: girl, 2017, year, don, employees, going, day, things, new, know.


I sent a follow request to someone on Instagram. Why hasn't he approved the request for so long? I know that person is posting pictures on instagram based on his increasing number of posts. Is he not getting my request or is he ignoring my request?


In [58]:
nmf_check()

Topic 10: business, read, start, job, work, engineering, ways, bad, books, good.


If I want to apply cdse. what course should I join at college?


In [59]:
##creating a random check function incorporating both LDA and NMF
def random_check(index=np.random.randint(0,df.shape[0])):
    print(f"LDA keywords: {df['lda_keywords'][index]}")
    print(f"NMF keywords: {df['nmf_keywords'][index]}")
    print('\n')
    print(df['Question'][index])

In [60]:
random_check()

LDA keywords: believe, answer, ask, instagram, question, facebook, questions, google, quora, people
NMF keywords: improvement, delete, asked, google, answers, answer, ask, question, questions, quora


How do I post something on Quora?


In [62]:
random_check()

LDA keywords: believe, answer, ask, instagram, question, facebook, questions, google, quora, people
NMF keywords: improvement, delete, asked, google, answers, answer, ask, question, questions, quora


How do I post something on Quora?
