___

# Topic Modeling with Non-negative Matrix Factorization

We have a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 19 cateogries to assign these questions to.

#### Import pandas and read in the quora_questions.csv file.

In [1]:
import pandas as pd
quora = pd.read_csv("files/quora_questions.csv")

In [2]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [3]:
quora.shape

(404289, 1)

# Preprocessing

#### TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
tf_idf = TfidfVectorizer(max_df=0.9, min_df=2, stop_words='english')

In [6]:
df = tf_idf.fit_transform(quora['Question'])

In [7]:
df

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# Non-negative Matrix Factorization

#### Using Scikit-Learn to create an instance of NMF with 20 expected components. (Use random_state=42)..

In [8]:
from sklearn.decomposition import NMF

In [9]:
nmf = NMF(n_components=19, random_state=42)

In [10]:
nmf.fit(df)



#### Print our the top 15 most common words for each of the 20 topics.

In [11]:
for index,topic in enumerate(nmf.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tf_idf.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'looking', 'differ', 'sex', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['things', 'earth', 'death', 'changed', 'want', 'live', 'day', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['prime', 'company', 'engineering', 'reservation', 'president', 'minister', 'china', 'country', 'oly

#### Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [12]:
quora

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."
...,...
404284,How many keywords are there in the Racket prog...
404285,Do you believe there is life after death?
404286,What is one coin?
404287,What is the approx annual cost of living while...


In [13]:
topic_result = nmf.transform(df)

In [14]:
quora['Topic'] = topic_result.argmax(axis=1)
quora

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
...,...,...
404284,How many keywords are there in the Racket prog...,6
404285,Do you believe there is life after death?,4
404286,What is one coin?,11
404287,What is the approx annual cost of living while...,11


In [15]:
topic_dict = {
    0: "General Recommendations",
    1: "Academic and Career Choices",
    2: "Online Q&A and Search",
    3: "Online Earning and Investment",
    4: "Life and Philosophy",
    5: "Global Affairs and Jobs",
    6: "Programming and Learning",
    7: "Political Elections",
    8: "IAS and Education",
    9: "Culture and Gender",
    10: "Jobs and Entertainment",
    11: "Currency and Economy",
    12: "Beliefs and Mindset",
    13: "Language Skills and Learning",
    14: "Health and Fitness",
    15: "Personal Preferences and Time",
    16: "Personal Interests and Activities",
    17: "Social Media and Hacks",
    18: "Technology and Engineering"
}

In [16]:
quora["Topic Label"] = quora["Topic"].map(topic_dict)
quora

Unnamed: 0,Question,Topic,Topic Label
0,What is the step by step guide to invest in sh...,5,Global Affairs and Jobs
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16,Personal Interests and Activities
2,How can I increase the speed of my internet co...,17,Social Media and Hacks
3,Why am I mentally very lonely? How can I solve...,11,Currency and Economy
4,"Which one dissolve in water quikly sugar, salt...",14,Health and Fitness
...,...,...,...
404284,How many keywords are there in the Racket prog...,6,Programming and Learning
404285,Do you believe there is life after death?,4,Life and Philosophy
404286,What is one coin?,11,Currency and Economy
404287,What is the approx annual cost of living while...,11,Currency and Economy
