# Topic Modeling Assessment Project

Welcome to your Topic Modeling Assessment! For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 20 cateogries to assign these questions to. The .csv file of these text questions can be found underneath the Topic-Modeling folder.

Remember you can always check the solutions notebook and video lecture for any questions.

#### Task: Import pandas and read in the quora_questions.csv file.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./quora_questions.csv')

In [4]:
df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [8]:
dtm = tfidf.fit_transform(df['Question'])

In [9]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# Non-negative Matrix Factorization

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..

In [10]:
from sklearn.decomposition import NMF

In [13]:
nmf = NMF(n_components=20, random_state=42, verbose=2)

In [14]:
nmf.fit(dtm)

violation: 1.0
violation: 0.12496204212069398
violation: 0.053788459037527575
violation: 0.033902998286721824
violation: 0.01745966027279114
violation: 0.010106599199413712
violation: 0.006615690893273301
violation: 0.005037051317142189
violation: 0.004339758276120241
violation: 0.004127172807259321
violation: 0.004166391364315649
violation: 0.0043304879176114764
violation: 0.004564667799953912
violation: 0.004741146368135883
violation: 0.004782374019333202
violation: 0.004646903860605924
violation: 0.004409900954619259
violation: 0.0041049254459422846
violation: 0.003773611020795247
violation: 0.003430973599343361
violation: 0.003198109346484904
violation: 0.0033663724427022267
violation: 0.0036932292497283015
violation: 0.004320319932782158
violation: 0.0052324213816678225
violation: 0.0059307587998492445
violation: 0.005876666798024568
violation: 0.005054227643429197
violation: 0.003753240332348861
violation: 0.002888635886953083
violation: 0.0022604146035839013
violation: 0.0017522



#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [15]:
for i, topic in enumerate(nmf.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{i}")
    print([tfidf.get_feature_names_out()[index] for index in topic.argsort()[-20:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['app', 'engineering', 'friend', 'website', 'site', 'thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['come', 'relationship', 'says', 'universities', 'grads', 'majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['users', 'writer', 'marked', 'search', 'use', 'add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['com', 'facebook', 'job', 'easiest', 'making', 'using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['embarrassing', 'decision', 'biggest', 'work', 'did', 'balance', 'earth',

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [17]:
df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [21]:
topic_res = nmf.transform(dtm)

violation: 1.0
violation: 0.05887240054619892
violation: 0.0007845889142260214
violation: 1.242689080219805e-05
Converged at iteration 5


In [24]:
df['Topic'] = topic_res.argmax(axis=1)
df.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14


# Great job!