# Topic Modeling Assessment Project

For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 20 cateogries to assign these questions to. The .csv file of these text questions can be found underneath the Topic-Modeling folder.



#### Task: Import pandas and read in the quora_questions.csv file.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('quora_questions.csv')

In [3]:
data.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [8]:
dtm = tfidf.fit_transform(data['Question'])

In [9]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# Non-negative Matrix Factorization

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..

In [10]:
from sklearn.decomposition import NMF

In [11]:
nmf_model = NMF(n_components=20,random_state=42)

In [12]:
# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm)



In [13]:
len(tfidf.get_feature_names_out())

38669

In [14]:
import random

In [17]:
for i in range(10):
    random_word_id = random.randint(0,38669)
    print(tfidf.get_feature_names_out()[random_word_id])

zootopia
paragliding
hepatic
dmw
dicuss
version
gauss
favoritism
clerk
blogging


In [18]:
for i in range(10):
    random_word_id = random.randint(0,38669)
    print(tfidf.get_feature_names_out()[random_word_id])

syndrome
khaled
katchatheevu
referral
recommend
midway
avoiding
ghanaian
conciousness
supporting


In [19]:
len(nmf_model.components_)

20

In [20]:
nmf_model.components_

array([[0.00000000e+00, 5.65092701e-02, 5.42127292e-05, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.24553666e-03, 0.00000000e+00, 3.47094202e-05, ...,
        0.00000000e+00, 3.66959931e-03, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [4.07751038e-04, 4.92508600e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [7.84859441e-05, 4.54677374e-04, 6.05888945e-05, ...,
        1.70508633e-03, 0.00000000e+00, 1.70508633e-03],
       [3.44985878e-04, 0.00000000e+00, 4.81894019e-06, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [21]:
len(nmf_model.components_[0])

38669

In [22]:
single_topic = nmf_model.components_[0]

In [23]:
# Returns the indices that would sort this array.
single_topic.argsort()

array([    0, 22613, 22611, ...,  5268, 22925,  4632], dtype=int64)

In [24]:
# Word least representative of this topic
single_topic[18302]

0.0

In [26]:
# Word most representative of this topic
single_topic[4632]

8.27883883193824

In [27]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

array([26057,  5976, 19847, 22924, 37520,   482,  5283,  5268, 22925,
        4632], dtype=int64)

In [29]:
top_word_indices = single_topic.argsort()[-10:]

In [30]:
for index in top_word_indices:
    print(tfidf.get_feature_names_out()[index])

phone
buy
laptop
movie
ways
2016
books
book
movies
best


#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [31]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [32]:
data.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [33]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

In [34]:
dtm.shape

(404289, 38669)

In [35]:
len(data)

404289

In [36]:
topic_results = nmf_model.transform(dtm)

In [37]:
topic_results[0].argmax()

5

In [39]:
data['Topic'] = topic_results.argmax(axis=1)

In [40]:
data.head(10)

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
5,Astrology: I am a Capricorn Sun Cap moon and c...,1
6,Should I buy tiago?,0
7,How can I be a good geologist?,10
8,When do you use シ instead of し?,19
9,Motorola (company): Can I hack my Charter Moto...,17
