# Topic Modeling: Assessment of 40,000 Quora questions
The goal of this project is to identify and assign 20 topics to a dataset extracted fro Quora website. This dataset includes 40,000 questions and data is stored in quora_questions.csv file.

### Loading Pandas library and the dataset and checking the dataset:

In [1]:
import pandas as pd
quora = pd.read_csv('quora_questions.csv')
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [6]:
quora['Question'][0]

'What is the step by step guide to invest in share market in india?'

In [7]:
quora.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404289 entries, 0 to 404288
Data columns (total 1 columns):
Question    404289 non-null object
dtypes: object(1)
memory usage: 3.1+ MB


In [8]:
quora.describe()

Unnamed: 0,Question
count,404289
unique,290456
top,How do I improve my English speaking?
freq,50


### Preprocessing: TF-IDF ...

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

Building the vocabulary ignoring: <br>
* Terms that exist in more than 95% of questions <br>
* Terms that exist in less than 2 questions <br>
* Terms that exist in the Standard English Stop word list    

In [10]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [11]:
dtm = tfidf.fit_transform(quora['Question'])

In [12]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

### NMF
LDA also can be used but since the size of the database is big and NMF is faster than LDA I will be using NMF for this project

In [16]:
from sklearn.decomposition import NMF

In [17]:
nmf_model = NMF(n_components=20)

In [18]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

### Displaying Topics

In [19]:
len(tfidf.get_feature_names())

38669

In [20]:
len(nmf_model.components_)

20

In [21]:
nmf_model.components_

array([[0.00000000e+00, 5.65702825e-02, 5.42732724e-05, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.23992480e-03, 0.00000000e+00, 3.45537311e-05, ...,
        0.00000000e+00, 3.65329345e-03, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [4.25346962e-04, 5.13682493e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.67754754e-04, 0.00000000e+00, 6.53185281e-06, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [7.50732027e-05, 4.33661607e-04, 5.80340624e-05, ...,
        1.63283966e-03, 0.00000000e+00, 1.63283966e-03]])

In [24]:
single_topic = nmf_model.components_[0]
single_topic.argsort()[-10:]

array([26057,  5976, 19847, 22924, 37520,   482,  5283,  5268, 22925,
        4632], dtype=int64)

In [25]:
for index in single_topic.argsort()[-10:]:
    print(tfidf.get_feature_names()[index])

phone
buy
laptop
movie
ways
2016
books
book
movies
best


let's view the top 15 words found for all the 20 topics:

In [26]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 

### Attaching Discovered Topic Labels to Original Articles

In [27]:
topic_results = nmf_model.transform(dtm)

In [28]:
topic_results.shape

(404289, 20)

Finding the topic label for the first question:

In [30]:
topic_results[0].round(3)

array([0.   , 0.   , 0.   , 0.   , 0.   , 0.026, 0.   , 0.   , 0.   ,
       0.   , 0.   , 0.001, 0.   , 0.   , 0.   , 0.   , 0.   , 0.001,
       0.   , 0.   ])

In [31]:
topic_results[0].argmax()

5

The model assigned topic 5 to the first question. Checking the top words for topic 5 shows this topic is related to job market or business in India and China

In [33]:
quora['Question'][0]

'What is the step by step guide to invest in share market in india?'

### Combining with Original Data

In [34]:
quora['Topic'] = topic_results.argmax(axis=1)

In [35]:
quora.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
