# Analyzing Quora Questions 

Here, I worked with a dataset of over 400,000 Quora questions with no labeled cateogry. <br>I attempt to find 15 cateogries to assign these questions to. 

#### Importing pandas and reading in the quora_questions.csv file.

In [2]:
import pandas as pd

In [3]:
qq = pd.read_csv('quora_questions.csv')

In [4]:
qq.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Use TF-IDF Vectorization to create a vectorized document term matrix. 
<br>Here, we're going to ask for words that show up in no more than 95 percent of the questions and in at least 2 questions.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [9]:
dtm = tfidf.fit_transform(qq['Question'])

In [10]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# Non-negative Matrix Factorization

Using Scikit-Learn create an instance of NMF with 20 expected components.<br>Due to the size of the dataset, chose Non-negative Matrix Factorization over LDA as the processing is faster. 

In [11]:
from sklearn.decomposition import NMF

In [35]:
nmf_model = NMF(n_components=15, random_state=42)

In [36]:
nmf_model.fit(dtm)



NMF(n_components=15, random_state=42)

####  Printing the top 15 most common words for each of the 20 topics.

In [37]:
for i,topic in enumerate(nmf_model.components_):
    print(f"The TOP 15 words for Topic #{i}")
    print([tfidf.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print ('\n')

The TOP 15 words for Topic #0
['place', 'visit', 'places', 'phone', 'time', 'ways', 'buy', 'laptop', 'movie', '2016', 'books', 'book', 'movies', 'way', 'best']


The TOP 15 words for Topic #1
['recruit', 'differ', 'looking', 'use', 'sex', 'exist', 'time', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


The TOP 15 words for Topic #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


The TOP 15 words for Topic #3
['facebook', 'friends', 'black', 'internet', 'free', 'easiest', 'home', 'easy', 'youtube', 'ways', 'way', 'earn', 'online', 'make', 'money']


The TOP 15 words for Topic #4
['earth', 'did', 'death', 'changed', 'day', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


The TOP 15 words for Topic #5
['minister', 'company', 'engineering', 'china', 'olympics', 'available', 'business', 'job', 'country', 'spotify

#### Adding a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [18]:
topic_results =nmf_model.transform(dtm)

In [19]:
qq['Topic']=topic_results.argmax(axis=1)

In [20]:
qq.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
