# NLP - Topic Modeling Assignment



## Dataset 
- For this assignment you will be working with a dataset of over 400,000 quora questions that have no labeled cateogries.

## Main Objective 
- You are attempting to find 20 cateogries to assign these questions in the CVS file.


#### Task: Import pandas and read in the quora_questions.csv file.

In [45]:
import pandas as pd
import random

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [46]:
# Loading dataset 
npr = pd.read_csv('Quora Questions.csv')

In [47]:
npr.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [48]:

npr.shape

(404289, 1)

In [49]:
# Creating copies of the dataset to use it for different models

lda_npr = npr.copy()
nnmf_npr = npr.copy()

In [50]:
# To print the Question at index 1

npr['Question'][1]

'What is the story of Kohinoor (Koh-i-Noor) Diamond?'

# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. 
Note: You may want to explore the max_df and min_df parameters.

In [51]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [52]:
dtm = cv.fit_transform(npr['Question'])

In [53]:
# Note: 
    # number of articles: 404289
    # number of words: 38669 --> than happened more than twice and less than 95% 
dtm

<404289x38669 sparse matrix of type '<class 'numpy.int64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# LDA - Latent Dirichlet Allocation

#### TASK: Using Scikit-Learn create an instance of LDA with 20 expected components. 
Note: Use random_state = 42

In [54]:
# Creating an LDA with 7 topics 
 
LDA = LatentDirichletAllocation(n_components=20,random_state=42)

In [55]:
# Note: This can take a while, we're dealing with a large amount of documents!

LDA.fit(dtm)

LatentDirichletAllocation(n_components=20, random_state=42)

In [56]:
# Printing the top 15 words for each of the NMF 20 topics in LDA

for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0




['sydney', 'development', 'code', 'open', 'services', 'media', 'google', 'good', 'company', 'india', 'social', 'career', 'history', 'service', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['economy', 'process', 'making', 'government', 'rupee', 'india', 'word', 'money', 'rs', 'english', 'black', 'indian', '1000', 'notes', '500']


THE TOP 15 WORDS FOR TOPIC #2
['current', 'ones', 'alcohol', 'center', 'legal', 'home', 'state', 'compare', 'man', 'purpose', 'good', 'india', 'cost', 'average', 'does']


THE TOP 15 WORDS FOR TOPIC #3
['answers', 'year', 'facts', 'apple', 'mind', 'series', 'looking', 'interesting', 'worth', 'big', 'exist', 'tv', 'does', 'iphone', 'new']


THE TOP 15 WORDS FOR TOPIC #4
['australia', 'overcome', 'usa', 'students', 'student', 'visa', 'canada', 'mba', 'apply', 'jobs', 'college', 'differences', 'india', 'car', 'job']


THE TOP 15 WORDS FOR TOPIC #5
['different', 'russia', 'win', 'relationship', 'culture', 'countries', 'pakistan', 'china', 'like', 'math', 'india', 'war'

# Non-negative Matrix Factorization
#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components.
Note: Use random_state = 42

In [57]:
# Importing packages 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [58]:
# working with a copy from original dataset 

npr = nnmf_npr

In [59]:
npr.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [60]:


tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [61]:

dtm = tfidf.fit_transform(npr['Question'])

In [62]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

In [63]:
nmf_model = NMF(n_components=20,random_state=42)

In [64]:
nmf_model.fit(dtm)



NMF(n_components=20, random_state=42)

#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [65]:
# Printing the top 15 words for each of the NMF 20 topics Non-negative Matrix Factorization

for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')



THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

I did this part twice, once with LDA and once with Non-negative Matrix Factorization 

# **NOW WITH LDA**

In [73]:
topic_results = LDA.transform(dtm)

In [74]:
# Creating a new cloumn in the dataset the gives the index "which represent the topic number" from LDA
npr['Topic'] = topic_results.argmax(axis=1)

In [77]:
# What the the appropraite topics from LDA ???

topics_dict = {0:'topic_1',1:'topic_2',2:'topic_3',3:'topic_4',4:'topic_5',5:'topic_6',6:'topic_7',7:'topic_8',8:'topic_9',9:'topic_10' , 10:'topic_11',11:'topic_12',12:'topic_13',13:'topic_14',14:'topic_15',15:'topic_16',16:'topic_17',17:'topic_18',18:'topic_19',19:'topic_20'}
npr["Topic Label"] = npr["Topic"].map(topics_dict)

In [78]:
npr.head(10)

Unnamed: 0,Question,Topic,Topic Label
0,What is the step by step guide to invest in sh...,16,topic_17
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,17,topic_18
2,How can I increase the speed of my internet co...,8,topic_9
3,Why am I mentally very lonely? How can I solve...,19,topic_20
4,"Which one dissolve in water quikly sugar, salt...",17,topic_18
5,Astrology: I am a Capricorn Sun Cap moon and c...,2,topic_3
6,Should I buy tiago?,17,topic_18
7,How can I be a good geologist?,1,topic_2
8,When do you use シ instead of し?,11,topic_12
9,Motorola (company): Can I hack my Charter Moto...,7,topic_8


NOW with Non-negative Matrix Factorization

In [79]:
topic_results = nmf_model.transform(dtm)

In [80]:
npr['Topic'] = topic_results.argmax(axis=1)

In [81]:


topics_dict = {0:'topic_1',1:'topic_2',2:'topic_3',3:'topic_4',4:'topic_5',5:'topic_6',6:'topic_7',7:'topic_8',8:'topic_9',9:'topic_10' , 10:'topic_11',11:'topic_12',12:'topic_13',13:'topic_14',14:'topic_15',15:'topic_16',16:'topic_17',17:'topic_18',18:'topic_19',19:'topic_20'}
npr["Topic Label"] = npr["Topic"].map(topics_dict)

In [82]:
npr.head(10)

Unnamed: 0,Question,Topic,Topic Label
0,What is the step by step guide to invest in sh...,5,topic_6
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16,topic_17
2,How can I increase the speed of my internet co...,17,topic_18
3,Why am I mentally very lonely? How can I solve...,11,topic_12
4,"Which one dissolve in water quikly sugar, salt...",14,topic_15
5,Astrology: I am a Capricorn Sun Cap moon and c...,1,topic_2
6,Should I buy tiago?,0,topic_1
7,How can I be a good geologist?,10,topic_11
8,When do you use シ instead of し?,19,topic_20
9,Motorola (company): Can I hack my Charter Moto...,17,topic_18
