### Delivery 4: Natural Language Processing - Topic Modelling
### Marcell Veiner, Balin Lin
Quora Questions Answers - Topic Modelling <br>
Perform Topic Modelling for Quora Questions Answers using NTLK and other required Python packages and provide the following information for Quora:

In [1]:
# Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Read the quora questions file
quora = pd.read_csv('../input/quora-question/quora_questions.csv')

- How many questions are asked?

In [2]:
len(quora)

404289

- What is the dimension of document term matrix(DTM)?

In [3]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(quora['Question'])
print(dtm.shape) # Document term matrix

(404289, 38669)


- How many topics are there?

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

# Options to try with our LDA
# Beware it will try *all* of the combinations, so it'll take ages
search_params = {
  'n_components': [5, 10, 20, 30, 50, 75, 100]
}

# Set up LDA with the options we'll keep static
model = LatentDirichletAllocation(learning_method='online')

# Try all of the options
gridsearch = GridSearchCV(model, param_grid=search_params, n_jobs=-1, verbose=1)
gridsearch.fit(dtm)

# What did we find?
print("Best Model's Params: ", gridsearch.best_params_)
print("Best Log Likelihood Score: ", gridsearch.best_score_)

Fitting 5 folds for each of 7 candidates, totalling 35 fits
Best Model's Params:  {'n_components': 5}
Best Log Likelihood Score:  -3704503.1198356836


- What are the 10 most common words for each topic?

In [5]:
number_components = 5
LDA = LatentDirichletAllocation(n_components = number_components,random_state = 42)
LDA.fit(dtm)

for index, topic in enumerate(LDA.components_):
    print(f'THE TOP 10 WORDS FOR TOPIC # {index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

THE TOP 10 WORDS FOR TOPIC # 0




['2016', 'work', 'donald', 'phone', 'engineering', 'good', 'trump', 'does', 'india', 'best']


THE TOP 10 WORDS FOR TOPIC # 1
['1000', 'notes', '500', 'online', 'english', 'make', 'learn', 'money', 'way', 'best']


THE TOP 10 WORDS FOR TOPIC # 2
['book', 'like', 'books', 'sex', 'good', 'did', 'best', 'time', 'life', 'does']


THE TOP 10 WORDS FOR TOPIC # 3
['things', 'best', 'questions', 'world', 'new', 'like', 'know', 'does', 'quora', 'people']


THE TOP 10 WORDS FOR TOPIC # 4
['car', 'india', 'college', 'increase', 'love', 'lose', 'account', 'difference', 'weight', 'does']




- Map each question to the right topic.

In [6]:
# Linking the topics and documents.
topic_results = LDA.transform(dtm)
topic_rate = [0] * number_components

for i in range(len(topic_results)):
    topic_rate[topic_results[i].argmax()] += 1
    # print("topic_results[{0}].argmax():".format(i), topic_results[i].argmax()) # The first document belings to topic 1.

- Which is the topic people are mostly interested in?

In [7]:
print("Topic {0} with value {1}.".format(topic_rate.index(max(topic_rate)), max(topic_rate) / len(topic_results)))

Topic 0 with value 0.22688967545493446.


- Which is the least interesting topic for people?

In [8]:
print("Topic {0} with value {1}.".format(topic_rate.index(min(topic_rate)), min(topic_rate) / len(topic_results)))

Topic 1 with value 0.16820640680305424.
