# Introduction

The purpose of this project is to look at the sentiment of Quora questions, investigate certain interesting words and the
sentiment associated with them and then analyze Quora questions using Topic Modeling; try to identify topics in Quora questions
(as well as Quora questions that have a strong negative sentiment). It is interesting to identify what kind of things people tend to look at at Quora. Afterwards, some of the "most important" keywords in the Quora quesitons will be identified. 

# Sentiment Analysis

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

%matplotlib inline

In [4]:
# Here we are importing the questions dataset and deleting irrelevant columns

questions = pd.read_csv('questions.csv',encoding = "ISO-8859-1")
del questions['id']
del questions['qid1']
del questions['qid2']
del questions['is_duplicate']
del questions['question2']
questions.head(5)

Unnamed: 0,question1
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [5]:
# Sentiment Scores computation 

from textblob import TextBlob

def sentiment_calc(text):
    try:
        return TextBlob(text).sentiment
    except:
        return None

questions['sentiment_score'] = questions['question1'].apply(sentiment_calc)
questions.head(5)

Unnamed: 0,question1,sentiment_score
0,What is the step by step guide to invest in sh...,"(0.0, 0.0)"
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,"(0.0, 0.0)"
2,How can I increase the speed of my internet co...,"(0.0, 0.0)"
3,Why am I mentally very lonely? How can I solve...,"(-0.12999999999999998, 0.9099999999999999)"
4,"Which one dissolve in water quikly sugar, salt...","(0.0, 0.0)"


In [6]:
# convert to string
questions = questions.astype(str)

In [7]:
# extract only the sentiment score from the column

polarity = questions.sentiment_score.str.split('=').str[1].str.split(',').str[0].str.strip().to_frame()
polarity.head(5)

Unnamed: 0,sentiment_score
0,0.0
1,0.0
2,0.0
3,-0.1299999999999999
4,0.0


In [8]:
# rename column
polarity = polarity.rename(columns= {'sentiment_score': 'polarity'})
polarity.head(1)

Unnamed: 0,polarity
0,0.0


In [9]:
# concatenate the dataframes, delete extra column
Question_polarity = pd.concat([questions,polarity], axis=1)
del Question_polarity['sentiment_score']
Question_polarity.head(5)

Unnamed: 0,question1,polarity
0,What is the step by step guide to invest in sh...,0.0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,0.0
2,How can I increase the speed of my internet co...,0.0
3,Why am I mentally very lonely? How can I solve...,-0.1299999999999999
4,"Which one dissolve in water quikly sugar, salt...",0.0


In [10]:
# Convert to numeric data types
Question_polarity['polarity'] = Question_polarity['polarity'].apply(pd.to_numeric)
Question_polarity.dtypes

question1     object
polarity     float64
dtype: object

In [11]:
# Look at questions that only have a strong negative sentiment
Negative_questions = Question_polarity.loc[Question_polarity['polarity']<-0.5]
Negative_questions.head(5)

Unnamed: 0,question1,polarity
48,What are some examples of products that can be...,-0.7
117,"Did you find the ending of the novel ""1984"" de...",-0.6
136,Does it matter whether humans are selfish or e...,-0.75
151,How can I stop being afraid of working?,-0.6
203,Why do people hate Hillary Clinton?,-0.8


In [12]:
# Keep all rows for different sentiment calculation
All_questions = Question_polarity.loc[Question_polarity['polarity']>=-1.0]
All_questions.head(5)

Unnamed: 0,question1,polarity
0,What is the step by step guide to invest in sh...,0.0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,0.0
2,How can I increase the speed of my internet co...,0.0
3,Why am I mentally very lonely? How can I solve...,-0.13
4,"Which one dissolve in water quikly sugar, salt...",0.0


In [13]:
# delete a column
del Negative_questions['polarity']
Negative_questions.head(1)

Unnamed: 0,question1
48,What are some examples of products that can be...


In [14]:
# delete a column
del All_questions['polarity']
All_questions.head(1)

Unnamed: 0,question1
0,What is the step by step guide to invest in sh...


In [15]:
# a sample of some of the questions using the top terms appearing in some of the topics later in the notebook discovered in topic modeling

pd.options.display.max_colwidth = 1000
All_questions[All_questions['question1'].str.contains("Trump") | All_questions['question1'].str.contains("Clinton")].head(10)

Unnamed: 0,question1
15,What would a Trump presidency mean for current international masterâs students on an F1 visa?
203,Why do people hate Hillary Clinton?
210,If Hillary Clinton could not continue her Presidential campaign how would the Democratic Party choose a new candidate?
234,"Trump supporters, how do you feel about the way he's walking back his promises?"
366,Who is more electable in a general Presidential election: Donald Trump or Ted Cruz?
412,"Is Donald Trump taking cocaine, as Howard Dean suggests?"
430,What would Hillary Clinton do now that the election is over?
551,How will Trump's presidency affect Indian students who are planning to do a PhD in the US?
583,Do Trump voters care that he has reneged all his campaign promises?
638,"If the United States has a female president, will her husband be called the first gentleman? What will Bill Clinton be called if Hillary is elected?"


In [16]:
# a sample of some of the questions using the top terms appearing in some of the topics later in the notebook discovered in topic modeling

pd.options.display.max_colwidth = 1000
Negative_questions[Negative_questions['question1'].str.contains("India") | Negative_questions['question1'].str.contains("Modi")].head(10)

Unnamed: 0,question1
917,Does all Muslims hate Narendra Modi?
2510,Why is India afraid of having referendum in Kashmir?
3233,Was 2G spectrum scam bad for India?
3948,Why do people say Dhanush (South Indian actor) is ugly? I don't think so.?
6950,Why is India failing so miserably at the Rio Olympics?
13526,Do all muslims hate Prime Minister Narendra Modi and RSS?
14128,Why is India so bad at Olympics?
14647,"Why don't Indians eat pheasant, grouse, goose, turkey, duck, partridge, quail, guinea pig, rabbit& frog? Why are we obsessed with chicken mutton?"
18530,Is Arvind mad about Modi?
18723,Why is India failing so miserably at the Rio Olympics?


# Topic Modeling 

# Topics for All Questions

In [17]:
pd.options.display.max_colwidth = 1000
import warnings
warnings.filterwarnings('ignore')

In [18]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

stopwords = set(stopwords.words('english'))
punctuation = set(string.punctuation) 
lemmatize = WordNetLemmatizer()

def cleaning(article):
    one = " ".join([i for i in article.split() if i not in stopwords])
    two = "".join(i for i in one if i not in punctuation)
    three = " ".join(lemmatize.lemmatize(i) for i in two.split())
    return three

[nltk_data] Downloading package stopwords to /Users/Misha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Misha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
text = All_questions.applymap(cleaning)['question1']
text_list = [i.split() for i in text]
len(text_list)

from time import time
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO,
                   filename='running.log',filemode='w')

In [20]:
#ldamodel.save('topic.model')

# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
dictionary = corpora.Dictionary(text_list)
dictionary.save('dictionary.dict')
print (dictionary)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in text_list]
corpora.MmCorpus.serialize('corpus.mm', doc_term_matrix)

print (len(doc_term_matrix))
print (doc_term_matrix[100])

start = time()
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50, alpha = 0.5)
print ('used: {:.2f}s'.format(time()-start))

Dictionary(94281 unique tokens: ['What', 'guide', 'india', 'invest', 'market']...)
404351
[(0, 1), (265, 1), (404, 1), (426, 1), (481, 1), (482, 1), (483, 1), (484, 1), (485, 1)]
used: 11841.60s


In [21]:
# Results of Topic Modeling with LDA for negative sentiment

for i in ldamodel.print_topics(): 
    for j in i: print (j)

0
0.125*"Why" + 0.042*"people" + 0.027*"Do" + 0.022*"Whats" + 0.019*"think" + 0.014*"Indian" + 0.011*"need" + 0.009*"say" + 0.008*"important" + 0.008*"human"
1
0.035*"would" + 0.029*"If" + 0.028*"time" + 0.027*"much" + 0.023*"someone" + 0.017*"take" + 0.015*"it" + 0.014*"possible" + 0.013*"person" + 0.013*"website"
2
0.225*"How" + 0.224*"I" + 0.044*"get" + 0.022*"make" + 0.015*"money" + 0.012*"find" + 0.010*"become" + 0.010*"want" + 0.010*"girl" + 0.009*"note"
3
0.113*"best" + 0.062*"Which" + 0.026*"Who" + 0.019*"work" + 0.019*"book" + 0.017*"Are" + 0.017*"movie" + 0.016*"better" + 0.014*"world" + 0.013*"phone"
4
0.026*"year" + 0.019*"start" + 0.014*"Should" + 0.014*"woman" + 0.013*"weight" + 0.011*"free" + 0.011*"lose" + 0.011*"In" + 0.011*"old" + 0.010*"2"
5
0.418*"What" + 0.033*"way" + 0.025*"difference" + 0.015*"mean" + 0.008*"example" + 0.007*"major" + 0.007*"computer" + 0.006*"car" + 0.006*"real" + 0.006*"song"
6
0.042*"one" + 0.034*"Quora" + 0.028*"know" + 0.027*"Does" + 0.026*"

In [22]:
from IPython.display import IFrame

When viewing the visualization for all the topics in Quora questions below we see some interesting topics. It is not always easy to clearly discern one single topic in each category but you can get a rough idea of some of the main areas that people tend to look for. Some of these relationships are stronger then others and that can be seen by the amount of red in the visualization. Additionally, a few of the topics are not as easily seperated into one simple category of things:

The categories you can roughly split the top topics into are:

-Education (things like engineering, business, college, books, etc are in the topic)

-Countries/Culture (India, USA, women, war, traveling, etc)

-Politics/US Presidential Election (mention of Donald Trump/Hilary Clinton)

-The idea of first things like a child, a new home, new startup, new year, etc

-Relationships (includes terms like guy, girl, weight, sex, etc)

-Making money online (Facebook, app, google, money, etc)

-Friendships, where to find friends/relationships and related items




In [23]:
# All Sentiment Analysis

import warnings
warnings.filterwarnings('ignore')

ldamodel.save('topic.model')
import os 
%matplotlib inline
import pyLDAvis.gensim
import gensim
pyLDAvis.enable_notebook()

d = gensim.corpora.Dictionary.load('dictionary.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')

pd.options.display.max_colwidth = 5000
data = pyLDAvis.gensim.prepare(lda, c, d, mds='tsne')
pyLDAvis.display(data)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


# Topics for Negative Questions

In [24]:
pd.options.display.max_colwidth = 1000
import warnings
warnings.filterwarnings('ignore')

In [25]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

stopwords = set(stopwords.words('english'))
punctuation = set(string.punctuation) 
lemmatize = WordNetLemmatizer()

def cleaning(article):
    one = " ".join([i for i in article.split() if i not in stopwords])
    two = "".join(i for i in one if i not in punctuation)
    three = " ".join(lemmatize.lemmatize(i) for i in two.split())
    return three

[nltk_data] Downloading package stopwords to /Users/Misha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Misha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [26]:
text = Negative_questions.applymap(cleaning)['question1']
text_list = [i.split() for i in text]
len(text_list)

from time import time
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO,
                   filename='running.log',filemode='w')

In [27]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
dictionary = corpora.Dictionary(text_list)
dictionary.save('dictionary.dict')
print (dictionary)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in text_list]
corpora.MmCorpus.serialize('corpus.mm', doc_term_matrix)

print (len(doc_term_matrix))
print (doc_term_matrix[100])

start = time()
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50, alpha = 0.5)
print ('used: {:.2f}s'.format(time()-start))

Dictionary(5391 unique tokens: ['What', 'crude', 'example', 'make', 'oil']...)
5011
[(28, 1), (202, 1), (366, 1), (367, 1), (368, 1), (369, 1), (370, 1), (371, 1)]
used: 182.01s


In [28]:
# Results of Topic Modeling with LDA for negative sentiment

for i in ldamodel.print_topics(): 
    for j in i: print (j)

0
0.243*"bad" + 0.127*"Is" + 0.019*"evil" + 0.019*"Does" + 0.018*"you" + 0.013*"The" + 0.013*"health" + 0.012*"Olympics" + 0.012*"day" + 0.010*"annoying"
1
0.070*"get" + 0.059*"rank" + 0.049*"I" + 0.018*"2016" + 0.017*"If" + 0.017*"sick" + 0.016*"really" + 0.014*"rid" + 0.012*"college" + 0.011*"Where"
2
0.134*"I" + 0.028*"make" + 0.026*"artificial" + 0.017*"do" + 0.015*"Im" + 0.014*"intelligence" + 0.014*"ugly" + 0.013*"go" + 0.012*"find" + 0.011*"know"
3
0.039*"stupid" + 0.037*"Quora" + 0.035*"chicken" + 0.026*"Can" + 0.022*"world" + 0.020*"question" + 0.017*"why" + 0.015*"Indian" + 0.015*"When" + 0.015*"job"
4
0.062*"afraid" + 0.035*"Are" + 0.032*"would" + 0.027*"crazy" + 0.027*"painful" + 0.026*"way" + 0.020*"impossible" + 0.018*"cat" + 0.015*"dog" + 0.011*"death"
5
0.186*"How" + 0.122*"I" + 0.046*"anger" + 0.039*"cold" + 0.030*"control" + 0.023*"feel" + 0.020*"girl" + 0.019*"base" + 0.012*"me" + 0.011*"feeling"
6
0.078*"hate" + 0.041*"Do" + 0.031*"Trump" + 0.027*"What" + 0.025*"bor

In [29]:
from IPython.display import IFrame

Some of the discovered topics for negative Quora questions are (some of these topics just like in the general sentiment are stronger then others and there are some topics that would be hard to classify into just 1 specific bucket):

-people afraid, hate, evil, dangerous

-Terrible/Worse things happening

-Anger control/sickness management

-Dangerous US President Donald Trump

-India (mention of India, their President, Muslims, Narendi)

-People feeling bad, ugly, annoying about something

-Relationships and friendships

In [30]:
import warnings
warnings.filterwarnings('ignore')

ldamodel.save('topic.model')
import os 
%matplotlib inline
import pyLDAvis.gensim
import gensim
pyLDAvis.enable_notebook()

d = gensim.corpora.Dictionary.load('dictionary.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')

pd.options.display.max_colwidth = 5000
data = pyLDAvis.gensim.prepare(lda, c, d, mds='tsne')
pyLDAvis.display(data)



# Conclusion 

It was interesting to create an LDA based Topic Model for questions on Quora based on two categories, one that includes the entire dataset
and the other being only the highly negatively sentimented questions. Our findings were that there were some types of questions that were similar between the overall dataset and the negative sentiment dataset (as expected as one is a subset of the other) as well as some topics that were different.

The similar topics between the two were:

-Presidential elections in the United States

-Country/culture, especially related to India

-Relationships (of all kinds, romantic or otherwise)

The differences between the entire dataset and the negatively sentimented one seemed to be:

In the entire dataset (hence the focus is on things missing from negative dataset):

-College/Education

-Making Money

-New starts such as new home, first child, etc

Whereas in the negative dataset there was focus on:

-Evil, horrible things happening 

-Anger Management

-Annoying/Ugly things


I think the analysis overall did a good job of analyzing sentiment and topics in the dataset. There is potential opportunity for further analysis by trying another model, particularly, Non-Negative Matrix Factorization and attempting to fine-tune the parameters in LDA.
