<h1>Topic Modelling</h1>

Here we have used  Latent Dirichlet Allocation (LDA) to convert comments into group of topics
This is an unsupervised machine learning model which can be used to discover hidden semantic structures in a comment, that allows us to learn topic representations of comments

<h1>The Process</h1>

<ul>
    <li>We pick the number of topics ahead of time even if we’re not sure what the topics are.</li>
    <li>Each comment is represented as a distribution over topics.</li>
    <li>Each topic is represented as a distribution over words.</li>
</ul>

<h1>Data Cleaning</h1>

In [5]:
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word. 

In [6]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
from nltk.corpus import wordnet as wn

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer

def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

In [8]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sakthy1497/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now we can define a function to prepare the text for topic modelling:

In [9]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.

Now we can see how our text data are converted:

In [11]:
import pandas as pd
import random
text_data = []
df = pd.read_csv("/home/sakthy1497/Downloads/Amazon_Review.csv")
for line in df["Comments"]:
    tokens = prepare_text_for_lda(line)
    if random.random() > .99:
        print(tokens)
        text_data.append(tokens)

['motorola', 'website', 'follow', 'direction', 'could']
['great', 'works']
['found', 'product', 'waaay']
['colors']
['additional', 'provide', 'instructions', 'whatsoever']
['get', 'compliments']
[]
['technology', 'suck']
['disappoint', 'accessoryone']
['try', 'exercise', 'frustration']
['wirefly', 'contact', 'cingular']
['better']
['post', 'detail', 'comment', 'black', 'phone', 'great', 'color']
['always', 'cord', 'headset', 'freedom', 'wireless', 'helpful']
['disappoint']
['sound', 'quality', 'excellent']
[]
['case!.']
[]
['sister', 'love']
['open', 'battery', 'connection', 'break', 'device', 'turn']
['reception']
['cake', 'everyone', 'raving', 'taste', 'sugary', 'disaster', 'tailor', 'palate']
[]
['price']
['veggitarian', 'platter', 'world']
['absolutley', 'fantastic']
['awesome']
['dining', 'college', 'cooking', 'course', 'class', 'dining', 'service']


<h1>LDA with Gensim</h1>

First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [12]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


We are asking LDA to find 5 topics in the data:

In [13]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.038*"connection" + 0.038*"battery" + 0.038*"break" + 0.038*"turn"')
(1, '0.061*"dining" + 0.034*"raving" + 0.034*"everyone" + 0.034*"palate"')
(2, '0.043*"freedom" + 0.043*"headset" + 0.043*"always" + 0.043*"cord"')
(3, '0.053*"veggitarian" + 0.053*"platter" + 0.053*"world" + 0.053*"works"')
(4, '0.067*"disappoint" + 0.037*"great" + 0.037*"phone" + 0.037*"color"')


Find 3 topics

In [15]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('model3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.037*"dining" + 0.021*"class" + 0.021*"college" + 0.021*"course"')
(1, '0.063*"disappoint" + 0.036*"platter" + 0.036*"world" + 0.036*"veggitarian"')
(2, '0.027*"great" + 0.026*"palate" + 0.026*"cake" + 0.026*"raving"')


  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

Find 10 topics

In [16]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.067*"great" + 0.067*"comment" + 0.067*"detail" + 0.067*"post"')
(1, '0.089*"frustration" + 0.089*"suck" + 0.089*"try" + 0.089*"technology"')
(2, '0.045*"cake" + 0.045*"disaster" + 0.045*"tailor" + 0.045*"sugary"')
(3, '0.089*"additional" + 0.089*"whatsoever" + 0.089*"provide" + 0.089*"instructions"')
(4, '0.082*"wirefly" + 0.082*"excellent" + 0.082*"contact" + 0.082*"cingular"')
(5, '0.089*"website" + 0.089*"follow" + 0.089*"direction" + 0.089*"motorola"')
(6, '0.128*"dining" + 0.067*"disappoint" + 0.067*"college" + 0.067*"accessoryone"')
(7, '0.096*"sister" + 0.096*"love" + 0.096*"price" + 0.096*"disappoint"')
(8, '0.106*"absolutley" + 0.106*"fantastic" + 0.106*"colors" + 0.010*"awesome"')
(9, '0.051*"break" + 0.051*"headset" + 0.051*"helpful" + 0.051*"battery"')


  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

<h1>pyLDAvis</h1>

pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Visualizing 5 topics

In [14]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Saliency: a measure of how much the term tells you about the topic.

Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.

Visualizing 3 topics:

In [17]:
lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim')
lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)

Visualizing 10 topics:

In [18]:
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)