# Topic Modeling Using LDA Bag of Word
We use the following function to clean our texts and return a list of tokens:

In [5]:
import spacy
# spacy.load('en')

In [4]:
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word.

In [6]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\salbo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Filter out stop words:

In [40]:
nltk.download('stopwords')
add_stop_word= ['article', 'state','member']
# en_stop = set(nltk.corpus.stopwords.words('english'))
en_stop = nltk.corpus.stopwords.words('english')

for i in add_stop_word:
    en_stop.append(i)

  and should_run_async(code)
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\salbo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

  and should_run_async(code)


Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.
Now we can see how our text data are converted:

In [41]:
import random
import pandas as pd
text_data = []
data = pd.read_csv("data/data1.csv")
features = data['title'] + " " + data['article']
for feature in features:
    tokens = prepare_text_for_lda(feature)

    text_data.append(tokens)

# text_data

  and should_run_async(code)


LDA with Gensim

First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [38]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

  and should_run_async(code)


We are asking LDA to find 5 topics in the data:


In [39]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

  and should_run_async(code)


(0, '0.026*"commission" + 0.025*"programme" + 0.018*"financial" + 0.015*"accordance"')
(1, '0.051*"vessel" + 0.036*"fishing" + 0.020*"annex" + 0.014*"inspection"')
(2, '0.036*"agency" + 0.021*"state" + 0.020*"control" + 0.019*"commission"')
(3, '0.000*"agency" + 0.000*"commission" + 0.000*"state" + 0.000*"accordance"')
(4, '0.022*"commission" + 0.020*"state" + 0.019*"scientific" + 0.018*"union"')


With LDA, we can see that different document with different topics, and the discriminations are obvious.

Let’s try a new document:

In [21]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc\
    = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(3307, 1)]
[(0, 0.599885), (1, 0.10004221), (2, 0.10004213), (3, 0.10001468), (4, 0.10001605)]


['practical', 'bayesian', 'optimization', 'machine', 'learning', 'algorithm']

## pyLDAvis
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Visualizing 5 topics:

In [24]:
import pyLDAvis
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim_models as gensimvis

lda_display = gensimvis.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)


- Saliency: a measure of how much the term tells you about the topic.
- Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.