# Topic Modeling Using LDA Bag of Word
We use the following function to clean our texts and return a list of tokens:

In [5]:
from pprint import pprint
import spacy
# spacy.load('en')


In [72]:
from gensim.models import CoherenceModel
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

  and should_run_async(code)


We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word.

In [28]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

  and should_run_async(code)
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\salbo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Filter out stop words:

In [67]:
nltk.download('stopwords')
# en_stop = set(nltk.corpus.stopwords.words('english'))

en_stop = nltk.corpus.stopwords.words('english')
# add more stop words mostly appear in the texts
en_stop.extend(['programme','accordance','article', 'state','member','this','annex','paragraph'])


  and should_run_async(code)
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\salbo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [52]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

  and should_run_async(code)


Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.
Now we can see how our text data are converted:

In [77]:
import random
import pandas as pd
text_data = []
data = pd.read_csv("data/data1.csv")
features = data['title'] + " " + data['article']
for feature in features:
    tokens = prepare_text_for_lda(feature)

    text_data.append(tokens)

# text_data
data

  and should_run_async(code)


Unnamed: 0,title,article,label
0,(EU 17 2017 establishment Union framework coll...,1.With view (EU Regulation management biologic...,fishing industry
1,Regulation (EU) 2019/833 European Parliament C...,1.This Regulation Union fishing vessels use pu...,conservation of fish stocks
2,Regulation (EU 1303/2013 European Parliament C...,Regulation common rules European Regional Deve...,European Regional Development Fund
3,Regulation (EU) 2019/473 European Parliament C...,Regulation provision European Fisheries Contro...,fishery management


LDA with Gensim

First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [78]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')
# corpus

  and should_run_async(code)


We are asking LDA to find 5 topics in the data:

In [55]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

  and should_run_async(code)


(0, '0.001*"commission" + 0.001*"accordance" + 0.001*"programme" + 0.001*"financial"')
(1, '0.000*"commission" + 0.000*"programme" + 0.000*"accordance" + 0.000*"financial"')
(2, '0.041*"vessel" + 0.030*"fishing" + 0.017*"inspection" + 0.016*"commission"')
(3, '0.001*"commission" + 0.001*"programme" + 0.000*"accordance" + 0.000*"regulation"')
(4, '0.027*"commission" + 0.025*"programme" + 0.018*"financial" + 0.015*"accordance"')


In [76]:
# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=text_data, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

  and should_run_async(code)



Perplexity:  -6.5603534881831465

Coherence Score:  0.3738435677442225


With LDA, we can see that different document with different topics, and the discriminations are obvious.

Let’s try a new document:

In [59]:
new_doc = data['article'][0]
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(new_doc)
print(ldamodel.get_document_topics(new_doc_bow))

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 6), (27, 2), (28, 18), (29, 4), (30, 2), (31, 1), (32, 2), (33, 1), (34, 1), (35, 6), (36, 2), (37, 3), (38, 3), (39, 3), (40, 3), (41, 1), (42, 2), (43, 8), (44, 5), (45, 2), (46, 2), (47, 2), (48, 1), (49, 3), (50, 1), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 6), (58, 2), (59, 8), (60, 1), (61, 3), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 4), (71, 1), (72, 3), (73, 5), (74, 5), (75, 1), (76, 1), (77, 1), (78, 2), (79, 2), (80, 2), (81, 1), (82, 5), (83, 1), (84, 1), (85, 2), (86, 8), (87, 1), (88, 2), (89, 1), (90, 1), (91, 4), (92, 1), (93, 12), (94, 1), (95, 2), (96, 4), (97, 47), (98, 1), (99, 1), (100, 1), (101, 4), (102, 3), (103, 1), (104, 2), (105, 1), (106, 1), (107, 1), (108, 1), (109, 5), (110, 

  and should_run_async(code)


In [48]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(new_doc)
print(ldamodel.get_document_topics(new_doc_bow))

[(3307, 1)]
['practical', 'bayesian', 'optimization', 'machine', 'learning', 'algorithm']
[(0, 0.100062005), (1, 0.10002429), (2, 0.10002141), (3, 0.5998307), (4, 0.1000616)]


  and should_run_async(code)


## pyLDAvis
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Visualizing 5 topics:

In [24]:
import pyLDAvis
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim_models as gensimvis

lda_display = gensimvis.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)


- Saliency: a measure of how much the term tells you about the topic.
- Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.






