We will learn how to identity which topic is discussed in a document, called topic modelling. In particular, we will cover Latent Dirichlet Allocation (LDA): a widely used topic modelling technique. And we will apply LDA to convert set of research papers to a set of topics.

In [8]:
!wget https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/dataset.csv

--2020-07-01 15:22:15--  https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/dataset.csv
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dataset.csv’

dataset.csv             [    <=>             ] 672.03K   764KB/s    in 0.9s    

2020-07-01 15:22:17 (764 KB/s) - ‘dataset.csv’ saved [688156]



The Process

*   We pick the number of topics ahead of time even if we’re not sure what the topics are.
*   Each document is represented as a distribution over topics.

*   Each topic is represented as a distribution over words.



Text Cleaning

We use the following function to clean our texts and return a list of tokens:

In [11]:
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [12]:
#We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. 
#In addition, we use WordNetLemmatizer to get the root word.

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

In [7]:
for w in ['sad','leaves','child', 'ran', 'discouraged']:
    print(w, get_lemma(w), get_lemma2(w))

sad sad sad
leaves leaf leaf
child child child
ran run ran
discouraged discourage discouraged


In [14]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
#Now we can define a function to prepare the text for topic modelling:

def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

In [16]:
import random
text_data = []
with open('dataset.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        if random.random() > .99:
            print(tokens)
            text_data.append(tokens)

[]
['class="list', 'style']
['class="border', 'bottom', 'border', 'bottom-0']
[]
['summary', 'class="btn', 'truncate']
['id="lc21', 'class="js']
['id="l33', 'class="blob', 'number', 'number="33"></td']
[]
['minimax', 'design', 'digital', 'filter', 'using', 'relaxation', 'technique.</td']
['id="lc75', 'class="js']
['id="l76', 'class="blob', 'number', 'number="76"></td']
[]
[]
['inductively', 'tune', 'astable', 'multivibrator.</td']
['id="lc208', 'class="js']
['codepipe', 'opportunistic', 'feeding', 'route', 'protocol', 'reliable', 'multicast', 'pipelined', 'network', 'coding.</td']
['id="l222', 'class="blob', 'number', 'number="222"></td']
[]
['computation', 'shadow', 'boundary', 'using', 'spatial', 'coherence', 'backprojections.</td']
[]
['id="l354', 'class="blob', 'number', 'number="354"></td']
['id="lc398', 'class="js']
['id="l429', 'class="blob', 'number', 'number="429"></td']
['id="lc431', 'class="js']
[]
['id="l482', 'class="blob', 'number', 'number="482"></td']
['synchronization'

LDA with Gensim
First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [18]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [19]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.045*"class="blob" + 0.045*"number" + 0.023*"using" + 0.016*"base"')
(1, '0.095*"number" + 0.095*"class="blob" + 0.011*"information" + 0.011*"relational"')
(2, '0.034*"number" + 0.034*"class="blob" + 0.013*"filter" + 0.013*"scale"')
(3, '0.049*"class="blob" + 0.049*"number" + 0.010*"databases.</td" + 0.010*"power"')
(4, '0.217*"class="js" + 0.026*"using" + 0.010*"gpus.</td" + 0.010*"method"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [20]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[]
[(0, 0.2), (1, 0.2), (2, 0.2), (3, 0.2), (4, 0.2)]


In [21]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('model3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.009*"base" + 0.009*"filter" + 0.009*"power" + 0.009*"interferometry"')
(1, '0.150*"class="js" + 0.018*"using" + 0.007*"model" + 0.007*"scale"')
(2, '0.095*"class="blob" + 0.095*"number" + 0.020*"using" + 0.009*"digital"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [22]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.025*"base" + 0.025*"move" + 0.025*"metadata" + 0.025*"invite"')
(1, '0.081*"class="blob" + 0.081*"number" + 0.017*"digital" + 0.017*"correction"')
(2, '0.173*"class="blob" + 0.173*"number" + 0.009*"id="l1664" + 0.009*"number="1662"></td"')
(3, '0.037*"using" + 0.019*"modelling" + 0.019*"power" + 0.019*"converter"')
(4, '0.038*"using" + 0.038*"filter" + 0.020*"pattern" + 0.020*"design"')
(5, '0.376*"class="js" + 0.015*"networks.</td" + 0.015*"multihomed" + 0.015*"egress"')
(6, '0.043*"using" + 0.022*"bsns.</td" + 0.022*"bodyt2" + 0.022*"assurance"')
(7, '0.018*"lattice" + 0.018*"geometry" + 0.018*"numerical" + 0.018*"method"')
(8, '0.026*"model" + 0.026*"scale" + 0.026*"opportunistic" + 0.026*"multi"')
(9, '0.035*"recommendation" + 0.035*"semantics" + 0.035*"group" + 0.035*"efficiency.</td"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.


In [25]:
#Visualizing 5 topics:
!pip install pyLDAvis
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 2.7MB/s 
Collecting funcy
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
[K     |████████████████████████████████| 552kB 11.3MB/s 
Building wheels for collected packages: pyLDAvis, funcy
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=2c7b4c742991024169d4c94f7ca68ec1ad26d7f449eb2658508a0473f79b4ab8
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... [?25l[?25hdone
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=89f5600a

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Saliency: a measure of how much the term tells you about the topic.


Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.


The size of the bubble measures the importance of the topics, relative to the data.


First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.

In [26]:
#Visualizing 3 topics:

lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim')
lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [27]:
#Visualizing 10 topics:
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
