**Topic Modelling**--

Latent Dirichlet Allocation (LDA): a widely used topic modelling technique. 
Have applied LDA to convert set of research papers to a set of topics.

Research paper topic modelling is an unsupervised machine learning method that helps us discover hidden semantic structures in a paper, that allows us to learn topic representations of papers in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the website.

In [1]:
#research paper text data is just a bunch of unlabeled texts
!wget https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/dataset.csv

--2020-07-01 15:23:19--  https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/dataset.csv
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dataset.csv’

dataset.csv             [<=>                 ]       0  --.-KB/s               dataset.csv             [ <=>                ] 672.03K  --.-KB/s    in 0.07s   

2020-07-01 15:23:19 (9.68 MB/s) - ‘dataset.csv’ saved [688156]



**Text Cleaning**

In [2]:
#function to clean our texts and return a list of tokens:

import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [3]:
#use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more.
#n addition, we use WordNetLemmatizer to get the root word.

import nltk
nltk.download('wordnet')

from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [18]:
for w in ['dogs', 'ran', 'discouraged']:
    print(w, get_lemma(w), get_lemma2(w))

dogs dog dog
ran run ran
discouraged discourage discouraged


In [4]:
#Filter out stop words:

nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
#Now we can define a function to prepare the text for topic modelling:

def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

In [6]:
#Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.
#Now we can see how our text data are converted:

import random
text_data = []
with open('dataset.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        if random.random() > .99:
            print(tokens)
            text_data.append(tokens)

[]
['input', 'type="hidden', 'csrf="true', 'class="js', 'suggestion', 'value="+hhzknwwv446jnwrdl/7ealjktmvs7', 'gprlio2lw7i2uj+3eqo4oaxyah', 'zsrksqens08kpdvgvjj8fdksq==']
['class="octicon', 'octicon', 'fork', 'viewbox="0', 'version="1.1', 'width="16', 'height="16', 'hidden="true"><path', 'rule="evenodd', 'd="m5', '3.25a.75.75', '.75.75', '011.5', '2.122a2.25', '0v.878a2.25', '005.75', '8.5h1.5v2.128a2.251', '2.251', '101.5', '0v8.5h1.5a2.25', '002.25', '2.25v-.878a2.25', '0v.878a.75.75', '01-.75.75h-4.5a.75.75', '6.25v-.878zm3.75', '7.378a.75.75', '.75.75', '011.5', '8.75a.75.75', '1.5.75.75', '1.5z"></path></svg']
[]
['height="16', 'class="octicon', 'octicon', 'people', 'text="gray', 'viewbox="0', 'version="1.1', 'width="16', 'hidden="true"><path', 'rule="evenodd', 'd="m5.5', '3.5a2', '5.5a3.5', '115.898', '2.549', '5.507', '5.507', '013.034', '4.084.75.75', '1.482.235', '4.001', '4.001', '.75.75', '1.482-.236a5.507', '5.507', '013.102', '5.5zm11', '4a.75.75', '01.666', '2.844.75.75'

**LDA with Gensim**

In [8]:
#First, we are creating a dictionary from the data,
#then convert to bag-of-words corpus and save the dictionary and corpus for future use.

from gensim import corpora
dictionary = corpora.Dictionary(text_data)

corpus = [dictionary.doc2bow(text) for text in text_data]

import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [9]:
#We are asking LDA to find 5 topics in the data:

import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')

topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.120*"number" + 0.120*"class="blob" + 0.007*"integrating" + 0.007*"sharing"')
(1, '0.016*"architecture" + 0.016*"networks.</td" + 0.016*"design" + 0.016*"encoder"')
(2, '0.015*"design" + 0.015*"base" + 0.015*".75.75" + 0.015*"011.5"')
(3, '0.026*"5.507" + 0.018*"simulation" + 0.018*"4.001" + 0.010*"modeling"')
(4, '0.265*"class="js" + 0.009*"image" + 0.009*"wireless" + 0.009*"system"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [10]:
#Let’s try a new document:

new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[]
[(0, 0.2), (1, 0.2), (2, 0.2), (3, 0.2), (4, 0.2)]


In [11]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('model3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.033*"number" + 0.033*"class="blob" + 0.031*"class="js" + 0.016*"architecture"')
(1, '0.046*"class="js" + 0.014*".75.75" + 0.014*"5.507" + 0.010*"efficient"')
(2, '0.087*"class="js" + 0.073*"number" + 0.073*"class="blob" + 0.006*"networks.</td"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [12]:
#We can also find 10 topics:

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.063*"number" + 0.063*"class="blob" + 0.017*"wireless" + 0.017*"system"')
(1, '0.062*"class="js" + 0.022*"modeling" + 0.022*"network" + 0.022*"social"')
(2, '0.037*"number" + 0.037*"class="blob" + 0.019*"apply" + 0.019*"modeling"')
(3, '0.033*"class="blob" + 0.033*"number" + 0.023*"base" + 0.023*"011.5"')
(4, '0.020*"image" + 0.020*"mobile" + 0.020*"method" + 0.020*"devices.</td"')
(5, '0.081*"class="js" + 0.035*"5.507" + 0.024*"4.001" + 0.013*"id="lc985"')
(6, '0.038*"class="blob" + 0.038*"number" + 0.020*"wireless" + 0.020*"networks.</td"')
(7, '0.201*"class="js" + 0.089*"number" + 0.089*"class="blob" + 0.009*"recognition"')
(8, '0.029*"efficient" + 0.029*"access" + 0.029*"information" + 0.029*"index"')
(9, '0.032*"simulation" + 0.032*"pearl" + 0.032*"optical" + 0.032*"phenomenon"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


**pyLDAvis**

In [14]:
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |▏                               | 10kB 15.8MB/s eta 0:00:01[K     |▍                               | 20kB 1.4MB/s eta 0:00:02[K     |▋                               | 30kB 1.7MB/s eta 0:00:01[K     |▉                               | 40kB 2.0MB/s eta 0:00:01[K     |█                               | 51kB 1.9MB/s eta 0:00:01[K     |█▏                              | 61kB 2.1MB/s eta 0:00:01[K     |█▍                              | 71kB 2.4MB/s eta 0:00:01[K     |█▋                              | 81kB 2.4MB/s eta 0:00:01[K     |█▉                              | 92kB 2.4MB/s eta 0:00:01[K     |██                              | 102kB 2.6MB/s eta 0:00:01[K     |██▎                             | 112kB 2.6MB/s eta 0:00:01[K     |██▍                             | 122kB 2.6MB/s eta 0:00:01[

In [15]:
#interpret the topics in a topic model that has been fit to a corpus of text data.

#Visualizing 5 topics:

dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [16]:
#Visualizing 3 topics:

lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim')
lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [17]:
#Visualizing 10 topics:

lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
