In this Notebook, we will learn how to identity which topic is discussed in a document, called topic modelling. In particular, we will cover Latent Dirichlet Allocation (LDA): a widely used topic modelling technique. And we will apply LDA to convert set of research papers to a set of topics.

Research paper topic modelling is an unsupervised machine learning method that helps us discover hidden semantic structures in a paper, that allows us to learn topic representations of papers in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the website.

**The Process**



*  We pick the number of topics ahead of time even if we’re not sure what the topics are.
*   Each document is represented as a distribution over topics.
*   Each topic is represented as a distribution over words.




















In [0]:
# Text Cleaning

import spacy
spacy.load("en")
from spacy.lang.en import English
parser=English()

def tokenize(text):
  lda_tokens=[]
  tokens=parser(text)
  for token in tokens:
    if token.orth_.isspace():
      continue
    elif token.like_url:
      lda_tokens.append("URL")
    elif token.orth_.startswith("@"):
      lda_tokens.append("SCREEN_NAME")
    else:
      lda_tokens.append(token.lower_)
  return lda_tokens

We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word.

In [0]:
import nltk
nltk.download("wordnet")
from nltk.corpus import wordnet as wn

def get_lemma(word):
  lemma=wn.morphy(word)
  if lemma is None:
    return word
  else:
    return lemma
  
from nltk.stem.wordnet import WordNetLemmatizer

def get_lemma2(word):
  return WordNetLemmatizer.lemmatize(word)

In [0]:
#filtering the stop words

nltk.download("stopwords")
en_stop=set(nltk.corpus.stopwords.words("english"))


In [0]:
#function to prepare the text for topic modelling:

def preprocess_text(text):
  tokens=tokenize(text)
  tokens=[token for token in tokens if len(token)>4]
  token=[token for token in tokens if token not in en_stop]
  token=[get_lemma(token) for token in tokens]
  return tokens

In [8]:
import random 
text_data=[]
with open("topic_modeling.csv") as f:
  for line in f:
    tokens=preprocess_text(line)
    if random.random()> .99:
      print(tokens)
      text_data.append(tokens)
  

['memory', 'efficient', 'continuous', 'processor', 'wimax', 'application']
['gesture', 'world', 'technology', 'estimation', 'system', 'unspecified', 'users', 'using', 'compact', 'speed', 'camera']
['improved', 'decoder', 'extra', 'error', 'compensation']
['utility', 'tweeted', 'search']
['tuning', 'elliptic', 'filters', 'tuning', 'biquad']
['logical', 'design', 'deductive', 'natural', 'language', 'consultable', 'bases']
['augmented', 'reality', 'jockey']
['sphere', 'unfolding', 'method', 'single', 'directional', 'shadow', 'mapping']
['quiet', 'continuous', 'query', 'driven', 'index', 'tuning']
['learning', 'classify', 'human', 'object', 'sketches']
['7-decades', 'tunable', 'translinear', 'bicmos', '3-phase', 'sinusoidal', 'oscillator']
['cache', 'cache']
['external', 'schema', 'codasyl']
['approximate', 'queries', 'streams', 'guaranteed', 'error', 'performance', 'bounds']
['gimme', 'context', 'context', 'driven', 'automatic', 'semantic', 'annotation', 'pankow']
['animated', 'series']
[

LDA with Gensim

creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [0]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]

import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [0]:
import gensim
n_topics=5
lda=gensim.models.ldamodel.LdaModel(corpus,num_topics=n_topics,id2word=dictionary,passes=15)
lda.save("model1.gensim")
topics=lda.print_topics(num_words=4)


In [22]:
for topic in topics:
  print(topic)

(0, '0.047*"tuning" + 0.026*"based" + 0.026*"coding" + 0.026*"optimization"')
(1, '0.057*"context" + 0.057*"cache" + 0.031*"driven" + 0.031*"annotation"')
(2, '0.025*"error" + 0.025*"unfolding" + 0.025*"3-phase" + 0.025*"directional"')
(3, '0.044*"design" + 0.024*"based" + 0.024*"112-gbit" + 0.024*"optical"')
(4, '0.033*"continuous" + 0.033*"system" + 0.018*"speed" + 0.018*"using"')


Topic 0 includes words like “tunning”, “based”, “coding” and “optimization”, it is definite a machine learning related topic

Topic 3 includes words like “computer”, “design”, “graphics” and “gallery”, it is definite a graphic design related topic

Let’s try a new document:

In [26]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = preprocess_text(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(lda.get_document_topics(new_doc_bow))

[(52, 1), (109, 1)]
[(0, 0.7332829), (1, 0.0666826), (2, 0.06667956), (3, 0.06667904), (4, 0.066675864)]


Now we are asking LDA to find 3 topics in the data:

In [31]:
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
lda.save('model3.gensim')
topics = lda.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.031*"tuning" + 0.031*"cache" + 0.018*"continuous" + 0.018*"error"')
(1, '0.028*"design" + 0.016*"system" + 0.016*"unspecified" + 0.016*"camera"')
(2, '0.026*"based" + 0.026*"context" + 0.015*"driven" + 0.015*"error"')


For top 10 topics

In [32]:
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)
lda.save('model10.gensim')
topics = lda.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.030*"7-decades" + 0.030*"receivers" + 0.030*"study" + 0.030*"search"')
(1, '0.106*"cache" + 0.055*"human" + 0.055*"object" + 0.055*"learning"')
(2, '0.054*"based" + 0.028*"users" + 0.028*"world" + 0.028*"compact"')
(3, '0.064*"design" + 0.033*"deductive" + 0.033*"logical" + 0.033*"reality"')
(4, '0.096*"tuning" + 0.050*"elliptic" + 0.050*"filters" + 0.050*"biquad"')
(5, '0.061*"continuous" + 0.061*"stability" + 0.061*"modulators" + 0.061*"modeling"')
(6, '0.061*"continuous" + 0.061*"driven" + 0.061*"query" + 0.061*"quiet"')
(7, '0.084*"context" + 0.044*"system" + 0.044*"annotation" + 0.044*"displays"')
(8, '0.065*"error" + 0.065*"decoder" + 0.065*"extra" + 0.065*"improved"')
(9, '0.050*"shadow" + 0.050*"directional" + 0.050*"mapping" + 0.050*"external"')


pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [0]:
!pip install pyLDAvis
# Visualizing 5 topics:

dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model1.gensim')

import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

Saliency: a measure of how much the term tells you about the topic.

Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

In [36]:
#Visualizing 3 topics:
lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim')
lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [37]:
#Visualizing 10 topics:
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
