# Topic Modelling with Latent Dirichlet Allocation (LDA)

Topic modeling is a type of statistical modeling for discovering the topics that occur in a collection of documents. By doing topic modeling, we build clusters of words rather than clusters of texts.

LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.

We can describe the generative process of LDA as, given the M number of documents, N number of words, and prior K number of topics, the model trains to output:

- psi, the distribution of words for each topic K

- phi, the distribution of topics for each document i

## Tokenize text for LDA
To use ```space.load('en')```, download it using the following:

```python3 -m spacy download en```

In [1]:
import spacy

spacy.load("en_core_web_sm")
from spacy.lang.en import English
parser = English()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        # elif token.like_url:
        #     lda_tokens.append('URL')
        # elif token.orth_.startswith('@'):
        #     lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

## Lemmatization using NLTK WordNet

Using ```WordNetLemmatizer``` to get the root word.

In [2]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('wordnet')

from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ejayb\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Getting a list of StopWords 

In [3]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ejayb\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Text pre-processing

Following steps are followed for text pre-processing before using LDA:
1. Find the topic sentence in the text by using a custom regex expression
1. Tokenize text using our tokenizer above
1. Remove tokens of length less than 4
1. Remove tokens that are stop words
1. Reduce tokens to their base form using our Lemmatizer above

In [None]:
import regex
from nltk.tokenize import sent_tokenize
def titleEquivelent(titleAbstract):
    
#By HJB
def extract_topic_sentences(title, abstract):
    title = title.lower()
    abstract = abstract.lower()
    sentences = sent_tokenize(abstract)
    topic_indicators = ['we show that','this study shows that','the goal of this study is',
                        'we find','this study finds','results:','we report here']
    
    

In [8]:
#By HJB
def isNum(s):
    if s[0] == "-":
        s = s[1:]
    return s.replace('.','').isnumeric()
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    #Added by HJB, filter out tokens which are just numbers
    tokens = [token for token in tokens if not(isNum(token))]
    tokens = [(token) for token in tokens]
    return tokens


## Processing Pubmed Articles

In [9]:
import random
import pandas as pd

text_data = []
pubmed_dataset = pd.read_csv('Pubmed_Articles.csv', encoding = "ISO-8859-1")
print("Total Articles: ", len(pubmed_dataset))


Total Articles:  300


In [10]:
for index, row in pubmed_dataset.iterrows():
    tokens = prepare_text_for_lda(str(row['Text']))
    if random.random() > .995:
        print(tokens)
    text_data.append(tokens)
print(len(text_data))
print(text_data[1:10])

['biochimie', 'anticarcinogenesis', 'pathways', 'activated', 'bovine', 'lactoferrin', 'murine', 'small', 'intestine', 'administration', 'bovine', 'lactoferrin', 'inhibits', 'carcinogenesis', 'colon', 'organs', 'metastasis', 'likely', 'mechanism', 'mediates', 'anticarcinogenesis', 'effects', 'enhanced', 'expression', 'cytokines', 'subsequent', 'activation', 'immune', 'cells', 'administration', 'enhances', 'expression', 'interleukin-18', 'il-18', 'mucosa', 'small', 'intestine', 'importantly', 'pepsin', 'hydrolysate', 'induced', 'expression', 'il-18', 'mouse', 'small', 'intestine', 'peptide', 'produced', 'pepsin', 'digestion', 'bovine', 'lactoferricin', 'blfcin', 'induced', 'expression', 'mature', 'il-18', 'organ', 'culture', 'addition', 'il-18', 'blfcin', 'induced', 'significant', 'increases', 'caspase-1', 'activity', 'peritoneal', 'macrophages', 'organ', 'cultures', 'increase', 'mature', 'il-18', 'macrophages', 'inhibited', 'caspase-1', 'inhibitor', 'caspase-1', 'known', 'cleave', 'prof

## Creating BOW with Dictionary

Doc2bow: Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples

In [7]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
print(corpus[0])
print("-----")
print(corpus[1])
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

ModuleNotFoundError: No module named 'gensim'

## Training LDA model

In [None]:
import gensim
NUM_TOPICS = 4
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15, random_state=42)
ldamodel.save('model5.gensim')


## Compute Model Perplexity and Coherence Score

In [None]:
from gensim.models.coherencemodel import CoherenceModel

# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=text_data, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

## Finding Model with Optimal K - Number of Topics

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    perplexity_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15, random_state=42)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        perplexity = model.log_perplexity(corpus)

        perplexity_values.append(perplexity)
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values, perplexity_values

In [None]:
# Can take a long time to run.
model_list, coherence_values, perplexity_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=text_data, start=2, limit=40, step=2)

## Coherence

Coherence measures the relative distance between words within a topic.

In [None]:
from matplotlib import pyplot as plt

# Show graph
limit=40; start=2; step=2;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")

plt.show()

## Perplexity

Not Often used to pick LDA model. As per https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

"However, recent studies have shown that predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated."

In [None]:
# Show graph
limit=40; start=2; step=2;
x = range(start, limit, step)
plt.plot(x, perplexity_values)
plt.xlabel("Num Topics")
plt.ylabel("Perplexity")

plt.show()

## Top words in each Topic

In [None]:
topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)
    print('----------------------')

In [None]:
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]  # more colors: 'mcolors.XKCD_COLORS'

cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=10,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

topics = ldamodel.show_topics(formatted=False)

fig, axes = plt.subplots(2, 2, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

## Testing on new document

In [None]:
new_doc = 'Increase in efficacy of cancer radiotherapy by combination with whole-body low dose irradiation'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

## Creating Visualization for Topics

In [None]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')

import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
pyLDAvis.save_html(lda_display, 'lda.html')