<a href="http://colab.research.google.com/github/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project04%20-%20Topic%20Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling on Research Papers

We will do an interesting exercise here—build topic models on past research papers
from the very popular NIPS conference (now known as the NeurIPS conference). The
late professor Sam Roweis compiled an excellent collection of NIPS Conference Papers
from Volume 1 – 12, which you can find at https://cs.nyu.edu/~roweis/data.html.
An interesting fact is that he obtained this by massaging the OCR’d data from NIPS
1-12, which was actually the pre-electronic submission era. Yann LeCun made the data
available. There is an even more updated dataset available up to NIPS 17 at http://
ai.stanford.edu/~gal/data.html. However, that dataset is in the form of a MAT file, so
you might need to do some additional preprocessing before working on it in Python.


# The Main Objective

Considering our discussion so far, our main objective is pretty simple. Given a whole
bunch of conference research papers, can we identify some key themes or topics from
these papers by leveraging unsupervised learning? We do not have the liberty of labeled
categories telling us what the major themes of every research paper are. Besides that, we
are dealing with text data extracted using OCR (optical character recognition). Hence,
you can expect misspelled words, words with characters missing, and so on, which
makes our problem even more challenging

# Download Data and Dependencies

In [1]:
# !pip install -qU pip wheel
# !pip install -qU gensim tqdm pandas nltk numpy

In [2]:
import gensim
gensim.__version__

'4.3.2'

In [3]:
import nltk
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import gensim

In [4]:
# nltk.download('stopwords')

In [5]:
# corpus
# Mount google drive and set path to corpus
path_corpus=os.path.expanduser('~/ppa_data/corpus_solr')
path_metadata = os.path.join(path_corpus, 'metadata.csv')
path_texts = os.path.join(path_corpus, 'texts')

In [6]:
# Read metadata
df_metadata = pd.read_csv(path_metadata).fillna('')
# df_metadata

In [7]:
# Read jsons
import gzip,json

def get_work_json(work_id):
    fn=os.path.join(path_texts, work_id+'.json.gz')
    try:
        with gzip.open(fn,mode='rt') as f:
            return json.load(f)
    except FileNotFoundError:
        pass
    return []

In [8]:
# os.listdir(path_texts)

In [9]:
# get_work_json('CW0115427928')

In [10]:
def get_work_tokens(work_id):
    work_tokens = []
    for paged in get_work_json(work_id):
        page_id = paged['page_id']

        #@TODO: FIX
        if not page_id:
            page_id = f'{paged["work_source"]}_{paged["page_orig"]}'
        tokens = [tok.lower() for tok in paged['page_tokens']]
        work_tokens.append((page_id, tokens))
    return work_tokens

In [11]:
# get_work_tokens('CW0115427928')[0]

In [12]:
import nltk
from nltk.corpus import stopwords 
stopwords = set(stopwords.words('english')) | {'one','may'}
# stopwords

In [13]:
def iter_id_tokens(min_toks=50):
    for i,work_id in enumerate(tqdm(df_metadata.work_id, position=0)):
        # if i>10: break
        for id,toks in get_work_tokens(work_id):
            toks = [''.join(x for x in tok if x.isalpha()) for tok in toks]
            toks = [tok for tok in toks if len(tok)>2 and tok not in stopwords]
            if len(toks)>=min_toks:
                yield (id,toks)

def iter_tokens():
    for id,toks in iter_id_tokens():
        yield toks

In [14]:
# iter = iter_tokens()
# next(iter) 

# Transforming corpus into bag of words vectors

We can now perform feature engineering by leveraging a simple Bag of Words
model.

In [38]:
fn='data.gensim.dictionary.pkl'
if not os.path.exists(fn):
    dictionary = gensim.corpora.Dictionary(iter_tokens())
    dictionary.save(fn)
else:
    dictionary = gensim.corpora.Dictionary.load(fn)
len(dictionary)

2019209

In [47]:
dictionary.filter_extremes(keep_n=50000)
len(dictionary)

50000

In [48]:
# Transforming corpus into bag of words vectors
def iter_corpus():
    for i,page_toks in enumerate(iter_tokens()):
        yield dictionary.doc2bow(page_toks)

# next(iter_corpus())

In [49]:
# Transforming corpus into bag of words vectors
fn='data.gensim.corpus.mm'
gensim.corpora.MmCorpus.serialize(
    fn,
    iter_corpus(),
    id2word=dictionary,
)

 50%|████▉     | 3462/6939 [17:40<16:35,  3.49it/s]  

: 

In [None]:
mm = gensim.corpora.MmCorpus(fn)

[(0, 1.0), (1, 1.0), (2, 1.0), (3, 2.0), (4, 2.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0), (9, 1.0), (10, 1.0), (11, 1.0), (12, 1.0), (13, 1.0), (14, 1.0), (15, 1.0), (16, 1.0), (17, 1.0), (18, 1.0), (19, 1.0), (20, 1.0), (21, 1.0), (22, 1.0), (23, 1.0), (24, 1.0), (25, 1.0), (26, 1.0), (27, 1.0), (28, 2.0), (29, 1.0), (30, 1.0), (31, 1.0), (32, 1.0), (33, 1.0), (34, 1.0), (35, 1.0), (36, 1.0), (37, 1.0), (38, 1.0), (39, 1.0), (40, 1.0), (41, 1.0), (42, 1.0), (43, 1.0), (44, 1.0), (45, 1.0), (46, 1.0), (47, 1.0), (48, 1.0), (49, 1.0), (50, 1.0), (51, 1.0), (52, 5.0), (53, 1.0), (54, 1.0), (55, 1.0), (56, 1.0), (57, 1.0), (58, 1.0), (59, 1.0), (60, 1.0), (61, 1.0), (62, 1.0), (63, 1.0), (64, 1.0), (65, 1.0), (66, 1.0), (67, 3.0), (68, 1.0), (69, 1.0), (70, 1.0), (71, 1.0), (72, 1.0), (73, 1.0), (74, 1.0), (75, 1.0), (76, 1.0), (77, 1.0), (78, 1.0), (79, 1.0)]


In [45]:
TOTAL_TOPICS = 25

In [46]:
%%time

lda_model = gensim.models.ldamulticore.LdaMulticore(
    corpus=mm, 
    id2word=dictionary,
    num_topics=TOTAL_TOPICS,
    # chunksize=1740,
    # alpha='auto', 
    # eta='auto', 
    # random_state=42,
    # iterations=500, 
    # passes=20, 
    # eval_every=None
)

CPU times: user 1.76 s, sys: 861 ms, total: 2.63 s
Wall time: 20.4 s


In [47]:
topics_coherences = lda_model.top_topics(bow_corpus, topn=20)
avg_coherence_score = np.mean([item[1] for item in topics_coherences])
print('Avg. Coherence Score:', avg_coherence_score)

Avg. Coherence Score: -1.5134945105345707


Topic coherence is a complex topic in its own and it can be used to measure the
quality of topic models to some extent. Typically, a set of statements is said to be
coherent if they support each other. Topic models are unsupervised learning based
models that are trained on unstructured text data, making it difficult to measure the
quality of outputs.

Refer to Text Analytics with Python 2nd Edition for more detail on this.

In [48]:
topics_with_wts = [item[0] for item in topics_coherences]

In [49]:
print('LDA Topics without Weights')
print('='*50)
for idx, topic in enumerate(topics_with_wts):
    print('Topic #'+str(idx+1)+': '+', '.join([term for wt, term in topic]))
    print()

LDA Topics without Weights
Topic #1: english, french, first, old, time, great, much, words, latin, man, men, language, like, would, also, two, every, yet, many, must

Topic #2: language, words, thought, word, english, would, first, time, great, must, much, upon, many, two, people, see, among, like, man, even

Topic #3: would, words, many, language, time, two, latin, great, like, much, sound, man, also, every, people, first, french, even, thus, made

Topic #4: english, language, would, words, also, first, time, name, many, french, upon, word, two, well, even, found, see, lines, though, made

Topic #5: english, language, man, first, french, every, saxon, name, even, great, words, much, time, like, many, england, word, upon, latin, following

Topic #6: thou, loved, love, men, would, first, language, man, upon, shall, words, like, great, thus, two, thought, see, many, time, might

Topic #7: words, english, language, upon, every, even, great, two, many, word, latin, well, french, must, woul

## Evaluating topic model quality

We can use perplexity and coherence scores as measures to evaluate the topic
model. Typically, lower the perplexity, the better the model. Similarly, the lower the
UMass score and the higher the Cv score in coherence, the better the model.

In [None]:
cv_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus,
                                                      texts=iter_tokens(),
                                                      dictionary=dictionary,
                                                      coherence='c_v')
avg_coherence_cv = cv_coherence_model_lda.get_coherence()

umass_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus,
                                                         texts=iter_tokens(),
                                                         dictionary=dictionary,
                                                         coherence='u_mass')
avg_coherence_umass = umass_coherence_model_lda.get_coherence()

perplexity = lda_model.log_perplexity(bow_corpus)

print('Avg. Coherence Score (Cv):', avg_coherence_cv)
print('Avg. Coherence Score (UMass):', avg_coherence_umass)
print('Model Perplexity:', perplexity)

# LDA Tuning: Finding the optimal number of topics

Finding the optimal number of topics in a topic model is tough, given that it is like a
model hyperparameter that you always have to set before training the model. We can
use an iterative approach and build several models with differing numbers of topics and
select the one that has the highest coherence score.

In [None]:
def topic_model_coherence_generator(corpus, texts, dictionary,
                                    start_topic_count=2, end_topic_count=10, step=1,
                                    cpus=1):

    models = []
    coherence_scores = []
    for topic_nums in tqdm.tqdm(range(start_topic_count, end_topic_count+1, step)):
        mallet_lda_model = gensim.models.wrappers.LdaMallet(mallet_path=MALLET_PATH, corpus=corpus,
                                                            num_topics=topic_nums, id2word=dictionary,
                                                            iterations=500, workers=cpus)
        cv_coherence_model_mallet_lda = gensim.models.CoherenceModel(model=mallet_lda_model, corpus=corpus,
                                                                     texts=texts, dictionary=dictionary,
                                                                     coherence='c_v')
        coherence_score = cv_coherence_model_mallet_lda.get_coherence()
        coherence_scores.append(coherence_score)
        models.append(mallet_lda_model)

    return models, coherence_scores

In [None]:
lda_models, coherence_scores = topic_model_coherence_generator(corpus=bow_corpus, texts=norm_corpus_bigrams,
                                                               dictionary=dictionary, start_topic_count=2,
                                                               end_topic_count=30, step=1, cpus=4)

In [None]:
coherence_df = pd.DataFrame({'Number of Topics': range(2, 31, 1),
                             'Coherence Score': np.round(coherence_scores, 4)})
coherence_df.sort_values(by=['Coherence Score'], ascending=False).head(10)

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

x_ax = range(2, 31, 1)
y_ax = coherence_scores
plt.figure(figsize=(12, 6))
plt.plot(x_ax, y_ax, c='r')
plt.axhline(y=0.535, c='k', linestyle='--', linewidth=2)
plt.rcParams['figure.facecolor'] = 'white'
xl = plt.xlabel('Number of Topics')
yl = plt.ylabel('Coherence Score')

We choose the optimal number of topics as 20, based on our intuition. We can retrieve the best model now

In [None]:
best_model_idx = coherence_df[coherence_df['Number of Topics'] == 20].index[0]
best_lda_model = lda_models[best_model_idx]
best_lda_model.num_topics

In [None]:
topics = [[(term, round(wt, 3))
               for term, wt in best_lda_model.show_topic(n, topn=20)]
                   for n in range(0, best_lda_model.num_topics)]

for idx, topic in enumerate(topics):
    print('Topic #'+str(idx+1)+':')
    print([term for term, wt in topic])
    print()

# Viewing LDA Model topics

In [None]:
topics_df = pd.DataFrame([[term for term, wt in topic]
                              for topic in topics],
                         columns = ['Term'+str(i) for i in range(1, 21)],
                         index=['Topic '+str(t) for t in range(1, best_lda_model.num_topics+1)]).T
topics_df

In [None]:
pd.set_option('display.max_colwidth', -1)
topics_df = pd.DataFrame([', '.join([term for term, wt in topic])
                              for topic in topics],
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, best_lda_model.num_topics+1)]
                         )
topics_df

# Interpreting Topic Model Results

An interesting point to remember is, given a corpus of documents (in the form of
features, e.g., Bag of Words) and a trained topic model, you can predict the distribution of
topics in each document (research paper in this case).

We can now get the most dominant topic per research paper with some intelligent
sorting and indexing.

In [None]:
tm_results = best_lda_model[bow_corpus]

In [None]:
corpus_topics = [sorted(topics, key=lambda record: -record[1])[0]
                     for topics in tm_results]
corpus_topics[:5]

In [None]:
corpus_topic_df = pd.DataFrame()
corpus_topic_df['Document'] = range(0, len(papers))
corpus_topic_df['Dominant Topic'] = [item[0]+1 for item in corpus_topics]
corpus_topic_df['Contribution %'] = [round(item[1]*100, 2) for item in corpus_topics]
corpus_topic_df['Topic Desc'] = [topics_df.iloc[t[0]]['Terms per Topic'] for t in corpus_topics]
corpus_topic_df['Paper'] = papers

# Dominant Topics Distribution Across Corpus

The first thing we can do is look at the overall distribution of each topic across the corpus
of research papers. Mainly we want to determine the total number of papers and the
total percentage of papers where each of the 20 topics was the most dominant.

In [None]:
pd.set_option('display.max_colwidth', 200)
topic_stats_df = corpus_topic_df.groupby('Dominant Topic').agg({
                                                'Dominant Topic': {
                                                    'Doc Count': np.size,
                                                    '% Total Docs': np.size }
                                              })
topic_stats_df = topic_stats_df['Dominant Topic'].reset_index()
topic_stats_df['% Total Docs'] = topic_stats_df['% Total Docs'].apply(lambda row: round((row*100) / len(papers), 2))
topic_stats_df['Topic Desc'] = [topics_df.iloc[t]['Terms per Topic'] for t in range(len(topic_stats_df))]
topic_stats_df

# Dominant Topics in Specific Research Papers

Another interesting perspective is to select specific papers, view the most dominant topic
in each of those papers, and see if that makes sense.

In [None]:
corpus_topic_df.groupby('Dominant Topic').apply(lambda topic_set: (topic_set.sort_values(by=['Contribution %'],
                                                                                         ascending=False)
                                                                             .iloc[0]))

In [None]:
sample_paper_patterns = ['Feudal Reinforcement Learning \nPeter', 'Illumination-Invariant Face Recognition with a', 'Improved Hidden Markov Model Speech Recognition']
sample_paper_idxs = [idx for pattern in sample_paper_patterns
                            for idx, content in enumerate(papers)
                                if pattern in content]
sample_paper_idxs

In [None]:
pd.set_option('display.max_colwidth', 200)
(corpus_topic_df[corpus_topic_df['Document']
                 .isin(sample_paper_idxs)])