## Building Topic Models with the Gensim Library

For this notebook, we'll see how to fit different types of topic models using the gensim library. We'll be visualizing the results of our Latent Dirichlet Algorithm, so we'll need to install the pyLDAvis library, which we can do from conda-forge.

In [1]:
#%conda install -c conda-forge pyldavis

In [2]:
import pandas as pd
from tqdm.notebook import tqdm

import gensim

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

For this notebook, we'll be using abstracts from all machine learning papers posted on arxiv.org since the beginning of the year.

In [2]:
papers = pd.read_csv('ml_papers.csv')

In [3]:
papers.head(2)

Unnamed: 0,id,title,categories,abstract,doi,created,updated,authors
0,1107.3689,edit wars in wikipedia,stat.ml cs.dl physics.data-an physics.soc-ph,"we present a new, efficient method for automat...",10.1109/passat/socialcom.2011.47,2011-07-19,2012-02-09,"['róbert sumi', 'taha yasseri', 'andrás rung',..."
1,1212.1108,on the convergence properties of optimal adaboost,cs.lg cs.ai stat.ml,adaboost is one of the most popular ml algorit...,,2012-12-05,2023-01-04,"['joshua belanich', 'luis e. ortiz']"


You can change the index number to preview some of the paper abstracts.

In [4]:
i = 10

print(f'Title: {papers.loc[i, "title"]}')
print('----------')
print(f'Abstract: {papers.loc[i, "abstract"]}')

Title: collaborative nested sampling: big data vs. complex physical models
----------
Abstract: the data torrent unleashed by current and upcoming astronomical surveys demands scalable analysis methods. many machine learning approaches scale well, but separating the instrument measurement from the physical effects of interest, dealing with variable errors, and deriving parameter uncertainties is often an after-thought. classic forward-folding analyses with markov chain monte carlo or nested sampling enable parameter estimation and model comparison, even for complex and slow-to-evaluate physical models. however, these approaches require independent runs for each data set, implying an unfeasible number of model evaluations in the big data regime. here i present a new algorithm, collaborative nested sampling, for deriving parameter probability distributions for each observation. importantly, the number of physical model evaluations scales sub-linearly with the number of data sets, and no 

Before applying any of these documents, we'll need to prepare the documents by preprocessing and tokenizing. For this notebook, we'll use the [simple_preprocess](https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html) function from the gensim library.

In [5]:
from gensim.utils import simple_preprocess

Use the simple_simple function to convert the paper abstracts into a list of list of tokens named `docs`.

In [None]:
docs = # fill this in

It's possible that the single tokens that the simple_preprocess function produces will be missing out on some possibly important phrases such as "machine learning" or "convolutional neural network". We can utilize another tool from gensim to try and automatically uncover such phrases from the text, the [Phrases](https://radimrehurek.com/gensim/models/phrases.html) class.

In [7]:
from gensim.models import Phrases

To fit this model, we need to pass in our tokenized documents as the `sentences` argument. We can also specify other hyperparameters. Here, we'll set the minimum count to be 25, meaning these phrases must appear at least 25 times.

In [9]:
bigram_finder = Phrases(
    sentences = # Fill This in
    min_count = 25
)

SyntaxError: invalid syntax (3458476304.py, line 3)

Once the model has been fit, we can apply it to a document by passing in the document (as a list of tokens) inside a set of square brackets. Notice that the individual tokens are still present, but two-word phrases are now also listed with the two words separated by an underscore.

In [10]:
i = 10
bigram_finder[docs[i]]

['the',
 'data',
 'torrent',
 'unleashed',
 'by',
 'current',
 'and',
 'upcoming',
 'astronomical',
 'surveys',
 'demands',
 'scalable',
 'analysis',
 'methods',
 'many',
 'machine_learning',
 'approaches',
 'scale',
 'well',
 'but',
 'separating',
 'the',
 'instrument',
 'measurement',
 'from',
 'the',
 'physical',
 'effects',
 'of',
 'interest',
 'dealing',
 'with',
 'variable',
 'errors',
 'and',
 'deriving',
 'parameter',
 'uncertainties',
 'is',
 'often',
 'an',
 'after',
 'thought',
 'classic',
 'forward',
 'folding',
 'analyses',
 'with',
 'markov_chain',
 'monte_carlo',
 'or',
 'nested',
 'sampling',
 'enable',
 'parameter',
 'estimation',
 'and',
 'model',
 'comparison',
 'even',
 'for',
 'complex',
 'and',
 'slow',
 'to',
 'evaluate',
 'physical',
 'models',
 'however',
 'these',
 'approaches',
 'require',
 'independent',
 'runs',
 'for',
 'each',
 'data',
 'set',
 'implying',
 'an',
 'unfeasible',
 'number_of',
 'model',
 'evaluations',
 'in',
 'the',
 'big',
 'data',
 'regi

You can also apply the model across the entire corpus.

In [11]:
bigram_finder[docs]

<gensim.interfaces.TransformedCorpus at 0x7f40704a6550>

The Phrases class will only look for two-word phrases, but what about three-word phrases? To look for these, we can fit another model but this time pass in the result of our first model.

In [None]:
trigram_finder = Phrases(
    sentences = # Fill this in
    min_count = 25
)

Notice how this picks up on three word phrases and some four word phrases ("markov_chain_monte_carlo").

In [13]:
i = 10
trigram_finder[bigram_finder[docs[i]]]

['the',
 'data',
 'torrent',
 'unleashed',
 'by',
 'current',
 'and',
 'upcoming',
 'astronomical',
 'surveys',
 'demands',
 'scalable',
 'analysis',
 'methods',
 'many',
 'machine_learning',
 'approaches',
 'scale',
 'well',
 'but',
 'separating',
 'the',
 'instrument',
 'measurement',
 'from',
 'the',
 'physical',
 'effects',
 'of',
 'interest',
 'dealing',
 'with',
 'variable',
 'errors',
 'and',
 'deriving',
 'parameter',
 'uncertainties',
 'is',
 'often',
 'an',
 'after',
 'thought',
 'classic',
 'forward',
 'folding',
 'analyses',
 'with',
 'markov_chain_monte_carlo',
 'or',
 'nested',
 'sampling',
 'enable',
 'parameter',
 'estimation',
 'and',
 'model',
 'comparison',
 'even',
 'for',
 'complex',
 'and',
 'slow',
 'to',
 'evaluate',
 'physical',
 'models',
 'however',
 'these',
 'approaches',
 'require',
 'independent',
 'runs',
 'for',
 'each',
 'data',
 'set',
 'implying',
 'an',
 'unfeasible',
 'number_of',
 'model',
 'evaluations',
 'in',
 'the',
 'big',
 'data',
 'regime',

We'll now take the results of applying our phrase finders.

In [14]:
docs = list(trigram_finder[bigram_finder[docs]])

**Bonus:** Modify your code so that for each document, you are keeping both the original tokens and the multi-word phrases.

Now, we need to build a [gensim Dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) from our documents. This is a class which builds a token to id map.

In [15]:
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

This object can convert from tokens to ids:

In [16]:
dictionary.token2id

{'and': 0,
 'argue_that': 1,
 'automatically': 2,
 'burstiness': 3,
 'conflicts': 4,
 'contentiousness': 5,
 'detecting': 6,
 'deviate': 7,
 'different': 8,
 'discussions': 9,
 'earlier': 10,
 'edit': 11,
 'editing': 12,
 'edits': 13,
 'efficient': 14,
 'estimated': 15,
 'evaluate': 16,
 'following': 17,
 'for': 18,
 'from': 19,
 'general': 20,
 'has': 21,
 'how': 22,
 'in': 23,
 'language': 24,
 'length': 25,
 'method': 26,
 'new': 27,
 'number_of': 28,
 'of': 29,
 'on': 30,
 'over': 31,
 'pages': 32,
 'process': 33,
 'reverts': 34,
 'severe': 35,
 'significantly': 36,
 'six': 37,
 'such': 38,
 'the': 39,
 'this': 40,
 'those': 41,
 'wars': 42,
 'we_discuss': 43,
 'we_present': 44,
 'wikipedia': 45,
 'work': 46,
 'workflow': 47,
 'wps': 48,
 'ability': 49,
 'accurate': 50,
 'actual': 51,
 'adaboost': 52,
 'address': 53,
 'affirmative': 54,
 'algorithm': 55,
 'algorithms': 56,
 'alleviate': 57,
 'almost': 58,
 'always': 59,
 'among': 60,
 'an': 61,
 'analysis': 62,
 'answer': 63,
 'app

To convert from id to token, you simply pass the id like you would with a dictionary.

In [17]:
dictionary[3]

'burstiness'

The Dictionary class has some useful methods. For example, use the [filter_extremes method](https://radimrehurek.com/gensim/corpora/dictionary.html) to remove any tokens that appear in less than 20 documents or in more than 50% of documents. 

In [None]:
# Your code here

We can convert a document into a bag-of-words representation using the [doc2bow method](https://radimrehurek.com/gensim/corpora/dictionary.html).

In [19]:
dictionary.doc2bow(docs[0])

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 2),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 1),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 1)]

**Question:** This returns a list of two-element tuples. What is the meaning of the first part of each tuple? What is the meaning of the second part?

Next, convert your documents into a bag-of-words representation and save as an object named `corpus`.

In [21]:
corpus = # fill this in

SyntaxError: invalid syntax (3589070076.py, line 1)

## Latent Dirichlet Allocation

In [22]:
from gensim.models import LdaModel

You can read more about the Gensim implementation of the LDA model here: https://radimrehurek.com/gensim/models/ldamodel.html

You can leave the parameters as they are set (or experiment and see how the results change).

In [40]:
num_topics = 8            # The number of topics to be extracted
passes = 20               # The number of times to pass through the entire corpus
chunksize = 2000          # The number of documents to be used in a training chunk 
iterations = 400          # The maximum number of iterations through the corpus when inferring the topic distribution

temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token     # We need to give the model the id2token dictionary

model = LdaModel(
    corpus = corpus,
    id2word = id2word,
    num_topics = num_topics,
    passes = passes,
    chunksize = chunksize,
    iterations = iterations,
    alpha='auto',         # Learn an asymmetric prior for document-topic distribution from the corpus
    eta='auto',           # Learn an asymmetric prior for topic-word distribution from the corpus
    eval_every = None,    # Speeds up training
    random_state = 321
)

Once the model has been fit, we can create a visualization of it using the pyLDAvis library.

In [41]:
vis = gensimvis.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(vis, 'lda.html')

Open up the html file that was created in your web browser and explore the topics that were found.

**Question:** How does the relevance metric change as the parameter lambda goes from 0 to 1?

**Question:** Look at the topic labeled as topic 6 in the visualization. What do papers related to this topic seem to be about?

Once our model is fit, we can get the topic distribution for each document. Take a look at the topic distribution for the document with id 100. Does this topic distribution look reasonable, given the visualization?

**Warning:** The pyldavis library starts counting at 1, whereas the gensim library starts counting at 0, so topic 1 in the html document really corresponds to topic 0.

In [44]:
i = 100

print(f'Abstract: {papers.loc[i, "abstract"]}')
model.get_document_topics(corpus[i])

Abstract: using deep neural networks for identifying physics objects at the large hadron collider (lhc) has become a powerful alternative approach in recent years. after successful training of deep neural networks, examining the trained networks not only helps us understand the behaviour of neural networks, but also helps improve the performance of deep learning models through proper interpretation. we take jet tagging problem at the lhc as an example, using recursive neural networks as a starting point, aim at a thorough understanding of the behaviour of the physics-oriented dnns and the information encoded in the embedding space. we make a comparative study on a series of different jet tagging tasks dominated by different underlying physics. interesting observations on the latent space are obtained.


[(3, 0.1322086), (5, 0.46441787), (7, 0.3985871)]

Now, build a DataFrame which has, for each document, the topic distribution.

In [49]:
# Your code here

Find a paper that has the highest makeup of topic 5. Then look at the abstract of this paper.

In [52]:
# Your code here

**Challenge Question:** Pick two topics and find a paper which is made up of about 50% of each of those topics. Hint: You could use the cosine similarity to find such a paper.

In [53]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [60]:
# Your code here