## Building Topic Models with the Gensim Library

For this notebook, we'll see how to fit different types of topic models using the gensim library. We'll be visualizing the results of our Latent Dirichlet Algorithm, so we'll need to install the pyLDAvis library, which we can do from conda-forge.

In [None]:
#%conda install -c conda-forge pyldavis

In [None]:
import pandas as pd
from tqdm.notebook import tqdm

import gensim

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

For this notebook, we'll be using abstracts from all machine learning papers posted on arxiv.org since the beginning of the year.

In [None]:
papers = pd.read_csv('ml_papers.csv')

In [None]:
papers.head(2)

You can change the index number to preview some of the paper abstracts.

In [None]:
i = 10

print(f'Title: {papers.loc[i, "title"]}')
print('----------')
print(f'Abstract: {papers.loc[i, "abstract"]}')

Before applying any of these documents, we'll need to prepare the documents by preprocessing and tokenizing. For this notebook, we'll use the [simple_preprocess](https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html) function from the gensim library.

In [None]:
from gensim.utils import simple_preprocess

Use the simple_simple function to convert the paper abstracts into a list of list of tokens named `docs`.

In [None]:
######### REMOVE
docs = list(map(simple_preprocess, papers['abstract']))

In [None]:
docs = # fill this in

It's possible that the single tokens that the simple_preprocess function produces will be missing out on some possibly important phrases such as "machine learning" or "convolutional neural network". We can utilize another tool from gensim to try and automatically uncover such phrases from the text, the [Phrases](https://radimrehurek.com/gensim/models/phrases.html) class.

In [None]:
from gensim.models import Phrases

To fit this model, we need to pass in our tokenized documents as the `sentences` argument. We can also specify other hyperparameters. Here, we'll set the minimum count to be 25, meaning these phrases must appear at least 25 times.

In [None]:
######### REMOVE
bigram_finder = Phrases(sentences = docs, min_count = 25)

In [None]:
bigram_finder = Phrases(
    sentences = # Fill This in
    min_count = 25
)

Once the model has been fit, we can apply it to a document by passing in the document (as a list of tokens) inside a set of square brackets. Notice that the individual tokens are still present, but two-word phrases are now also listed with the two words separated by an underscore.

In [None]:
i = 10
bigram_finder[docs[i]]

You can also apply the model across the entire corpus.

In [None]:
bigram_finder[docs]

The Phrases class will only look for two-word phrases, but what about three-word phrases? To look for these, we can fit another model but this time pass in the result of our first model.

In [None]:
trigram_finder = Phrases(
    sentences = bigram_finder[docs],
    min_count = 25
)

In [None]:
trigram_finder = Phrases(
    sentences = # Fill this in
    min_count = 25
)

Notice how this picks up on three word phrases and some four word phrases ("markov_chain_monte_carlo").

In [None]:
i = 10
trigram_finder[bigram_finder[docs[i]]]

We'll now take the results of applying our phrase finders.

In [None]:
docs = list(trigram_finder[bigram_finder[docs]])

**Bonus:** Modify your code so that for each document, you are keeping both the original tokens and the multi-word phrases.

In [None]:
#### REMOVE
def add_bigrams_and_trigrams(document):
    doc_bigrams = bigram_finder[document]
    doc_trigrams = trigram_finder[bigram_finder[document]]
    new = document + list(filter(lambda x: '_' in x, doc_bigrams))
    new = new + list(filter(lambda x: x.count('_') == 2, doc_trigrams))
    return new

docs = list(map(add_bigrams_and_trigrams, docs))

Now, we need to build a [gensim Dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) from our documents. This is a class which builds a token to id map.

In [None]:
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

This object can convert from tokens to ids:

In [None]:
dictionary.token2id

To convert from id to token, you simply pass the id like you would with a dictionary.

In [None]:
dictionary[3]

The Dictionary class has some useful methods. For example, use the [filter_extremes method](https://radimrehurek.com/gensim/corpora/dictionary.html) to remove any tokens that appear in less than 20 documents or in more than 50% of documents. 

In [None]:
###REMOVE
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [None]:
# Your code here

We can convert a document into a bag-of-words representation using the [doc2bow method](https://radimrehurek.com/gensim/corpora/dictionary.html).

In [None]:
dictionary.doc2bow(docs[0])

**Question:** This returns a list of two-element tuples. What is the meaning of the first part of each tuple? What is the meaning of the second part?

Next, convert your documents into a bag-of-words representation and save as an object named `corpus`.

In [None]:
### REMOVE
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [None]:
corpus = # fill this in

## Latent Dirichlet Allocation

In [None]:
from gensim.models import LdaModel

You can read more about the Gensim implementation of the LDA model here: https://radimrehurek.com/gensim/models/ldamodel.html

You can leave the parameters as they are set (or experiment and see how the results change).

In [None]:
num_topics = 8            # The number of topics to be extracted
passes = 20               # The number of times to pass through the entire corpus
chunksize = 2000          # The number of documents to be used in a training chunk 
iterations = 400          # The maximum number of iterations through the corpus when inferring the topic distribution

temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token     # We need to give the model the id2token dictionary

model = LdaModel(
    corpus = corpus,
    id2word = id2word,
    num_topics = num_topics,
    passes = passes,
    chunksize = chunksize,
    iterations = iterations,
    alpha='auto',         # Learn an asymmetric prior for document-topic distribution from the corpus
    eta='auto',           # Learn an asymmetric prior for topic-word distribution from the corpus
    eval_every = None,    # Speeds up training
    random_state = 321
)

Once the model has been fit, we can create a visualization of it using the pyLDAvis library.

In [None]:
vis = gensimvis.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(vis, 'lda.html')

Open up the html file that was created in your web browser and explore the topics that were found.

**Question:** How does the relevance metric change as the parameter lambda goes from 0 to 1?

**Question:** Look at the topic labeled as topic 6 in the visualization. What do papers related to this topic seem to be about?

Once our model is fit, we can get the topic distribution for each document. Take a look at the topic distribution for the document with id 100. Does this topic distribution look reasonable, given the visualization?

**Warning:** The pyldavis library starts counting at 1, whereas the gensim library starts counting at 0, so topic 1 in the html document really corresponds to topic 0.

In [None]:
i = 100

print(f'Abstract: {papers.loc[i, "abstract"]}')
model.get_document_topics(corpus[i])

Now, build a DataFrame which has, for each document, the topic distribution.

In [None]:
### REMOVE
topic_dist = pd.DataFrame([{key: value for key,value in model[paper]} for paper in corpus],
                         columns = list(range(8))).fillna(0)

In [None]:
# Your code here

Find a paper that has the highest makeup of topic 5. Then look at the abstract of this paper.

In [None]:
# Your code here

In [None]:
topic_dist.nlargest(5, 5)

In [None]:
papers.loc[1173, 'abstract']

**Challenge Question:** Pick two topics and find a paper which is made up of about 50% of each of those topics. Hint: You could use the cosine similarity to find such a paper.

In [None]:
# Your code here

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
np.argsort(cosine_similarity(np.array([1, 0, 0, 0, 0, 1, 0, 0]).reshape(1, -1), topic_dist[list(range(8))]))[0]

In [None]:
topic_dist.loc[1396]

In [None]:
papers.loc[1234, 'abstract']