<img src="../../Img/backdrop-wh.png" alt="Drawing" style="width: 300px;"/>

# Topic Modeling

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Understand topic modeling and how it can be used to find themes and topics across posts.
* Visualize topic models to facilitate exploration.
* Evaluate and improve topic models using several methods.
* Give names to topics, and use them to classify text.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

### Sections
1. [Topic Modeling](#topic)
2. [Visualizing a Topic Model](#vis)
3. [Tweaking the Data](#data)
4. [Tweaking Hyperparameters](#hyper)
5. [Calculating Topic Coherence](#coh)
6. [Changing Amount of Topics](#topics)
7. [Using Topic Models](#use)

<a id='topic'></a>

# Topic modeling

This notebook introduces topic modeling. Topic modeling is a type of statistical modeling for the discovery of abstract "topics" that occur in a collection of documents. It is frequently used in NLP to aid the discovery of hidden semantic structures in a collection of texts.

Before you start, please read [this post](https://tomvannuenen.medium.com/analyzing-reddit-communities-with-python-part-5-topic-modeling-a5b0d119add) for an explainer of how topic modeling (and LDA, which is just one form of topic modeling) works.

We'll use the `Gensim` package to create our topic models, which also allows us to run tests to optimize our topic amount.

## Loading the data

In [None]:
import pandas as pd

df = pd.read_csv('../../data/aita_pp.csv')

Let's split up the lemmas--we need them split up to use in our topic model. All we need to do is run `.split()` on our "pp_text" column to tokenize the data again.

In [None]:
lemmas_split = [lemma.split() for lemma in df['pp_text']]

In [None]:
lemmas_split[0][:10]

## Creating a `Dictionary` with Gensim

Now, let's create our gensim dictionary - a mapping of each word to a unique id – a Document-Term matrix – much like the `CountVectorizer` we saw last week. We'll use gensim's `Dictionary` class for this.

In [None]:
from gensim import corpora, models, similarities
from gensim.models.coherencemodel import CoherenceModel

# Create Dictionary 
dictionary = corpora.Dictionary(lemmas_split)

# filter extremes and assign new ids
dictionary.filter_extremes(no_below=10, no_above=0.4)
dictionary.compactify() 

# SAVE DICT
dictionary.save('../../data/aita_lda.dict')

# Create Document-Term Matrix of our whole corpus 
corpus = [dictionary.doc2bow(text) for text in lemmas_split]

Topic modeling uses a **bag-of-words** model to represent documents in a corpus. In the bag-of-words model, a document is represented by word counts. Additional information, such as word order, is discarded.

Let's view some of the corpus we have now:

In [None]:
corpus[0][:10]

Observe the first 10 tuples above. Each consists of words with a unique id. This a mapping of (word_id, word_frequency). For example, (0, 1) above demonstrates that word id 0 occurs once in the first document. Word id 6 occurs 6 times, and so on. This is used as the input by the LDA model.

If you want to see what word a given id corresponds to, pass the id as a key to the dictionary.

In [None]:
dictionary[6]

And if you want to see the associated id for some word:

In [None]:
dictionary.token2id['father']

## Running a model

Let's run our first Gensim LDA topic model! Check out the comments to learn about the function arguments we're using.

Note that topic modeling essentially adds a third latent layer on top of the documents and tokens (which is the representation we saw last week when running SKLearn's `CountVectorizer()`. That third layer consists of topics. Topic modeling assumes that documents are made of topics, and topics is made up of tokens. It also assumes a Dirichlet probability distribution, which encourages docs to only consist of a handful of topics and topics only of a handful of words.

In [None]:
from gensim.models.ldamodel import LdaModel

%time
lda_model = LdaModel(corpus=corpus,   # stream of document vectors or sparse matrix of shape
            id2word=dictionary,       # mapping from word IDs to words (for determining vocab size)
            num_topics=10,            # amount of topics
            random_state=100,         # seed to generate random state; useful for reproducibility
            passes=2,                 # amount of iterations/epochs 
            per_word_topics=False)    # computing most-likely topics for each word 

The most challenging part about topic modeling is creating a *good*, i.e. interpretable, topic model. This is a heavily iterative process. The first thing we should do is visualize the model.

<a id='vis'></a>

# Visualizing a Topic Model

One of the best ways to visualize a topic model is through the pyLDAvis package. `pyLDAvis` was designed to help users interpret the topics in a topic model. Let's start by downloading the package.

In [None]:
try:
    import pyLDAvis
except ImportError:
    !pip install pyLDAvis

PyLDAvis allows us to visualize our topics. A "good" topic model produces non-overlapping, fairly large bubbles, which should be scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics will typically have many overlaps, small sized bubbles clustered in one region of the chart.

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(lda_model, corpus, dictionary)
lda_viz

On the left, there is a 2D plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map). This plot uses a multidimensional scaling (MDS) algorithm. 
- Similar topics should appear close together on the plot; dissimilar topics should appear far apart. 
- The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.

### Exploring topics and words
- You can scrutinize a topic more closely by clicking on its circle, or entering its number in the "selected topic" box in the upper-left (Note that, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics.)
- If you roll your mouse over a term in the bar chart on the right, the topic circles will resize in the plot on the left. This shows the strength of the relationship between the topics and the selected term.

### Salience
On the right, there is a bar chart with the top terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most **salient** terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

### Probability Vs Exclusivity 
When you select a particular topic, this bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λ, which can be adjusted with a slider above the bar chart:

* Setting λ close to 1.0 (the default) will rank the terms according to their probability within the topic.
* Setting λ close to 0.0 will rank the terms according to their "distinctiveness" or "exclusivity" within the topic. This means that terms that occur only in this topic, and do not occur in other topics.

You can move the slider between 0.0 and 1.0 to weigh term probability and exclusivity.

### Exploring the graph
The interactive visualization pyLDAvis produces is helpful for **individual** topics: you can manually select each topic to view its top most frequent and/or "relevant" terms, using different values of the λ parameter. This can help when you're trying to assign a name or "meaning" to each topic. 

It also helps you to see the **relationships** between topics: exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

### Getting insights about the model
As you can see, this model probably has too many topics: they are overlapping, and most of them appear in one corner of the graph. So we have our first hint that we might want to alter our model. Let's start by tweaking our data.

<a id='data'></a>

# Tweaking the data

Remember we used the lemmas of our dataset? What if we tweaked it some more -- for instance, by POS tagging?

POS - Part of Speech - tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag (noun, adjective, verb, etc.). Often it is a process of converting a sentence to a list of tuples where each tuple takes the form of (word, tag).

Let's do this with spaCy.

In [None]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

import spacy
#!spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

def POS(text, allowed_postags = ['NOUN', 'ADJ']):
    parsed = nlp(text)
    return [token.lemma_ for token in parsed if token.pos_ in allowed_postags]

In [None]:
# This will take a long time
pos_lemmas_split = [POS(text) for text in tqdm(df['pp_text'])]

In [None]:
# turn our POS tagged lemmas into a string so we can save them in our DF
str_pos_lemmas = [' '.join(t) for t in pos_lemmas_split]

In [None]:
str_pos_lemmas[0]

In [None]:
df['pos_text'] = str_pos_lemmas

Next, we need to create new dictionary and corpus objects for Gensim.

In [None]:
# Create Dictionary 
pos_dictionary = corpora.Dictionary(tqdm(pos_lemmas_split))

# filter extremes and assign new ids
pos_dictionary.filter_extremes(no_below=10, no_above=0.4)
pos_dictionary.compactify() 

# SAVE DICT
pos_dictionary.save('../../data/aita_pos_lda.dict')

# Create Document-Term Matrix of our whole corpus 
pos_corpus = [pos_dictionary.doc2bow(text) for text in tqdm(pos_lemmas_split)]


<a id='hyper'></a>

# Tweaking hyperparameters

Next, let's change some hyperparameters. This can also determine how interpretable our topic models become.

`passes` controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. It defaults to `1` but we might want to set it to a higher number.
 
Gensim's designer suggests the following way to choose the amount of passes. First, enable `logging`, and set `eval_every = 1` in `LdaModel`. This will yield a **perplexity** score for every update. Perplexity is a measure of how well a probability model predicts a sample. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. 


In [None]:
import logging
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

logging.basicConfig(filename='../../data/gensim.log', filemode='w', format="%(asctime)s:%(levelname)s:%(message)s", level=logging.INFO)

lda_model_tweak = LdaModel(corpus=pos_corpus,
                           id2word=pos_dictionary,
                           num_topics=20, 
                           random_state=100,
                           eval_every=1,           # show perplexity after every update for visualization
                           passes=5,              # number of model training cycles, aka epochs
                           per_word_topics=False)

Using regular expressions we can now search through our newly created "gensim.log" file and find  / plot the relevant information.
This shows how topic/word assignments reach a steady state and no longer change much, (i.e. converge). Adding "passes" when running the topic model can increase the log-likelihood. The higher the value of the log-likelihood, the better our model fits the dataset.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import re

p = re.compile(r"(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity")
matches = [p.findall(l) for l in open('../../data/gensim.log')]
matches = [m for m in matches if len(m) > 0]
tuples = [t[0] for t in matches]
likelihood = [float(t[0]) for t in tuples]
perplexity = [float(t[1]) for t in tuples]
iter = list(range(0,len(tuples)*10,10))
plt.plot(iter,likelihood,c="black")
plt.ylabel("log likelihood")
plt.xlabel("iteration")
plt.title("Topic Model Convergence")
plt.grid()

 Note that the graph contains about 5 "peaks", which refers to the number of `passes` we set above. As you can see, the model converges quite rapidly, so we do not need to set `passes` very high.

<a id='coh'></a>

# Calculating Topic Coherence

We can also apply statistical measures to help us determine the optimal number of topics in our topic model.

**Topic Coherence** measures the score of a single topic by measuring the degree of semantic similarity between high scoring words in the topic. This helps to distinguish between topics that are semantically interpretable topics, and topics that are artifacts of statistical inference. 

A set of statements or facts is said to be coherent if the statements support each other. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

A good model will generate topics with *high* topic coherence scores. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture.

💡**Tip**: There are different ways to measure coherence. For instance, the c_v measure used here is calculated based on a combination of confirmation measures, such as how often word pairs occur together)

In [None]:
#import logging
#logging.getLogger().setLevel(logging.CRITICAL)
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
from tqdm import tqdm

# Compute Coherence Score
coherence_model = CoherenceModel(model=lda_model_tweak, corpus=pos_corpus, texts=tqdm(pos_lemmas_split), dictionary=pos_dictionary, coherence='c_v') 
coherence = coherence_model.get_coherence()
print('\nCoherence Score: ', coherence)

There's no hard or fast rule on what makes a good coherence score.
In general, a coherence score of .4 means you probably are not using the right number of topics. .6 to .7 is good. Anything more is suspiciously great. As you can see, our coherence score here is very low, so we should definitely try to improve upon our model.

<a id='topics'></a>
# Changing number of topics

The final and most important thing we can do to find optimal scores is to play around with the amount of topics our model creates. One way to do this is to build many LDA models with different values of number of topics (k), and then pick the one that gives the highest coherence value. Choosing a ‘k’ at the end of a rapid growth of topic coherence usually yields meaningful and interpretable topics. If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.

This `compute_coherence_values()` function trains multiple LDA models, provides the models, and tells you their corresponding coherence scores.

Also note the docstring I create here: these are documentation for the functions we create. It describes what a function does, and can be called using `help(function_X)`.

In [None]:
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

def compute_coherence_values(dictionary, corpus, texts, start, limit, step):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : Tokenized text (list of lis of str)
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    total_amount = (limit - start - 1) // step + 1
    current_amount = 0
    passes = 5
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=100, 
                         update_every=1, passes=passes, alpha='auto', per_word_topics=False)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        # When using 'c_v' texts should be provided, corpus isn’t needed. 
        # When using ‘u_mass’ corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary 
        coherence_values.append(coherencemodel.get_coherence())
        current_amount += 1
        print("Built " + str(current_amount) + " of " + str(total_amount) + " models")

    return model_list, coherence_values


Using our new function, let's run a bunch of topic models with different amounts of topics.

In [None]:
# Can take a long time to run
model_list, coherence_values = compute_coherence_values(dictionary=pos_dictionary, 
                                                        corpus=pos_corpus, texts=pos_lemmas_split, 
                                                        start=6, limit=30, step=5)

Now, from all those models, let's visualize the output of the coherence scores.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Show graph
start=6; limit=27; step=5
x = range(start, limit, step) # range uses start, stop, and incrementation
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print these coherence scores
c = 0
for m, cv in zip(x, coherence_values):
    print(f"model_list[{c}]: Num Topics = {m}, Coherence Value = {round(cv, 4)}")
    c += 1

If the coherence score seems to keep increasing, it generally makes sense to pick the model that gave the highest CV before dropping again. Following this "elbow method", we have a few options.

However, these methods are only heuristics, and the scores we have here are not that far apart. At this point you should *go back* to pyLDAvis using the models from our `model_list`, and compare them to see which topic model looks better. Based on these combined insights, I will pick the model with 11 topics.

Note that this is **not** the model with the highest coherence value! While these metrics can be useful, they should never be followed blindly. What matters more is the **interpretability** of topic models.

In [None]:
from gensim import corpora, models, similarities

# SAVE MODEL
optimal_lda_model = model_list[1]
optimal_lda_model.save('../../data/aita_pos_lda_optimal.model')


In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(optimal_lda_model, pos_corpus, pos_dictionary)
lda_viz

In case you want to load these models from disk again:

In [None]:
# LOAD MODEL
optimal_lda_model = LdaModel.load('../../data/aita_pos_lda_optimal.model')

# LOAD DICT
pos_dictionary = corpora.Dictionary.load('../../data/aita_pos_lda.dict')

# LOAD CORPUS
pos_corpus = [pos_dictionary.doc2bow(text) for text in tqdm(pos_lemmas_split)]

## Naming our topics

The next thing we should do is name our topics. This is the most important interpretative step in the process: after all, our model has no semantic knowledge of the data. We will print out the top words of each topic, then go over all of them and give them names.

In [None]:
from pprint import pprint

# Select the ideal model and print the topics
model_topics = optimal_lda_model.show_topics(formatted=False)
pprint(optimal_lda_model.print_topics(num_words=20))

**Important**: If you are using your own data, make sure to replace the following names with those of your own!

In [None]:
# giving names to our topics
topic_names = {0: 'family, kids', 
               1: 'married life, neighborhoods, pets',
               2: 'school, work, cars',
               3: 'family, siblings',
               4: 'dating, friends',
               5: 'lifestyle, looks',
               6: 'work',
               7: 'money, work',
               8: 'family, babies',
               9: 'food',
               10: 'sleep'}

Naming topics is a heavily iterative process, based on looking closer at particular posts (see below).

For instance, I had initially named topic 1 "weddings", but after reading some posts with this dominant topic, I renamed it to "married life, neighborhoods, pets".

<a id='use'></a>

# Using Topic Models

Topic modeling has several practical applications. One of them is to determine what topic a Reddit post is about. To figure this out, we find the topic number that has the highest percentage contribution to that thread. We'll write a `dominant_topic()` function that aggregates this information in a DataFrame.

In [None]:
def dominant_topic(ldamodel=optimal_lda_model, corpus=corpus, texts=df['selftext']):
    # Create DF
    topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each thread
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                # get value of topic_names dict based on key
                topic_name = topic_names[topic_num]
                topic_keywords = ", ".join([word for word, prop in wp])
                new_row = pd.DataFrame([[int(topic_num), topic_name, round(prop_topic,4), topic_keywords]])
                topics_df = pd.concat([topics_df, new_row], ignore_index=True)

            else:
                break
    topics_df.columns = ['Dominant_Topic', 'Dominant_Topic_Name', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    topics_df = pd.concat([topics_df, contents], axis=1)
    return topics_df 

# Run function
df_topic_keywords = dominant_topic(ldamodel=optimal_lda_model, corpus=pos_corpus, texts=df['selftext'])

# Format
df_dominant_topic = df_topic_keywords.reset_index(drop=True)
df_dominant_topic.columns = ['Dominant_Topic', 'Dominant_Topic_Name', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic

We can now find the posts with a dominant topic using `.loc`

In [None]:
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic_Name'] == 'family, kids']['Text']

Look at the first post:

In [None]:
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic_Name'] == 'family, kids']['Text'][1]

That does look to be about family life, and particularly about having kids.

## Adding topics to original DF

Once we are happy with your topic names, we can add the dominant topic names to our original DataFrame and save it.

In [None]:
df['dom_topic'] = df_dominant_topic['Dominant_Topic_Name']
df['dom_topic_num'] = df_dominant_topic['Dominant_Topic']

df.to_csv('../../data/aita_lda.csv', index=False)

In [None]:
df.head(3)

# 💭 Reflection: The hermeneutics of topic modeling 

One thought to end with: for most topic models you will create, it will be hard to apply a meaningful interpretation to each topic. Not every topic will have some meaningful insight "fall out of it" upon first inspection. This is a typical issue in machine learning, which can pick up on patterns that might not make sense to humans.

It is an open question to which extent you should let yourself be surprised by particular combinations of words in a topic, or if topic models primarily should follow the intuitions you already have as a researcher. What makes for a "good" topic model probably straddles the boundaries of surprise and expectation.

<div class="alert alert-success">

## ❗ Key Points

* Topic modeling can help us find themes and topics in textual data.
* Topic models can be evaluated and improved based on coherence metrics; however, using your eyes to see whether topics make intuitive sense is just as important.
* Topic models yield information that can be used to do different things, such as finding submissions with particular topics, or classifying texts.
    
</div>