# Topic Modeling – Project Notebook

Use this notebook for carrying out the analyses from the workshop notebook on your own subreddit data.

## Loading the data

Make sure to use the preprocessed file from Week 1, with the `lemmas` column in it!

In all of the cells below, replace YOUR_FILE with the name of the files you are working with.

In [None]:
import pandas as pd

df = pd.read_csv('../../data/YOUR_FILE_PP.csv')

In [None]:
df.head(3)

In [None]:
from tqdm import tqdm 

lemmas_split = [lemma.split() for lemma in tqdm(df['pp_text'])]

## Creating a `Dictionary` with Gensim

In [None]:
from gensim import corpora, models, similarities
from gensim.models.coherencemodel import CoherenceModel

# Create Dictionary 
dictionary = corpora.Dictionary(tqdm(lemmas_split))

# filter extremes and assign new ids
dictionary.filter_extremes(no_below=10, no_above=0.4)
dictionary.compactify() 

# SAVE DICT
dictionary.save('../../data/YOUR_FILE.dict')

# Create Document-Term Matrix of our whole corpus 
corpus = [dictionary.doc2bow(text) for text in tqdm(lemmas_split)]

# Running a model

In [None]:
from gensim.models.ldamodel import LdaModel

%time
lda_model = LdaModel(corpus=tqdm(corpus, iter_count=2),   # stream of document vectors or sparse matrix of shape
            id2word=dictionary,       # mapping from word IDs to words (for determining vocab size)
            num_topics=10,            # amount of topics
            random_state=100,         # seed to generate random state; useful for reproducibility
            passes=2,                 # amount of iterations/epochs 
            per_word_topics=False)    # computing most-likely topics for each word 

<a id='vis'></a>

# Visualizing a Topic Model

In [None]:
try:
    import pyLDAvis
except ImportError:
    !pip install pyLDAvis

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(lda_model, corpus, dictionary)
lda_viz

On the left, there is a 2D plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map). This plot uses a multidimensional scaling (MDS) algorithm. 
- Similar topics should appear close together on the plot; dissimilar topics should appear far apart. 
- The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.

### Exploring topics and words
- You can scrutinize a topic more closely by clicking on its circle, or entering its number in the "selected topic" box in the upper-left (Note that, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics.)
- If you roll your mouse over a term in the bar chart on the right, the topic circles will resize in the plot on the left. This shows the strength of the relationship between the topics and the selected term.

### Salience
On the right, there is a bar chart with the top terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most **salient** terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

### Probability Vs Exclusivity 
When you select a particular topic, this bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λ, which can be adjusted with a slider above the bar chart:

* Setting λ close to 1.0 (the default) will rank the terms according to their probability within the topic.
* Setting λ close to 0.0 will rank the terms according to their "distinctiveness" or "exclusivity" within the topic. This means that terms that occur only in this topic, and do not occur in other topics.

You can move the slider between 0.0 and 1.0 to weigh term probability and exclusivity.

### Exploring the graph
The interactive visualization pyLDAvis produces is helpful for **individual** topics: you can manually select each topic to view its top most frequent and/or "relevant" terms, using different values of the λ parameter. This can help when you're trying to assign a name or "meaning" to each topic. 

It also helps you to see the **relationships** between topics: exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

### Getting insights about the model
As you can see, this model probably has too many topics: they are overlapping, and most of them appear in one corner of the graph. So we have our first hint that we might want to alter our model.

<a id='coh'></a>

# Calculating Topic Coherence

In [None]:
#import logging
#logging.getLogger().setLevel(logging.CRITICAL)
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

# Compute Coherence Score
coherence_model = CoherenceModel(model=lda_model, corpus=corpus, texts=tqdm(lemmas_split), dictionary=dictionary, coherence='c_v') 
coherence = coherence_model.get_coherence()
print('\nCoherence Score: ', coherence)

## Tweaking the data - POS tagging

In [None]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

import spacy
#!spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

def POS(text, allowed_postags = ['NOUN', 'ADJ']):
    parsed = nlp(text)
    return [token.lemma_ for token in parsed if token.pos_ in allowed_postags]

In [None]:
# This will take a long time
pos_lemmas_split = [POS(text) for text in tqdm(df['lemmas'])]

In [None]:
import json
with open('../../data/YOUR_FILE_pos_lemmas.json', 'w' ) as write:
    json.dump(pos_lemmas_split, write)

# Uncomment the following two lines if you want to import this data again
#with open("aita_pos_lemmas.json") as f:
#    pos_lemmas = json.load(f)


In [None]:
# turn them into a string so we can save them in our DF
str_pos_lemmas = [' '.join(t) for t in pos_lemmas_split]

In [None]:
str_pos_lemmas[0]

In [None]:
df.insert(loc=9, column='pos_lemmas', value=str_pos_lemmas)

In [None]:
df.to_csv('../../data/YOUR_FILE_pos_lemmas.csv', index=False)

Create new dictionary and corpus objects for Gensim.

In [None]:
# Create Dictionary 
pos_dictionary = corpora.Dictionary(tqdm(pos_lemmas_split))

# filter extremes and assign new ids
pos_dictionary.filter_extremes(no_below=10, no_above=0.4)
pos_dictionary.compactify() 

# SAVE DICT
pos_dictionary.save('../../data/YOUR_FILE_pos_lda.dict')

# Create Document-Term Matrix of our whole corpus 
pos_corpus = [pos_dictionary.doc2bow(text) for text in tqdm(pos_lemmas_split)]


## Tweaking hyperparameters

`passes` controls how often we train the model on the entire corpus. Another word for passes might be “epochs”.
 
`iterations` is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of “passes” and “iterations” high enough.

Gensim's designer suggests the following way to choose iterations and passes. First, enable `logging`, and set `eval_every = 1` in `LdaModel`. This will yield a **perplexity** score for every update. Perplexity is a measure of how well a probability model predicts a sample. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. 


In [None]:
import logging
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

logging.basicConfig(filename='../../data/gensim_my_project.log', filemode='w', format="%(asctime)s:%(levelname)s:%(message)s", level=logging.INFO)

lda_model_tweak = LdaModel(corpus=tqdm(pos_corpus, iter_count=20),
                           id2word=pos_dictionary,
                           num_topics=20, 
                           random_state=100,
                           eval_every=1,           # show perplexity after every update for visualization
                           iterations=50,          # number of model iterations over each doc
                           passes=20,              # number of model training cycles, aka epochs
                           per_word_topics=False)

Search through our newly created "gensim.log" file and find  / plot the relevant information.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import re

p = re.compile(r"(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity")
matches = [p.findall(l) for l in open('../../data/gensim_my_project.log')]
matches = [m for m in matches if len(m) > 0]
tuples = [t[0] for t in matches]
likelihood = [float(t[0]) for t in tuples]
perplexity = [float(t[1]) for t in tuples]
iter = list(range(0,len(tuples)*10,10))
plt.plot(iter,likelihood,c="black")
plt.ylabel("log likelihood")
plt.xlabel("iteration")
plt.title("Topic Model Convergence")
plt.grid()
plt.savefig("convergence_liklihood_1_it_1_pa.pdf")

## Changing number of topics

This `compute_coherence_values()` function trains multiple LDA models, provides the models, and tells you their corresponding coherence scores.

In [None]:
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

def compute_coherence_values(dictionary, corpus, texts, start, limit, step):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : Tokenized text (list of lis of str)
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    total_amount = int(limit/step - start/step)
    current_amount = 0
    passes=10
    corpus = tqdm(corpus, iter_count=passes*total_amount)
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=100, 
                         update_every=1, iterations=50, passes=passes, alpha='auto', per_word_topics=False)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        # When using 'c_v' texts should be provided, corpus isn’t needed. 
        # When using ‘u_mass’ corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary 
        coherence_values.append(coherencemodel.get_coherence())
        current_amount += 1
        print("Built " + str(current_amount) + " of " + str(total_amount) + " models")

    return model_list, coherence_values


In [None]:
help(compute_coherence_values)

In [None]:
# Can take a long time to run
model_list, coherence_values = compute_coherence_values(dictionary=pos_dictionary, 
                                                        corpus=pos_corpus, texts=pos_lemmas_split, 
                                                        start=6, limit=30, step=2)

Visualize the output of the coherence scores.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Show graph
start=6; limit=30; step=2
x = range(start, limit, step) # range uses start, stop, and incrementation
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print these coherence scores
c = 0
for m, cv in zip(x, coherence_values):
    print(f"model_list[{c}]: Num Topics = {m}, Coherence Value = {round(cv, 4)}")
    c += 1

If the coherence score seems to keep increasing, it generally makes sense to pick the model that gave the highest CV before dropping again.

Replace MY_MODEL below with the number of the model that achieves the best results for you.

In [None]:
from gensim import corpora, models, similarities

# SAVE MODEL
optimal_lda_model = model_list[MY_MODEL]
optimal_lda_model.save('aita_pos_lda_optimal.model')


In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(optimal_lda_model, pos_corpus, pos_dictionary)
lda_viz

In case you want to load these models from disk again:

In [None]:
# LOAD MODEL
optimal_lda_model = LdaModel.load('../../data/YOUR_FILE_pos_lda_optimal.model')

# LOAD DICT
pos_dictionary = corpora.Dictionary.load('../../data/YOUR_FILE_pos_lda.dict')

# LOAD CORPUS
pos_corpus = [pos_dictionary.doc2bow(text) for text in tqdm(pos_lemmas_split)]

## Naming our topics

The next thing we should do is name our topics. This is the most important interpretative step in the process: after all, our model has no semantic knowledge of the data. We will print out the top words of each topic, then go over all of them and give them names.

In [None]:
from pprint import pprint

# Select the ideal model and print the topics
model_topics = optimal_lda_model.show_topics(formatted=False)
pprint(optimal_lda_model.print_topics(num_words=20))

Name the topics you have. Make sure to elongate / shorten this dictionary based on how many topics you have in your final topic model!

In [None]:
# giving names to our topics
topic_names = {0: 'NAME ME', 
               1: 'NAME ME', 
               2: 'NAME ME', 
               3: 'NAME ME', 
               4: 'NAME ME', 
               5: 'NAME ME', 
               6: 'NAME ME',
               7: 'NAME ME', 
               8: 'NAME ME', 
               9: 'NAME ME', 
               10: 'NAME ME', 
               11: 'NAME ME', 
               12: 'NAME ME', 
               13: 'NAME ME', 
               14: 'NAME ME', 
               15: 'NAME ME'
              } 
               

Naming topics is a heavily iterative process, based on looking closer at particular posts (see below).

<a id='use'></a>

# Using Topic Models: What is a Reddit Post About?

In [None]:
def dominant_topic(ldamodel=optimal_lda_model, corpus=corpus, texts=df['selftext']):
    # Create DF
    topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each thread
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                # get value of topic_names dict based on key
                topic_name = topic_names[topic_num]
                topic_keywords = ", ".join([word for word, prop in wp])
                topics_df = topics_df.append(pd.Series([int(topic_num), topic_name, round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    topics_df.columns = ['Dominant_Topic', 'Dominant_Topic_Name', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    topics_df = pd.concat([topics_df, contents], axis=1)
    return topics_df 

# Run function
df_topic_keywords = dominant_topic(ldamodel=optimal_lda_model, corpus=pos_corpus, texts=df['selftext'])

# Format
df_dominant_topic = df_topic_keywords.reset_index(drop=True)
df_dominant_topic.columns = ['Dominant_Topic', 'Dominant_Topic_Name', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic

We can now find the posts with a dominant topic using `.loc`.

Change 'TOPIC NAME' below to a topic you have named above!

In [None]:
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic_Name'] == 'TOPIC NAME']

## Adding topics to original DF

Once we are happy with your topic names, we can add the dominant topic names to our original DataFrame and save it.

In [None]:
df['dom_topic'] = df_dominant_topic['Dominant_Topic_Name']
df['dom_topic_num'] = df_dominant_topic['Dominant_Topic']

df.to_csv('../../data/YOUR_FILE_pos_lemmas_topics.csv', index=False)

In [None]:
df.head(3)