In [None]:
!wget https://tufts.box.com/shared/static/325sgkodnq30ez61ugazvctif6r24hsu.csv -O daf.csv

# Traditional Topic Modeling using SKLearn

In this workshop, we'll be learning how to conduct topic modelling on text using `sci-kit learn`.

This workshop builds on what we learned about TF-IDF in the `Textual Feature Extraction using Traditional Machine Learning`. In that notebook, we saw how we could take sections of text from a larger work and turn them into a numerical representations of that text. We also saw how we might begin to manipulate this numerical data, for example using linear regression. In this workshop, we'll see a more complex transformation of the numerical data. That said, the underlying concept remains the same: we can split up our corpus into several texts and then we can using TF-IDF to transform this text into a matrix of numbers. Instead of using the dot product to determine similarity between chapters, though, we'll see how we can find similar word usage across different chapters.

**Topic modelling seeks to group together words which have a similar usage. These groups constitute a topic**. As a result, topic modelling can be particularly useful if you don't know what is in a text, but you know that it has distinct parts. As we will see, there are two different approaches to topic modelling, Non-negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA), but the results will be very similar.  

## Data

For this example, like in `Textual Feature Extraction using Traditional Machine Learning`, we'll be using Edward Gibbon's *Decline and Fall of the Roman Empire* as our example text. I like using this source for topic modelling because, providing a history of Western Europe from the 200s to the 1450s AD, it's incredibly long and multifacetted. These are the sorts of texts for which topic modelling can be most useful, though feel free to use any other text instead.

In [None]:
import pandas as pd
import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import matplotlib.pyplot as plt

random_state = 1337  # will be using later

In [None]:
daf = pd.read_csv("daf.csv")[["title", "text"]]
daf

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(daf.iloc[0]["text"][:300])

In [None]:
# applying TF-IDF to the text in Gibbon
# feel free to play around with the default parameters
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words="english")
dtm = tfidf.fit_transform(daf["text"])  # dtm = document-term matrix
dtm

## Non-negative Matrix Factorization (NMF)

The NMF algorithm asks what document matrix, $W$ and what word matrix $H$ would need to be multiplied together so that we arrive back at our original document-term matrix (DTM), the TF-IDF transformed text. In other words, it attempts to group those words which would have to occur together in such a pattern so as to produce the original DTM. Importantly, this process requires the user to input the amount of topics they think is appropriate for the text. Otherwise, NMF would return its approximation of the exact same DTM, with each “topic” comprising just a single word. Instead, topic modeling seeks to project this approximation into the predefined amount of topics, so it must group together words which have a high probability of occurring together.

It is for this reason that we import NMF from the `decomposition` module. Outside of NLP, NMF is used to reduce the dimensionality of high dimensional inputs, like images or complex tabular datasets. This is exactly what it's doing in this context as well: taking the features from our TF-IDF vectorization and reducing them to the given number of topics.

In this way, you think of topics of both NMF and LDA, which we will turn to after NMF, as representing groups of probability. All words are in all topics, but we care the most about those which are most likely to be part of a given topic. Thus, below, we'll see "THE TOP 15 WORDS" for each topic. These are then the fifteen words most likely to be associated with each other.

In [None]:
nmf_model = NMF(n_components=12, random_state=random_state)
nmf_model.fit(dtm)

In [None]:
for index, topic in enumerate(nmf_model.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{index+1}")
    # note: we use `tfidf.get_feature_names_out` to access the actual words, not just the index position of them
    pp.pprint([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print("\n")

## Latent Dirichlet Allocation (LDA)

**Nota bene**: *Latent Dirichlet Allocation is usually abbreviated as 'LDA'. However, Linear Discriminant Analysis, another dimensionality reduction algorithm, is also frequently abbreviated as 'LDA'. This unfortunate coincidence may be confusing if you do your own research, so keep it in mind.*

Similar to NMF, LDA supposes that we can reverse engineer the process that generated the documents. The documents, the document-term matrix from our TF-IDF transformation, are understood as random mixtures over latent topics, where each topic is characterized by a symmetric *Dirichlet* distribution over all of the words in our document-term matrix.

In this way, NMF and LDA are quite similar. They both reduce the dimensionality of text features (in our case from TF-IDF). Where the two algorithms differ is in the probabiltiy distribution that each assumes. NMF is much more naive in its approach, assuming that each word comes from a uniform distribution and is, before any computation, as equally likely to be in any topic as any other word. LDA assumes that the words are sampled from a Poisson distribution and that topics are sampled a Dirichlet distribution. This difference theoretically makes LDA more efficient and accurate for text and NMF better for other uses.

In practice, however, determining which is better comes from empirical testing and evaluation against research question.

In [None]:
lda_model = LatentDirichletAllocation(n_components=12, random_state=random_state)
lda_model.fit(dtm)

In [None]:
for index, topic in enumerate(lda_model.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{index+1}")
    pp.pprint([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print("\n")

## Visualizing Topic Models

### Visualizing topic probability

As we saw above, a topic is just a clustering together of words, based on a probability to associate with each other. We can see visualize this probability to give us a better sense of what's happen.

In [None]:
# parameters from above
n_components = 12
top_words = 15

# plotting
fig, axes = plt.subplots(2, (n_components + 1) // 2, figsize=(40, 15), sharex=True)
axes = axes.flatten()

# looping through each topic
for topic_idx, topic in enumerate(nmf_model.components_):
    # topic manipuluation
    top_features_ind = topic.argsort()[
        : -top_words - 1 : -1
    ]  # getting top word indices for each topic
    top_features = [
        tfidf.get_feature_names_out()[i] for i in top_features_ind
    ]  # getting the actual text
    weights = topic[top_features_ind]  # getting the probability of each word

    # plotting
    ax = axes[topic_idx]
    ax.barh(top_features, weights, height=0.7)
    ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 20})
    ax.invert_yaxis()
    ax.tick_params(axis="both", which="major", labelsize=20)
    for i in "top right left".split():
        ax.spines[i].set_visible(False)
    fig.suptitle("NMF Model", fontsize=30)

plt.show()

### Visualizing topics over time

In addition to looking at the composition of each topic, we can also visualize how topics interact over the course of a single corpus.

In [None]:
daf_time = daf.copy().reset_index().rename(columns={"index": "chapter_number"})
daf_time["chapter_number"] = daf_time["chapter_number"] + 1
daf_time

In [None]:
# adding a column for each topic to each chapter
# these represent the topic membership of each chapter
for i in range(n_components):
    daf_time[f"Topic {i+1}"] = nmf_model.transform(dtm)[:, i]
daf_time

In [None]:
nmf_weights = daf_time[["chapter_number"] + list(daf_time.columns[3:])]
nmf_weights

In [None]:
# topic importance vs. chapter number
nmf_weights.plot("chapter_number", list(nmf_weights.columns[1:]), figsize=(25, 15))

In [None]:
# above is a built hard to read so we can apply a gaussian filter
# this will show the trends a bit more clearly by removing error
from scipy.ndimage.filters import gaussian_filter1d

for i in range(n_components):
    nmf_weights[f"Topic {i+1} Normalized"] = gaussian_filter1d(
        nmf_model.transform(dtm)[:, i], sigma=1
    )
nmf_weights

In [None]:
# topic importance vs. chapter number
nmf_weights.plot("chapter_number", list(nmf_weights.columns[13:]), figsize=(25, 15))

In [None]:
# clean up the plot a bit
fig, ax = plt.subplots(figsize=(20, 10))
linewidth = 3
alpha = 0.8

for i in range(n_components):
    top_features_ind = nmf_model.components_[i].argsort()[: -top_words - 1 : -1]
    top_features = [tfidf.get_feature_names_out()[i] for i in top_features_ind]
    label = f'Topic {i+1}: {", ".join(top_features).replace("¾","æ")}...'
    ax.plot(
        nmf_weights["chapter_number"],
        nmf_weights[f"Topic {i+1} Normalized"],
        label=label,
    )

ax.title.set_text("Topic Importance Over Time")
ax.set_xlabel("Chapter Number")
ax.set_ylabel("Topic Weight")
ax.legend(bbox_to_anchor=(0, -0.1), loc="upper left")

### Interpretation

Interpreting topic models can be difficult. They fall into *unsupervised* learning, which means we don't have a validation set with which to evaluate our results. Thus, the interpretation of a topic model often depends on why someone is modeling the topics in a text in the first place. For example, one might want to create a topic model for Gibbon's *Decline and Fall* because they are interested in the importance of certain ideas about history as they develop in the course of the book. In this case, they might look at the figure above and see that most topics occur as spikes, meaning that Gibbon will treat a subject closely and then move on to a different one.

I recommend that you try these visualization examples with the LDA model, as I only used the NMF model. You'll get much different results. Why do you think that is? What does it mean about the topics from the LDA model?