In [1]:
import bertopic
import pandas as pd
import sentence_transformers
from hdbscan import HDBSCAN
import numpy as np
from umap import UMAP

# Topic modelling

Topic modelling is the process of grouping large volumes of text into *topics*, that is collections of texts that align in the words that they use. At the conceptual level, topic modelling tries to cluster representations of the texts based on some similarity metric. As a result, there are many different approaches and techniques for topic modelling.

For the purposes of this workshop we will use BERTopic, a method that leverages the embeddings that are discussed in one of the other tutorials. This example will use the same news content about `climate change` and `global warming` as the other examples.

In [None]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

At the simplest level, the BERTopic library handles both the embedding and clustering steps using its default parameters. We collect topic labels and probability assignments for each text in turn.

In [None]:
topic_model = bertopic.BERTopic()
topics, probs = topic_model.fit_transform(df_news.title[:1000])

How do we interpret this information? The topic label denotes which group a given text belongs to. Note that the topic `-1` is included as a catch-all for any texts that do not fit within a clear topic, and as such should not be considered as cohesive topic. The probabilities indicate a measure of confidence in the topic model that this is the correct topic assignment for a given text. This can vary significantly between texts.

These labels and probablities are useful, but they don't tell us much about the texts. Luckily, BERTopic automatically processes summary information for each topic.

In [None]:
df_topics = topic_model.get_topic_info()
df_topics

This set of topic information tells us all we need to know about the clusters forming around texts. `Count` indicates the number of texts assigned to this topic, `Representation` leverages TF-IDF to rank the key terms that characterise a topic and `Representative_Docs` returns the texts closest to the topic's TF-IDF representation. These representative words and docs are typically enough to characterise the largest topics.

In the example above, we can see topics forming around Pope Francis, Canadian politics and Indian PM Modi arriving at COP26. Many of the topics reference COP26, a key event at the time this data was produced.

Topic size is an important factor when considering the quality of a topic. If a topic is small relative to the size of your dataset it is more likely to have formed around highly specfic cases or anomalies in the data. Here the smallest topic is approximately 1\% of the dataset at 11 documents. Normally, we can avoid these issues with topic size by considering only the largest topics for subsequent analysis, such as the top 10.

A major strength of the BERTopic method is how customisable it is with different embedding components. We can pass any embeddings we want to the model to check their effect on the topics we find. Here we'll try the `all-distilroberta-v1` model, but you can find a list of pre-trained models in the [documentation](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html).

In [None]:
sentence_model = sentence_transformers.SentenceTransformer('all-distilroberta-v1')

roberta_topic_model = bertopic.BERTopic(embedding_model=sentence_model)
rob_topics, rob_probs = roberta_topic_model.fit_transform(df_news.title[:1000])
df_roberta_topics = roberta_topic_model.get_topic_info()
df_roberta_topics

We can see that some of the detected topics are consistent between different embedding models, whereas others have changed. Canadian politics, Boris Johnson and Pope Francis are common catalysts, whereas others such as COP26 are captured differently. Comparing topics across different model parameters can be a good way to assess their quality. Those that are consistent are likely to be more robust and clearly defined in your data.

At present, BERTopic recomputes the embeddings each time the model is fit. This can be resource intensive (in both time and CPU cycles) when applied to large datasets. You can instead pass precomputed embeddings to the model, in which case BERTopic only looks for the best way to group texts together.

In [None]:
embs = sentence_model.encode(df_news.title[:1000])

roberta_topic_model = bertopic.BERTopic(embedding_model=sentence_model)
rob_topics, rob_probs = roberta_topic_model.fit_transform(df_news.title[:1000],embs)
df_roberta_topics = roberta_topic_model.get_topic_info()
df_roberta_topics

Once again, we can see some variation in the topics that have been returned - even though we are passing the same data and  embedding model. This is because BERTopic has stochastic elements. To fix the random state for reproducibility purposes, we need to customise the clustering and dimensionality reduction models.

By default, BERTopic uses the `HDBSCAN` model for clustering. We can define the model ourselves to alter the clustering parameters - review the [HDBSCAN documentation](https://hdbscan.readthedocs.io/en/latest/) for a full list of options. We'll also set a global seed in `numpy` to fix the random state.

In [None]:
np.random.seed(42)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model = bertopic.BERTopic(hdbscan_model=hdbscan_model)

It is possible to substitute one of a number of clustering algorithms into BERTopic - the [documentation](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) gives a list of examples.

The final component of BERTopic we can vary is the dimensionality reduction step. This process reduces some of the complexity in the model and make it more memory efficient (as noted when we discussed the value of sentence embeddings over bag of words and TF-IDF methods). Like with the embedding and clustering methods there are many options that can be easily subsituted, or updated for finer control of the method, with exmaples listed in the [documentation](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html).

In [None]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
topic_model = bertopic.BERTopic(umap_model=umap_model)

# Exercises

Fit a topic model over the first 1000 article bodies using the default BERTopic settings. Find the list of row numbers in the dataset that are about the president of Zimbabwe according to the topic model.

The topics and probabilities assigned by BERTopic correspond to the best fit. Using the probabilities return for each text assigned to the topics in your previous model, find the topics with the highest and lowest mean probability. Do these values tell you anything about the topics highlighted?