In [18]:
import bertopic
import pandas as pd
import sentence_transformers
from hdbscan import HDBSCAN
import numpy as np
from umap import UMAP

# Topic modelling

Topic modelling is the process of grouping large volumes of text into *topics*, that is collections of texts that align in the words that they use. At the conceptual level, topic modelling tries to cluster representations of the texts based on some similarity metric. As a result, there are many different approaches and techniques for topic modelling.

For the purposes of this workshop we will use BERTopic, a method that leverages the embeddings that are discussed in one of the other tutorials. This example will use the same news content about `climate change` and `global warming` as the other examples.

In [3]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

At the simplest level, the BERTopic library handles both the embedding and clustering steps using its default parameters. We collect topic labels and probability assignments for each text in turn.

In [4]:
topic_model = bertopic.BERTopic()
topics, probs = topic_model.fit_transform(df_news.title[:1000])

Downloading:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

How do we interpret this information? The topic label denotes which group a given text belongs to. Note that the topic `-1` is included as a catch-all for any texts that do not fit within a clear topic, and as such should not be considered as cohesive topic. The probabilities indicate a measure of confidence in the topic model that this is the correct topic assignment for a given text. This can vary significantly between texts.

These labels and probablities are useful, but they don't tell us much about the texts. Luckily, BERTopic automatically processes summary information for each topic.

In [5]:
df_topics = topic_model.get_topic_info()
df_topics

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,143,-1_and_us_to_crises,"[and, us, to, crises, erdogan, turkey, the, av...",[Biden tells Erdogan that US and Turkey must a...
1,0,74,0_trudeau_wanted_canada_ambitious,"[trudeau, wanted, canada, ambitious, though, m...","[Trudeau says climate progress made at G20, th..."
2,1,48,1_modi_glasgow_in_arrives,"[modi, glasgow, in, arrives, uk, pm, minister,...",[modi: PM Modi arrives in UK for COP26 summi...
3,2,43,2_paristype_exun_doesnt_see,"[paristype, exun, doesnt, see, moment, chief, ...",[Ex-UN climate chief doesn't see Paris-type mo...
4,3,42,3_la_of_de_2021,"[la, of, de, 2021, climate, the, change, scary...",[Los efectos de la crisis climática en los océ...
5,4,37,4_too_worries_production_cutting,"[too, worries, production, cutting, fast, hurt...",[Biden says he worries that cutting oil produc...
6,5,36,5_neutrality_coal_mild_financing,"[neutrality, coal, mild, financing, pledges, m...","[G-20 make mild pledges on climate neutrality,..."
7,6,36,6_clock_down_humanity_run,"[clock, down, humanity, run, has, pm, act, now...",[Humanity has run down the clock on climate ch...
8,7,33,7_pope_francis_everything_impression,"[pope, francis, everything, impression, strong...",[Meeting with Pope Francis leaves a strong imp...
9,8,31,8_cop26_change_climate_opens,"[cop26, change, climate, opens, future, action...",[COP26 begins as countries plan future actions...


This set of topic information tells us all we need to know about the clusters forming around texts. `Count` indicates the number of texts assigned to this topic, `Representation` leverages TF-IDF to rank the key terms that characterise a topic and `Representative_Docs` returns the texts closest to the topic's TF-IDF representation. These representative words and docs are typically enough to characterise the largest topics.

In the example above, we can see topics forming around Pope Francis, Canadian politics and Indian PM Modi arriving at COP26. Many of the topics reference COP26, a key event at the time this data was produced.

Topic size is an important factor when considering the quality of a topic. If a topic is small relative to the size of your dataset it is more likely to have formed around highly specfic cases or anomalies in the data. Here the smallest topic is approximately 1\% of the dataset at 11 documents. Normally, we can avoid these issues with topic size by considering only the largest topics for subsequent analysis, such as the top 10.

A major strength of the BERTopic method is how customisable it is with different embedding components. We can pass any embeddings we want to the model to check their effect on the topics we find. Here we'll try the `all-distilroberta-v1` model, but you can find a list of pre-trained models in the [documentation](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html).

In [8]:
sentence_model = sentence_transformers.SentenceTransformer('all-distilroberta-v1')

roberta_topic_model = bertopic.BERTopic(embedding_model=sentence_model)
rob_topics, rob_probs = roberta_topic_model.fit_transform(df_news.title[:1000])
df_roberta_topics = roberta_topic_model.get_topic_info()
df_roberta_topics

Downloading:   0%|          | 0.00/791 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/328M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,109,-1_us_turkey_erdogan_crises,"[us, turkey, erdogan, crises, cooperation, avo...",[Biden tells Erdogan that US and Turkey must a...
1,0,74,0_trudeau_wanted_canada_ambitious,"[trudeau, wanted, canada, ambitious, though, m...","[Trudeau says climate progress made at G20, th..."
2,1,52,1_cop26_to_success_world,"[cop26, to, success, world, what, change, clim...","['Time to do the right thing,' negotiators tol..."
3,2,46,2_can_clean_of_how,"[can, clean, of, how, energy, trash, burn, pla...",[Trash and Burn: Big Brands&#039; New Plastic ...
4,3,44,3_ex_type_doesn_chief,"[ex, type, doesn, chief, see, paris, moment, u...",[Ex-UN climate chief doesn't see Paris-type mo...
5,4,40,4_clock_humanity_down_run,"[clock, humanity, down, run, has, pm, act, now...",[Humanity has run down the clock on climate ch...
6,5,36,5_years_la_de_of,"[years, la, de, of, hottest, 2021, climate, in...",[Los efectos de la crisis climática en los océ...
7,6,36,6_too_production_hurt_fast,"[too, production, hurt, fast, cutting, worries...",[Biden says he worries that cutting oil produc...
8,7,35,7_g20_eu_global_offers,"[g20, eu, global, offers, deal, steel, disappo...",[G20 disappoints on key climate target as eyes...
9,8,34,8_neutrality_coal_financing_mild,"[neutrality, coal, financing, mild, pledges, m...","[G-20 make mild pledges on climate neutrality,..."


We can see that some of the detected topics are consistent between different embedding models, whereas others have changed. Canadian politics, Boris Johnson and Pope Francis are common catalysts, whereas others such as COP26 are captured differently. Comparing topics across different model parameters can be a good way to assess their quality. Those that are consistent are likely to be more robust and clearly defined in your data.

At present, BERTopic recomputes the embeddings each time the model is fit. This can be resource intensive (in both time and CPU cycles) when applied to large datasets. You can instead pass precomputed embeddings to the model, in which case BERTopic only looks for the best way to group texts together.

In [10]:
embs = sentence_model.encode(df_news.title[:1000])

roberta_topic_model = bertopic.BERTopic(embedding_model=sentence_model)
rob_topics, rob_probs = roberta_topic_model.fit_transform(df_news.title[:1000],embs)
df_roberta_topics = roberta_topic_model.get_topic_info()
df_roberta_topics

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,112,-1_tells_crises_erdogan_turkey,"[tells, crises, erdogan, turkey, us, avoid, mi...",[Biden tells Erdogan that US and Turkey must a...
1,0,74,0_trudeau_wanted_canada_ambitious,"[trudeau, wanted, canada, ambitious, though, m...","[Trudeau says climate progress made at G20, th..."
2,1,53,1_cop26_to_success_world,"[cop26, to, success, world, what, change, clim...","['Time to do the right thing,' negotiators tol..."
3,2,44,2_ex_type_doesn_chief,"[ex, type, doesn, chief, see, paris, moment, u...",[Ex-UN climate chief doesn't see Paris-type mo...
4,3,41,3_humanity_down_clock_run,"[humanity, down, clock, run, has, pm, act, cha...",[Humanity has run down the clock on climate ch...
5,4,40,4_clean_how_can_energy,"[clean, how, can, energy, plastic, burn, susta...",[Trash and Burn: Big Brands&#039; New Plastic ...
6,5,36,5_too_production_hurt_fast,"[too, production, hurt, fast, cutting, worries...",[Biden says he worries that cutting oil produc...
7,6,35,6_neutrality_coal_financing_mild,"[neutrality, coal, financing, mild, pledges, m...","[G-20 Make Mild Pledges On Climate Neutrality,..."
8,7,34,7_years_la_de_of,"[years, la, de, of, hottest, 2021, climate, in...",[Los efectos de la crisis climática en los océ...
9,8,32,8_pope_francis_impression_strong,"[pope, francis, impression, strong, everything...",[Meeting with Pope Francis leaves a strong imp...


Once again, we can see some variation in the topics that have been returned - even though we are passing the same data and  embedding model. This is because BERTopic has stochastic elements. To fix the random state for reproducibility purposes, we need to customise the clustering and dimensionality reduction models.

By default, BERTopic uses the `HDBSCAN` model for clustering. We can define the model ourselves to alter the clustering parameters - review the [HDBSCAN documentation](https://hdbscan.readthedocs.io/en/latest/) for a full list of options. We'll also set a global seed in `numpy` to fix the random state.

In [16]:
np.random.seed(42)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model = bertopic.BERTopic(hdbscan_model=hdbscan_model)

It is possible to substitute one of a number of clustering algorithms into BERTopic - the [documentation](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) gives a list of examples.

The final component of BERTopic we can vary is the dimensionality reduction step. This process reduces some of the complexity in the model and make it more memory efficient (as noted when we discussed the value of sentence embeddings over bag of words and TF-IDF methods). Like with the embedding and clustering methods there are many options that can be easily subsituted, or updated for finer control of the method, with exmaples listed in the [documentation](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html).

In [20]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
topic_model = bertopic.BERTopic(umap_model=umap_model)

# Exercises

Fit a topic model over the first 1000 article bodies using the default BERTopic settings. Find the list of row numbers in the dataset that are about the president of Zimbabwe according to the topic model.

In [23]:
## Note that there is some variation in the solutions under the default settings - stochasticity means the topic assignments will vary.
topic_model = bertopic.BERTopic()
topics, probs = topic_model.fit_transform(df_news.body[:1000])
df_topics = topic_model.get_topic_info()
df_topics

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,202,-1_the_of_and_to,"[the, of, and, to, in, that, is, on, for, with]","[THE PRESIDENT: Please, sit down. I apologize ..."
1,0,92,0_the_of_in_to,"[the, of, in, to, and, climate, is, that, for,...",[More than three decades have passed since for...
2,1,74,1_g20_trudeau_to_for,"[g20, trudeau, to, for, said, leaders, was, by...",[Canada wanted a stronger and more ambitious a...
3,2,47,2_he_biden_pope_his,"[he, biden, pope, his, francis, knew, the, my,...",[A visibly moved President Joe Biden on Sunday...
4,3,43,3_the_to_g20_countries,"[the, to, g20, countries, of, that, and, leade...",[ROME (AP) - Leaders of the world's biggest ec...
5,4,39,4_to_the_will_climate,"[to, the, will, climate, and, of, johnson, in,...",[Prime Minister will warn it is 'one minute to...
6,5,37,5_biden_he_said_his,"[biden, he, said, his, about, pump, to, produc...",[President Biden said on Sunday that the world...
7,6,35,6_and_we_the_of,"[and, we, the, of, to, our, vaccines, in, glob...","[ROME, Oct. 31 (Xinhua) -- The 16th Group of 2..."
8,7,32,7_to_and_the_of,"[to, and, the, of, new, stuff, in, for, is, are]",[Batten down the hatches. Hurricane Blah Blah ...
9,8,30,8_to_countries_and_climate,"[to, countries, and, climate, action, change, ...","[Humanity has ""run down the clock"" on climate ..."


In [25]:
print(f'Row ids corresponding to the president of Zimbabwe are: {[i for i,t in enumerate(topics) if t == 26]}')

Row ids corresponding to the president of Zimbabwe are: [53, 197, 208, 289, 490, 512, 542, 595, 662, 760, 938, 972]


The topics and probabilities assigned by BERTopic correspond to the best fit. Using the probabilities return for each text assigned to the topics in your previous model, find the topics with the highest and lowest mean probability. Do these values tell you anything about the topics highlighted?

In [27]:
## First we convert topics and probs to a more convenient format.
## This gives a list of probabilities for each topic label.
topic_probs = {}
for t,p in zip(topics,probs):
        if not topic_probs.get(t):
            topic_probs[t] = []
        topic_probs[t].append(p)

## Next we average them and record the maximum and minimum we've seen.
max_avg_topic = -2
min_avg_topic = -2
max_avg = 0
min_avg = 1
for t in topic_probs:
    avg = np.mean(topic_probs[t])
    if avg > max_avg:
        max_avg_topic = t
        max_avg = avg
    if avg < min_avg:
        min_avg_topic = t
        min_avg = avg

## This is almost certainly going to be topic -1 with probability 0 - BERTopic doesn't record probabilities for the junk topic.
## This is because they don't really make sense.
print(f'The topic with lowest average probability is {min_avg_topic} with probability {min_avg}.')
## This is probably one of the smaller topics with a very high probability. Be careful of overfitting to duplicate or
## very similar texts in your dataset causing the model to converge anomalously.
print(f'The topic with highest average probability is {max_avg_topic} with probability {max_avg}.')

The topic with lowest average probability is -1 with probability 0.0.
The topic with highest average probability is 24 with probability 0.992740887621854.
