In [8]:
import bertopic
import pandas as pd

# Topic modelling

Topic modelling is the process of grouping large volumes of text into *topics*, that is collections of texts that align in the words that they use. At the conceptual level, topic modelling tries to cluster representations of the texts based on some similarity metric. As a result, there are many different approaches and techniques for topic modelling.

For the purposes of this workshop we will use BERTopic, a method that leverages the embeddings that are discussed in one of the other tutorials. This example will use the same news content about `climate change` and `global warming` as the other examples.

In [9]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

At the simplest level, the BERTopic library handles both the embedding and clustering steps using its default parameters. We collect topic labels and probability assignments for each text in turn.

In [10]:
topic_model = bertopic.BERTopic()
topics, probs = topic_model.fit_transform(df_news.title[:1000])

How do we interpret this information? The topic label denotes which group a given text belongs to. Note that the topic `-1` is included as a catch-all for any texts that do not fit within a clear topic, and as such should not be considered as cohesive topic. The probabilities indicate a measure of confidence in the topic model that this is the correct topic assignment for a given text. This can vary significantly between texts.

These labels and probablities are useful, but they don't tell us much about the texts. Luckily, BERTopic automatically processes summary information for each topic.

In [13]:
df_topics = topic_model.get_topic_info()
df_topics

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,164,-1_us_and_to_the,"[us, and, to, the, turkey, crises, erdogan, av...",[Biden tells Erdogan that US and Turkey must a...
1,0,74,0_trudeau_wanted_canada_ambitious,"[trudeau, wanted, canada, ambitious, though, m...","[Trudeau says climate progress made at G20, th..."
2,1,48,1_modi_glasgow_arrives_in,"[modi, glasgow, arrives, in, uk, summit, cop26...",[World News | PM Modi Arrives in UK for COP26 ...
3,2,43,2_paristype_exun_doesnt_see,"[paristype, exun, doesnt, see, moment, chief, ...",[Ex-UN climate chief doesn't see Paris-type mo...
4,3,43,3_how_de_clean_the,"[how, de, clean, the, of, climate, to, india, ...",[A climate power: India must be atmanirbhar in...
5,4,37,4_too_worries_production_cutting,"[too, worries, production, cutting, fast, hurt...",[Biden says he worries that cutting oil produc...
6,5,36,5_clock_down_humanity_run,"[clock, down, humanity, run, has, pm, act, now...",[Humanity has run down the clock on climate ch...
7,6,36,6_neutrality_coal_mild_financing,"[neutrality, coal, mild, financing, pledges, m...","[G-20 make mild pledges on climate neutrality,..."
8,7,33,7_pope_francis_everything_impression,"[pope, francis, everything, impression, strong...",[Meeting with Pope Francis leaves a strong imp...
9,8,29,8_charles_warlike_footing_tackle,"[charles, warlike, footing, tackle, sends, ale...",['We need to act now' - UK's Johnson sends cli...


This set of topic information tells us all we need to know about the clusters forming around texts. `Count` indicates the number of texts assigned to this topic, `Representation` leverages TF-IDF to rank the key terms that characterise a topic and `Representative_Docs` returns the texts closest to the topic's TF-IDF representation. These representative words and docs are typically enough to characterise the largest topics.

In the example above, we can see topics forming around Pope Francis (topic 7), Canadian politics (topic 1) and Indian PM Modi arriving at COP26 (topic 1). Many of the topics reference COP26, a key event at the time this data was produced.

Topic size is an important factor when considering the quality of a topic. If a topic is small relative to the size of your dataset it is more likely to have formed around highly specfic cases or anomalies in the data. Here the smallest topic is approximately 1\% of the dataset at 11 documents. Normally, we can avoid these issues with topic size by considering only the largest topics for subsequent analysis, such as the top 10.

A major strength of the BERTopic method is how customisable it is with different embedding and clustering components. We can pass any embeddings we want to the model to check their effect on the topics we find. Here we'll try the `test` model.

In [None]:
## Changing the embedding model

We can see...

Note that comparing across different model parameters can be a good way to assess the quality of any topics found. Those that are consistent across embedding or clustering tools are likely to be more robust and clearly defined in your data.

In [None]:
## Changing the clustering model

In [5]:
## Some exercises