Topic Modelling with BERTopic

###### Documentation: https://maartengr.github.io/BERTopic/

<div style="color: light grey;">

- **Description**: BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
- **Purpose**: It is used in this project to understand the diversity of topics and their distribution across articles. It can be run on either the title or the whole article. Transformer models ensure quality for both short-length and full-length texts.
- **Deployment**: The results are deployed for further Sentiment Analysis and News Summarization tasks.
- **Input**: The raw BBC news dataset for the category 'Politics'.

</div>

In [17]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
# Import libraries for Topic Modelling with BERTopic
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore")
%load_ext autoreload
%autoreload 2
# local imports
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from src.query import bbc_news_politics
# Setting secret credentials
from dotenv import load_dotenv #pip install python-dotenv
load_dotenv()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


True

In [18]:
# Fetch the data from BigQuery
df = bbc_news_politics()
df.head()

This query will process 5114621 bytes.


Unnamed: 0,body,title,filename,category
0,"The ""best person for the job"" should be appoin...",'Best person' for top legal job,bbc/politics/273.txt,politics
1,A cap on donations to political parties should...,'Debate needed' on donations cap,bbc/politics/059.txt,politics
2,A cap on donations to political parties should...,'Debate needed' on donations cap,bbc/politics/298.txt,politics
3,It could cost £80m to run a UK referendum on t...,'EU referendum could cost £80m',bbc/politics/391.txt,politics
4,The initial attempt to sell the Millennium Dom...,'Errors' doomed first Dome sale,bbc/politics/006.txt,politics


In [19]:
# Drop unnecessary columns
df.drop(columns=['filename'], inplace=True)

In [None]:
# Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L12-v2")

In [None]:
# Pre-compute embeddings
embeddings = embedding_model.encode(df['body'], show_progress_bar=False)

In [None]:
# Reduce dimensionality with UMAP
umap_model = UMAP(
    n_neighbors=10,  # capture more fine-grained relationships for a small dataset
    n_components=5,  # reduce dimensionality further for better visualization
    min_dist=0.0,  # keep clusters tight
    metric='cosine',  # best for text embeddings
    random_state=42
)

In [23]:
hdbscan_model = HDBSCAN(
    min_cluster_size=15,  # to allow more but smaller clusters
    min_samples=5,  # controls how "noise-tolerant" clustering is, detects smaller clusters
    metric='euclidean',  # distance metric
    cluster_selection_method='eom',  # extracts dense clusters
    prediction_data=True  # allows predicting cluster membership for new points
)

In [24]:
# Tokenize
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")

In [25]:
# Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

In [32]:
# Set λ to 0.5 to get the optimal mix of diversity and accuracy in the result set
representation_model = MaximalMarginalRelevance(diversity=0.5, top_n_words=15)

In [33]:
# Set the parameters for the model
topic_model = BERTopic(
  embedding_model=embedding_model,            # extract embeddings
  umap_model=umap_model,                      # reduce dimensionality
  hdbscan_model=hdbscan_model,                # cluster reduced embeddings
  vectorizer_model=vectorizer_model,          # tokenize topics
  ctfidf_model=ctfidf_model,                  # extract topic words
  representation_model=representation_model,  # diversify topic words         
  nr_topics=None,                             # no forced merging of topics BUT nr_topics='auto' to merge similar ones
  min_topic_size=20,                          # filter out small, less meaningful topics
  verbose=True,
  top_n_words=15                              # number of words per topic
  )                            

In [34]:
# Initialize BERT model
topics, probabilities = topic_model.fit_transform(df['body'], embeddings)

2025-03-25 11:52:17,000 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-03-25 11:52:17,473 - BERTopic - Dimensionality - Completed ✓
2025-03-25 11:52:17,474 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-03-25 11:52:17,481 - BERTopic - Cluster - Completed ✓
2025-03-25 11:52:17,482 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-03-25 11:52:18,344 - BERTopic - Representation - Completed ✓


In [None]:
# Print the topics
freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(15)   

Number of topics: 11


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,97,-1_party_silk_ukip_blunkett,"[party, silk, ukip, blunkett, labour, mr kilro...",[Michael Howard has denied his shadow cabinet ...
1,0,105,0_mr brown_election_labour_said,"[mr brown, election, labour, said, prime, prim...",[Gordon Brown has made an appeal for unity aft...
2,1,42,1_kennedy_students_party_lib dems,"[kennedy, students, party, lib dems, people, f...",[The Liberal Democrats are attempting to woo f...
3,2,42,2_police_human rights_trial_drinking,"[police, human rights, trial, drinking, forsyt...",[Detaining foreign terrorist suspects without ...
4,3,26,3_mr howard_tory_tories_labour,"[mr howard, tory, tories, labour, election, ta...","[""He's not finished yet,"" whispered the Conser..."
5,4,22,4_minimum wage_workers_unions_jobs,"[minimum wage, workers, unions, jobs, pensions...",[Talks aimed at averting a series of national ...
6,5,20,5_regiments_mayor_mr livingstone_galloway,"[regiments, mayor, mr livingstone, galloway, a...",[First Minister Jack McConnell has ordered a r...
7,6,17,6_lord chancellor_courts_house lords_access,"[lord chancellor, courts, house lords, access,...",[In a locked room at the heart of Parliament t...
8,7,16,7_hunting_ban_dogs_casinos,"[hunting, ban, dogs, casinos, betting, foxes, ...",[Hunts in England and Wales have begun on the ...
9,8,15,8_tb_asylum seekers_borders_roma,"[tb, asylum seekers, borders, roma, children, ...",[The UK' opposition Conservatives have unveile...


**Note:**  
All documents labeled as `-1` are considered **outliers**. This means they either:  
1. Contain **complex or ambiguous content**, making it difficult to assign a single topic.  
2. Lack distinct features, preventing clear topic classification.  

##### 🔹 How to Handle These Outliers?  
- **Option 1 (Recommended for Now):** Exclude `-1` documents from summarization and analysis to ensure topic clarity.  
- **Option 2 (Alternative Approach):** Assign them to the closest topic using similarity-based methods (e.g., cosine similarity with topic centroids).  

For now, we will proceed with **Option 1**, focusing only on clearly defined topics.

In [None]:
# Print the keywords
a_topic = freq.iloc[9]["Topic"] # select the topic by index
topic_model.get_topic(a_topic) # Show the words and their c-TF-IDF scores   

[('tb', 0.3347968421320353),
 ('asylum seekers', 0.31872054613944484),
 ('borders', 0.31317376935797975),
 ('roma', 0.29243565635914254),
 ('children', 0.2833421844341833),
 ('quotas', 0.27759887658695287),
 ('sites', 0.2755410536482455),
 ('migration', 0.2725189459998106),
 ('economic migrants', 0.2561676187985217),
 ('health', 0.2532659144207702),
 ('tories', 0.250106552229561),
 ('tests', 0.24374835536486833),
 ('mr howard', 0.23393535981858354),
 ('travellers', 0.2337092510746216),
 ('genuine refugees', 0.2337092510746216)]

In [37]:
# Visualise the topics and their keywords
topic_model.visualize_barchart(n_words=10)

In [38]:
# Visualise clusters of topics
topic_model.visualize_topics()

In [40]:
# Visualise the topic hierarchy
topic_model.visualize_hierarchy(top_n_topics=11)

In [None]:

# Visualise a similarity matrix
topic_model.visualize_heatmap(top_n_topics=30)

In [42]:
# Create a new column filled with topics
df['topic'] = topics
df.head(5) 

Unnamed: 0,body,title,category,topic
0,"The ""best person for the job"" should be appoin...",'Best person' for top legal job,politics,6
1,A cap on donations to political parties should...,'Debate needed' on donations cap,politics,-1
2,A cap on donations to political parties should...,'Debate needed' on donations cap,politics,-1
3,It could cost £80m to run a UK referendum on t...,'EU referendum could cost £80m',politics,0
4,The initial attempt to sell the Millennium Dom...,'Errors' doomed first Dome sale,politics,5


In [46]:
# Select topics that are semantically similar to an input query
similar_topics, similarity = topic_model.find_topics('migration', top_n=2)
similar_topics

[8, 1]