<a href="https://colab.research.google.com/github/AbhishekVel/applied_ai_mini/blob/main/BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic: Clustering and Labeling of BBC news articles based on content


BERTopic high level overview:
1. Create embeddings
2. Dimensionality reduction
3. Cluster
4. Representation (topic modeling)


Some key learnings:
- The min_cluster_size on the clustering algorithm had SIGNIFICANT impact on the output. I initially had 250 as the minimum (with the assumption that there should be ~10 large topics. This was wrong. There were a few large topis, but most are small: ~30 articles each).
- Keywords are not great yet, need to figure out how to get better keywords. Some potential things to look into: removing bbc, debug what the combination of keybert + mmr
- Representation model part responsible for generating the label overindexes on the documents given, rather than the keywords and the labeling is too specific. TODO: need to solve this for better descriptions, I believe.


In [1]:
!pip install bertopic
!pip install datasets
!pip install openai



In [90]:
from datasets import load_dataset

dataset = load_dataset("RealTimeData/bbc_news_alltime", "2025-03")

In [3]:
# Combine the title + content into one dataset separated by new line
#title_and_content = dataset["train"]["title"] + "\n" + dataset["train"]["content"]

titles = dataset["train"]["title"]
contents = dataset["train"]["content"]
title_and_contents = [title + "\n" + content for title, content in zip(titles, contents)]


## Generating Clusters

In [4]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("thenlper/gte-small")
embeddings = embedding_model.encode(title_and_contents, show_progress_bar=True)

embeddings.shape

Batches:   0%|          | 0/85 [00:00<?, ?it/s]

(2697, 384)

In [5]:
from umap import UMAP

# UMAP is a dimensionality reduction algorithm.
# This should result in vectors of size 5.
umap_model = UMAP(
    n_components=5, min_dist=0.0, metric='cosine', random_state=42
)

In [23]:
from hdbscan import HDBSCAN

# HDBSCAN is a clustering algorithm.
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
# Note, this doesn't affect the assignment of articles to topics, this processing
# of count vectorizer is used after the assignment
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

In [28]:
import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech
from google.colab import userdata

# KeyBERT
keybert_model = KeyBERTInspired()

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT-4-O
client = openai.OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
openai_model = OpenAI(client, model="gpt-4o-mini", exponential_backoff=True, prompt=prompt)

combined_aspect = [
    keybert_model,
    mmr_model,
    openai_model,
]

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
    "GPT-4-O": openai_model,
    "Combined": combined_aspect,
}


In [51]:
from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=25,
  verbose=True
)

# Train model
topic_for_doc, probs = topic_model.fit_transform(title_and_contents, embeddings)

# Show topics
topic_model.get_topic_info()


2025-03-30 23:47:00,703 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-03-30 23:47:13,073 - BERTopic - Dimensionality - Completed ✓
2025-03-30 23:47:13,074 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-03-30 23:47:13,172 - BERTopic - Cluster - Completed ✓
2025-03-30 23:47:13,176 - BERTopic - Representation - Fine-tuning topics using representation models.
100%|██████████| 90/90 [01:40<00:00,  1.12s/it]
100%|██████████| 90/90 [01:22<00:00,  1.09it/s]
2025-03-30 23:50:37,102 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,GPT-4-O,Combined,Representative_Docs
0,-1,505,-1_said_bbc_says_government,"[said, bbc, says, government, people, year, ne...","[ukraine, uk, government, russia, country, new...","[bbc, police, news, uk, bbc news, told, ukrain...",[Ukraine-Russia Conflict Updates],[Ukraine Conflict and Government Efficiency],[Ukraine in maps: Tracking the war with Russia...
1,0,106,0_ukraine_zelensky_trump_peace,"[ukraine, zelensky, trump, peace, president, k...","[ukraine peace, support ukraine, peace ukraine...","[ukraine, zelensky, trump, sir keir, deal, pri...",[Ukraine Peace Coalition Summit],[Ukraine Peace Coalition Initiative],[Starmer: Coalition of willing to guarantee Uk...
2,1,100,1_tariffs_canada_trump_trade,"[tariffs, canada, trump, trade, china, canadia...","[trump tariffs, tariffs, tariff, canada mexico...","[tariffs, trump, trade war, imports, trudeau, ...",[US Tariffs and Trade Tensions],[US Tariffs on Canada and Mexico],[Stock markets sink as Trump confirms tariffs ...
3,2,96,2_gaza_israel_hamas_israeli,"[gaza, israel, hamas, israeli, hostages, cease...","[gaza, gaza strip, hamas, israeli military, is...","[gaza, israel, hamas, hostages, ceasefire, pal...",[Renewed Conflict in Gaza],[Gaza Conflict Escalation and Ceasefire],[Israel launches waves of strikes on Gaza with...
4,3,57,3_ship_sea_immaculate_crew,"[ship, sea, immaculate, crew, north sea, cargo...","[maritime, ships, north sea, crew member, ship...","[ship, north sea, collision, cargo ship, coast...",[North Sea Cargo Ship Collision],[North Sea Ship Collision Incident],[North Sea collision ship captain appears in c...
...,...,...,...,...,...,...,...,...,...
85,84,11,84_talbot_green_police_identified,"[talbot, green, police, identified, woman, lan...","[woman shot, murdered, killed, murder, shot de...","[talbot, green, police, joanne, murder, rhondd...",[Indigenous Murder Victims Identified],[Indigenous Women Murder Cases],[Talbot Green shooting: Murder arrest after wo...
86,85,11,85_traffic_road_highways_national highways,"[traffic, road, highways, national highways, b...","[motorway, roads, thames crossing, highways, c...","[national highways, roads, newport, 30mph, ext...",[Road Safety and Traffic Management],[Roadworks and Speed Limit Changes],[M25: Drivers warned as final motorway closure...
87,86,11,86_hurdle_mullins_cheltenham_trainer,"[hurdle, mullins, cheltenham, trainer, winner,...","[hurdle, cheltenham, wins, racing, finish line...","[hurdle, mullins, cheltenham, trainer, winner,...",[Cheltenham Festival Highlights 2023],[Cheltenham Racing Highlights 2023],[Champion Hurdle: Golden Ace wins at Cheltenha...
88,87,11,87_sturgeon_snp_nicola sturgeon_nicola,"[sturgeon, snp, nicola sturgeon, nicola, inves...","[nicola sturgeon, snp leader, snp, holyrood el...","[nicola sturgeon, election, minister, john swi...",[SNP Leadership and Legal Challenges],[SNP Leadership and Investigation Updates],[What next for the SNP? Moving out from under ...


In [27]:
topic_model.visualize_topics()

# Debugging

In [91]:
import pandas as pd
topic_id_and_name_df = topic_model.get_topic_info()[['Topic', 'Combined', 'KeyBERT']]
topic_and_title_df = pd.DataFrame({'Topic': topic_for_doc, 'Title': titles})
topic_and_title_df.merge(topic_id_and_name_df, on='Topic')

Unnamed: 0,Topic,Title,Combined,KeyBERT
0,12,Steve Rosenberg: Vladimir Putin can afford to ...,[JD Vance's Conservative Ideology],"[trump vance, said vance, jd vance, trump zele..."
1,0,Keir Starmer faces decision about who he can t...,[Ukraine Peace Coalition Initiative],"[ukraine peace, support ukraine, peace ukraine..."
2,12,Most Republicans laud Trump after Zelensky sho...,[JD Vance's Conservative Ideology],"[trump vance, said vance, jd vance, trump zele..."
3,35,Demi Hannaway: New investigation ordered into ...,[Child Disappearance and Domestic Violence],"[murder, police said, disappearance, told cour..."
4,60,Bristol dog attack: 19-year-old victim named b...,[Tragic Impact of Serial Killer],"[serial killer, killed, murder, disappearance,..."
...,...,...,...,...
2692,74,Margaret Miles-Bramwell: Legacy of weight loss...,[Body Image and Social Media],"[overweight, underweight, obesity, obese, slim..."
2693,7,Cole Palmer: Ill forward 'wanted to be on pitc...,[Hojlund's Struggles at Manchester United],"[manchester united, man utd, goalscorer, roone..."
2694,75,Crufts 2025: Whippet from Italy called Miuccia...,[Miuccia Wins Crufts Best in Show],"[crufts dog, dogs, crufts, dog, 000 dogs, whip..."
2695,51,Sweden is 'no longer a country that cannot be ...,[Europe's Military Response to Ukraine],"[ukraine military, fighting ukraine, ukraine, ..."
