# Topic Modeling with BERTopic

Guide: [Best Practices](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html#data)

In [1]:
%%capture
%pip install bertopic scikit-learn spacy nltk

In [2]:
import os
import pandas as pd

# BERTopic Imports
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic

## Loading Data

In [3]:
ROOT_PATH = r"/content/drive/MyDrive/News Data"

CANADA_FILE_PATH = os.path.join(ROOT_PATH, "canada.csv")
CHINA_FILE_PATH = os.path.join(ROOT_PATH, "china.csv")
RUSSIA_FILE_PATH = os.path.join(ROOT_PATH, "russia.csv")

INF_ROOT_PATH = r"/content/drive/MyDrive/News Data/llama2_inference_data"
INFERENCE_CANADA_FILE_PATH = os.path.join(INF_ROOT_PATH, "answer_canada.csv")
INFERENCE_CHINA_FILE_PATH = os.path.join(INF_ROOT_PATH, "answer_china.csv")
INFERENCE_RUSSIA_FILE_PATH = os.path.join(INF_ROOT_PATH, "answer_russia.csv")

In [4]:
df_canada = pd.read_csv(CANADA_FILE_PATH, usecols=["maintext"])
df_china = pd.read_csv(CHINA_FILE_PATH, usecols=["maintext"])
df_russia = pd.read_csv(RUSSIA_FILE_PATH, usecols=["maintext"])

## Common Elements

### Preventing Stochastic Behavior

In BERTopic, we generally use a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the curse of dimensionality to a certain degree.

As a default, this is done with UMAP which is an incredible algorithm for reducing dimensional space.

In [5]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0,
                  metric='cosine', random_state=42)

### Controlling Number of Topics

There is a parameter to control the number of topics, namely nr_topics. This parameter, however, merges topics after they have been created. It is a parameter that supports creating a fixed number of topics.

However, it is advised to control the number of topics through the cluster model which is by default HDBSCAN. HDBSCAN has a parameter, namely `min_cluster_size` that indirectly controls the number of topics that will be created.

A higher `min_cluster_size` will generate fewer topics and a lower `min_cluster_size` will generate more topics.

In [6]:
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean',
                        cluster_selection_method='eom',
                        prediction_data=True)

### Improving Default Representation



In [7]:
vectorizer_model = CountVectorizer(stop_words="english",
                                   min_df=0.01, max_df=0.9,
                                   ngram_range=(1, 3))

## Canada

### Pre-calculate embeddings

BERTopic works by converting documents into numerical values, called embeddings. This process can be very costly, especially if we want to iterate over parameters. Instead, we can calculate those embeddings once and feed them to BERTopic to skip calculating embeddings each time.

In [8]:
embedding_model = SentenceTransformer("all-MiniLM-L12-v2")
embeddings = embedding_model.encode(df_canada.maintext, show_progress_bar=True)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/393 [00:00<?, ?it/s]

### Training

In [9]:
topic_model = BERTopic(
  nr_topics = "auto",

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,

  # Settings
  language = 'english',
  calculate_probabilities = True,

  # Hyperparameters
  top_n_words=25,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(df_canada.maintext, embeddings)

2023-12-04 01:12:55,654 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-04 01:13:24,893 - BERTopic - Dimensionality - Completed ✓
2023-12-04 01:13:24,895 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-04 01:13:26,552 - BERTopic - Cluster - Completed ✓
2023-12-04 01:13:26,553 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-04 01:14:19,995 - BERTopic - Representation - Completed ✓
2023-12-04 01:14:20,137 - BERTopic - Topic reduction - Reducing number of topics
2023-12-04 01:15:15,214 - BERTopic - Topic reduction - Reduced number of topics from 20 to 12


Setting Custom Labels

In [15]:
for topic in range(-1, 11):
  print(f"Topic {topic+2}: " + ", ".join(topic_model.get_topic_info(topic)["Representation"][0]))

Topic 1: india, israel, gaza, hamas, smoke, sikh, indian, violence, air quality, asylum, justice, mercedes, israeli, cancer, ai, hurricane, stephenson, mercedes stephenson, bank canada, crime, refugees, hate, nijjar, allegations, young
Topic 2: fires, officers, firefighters, residential, crime, fireworks, lake, evacuation, sexual, murder, police said, forest, justice, music, victims, violence, smoke, gender, reconciliation, river, young, hospital, residential school, crews, shooting
Topic 3: ukraine, chinese, interference, bank canada, carbon, emissions, foreign interference, basis points, russian, central bank, canadian dollar, pipeline, trading, inquiry, national security, macklem, cents, allegations, greenback, exports, bond, defence, yields, hike, csis
Topic 4: cup, hockey, win, world cup, league, coach, womens, tournament, soccer, scored, knights, sport, sports, championship, olympic, fans, medal, mens, athletes, football, jets, nhl, hockey canada, ice, olympics
Topic 5: drug, pat

In [16]:
topic_model.set_topic_labels({-1: "conflict",
                              0: "crime",
                              1: "national-security",
                              2: "sports",
                              3: "health",
                              4: "finance",
                              5: "weather",
                              6: "housing",
                              7: "food",
                              8: "transportation",
                              9: "telecommunications",
                              10: "public-service"})

In [17]:
topic_model.visualize_topics(custom_labels=True)

In [18]:
topic_model.visualize_hierarchy(custom_labels=True, width = 600)

In [12]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3680,-1_india_israel_gaza_hamas,"[india, israel, gaza, hamas, smoke, sikh, indi...",[Allegations that there is evidence that agent...
1,0,2659,0_fires_officers_firefighters_residential,"[fires, officers, firefighters, residential, c...","[After a cool-weather reprieve, wildfire condi..."
2,1,2010,1_ukraine_chinese_interference_bank canada,"[ukraine, chinese, interference, bank canada, ...",[The Canadian dollar weakened to a near two-we...
3,2,1024,2_cup_hockey_win_world cup,"[cup, hockey, win, world cup, league, coach, w...",[The FIFA Womens World Cup is already setting ...
4,3,677,3_drug_patients_health canada_vaccine,"[drug, patients, health canada, vaccine, canna...",[With the arrival of the latest COVID-19 varia...
5,4,639,4_tsx_composite_cents_tsx composite,"[tsx, composite, cents, tsx composite, composi...",[Canadas main stock index stepped lower Wednes...
6,5,500,5_snow_environment canada_temperatures_flooding,"[snow, environment canada, temperatures, flood...",[Environment Canada has issued a snowfall warn...
7,6,422,6_units_rent_mortgage_buyers,"[units, rent, mortgage, buyers, affordable hou...",[Canadian renters have a tough few years in st...
8,7,298,7_grocery_grocers_food bank_food banks,"[grocery, grocers, food bank, food banks, rest...","[Whether youre catching a flight, opening a ne..."
9,8,298,8_air canada_westjet_airport_airline,"[air canada, westjet, airport, airline, pilots...",[WestJet is scrambling to restart more than 20...


In [19]:
topic_model.visualize_heatmap(custom_labels=True)

### Inference

In [20]:
df = pd.read_csv(INFERENCE_CANADA_FILE_PATH)

In [21]:
x = topic_model.transform(df.maintext)

Batches:   0%|          | 0/125 [00:00<?, ?it/s]

2023-12-04 01:28:09,553 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2023-12-04 01:28:27,242 - BERTopic - Dimensionality - Completed ✓
2023-12-04 01:28:27,243 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2023-12-04 01:28:27,610 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2023-12-04 01:28:28,416 - BERTopic - Probabilities - Completed ✓
2023-12-04 01:28:28,417 - BERTopic - Cluster - Completed ✓


In [22]:
df["topic"] = [topic_model.get_topic_info(i).CustomName[0] for i in x[0]]

In [23]:
df.topic.value_counts()

national-security     1079
conflict              1042
crime                  505
finance                423
sports                 378
health                 125
housing                 92
weather                 81
telecommunications      79
transportation          76
food                    73
public-service          47
Name: topic, dtype: int64

In [24]:
df.to_csv(INFERENCE_CANADA_FILE_PATH, index=False)

In [25]:
del df_canada, df, topic_model, embeddings, x

## China

In [26]:
embedding_model = SentenceTransformer("all-MiniLM-L12-v2")
embeddings = embedding_model.encode(df_china.maintext, show_progress_bar=True)

Batches:   0%|          | 0/312 [00:00<?, ?it/s]

### Training

In [27]:
topic_model = BERTopic(
  nr_topics = "auto",

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,

  # Settings
  language = 'english',
  calculate_probabilities = True,

  # Hyperparameters
  top_n_words=25,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(df_china.maintext, embeddings)

2023-12-04 01:30:30,359 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-04 01:30:50,400 - BERTopic - Dimensionality - Completed ✓
2023-12-04 01:30:50,402 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-04 01:30:51,568 - BERTopic - Cluster - Completed ✓
2023-12-04 01:30:51,569 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-04 01:31:21,611 - BERTopic - Representation - Completed ✓
2023-12-04 01:31:21,694 - BERTopic - Topic reduction - Reducing number of topics
2023-12-04 01:31:51,672 - BERTopic - Topic reduction - Reduced number of topics from 11 to 11


Setting Custom Labels

In [29]:
for topic in range(-1, 10):
  print(f"Topic {topic+2}: " + ", ".join(topic_model.get_topic_info(topic)["Representation"][0]))

Topic 1: cpc, india, central committee, dollar, brics, cultural, developing, japanese, civil, xiongan, midpoint, standing committee, new area, han, climate change, xi said, peace, space, carbon, offshore, australia, fugitives, increase, cpc central, european
Topic 2: cpc, central committee, cpc central, cpc central committee, standing committee, governance, npc, national congress, cppcc, 19th, national committee, socialism, political bureau, china cpc, era, general secretary, party china cpc, xi said, socialism chinese, socialism chinese characteristics, new era, democracy, discipline, people congress, socialist
Topic 3: russia, peace, china sea, south china sea, exchanges, xi said, asean, ukraine, mutual, russian, dialogue, partnership, prime, prime minister, putin, foreign minister, philippines, nuclear, friendship, vietnam, blinken, islands, africa, philippine, chinese president
Topic 4: li said, keqiang, li keqiang, premier li keqiang, increase, central bank, bonds, logistics, guid

In [30]:
topic_model.set_topic_labels({-1: "politics",
                              0: "politics",
                              1: "international-relations",
                              2: "finance",
                              3: "agriculture",
                              4: "healthcare",
                              5: "politics",
                              6: "finance",
                              7: "environment",
                              8: "finance",
                              9: "politics"})

In [31]:
topic_model.visualize_topics(custom_labels=True)

In [32]:
topic_model.visualize_hierarchy(custom_labels=True, width = 600)

In [33]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,2641,-1_cpc_india_central committee_dollar,politics,"[cpc, india, central committee, dollar, brics,...",[China's yuan firmed to a three-week high agai...
1,0,2698,0_cpc_central committee_cpc central_cpc centra...,politics,"[cpc, central committee, cpc central, cpc cent...","[Huang Kunming, a member of the Political Bure..."
2,1,1625,1_russia_peace_china sea_south china sea,international-relations,"[russia, peace, china sea, south china sea, ex...",[The course of events in the year since the la...
3,2,1210,2_li said_keqiang_li keqiang_premier li keqiang,finance,"[li said, keqiang, li keqiang, premier li keqi...",[China will consolidate and expand its economi...
4,3,427,3_poverty_grain_alleviation_poverty alleviation,agriculture,"[poverty, grain, alleviation, poverty alleviat...","[With Xi Jinping in charge, China's poverty-re..."
5,4,385,4_care_patients_drugs_healthcare,healthcare,"[care, patients, drugs, healthcare, epidemic, ...",[Premier says changes will be good for public ...
6,5,248,5_discipline_bribes_cpc_discipline inspection,politics,"[discipline, bribes, cpc, discipline inspectio...","[Yang Keqin. Yang Keqin, former chief procurat..."
7,6,223,6_imports_tons_crude_metric,finance,"[imports, tons, crude, metric, bpd, metric ton...",[China's imports of major commodities lost mom...
8,7,207,7_pollution_river_ecological_environmental pro...,environment,"[pollution, river, ecological, environmental p...","[A total of 5,500 Chinese sturgeons are releas..."
9,8,167,8_index_hang seng_hang_seng,finance,"[index, hang seng, hang, seng, blue chip, comp...",[China stocks closed lower on Tuesday after a ...


In [34]:
topic_model.visualize_heatmap(custom_labels=True)

### Inference

In [35]:
df = pd.read_csv(INFERENCE_CHINA_FILE_PATH)

In [36]:
x = topic_model.transform(df.maintext)

Batches:   0%|          | 0/125 [00:00<?, ?it/s]

2023-12-04 01:38:50,926 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2023-12-04 01:38:55,051 - BERTopic - Dimensionality - Completed ✓
2023-12-04 01:38:55,052 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2023-12-04 01:38:55,422 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2023-12-04 01:38:56,165 - BERTopic - Probabilities - Completed ✓
2023-12-04 01:38:56,166 - BERTopic - Cluster - Completed ✓


In [37]:
df["topic"] = [topic_model.get_topic_info(i).CustomName[0] for i in x[0]]

In [39]:
df.topic.value_counts()

politics                   2034
finance                     860
international-relations     841
agriculture                 113
healthcare                  111
environment                  41
Name: topic, dtype: int64

In [40]:
df.to_csv(INFERENCE_CHINA_FILE_PATH, index=False)

In [41]:
del df_china, df, topic_model, embeddings, x

## Russia

In [42]:
embedding_model = SentenceTransformer("all-MiniLM-L12-v2")
embeddings = embedding_model.encode(df_russia.maintext, show_progress_bar=True)

Batches:   0%|          | 0/503 [00:00<?, ?it/s]

### Training

In [51]:
topic_model = BERTopic(
  nr_topics = 11,

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,

  # Settings
  language = 'english',
  calculate_probabilities = True,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(df_russia.maintext, embeddings)

2023-12-04 01:49:49,118 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-04 01:50:02,543 - BERTopic - Dimensionality - Completed ✓
2023-12-04 01:50:02,545 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-04 01:50:04,608 - BERTopic - Cluster - Completed ✓
2023-12-04 01:50:04,609 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-04 01:50:47,468 - BERTopic - Representation - Completed ✓
2023-12-04 01:50:47,590 - BERTopic - Topic reduction - Reducing number of topics
2023-12-04 01:50:47,592 - BERTopic - Topic reduction - Reduced number of topics from 7 to 7


Setting Custom Labels

In [54]:
for topic in range(-1, 6):
  print(f"Topic {topic+2}: " + ", ".join(topic_model.get_topic_info(topic)["Representation"][0]))

Topic 1: yandex, court, griner, witnesses, jehovahs, iss, roscosmos, moon, jehovahs witnesses, calvey
Topic 2: athletes, doping, olympic, ioc, games, olympics, compete, sport, wada, anti doping
Topic 3: azerbaijan, armenia, karabakh, armenian, nagorno, nagorno karabakh, pashinyan, azerbaijani, ceasefire, yerevan
Topic 4: korea, north korea, kim, korean, pyongyang, north korean, weapons, jong, kim jong, arms
Topic 5: vaccine, covid, covid 19, sputnik, infections, virus, vaccination, patients, vaccines, vaccinated
Topic 6: wagner, prigozhin, bakhmut, mercenary, fighters, yevgeny prigozhin, mutiny, mercenaries, prigozhin said, libya
Topic 7: nuclear, navalny, court, billion, opposition, bank, nato, eu, weapons, attacks


In [55]:
topic_model.set_topic_labels({-1: "politics",
                              0: "sports",
                              1: "conflict",
                              2: "conflict",
                              3: "health",
                              4: "conflict",
                              5: "politics"})

In [56]:
topic_model.visualize_topics(custom_labels=True)

In [57]:
topic_model.visualize_hierarchy(custom_labels=True, width = 600)

In [58]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,516,-1_yandex_court_griner_witnesses,politics,"[yandex, court, griner, witnesses, jehovahs, i...",[Danish national Dennis Christensen was arrest...
1,0,214,0_athletes_doping_olympic_ioc,sports,"[athletes, doping, olympic, ioc, games, olympi...",[Russian athletes have won a total of 25 medal...
2,1,160,1_azerbaijan_armenia_karabakh_armenian,conflict,"[azerbaijan, armenia, karabakh, armenian, nago...",[Six Armenian soldiers were captured by Azerba...
3,2,163,2_korea_north korea_kim_korean,conflict,"[korea, north korea, kim, korean, pyongyang, n...",[North Korean leader Kim Jong Un plans to trav...
4,3,919,3_vaccine_covid_covid 19_sputnik,health,"[vaccine, covid, covid 19, sputnik, infections...","[In early October, as the second wave of Russi..."
5,4,363,4_wagner_prigozhin_bakhmut_mercenary,conflict,"[wagner, prigozhin, bakhmut, mercenary, fighte...",[Belarus said on Friday that fighters from the...
6,5,13737,5_nuclear_navalny_court_billion,politics,"[nuclear, navalny, court, billion, opposition,...",[A Ukrainian attack on a strategic shipyard ea...


In [59]:
topic_model.visualize_heatmap(custom_labels=True)

### Inference

In [60]:
df = pd.read_csv(INFERENCE_RUSSIA_FILE_PATH)

In [61]:
x = topic_model.transform(df.maintext)

Batches:   0%|          | 0/125 [00:00<?, ?it/s]

2023-12-04 01:53:36,864 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2023-12-04 01:53:41,393 - BERTopic - Dimensionality - Completed ✓
2023-12-04 01:53:41,394 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2023-12-04 01:53:41,826 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2023-12-04 01:53:42,714 - BERTopic - Probabilities - Completed ✓
2023-12-04 01:53:42,715 - BERTopic - Cluster - Completed ✓


In [62]:
df["topic"] = [topic_model.get_topic_info(i).CustomName[0] for i in x[0]]

In [63]:
df.topic.value_counts()

politics    3575
conflict     226
health       142
sports        57
Name: topic, dtype: int64

In [64]:
df.to_csv(INFERENCE_RUSSIA_FILE_PATH, index=False)

In [65]:
del df_russia, df, topic_model, embeddings, x