# BERTopic Tutorial
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.

https://arxiv.org/pdf/2203.05794.pdf

Document Embedding (BERT) > Dimension Reduction (UMAP) > Clustering (HDBSCAN) > Topic Representation (TF/IDF)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic

# Data
For this example, we use a large Kaggle dataset with 20 topics for news headlines from Huffington Post.

https://www.kaggle.com/datasets/rmisra/news-category-dataset

In [None]:
import pandas as pd
url = 'https://github.com/EunCheolChoi0123/COMM557Tutorial/raw/main/Tutorial%205%20NLP%20(2)%20Topic%20Modeling/news_headline_sample.csv'
df  = pd.read_csv(url)
df = df[~df.headline.isna()]
df = df.sample(frac = 0.1)
docs = df.headline.to_list()

In [None]:
len(docs)

20952

In [None]:
df.category.value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
POLITICS,3602
WELLNESS,1798
ENTERTAINMENT,1747
TRAVEL,986
STYLE & BEAUTY,951
PARENTING,907
HEALTHY LIVING,684
QUEER VOICES,656
FOOD & DRINK,596
BUSINESS,568


# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.


In [None]:
%%time
# This line took 3 minutes on T4
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

2024-10-01 00:44:16,222 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/655 [00:00<?, ?it/s]

2024-10-01 00:44:31,989 - BERTopic - Embedding - Completed ✓
2024-10-01 00:44:31,990 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-01 00:45:12,530 - BERTopic - Dimensionality - Completed ✓
2024-10-01 00:45:12,532 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-01 00:46:45,417 - BERTopic - Cluster - Completed ✓
2024-10-01 00:46:45,430 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-10-01 00:46:45,975 - BERTopic - Representation - Completed ✓


CPU times: user 2min 50s, sys: 3.41 s, total: 2min 53s
Wall time: 3min 5s


## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. -1 refers to all outliers. We will reduce outliers by assiging documents to topics that has similar embeddings.

In [None]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,9461,-1_the_of_to_in,"[the, of, to, in, is, and, on, for, trump, video]",[What My Son Has Taught Me in the First 100 Da...
1,0,873,0_recipes_food_eat_recipe,"[recipes, food, eat, recipe, cooking, thanksgi...","[How Food Photos Can Make Us Healthier, Angel ..."
2,1,216,1_wedding_weddings_proposal_engagement,"[wedding, weddings, proposal, engagement, coup...",[Real Weddings: Couples Who Got Married This W...
3,2,211,2_divorce_marriage_married_cheating,"[divorce, marriage, married, cheating, divorce...",[Divorce Confidential: A Cheating Heart and It...
4,3,190,3_women_womens_qa_business,"[women, womens, qa, business, gender, feminism...",[International Women's Day 2012: Advancing Wom...


In [None]:
# Reduce outliers using the `embeddings` strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="embeddings")

In [None]:
topic_model.update_topics(docs, topics=new_topics)



In [None]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,940,0_recipes_food_recipe_eat,"[recipes, food, recipe, eat, cooking, thanksgi...","[How Food Photos Can Make Us Healthier, Angel ..."
1,1,231,1_wedding_weddings_proposal_couple,"[wedding, weddings, proposal, couple, engageme...",[Real Weddings: Couples Who Got Married This W...
2,2,237,2_divorce_marriage_married_cheating,"[divorce, marriage, married, cheating, divorce...",[Divorce Confidential: A Cheating Heart and It...
3,3,274,3_women_womens_gender_qa,"[women, womens, gender, qa, business, female, ...",[International Women's Day 2012: Advancing Wom...
4,4,268,4_gay_lgbt_lgbtq_queer,"[gay, lgbt, lgbtq, queer, pride, rights, commu...","[Why I Don't Celebrate LGBT Pride Month, Gay P..."


Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('recipes', 0.025492883277435693),
 ('food', 0.024281004804677698),
 ('recipe', 0.016049274898110404),
 ('eat', 0.015595100153641257),
 ('cooking', 0.013295814590861882),
 ('thanksgiving', 0.012882214132455272),
 ('foods', 0.01221813199740809),
 ('healthy', 0.010656163203564835),
 ('best', 0.009299870526515121),
 ('cake', 0.009144710599689465)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [None]:
topic_model.topics_[:10]

[79, 51, 14, 258, 4, 54, 23, 11, 76, 39]

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created.

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can
be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the distributions, we simply call:

In [None]:
topic_model.visualize_distribution(probs[200], min_probability=0.001)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created.

This allows for fine-tuning the model to your specifications and wishes.

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update
the topic representation with new parameters for `c-TF-IDF`:


In [None]:
topic_model.update_topics(docs, n_gram_range=(2, 3))

In [None]:
topic_model.get_topic(0)   # We select topic that we viewed before

[('the best', 0.005726331944734901),
 ('to make', 0.00539328563229145),
 ('to eat', 0.005381521797903613),
 ('recipe of the', 0.004990172553730075),
 ('recipe of', 0.004990172553730075),
 ('how to', 0.004886302758439902),
 ('how to make', 0.004189469312446796),
 ('recipes that', 0.004099578447577993),
 ('ice cream', 0.003724536182291051),
 ('of the day', 0.0036192806607268212)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
topic_model.reduce_topics(docs, nr_topics=60)

2024-10-01 00:48:42,206 - BERTopic - Topic reduction - Reducing number of topics
2024-10-01 00:48:43,885 - BERTopic - Topic reduction - Reduced number of topics from 289 to 60


<bertopic._bertopic.BERTopic at 0x791d0f39fe50>

In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

[1, 2, 12, 0, 6, 37, 1, 1, 19, 1, 4, 14, 1, 10, 3, 2, 8, 11, 0, 0, 1, 4, 1, 2, 12, 0, 3, 35, 10, 4, 14, 29, 55, 1, 0, 0, 5, 1, 5, 18, 1, 2, 5, 0, 0, 3, 14, 4, 34, 0, 1, 2, 3, 8, 13, 23, 9, 1, 17, 2, 1, 0, 0, 1, 0, 1, 35, 21, 39, 25, 39, 3, 1, 2, 1, 0, 0, 8, 1, 0, 12, 5, 9, 9, 36, 0, 4, 0, 17, 14, 0, 45, 3, 35, 2, 26, 2, 0, 39, 53, 1, 2, 1, 0, 17, 16, 37, 0, 23, 14, 1, 22, 2, 21, 5, 0, 3, 5, 2, 7, 2, 3, 14, 33, 0, 14, 2, 1, 12, 39, 1, 0, 1, 10, 34, 4, 7, 37, 3, 3, 0, 38, 0, 0, 2, 13, 55, 34, 0, 5, 2, 2, 0, 0, 37, 0, 7, 1, 7, 0, 1, 1, 5, 4, 36, 2, 46, 6, 3, 38, 21, 4, 0, 10, 14, 4, 8, 6, 0, 16, 40, 0, 3, 5, 8, 33, 2, 9, 21, 1, 15, 0, 0, 13, 4, 16, 7, 1, 2, 5, 20, 12, 6, 9, 26, 37, 32, 32, 13, 2, 19, 2, 3, 2, 37, 0, 0, 0, 3, 1, 10, 1, 11, 18, 10, 17, 2, 1, 26, 12, 0, 2, 1, 30, 25, 0, 2, 6, 12, 2, 0, 6, 25, 4, 36, 0, 1, 0, 0, 2, 32, 14, 0, 4, 19, 16, 17, 2, 8, 18, 10, 0, 1, 1, 21, 23, 16, 6, 8, 6, 0, 0, 1, 7, 46, 33, 0, 2, 0, 13, 5, 6, 6, 17, 11, 4, 10, 3, 3, 14, 1, 0, 41, 2, 10, 0, 10, 6,

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. Here, we are going to be searching for topics that closely relate the
search term "vehicle". Then, we extract the most similar topic and check the results:

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[41, 2, 0, 11, 5]

In [None]:
topic_model.get_topic(similar_topics[0])

[('drowsy driving', 0.018781615485527647),
 ('uber drivers', 0.014633884988598602),
 ('is driving', 0.014633884988598602),
 ('driving your', 0.010270604638319894),
 ('selfdriving cars', 0.010270604638319894),
 ('selfdriving car', 0.010270604638319894),
 ('uber what', 0.010270604638319894),
 ('who drove', 0.010270604638319894),
 ('truck driver', 0.010270604638319894),
 ('halts selfdriving', 0.010270604638319894)]

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

In [None]:
# Save model
topic_model.save("my_model")

In [None]:
# Load model
my_model = BERTopic.load("my_model")

# On Your Own
1. Use your collected 1,000 Reddit submissions.
2. Run BERTopic
- Reduce outliers
- Update topics so that it makes the best sense.