# **Tutorial** - Topic Modeling with BERTopic
(last updated 08-06-2021)

In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for. 


## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [1]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [2]:
# from sklearn.datasets import fetch_20newsgroups
# docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

docs = []
with open('../resources/corpus/inuse/cluster/tmp6', mode='r', encoding='utf-8') as f:
    lines = f.readlines()
    for item in lines:
        item = item.strip()
        # print(item)
        # if len(item.split(' ')) != 2:
        #     continue
        docs.append(item)

In [3]:
print(docs[1])

自然吸气发动机和涡轮增压发动机的区别是什么?


# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model. 


In [4]:
from bertopic import BERTopic

topic_model = BERTopic(language="chinese (simplified)", calculate_probabilities=True, verbose=True, top_n_words=10, min_topic_size=3)
topics, probs = topic_model.fit_transform(docs)

Batches: 100%|██████████| 19/19 [00:12<00:00,  1.54it/s]
2022-07-14 16:35:46,558 - BERTopic - Transformed documents to Embeddings
2022-07-14 16:35:51,011 - BERTopic - Reduced dimensionality with UMAP
2022-07-14 16:35:51,097 - BERTopic - Clustered UMAP embeddings with HDBSCAN


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [5]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,112,-1_0导航你们要哪个_汽车因为发动机坏了_摩托车发动机声音大的原因是什么_我想买个电动车
1,0,26,0_0最低价多少哦_需要多少钱_二手车过户要多少钱啊请问_哈佛h6分期付款需要多少现金
2,1,24,1_11款北京现代瑞纳正时皮带多久更换一次_都在哪_上海有哪些灯光高尔夫球场_上海大众
3,2,24,2_大众汽车车上rear什么意思_挡位m加和m减是什么意思_逆变器是什么_车里的eco是什么意思
4,3,23,3_18款速腾1_4t熠动版隐藏功能有哪些_车内灯在行驶过程中无法正常关闭_请问这是什么原因


In [6]:
len(freq)

50

In [8]:
freq.head(50)

Unnamed: 0,Topic,Count,Name
0,-1,112,-1_0导航你们要哪个_汽车因为发动机坏了_摩托车发动机声音大的原因是什么_我想买个电动车
1,0,26,0_0最低价多少哦_需要多少钱_二手车过户要多少钱啊请问_哈佛h6分期付款需要多少现金
2,1,24,1_11款北京现代瑞纳正时皮带多久更换一次_都在哪_上海有哪些灯光高尔夫球场_上海大众
3,2,24,2_大众汽车车上rear什么意思_挡位m加和m减是什么意思_逆变器是什么_车里的eco是什么意思
4,3,23,3_18款速腾1_4t熠动版隐藏功能有哪些_车内灯在行驶过程中无法正常关闭_请问这是什么原因
5,4,21,4_什么是交通标志智能识别系统_双r是什么汽车_跑车有哪些牌子_羊头标志的皮卡是什么车
6,5,19,5_10万以内的suv车排行榜前十名_10万左右口碑最好的车是什么_还多少期最划算_运输
7,6,19,6_手机电池在没电的时候重量会不会比较轻_手机电池如何放电_那如果是锂电池的笔记本好吗_车辆...
8,7,18,7_如何查召回汽车_驾驶证到期在哪里换证_行驶证被扣_自己原车牌可以挂二手车上吗
9,8,18,8_车辆的后门从内部打不开_怎样清除汽车座套的异味_汽车漆面裂纹怎么修复_汽车前风挡贴膜的问题


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [7]:
# topic_model.get_topics()
for index in range(-1, 200):
    res = []
    for item in topic_model.get_topic(index):
        res.append(item[0])
    print(str(index) + "\t" + "____".join(res))

-1	0导航你们要哪个____汽车因为发动机坏了____摩托车发动机声音大的原因是什么____我想买个电动车____新车被刮蹭了是去4s店补漆好还是去修理厂好____想改装下____怎么接的啊____怎么切换音频设备____异地审车都要什么手续____小鸟
0	0最低价多少哦____需要多少钱____二手车过户要多少钱啊请问____哈佛h6分期付款需要多少现金____或者换一个____修下普桑的倒车镜要多少钱啊____改装多少钱____中介卖一台二手车能赚多少钱____一辆电动车的承重是多少斤____一般情况下清洗节气门的费用是多少
1	11款北京现代瑞纳正时皮带多久更换一次____都在哪____上海有哪些灯光高尔夫球场____上海大众____上海大众朗逸最小离地间隙是多少____成都哪点有打汽车座套的啊____像这样的工业垃圾怎么处理啊____买本地二手车过户要什么手续____二手车怎么过户____急啊每天都有一车
2	大众汽车车上rear什么意思____挡位m加和m减是什么意思____逆变器是什么____车里的eco是什么意思____车上的圈a是什么意思____自动档车加号减号是什么意思____汽车里的setup什么意思____汽车空档是什么意思____汽车的排量t和l有什么区别啊____汽车排量里的t什么意思
3	18款速腾1____4t熠动版隐藏功能有哪些____车内灯在行驶过程中无法正常关闭____请问这是什么原因____荣威350保养灯归零步骤是什么____老款别克君越车____电车充电器一直红灯充不满怎么回事____电动车遥控器按下没反应____电动车的显示屏不亮____电动车按喇叭
4	什么是交通标志智能识别系统____双r是什么汽车____跑车有哪些牌子____羊头标志的皮卡是什么车____汽车标志一个圈圈____汽车是谁发明的____标志汽车是哪个国家的____标志t的汽车是什么牌子____有条蛇的标志是什么车____有两个旗子的标志的汽车是什么牌子
5	10万以内的suv车排行榜前十名____10万左右口碑最好的车是什么____还多少期最划算____运输____过户详细步骤____贷款买车是不是必须上盗抢险呢____购买异地二手车验车____能买到什么样的____神州租车____求教
6	手机电池在没电的时候重量会不会比较轻____手机电池如何

TypeError: 'bool' object is not iterable

In [None]:
new_topics, new_probs = topic_model.reduce_topics(docs, topics, probs, nr_topics=20)

In [None]:
# # topic_model.get_topics()
# for index in range(-1, 20):
#     res = []
#     for item in topic_model.get_topic(index):
#         res.append(item[0])
#     print(str(index) + "\t" + "____".join(res))

In [None]:
freq = topic_model.get_topic_info(); freq.head(5)

In [None]:
print(len(topic_model.get_topic(-1)))
print(topic_model.get_topic(-1))

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
# topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [None]:
# topic_model.visualize_distribution(probs[200], min_probability=0.015)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
# topic_model.visualize_hierarchy(top_n_topics=250)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
# topic_model.visualize_barchart(top_n_topics=20)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
# topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
# topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created. 

This allows for fine-tuning the model to your specifications and wishes. 

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [None]:
# topic_model.update_topics(docs, topics, n_gram_range=(1, 2))

In [None]:
# topic_model.get_topic(0)   # We select topic that we viewed before

In [None]:
# topic_model.get_topic(1)

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
# new_topics, new_probs = topic_model.reduce_topics(docs, topics, probs, nr_topics=20)

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
# similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
# topic_model.save("my_model")	

In [None]:
# Load model
# my_model = BERTopic.load("my_model")	

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
# topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [None]:
# from sentence_transformers import SentenceTransformer

# sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
# topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
