<a href="https://colab.research.google.com/github/LimingXu-MAG/2020_Autumn_FDS_example_solutions/blob/main/BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BERTopic - Tutorial**
We start with installing bertopic from pypi before preparing the data. 

**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!

In [2]:
!pip install bertopic[visualization]

Collecting bertopic[visualization]
  Downloading https://files.pythonhosted.org/packages/7d/a5/700851ac2bc1068462b8ee18b52b54a6716b4f90d758b232105b50c9f227/bertopic-0.4.3-py2.py3-none-any.whl
Collecting joblib==0.17.0
[?25l  Downloading https://files.pythonhosted.org/packages/fc/c9/f58220ac44a1592f79a343caba12f6837f9e0c04c196176a3d66338e1ea8/joblib-0.17.0-py3-none-any.whl (301kB)
[K     |████████████████████████████████| 307kB 10.6MB/s 
Collecting hdbscan>=0.8.26
[?25l  Downloading https://files.pythonhosted.org/packages/22/2f/2423d844072f007a74214c1adc46260e45f034bb1679ccadfbb8a601f647/hdbscan-0.8.26.tar.gz (4.7MB)
[K     |████████████████████████████████| 4.7MB 15.4MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sentence-transformers>=0.3.9
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb

# **Prepare data**
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts on 20 topics.

In [4]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import pandas as pd
 
docs = pd.read_csv('customer_survey_comments.csv')['comments']

docs

0       I know that there is a shortage of skilled eng...
1       We had a bit bad luck on that day, one of thei...
2       It is quite easy to order parts..I just ring a...
3                                          No not really.
4       Stop phoning me for a stupid survey everytime ...
                              ...                        
3483                    A delivery service would be nice.
3484                                I have had no issues.
3485                                    Cut their prices.
3486                                  No problems at all.
3487    Talk to me on the phone not email, get the job...
Name: comments, Length: 3488, dtype: object

Remove stopwords

In [5]:
from gensim.parsing.preprocessing import remove_stopwords

# Remove stopwords
docs = docs.apply(lambda x: remove_stopwords(x))

In [6]:
docs

0       I know shortage skilled engineers. If improve ...
1       We bit bad luck day, guys sick. But best keepi...
2       It easy order parts..I ring number speak them....
3                                              No really.
4       Stop phoning stupid survey everytime I order p...
                              ...                        
3483                             A delivery service nice.
3484                                            I issues.
3485                                          Cut prices.
3486                                     No problems all.
3487                         Talk phone email, job right.
Name: comments, Length: 3488, dtype: object

# **Create Topics**
We select the "english" as the main language for our documents. If you want a multilingual model that supports 50+ languages, please select "multilingual" instead. 

In [None]:
# sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens")
# embeddings = sentence_model.encode(docs, show_progress_bar=False)

# # Create topic model
# model = BERTopic()
# topics, probs = model.fit_transform(docs, embeddings)

model = BERTopic(language="english")
topics, probs = model.fit_transform(docs)

top_k = 5

# Further reduce topics 
new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=top_k)

# Save the topics and their probabilities 
topics = model.get_topics()
i=0
topic_df = pd.DataFrame()
for k, v in topics.items():
  words, probs = [], []
  for w, p in v:
    words.append(w)
    probs.append(p)
  topic_df['topic[' + str(i) + ']_words'] = words
  topic_df['topic[' + str(i) + ']_probs'] = probs
  i += 1

topic_df.to_csv(f'top{top_k}_topics.csv')

In [None]:
topic_df

We can then extract most frequent topics:

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated:

In [None]:
model.visualize_topics()

In [None]:
model.get_topic(5)

Note that the model is stocastich which mmeans that the topics might differ across runs. 

For a full list of support languages, see the values below:

# **Embedding model**
You can select any model from `sentence-transformers` and use it instead of the preselected models by simply passing the model through  
BERTopic with `embedding_model`:

In [None]:
# st_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  


# **Visualize Topics**
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
model.visualize_topics()

# **Visualize Topic Probabilities**

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [None]:
model.visualize_distribution(probs[0])

# **Topic Reduction**
Finally, we can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=10)


The reasoning for putting `docs`, `topics`, and `probs` as parameters is that these values are not saved within 
BERTopic on purpose. If you were to have a million documents, it seems very inefficient to save those in BERTopic 
instead of a dedicated database.  

# **Topic Representation**
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stop_words or you want to try out a different n_gram_range. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [None]:
model.update_topics(docs, topics, n_gram_range=(1, 3), stop_words="english")

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
model.get_topics()

In [None]:
model.get_topics()

In [None]:
similar_topics, similarity = model.find_topics("parts", top_n=5); similar_topics

In [None]:
model.get_topic(5)

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
model.save("my_model")	

In [None]:
# Load model
my_model = BERTopic.load("my_model")	

In [None]:
[1, 5 for i in range(3)]

In [None]:
5 // 2