## BERTopic for Interpreting Contextual Embeddings (a demo based on Rotten Tomatoes)


"pip install bertopic" and "pip install datasets" if needed

In [None]:
pip install bertopic

In [None]:
pip install datasets

### **Load dataset**
For this example, we use the Rotten Tomatoes dataset. This is a movie review dataset containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

In [3]:
from bertopic import BERTopic
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")
# Only select 500 samples for demo
docs = [dataset["train"][i]['text'] for i in range(500)]

README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

### **BETTopic pipeline**

Select model from `sentence-transformers` and use as the model to get embeddings

In [4]:
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Select dimension reduction method to reduce the dimensionality of the embeddings to a workable dimensional space (PCA, UMAP...)

In [5]:
from umap import UMAP
umap_model = UMAP(n_neighbors=10, n_components=10, min_dist=0.0, metric='cosine')

Select clustering method to cluster embeddings into groups to extract our topics (HDBSCAN, KMeans...)


In [6]:
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

Select method to get topic representations for interpreting the topics

Here we used the `CountVectorizer` and `c-TF-IDF` to represent each topic based on word count.
There are other methods like using other language model to extract the topic representation from documents information...

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(min_df=2, max_df=10, stop_words="english")

from bertopic.vectorizers import ClassTfidfTransformer
ctfidf_model = ClassTfidfTransformer()

Run bertopic pipeline

In [8]:
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=hdbscan_model,
             vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, verbose=True)
topics, probs = topic_model.fit_transform(docs)

2024-10-28 13:30:05,312 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2024-10-28 13:30:06,943 - BERTopic - Embedding - Completed ✓
2024-10-28 13:30:06,945 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-28 13:30:18,535 - BERTopic - Dimensionality - Completed ✓
2024-10-28 13:30:18,537 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-28 13:30:18,562 - BERTopic - Cluster - Completed ✓
2024-10-28 13:30:18,573 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-10-28 13:30:18,610 - BERTopic - Representation - Completed ✓


## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [None]:
topic_model.topics_[:10]

[0, 0, -1, -1, -1, -1, 2, -1, 1, -1]

### **Visualize Clustering an Topics**
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.

In [None]:
topic_model.visualize_topics()

**Visualize Topic Similarity**

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other. To visualize the heatmap, run the following:

In [9]:
topic_model.visualize_heatmap()

**Visualize Documents in 2D**

Recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization, run the following:

*This step may cost a lot of memory for plotting, try to reduce the sample size if your memory is limited.*

In [11]:
import numpy as np
embeddings = sentence_model.encode(np.array(docs), show_progress_bar=True)
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
fig = topic_model.visualize_documents(np.array(docs), reduced_embeddings=reduced_embeddings)

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [21]:
fig.update_traces(marker=dict(size=10))
fig.update_layout(title_text="Rotten Tomatoes Documents and Topics",
                  title_font=dict(size=24, weight="bold"),

                  width=1000, height=800)
fig.show()

**Visualize Keywords of topics**

Bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other. To visualize this hierarchy, run the following:

In [22]:
topic_model.visualize_barchart()