# BERTopic + KeyphraseVectorizers for topic modelling

Topic modeling is a technique in natural language processing (NLP) that is used to discover latent topics in a collection of texts. The main objective of topic modeling is to extract meaningful information from large volumes of unstructured text data, which can be useful for various applications such as information retrieval, recommendation systems, and content analysis. It
involves determining the categories, or topics, within a set of documents and which topics each document is likely to belong to. This is done through unsupervised learning, meaning that no pre-existing labels or topics are needed, only the text from the documents.

## BERTopic

Most available data is not meant to be processed by machines but is designed for human consumption. Human processing of large amounts of data is very expensive and slow.

Computers are becoming better at understanding unstructured text data by using transformers, machine learning models that infer meaning, sentiment, and entities from text, among other things.

BERTopic is a leading Python package that utilizes state-of-the-art sentence transformer models and a custom class-based TF-IDF (Term Frequency - Inverse Document Frequency) to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">


Transformer models, a neural network architecture, offer a solution to the information bottleneck problem faced by traditional encoder-decoder models, leading to superior performance in natural language processing. Traditional encoder-decoder models consist of two parts: an encoder and a decoder. The encoder takes in an input sequence and encodes it into a fixed-size representation, also known as a context vector, that captures the relevant information from the input sequence. The decoder then generates an output sequence based on the context vector and the previously generated output tokens. One of the main challenges of traditional encoder-decoder models is the information bottleneck problem. Since the input sequence is compressed into a fixed-size context vector, the model may lose some important information from the original sequence. This can lead to poor performance, especially when dealing with long input sequences.

The transformer model employs attention mechanisms and several key components, including positional encoding, self-attention, and multi-head attention. Attention mechanisms and positional encoding are key components of modern transformer models, which have become the state-of-the-art in many natural language processing tasks. Positional encoding is used to provide the model with information about the relative position of the tokens in the input sequence. This is important because traditional neural networks treat each input token independently and have no notion of order. Self-attention is a mechanism that allows the model to attend to different parts of the input sequence based on their importance. This is done by computing a weighted sum of the input tokens, where the weights are learned by the model during training. Multi-head attention is an extension of self-attention that allows the model to attend to multiple parts of the input sequence simultaneously. By combining these mechanisms, transformer models are able to capture long-range dependencies in the input sequence and achieve state-of-the-art performance on a wide range of natural language processing tasks.

Transformer models generalize better and enable the development of pre-trained models that can be easily adapted to different use cases. However, these models did not provide an accurate method for building sentence-level embeddings until the development of sentence transformers.Unlike traditional transformer models that operate on individual words or tokens, sentence transformer models are designed to encode entire sentences or documents into fixed-length vector representations that can be used for downstream tasks. This is achieved by using a pre-trained transformer encoder that learns to capture contextual relationships between words within a sentence and produce a sentence-level representation that captures the meaning and semantic relationships of the entire sentence.

These sentence transformer models have human-like language comprehension skills that can help organize unstructured text data into topics. This process is called topic modeling.




## KeyphraseVectorizers

To understand the main idea of a text quickly, we can use keyphrases. Keyphrases are brief and reflect the meaning of the text. Unlike single words, keyphrases consist of multiple words that describe the most critical aspect of the text, such as "youth football training" instead of just "football". Keyphrases are better than single keywords because they give a more precise description of the text. The word order in a sentence is crucial for both its grammar and meaning. A collocation is a unique phrase that carries a meaning beyond the literal interpretation of its individual words. For example, "white house" holds a special connotation, whereas "yellow house" does not.

N-gram phrases (where `n` stands for the number of words)  play a fundamental role in natural language processing and text mining, such as parsing, machine translation, and information retrieval. Generally, phrases convey more information than their individual words, making them critical in determining the topics of collections.Fortunately, there are open source tools available that can automatically extract keyphrases from text, and these tools don't need labeled data.

In the context of topic modelling, keyphrase vectorizers generate grammatically correct keyphrases to describe topics, instead of just using simple n-grams. This way, important topic description keyphrases will not be missed by setting the n-gram range too short. Additionally, there is no need to remove stopwords in advance, and the resulting topic models are more precise, avoiding keyphrases that are slightly off-topic.


# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic and KeyphraseVectorizers**

We start by installing BERTopic and KeyphraseVectorizers from PyPi:

In [None]:
!pip install bertopic
!pip install keyphrase-vectorizers
!pip install nbformat

# Data: UN Tweets

For this example, we use the UN tweets dataset which contains roughly 8500 tweets.

In [None]:
import locale
import pandas as pd
from bertopic import BERTopic
from keyphrase_vectorizers import KeyphraseCountVectorizer
locale.getpreferredencoding = lambda: "UTF-8"


In [None]:

# !wget https://raw.githubusercontent.com/world-politics-datalab/un_hum_rights_office_tweets/main/un_office_humrights_tweets_sept4_2017_sept3_2022.csv
!curl -O https://raw.githubusercontent.com/world-politics-datalab/un_hum_rights_office_tweets/main/un_office_humrights_tweets_sept4_2017_sept3_2022.csv 



In [None]:


# the original file has a problem around row 4037 so we need to import it in two steps to fix it

data = pd.read_csv("un_office_humrights_tweets_sept4_2017_sept3_2022.csv", header=0, nrows=4037, encoding='utf-8',  quotechar='"')
data = data.iloc[:,:88]

!tail -n 17130 un_office_humrights_tweets_sept4_2017_sept3_2022.csv > temp.csv

data2 = pd.read_csv("temp.csv", encoding='utf-8',  quotechar='"', header=None)
data2.drop(data2.columns[[14, 15]], axis=1, inplace=True)

data2 = pd.DataFrame(data=data2.values, columns=data.columns)

# data_all = data.append(data2,ignore_index=True) # version for use with older versions of pandas
data_all = pd.concat((data, data2), ignore_index=True)
data_all

In [5]:
data_all.to_csv("un_tweets_corrected.csv", encoding='utf-8')

Finally, we select only the English tweets.

In [None]:
en_data = data_all.loc[data_all['lang'] == "en"]
docs = list(en_data["text"])
len(docs)

# Topic Modeling

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.


## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language.

We will use the  `all-mpnet-base-v2` sentence transformer model.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.


In [None]:

# NOTE: earlier versions of KeyphraseCountVectorizer do not use the `decay` and `delete_min_df` params: delete these if you get an error
try:
    topic_model = BERTopic(embedding_model="all-mpnet-base-v2",
                        language="english",
                        calculate_probabilities=True,
                        verbose=True,
                        vectorizer_model=KeyphraseCountVectorizer(max_df = int(0.5*len(docs)), min_df = 5, decay = 0.1, delete_min_df=5.0)
                        )
except TypeError:
    topic_model = BERTopic(embedding_model="all-mpnet-base-v2",
                        language="english",
                        calculate_probabilities=True,
                        verbose=True,
                        vectorizer_model=KeyphraseCountVectorizer(max_df = int(0.5*len(docs)), min_df = 5)
                        )
topics, probs = topic_model.fit_transform(docs)

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.get_topic_info().iloc[:50]

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(topic=5)

**NOTE**: BERTopic is stochastic, which means that the topics might differ across runs. This is mostly due to the stochastic nature of UMAP.

To see a small set of documents representative of a given topic, use:

In [None]:
topic_model.get_representative_docs(5)

In [None]:
df = pd.DataFrame({'topic': topics, 'document': docs})

en_data["topic"] = topics
en_data.loc[en_data["topic"] == 5]

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [None]:
topic_model.topics_[:10]

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created.

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can
be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the topic probability distributions for document `d`, we simply call:

In [None]:
d = 301
print("Document text:", docs[d])
topic_model.visualize_distribution(probs[d], min_probability=0.005)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy()

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(range(1,40))

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap()

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. Here, we are going to be searching for topics that closely relate the
search term "vehicle". Then, we extract the most similar topic and check the results:

In [None]:
similar_topics, similarity = topic_model.find_topics("children", top_n=5)
similar_topics, similarity

In [None]:
for topic in similar_topics:
    print(topic_model.get_topic(topic))

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created.

This allows for fine-tuning the model to your specifications and wishes.

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:


In [None]:
# Further reduce topics
r_topic_model = topic_model
r_topic_model.reduce_topics(docs, nr_topics=50)

In [None]:
r_topic_model.get_topic_info()