# Topic Modeling with BERTopic

This is based on the Tutorial on BERTopic which you can find [here](https://github.com/MaartenGr/BERTopic). You can find there some tutorials on how to use BERTopic for topic modeling. They go more into detail than I did in this notebook. However, I tried to put the most important things here.


## BERTopic
In a nutshell, BERTopic is a topic modeling technique that leverages [transformers](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. If you want to read the whole story [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) is the link with more details.

<br>

<div style="text-align:center"><img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%"></div>

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-dow

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
## !pip install bertopic
import json
import matplotlib.pyplot as plt
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from bertopic.representation import MaximalMarginalRelevance
from bertopic.vectorizers import ClassTfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
import re
import pandas as pd

# Data

Let's load our data from `submissions.jl`. In theory, the method we are using does not require any preprocessing, however, I will just remove from the texts emails, and URLs. Otherwise, they will appear in the topics and we don't need it. It is just a simple loop that tokenizes the texts, recognizes URLs and emails, and removes them. Afterward, it appends all the texts into a list called `texts`.

In [None]:
texts = []
with open("comments_opinions_veganisms.jl", "r") as file:
    for submission in file:
        temp = json.loads(submission)
        text = temp["body"]
        ## Remove email addresses
        text = re.sub(r"[\w.+-]+@[\w-]+\.[\w.-]+", "", text)
        ## Remove URLs
        text = re.sub(r"http\S+", "", text)
        temp["text"] = text
        texts.append(temp)

You can play a bit with removing some texts. I removed the ones that were shorter than 10 words but you can play a bit with that. You can also see whether some texts that are too long should be also removed (for various reasons).

In [None]:
## Remove texts shorter than 10 words
docs = [line for line in texts if len(line["text"].split()) > 10]

## Create a data frame
df = pd.DataFrame(docs)

## Extract from a data frame a list-like object with texts
texts = df.text

In [None]:
len(texts)

A visualization of the length of texts. They are pretty short in general. The majority is less than 200 words.

In [None]:
## Create a list with lengths of texts
length = [len(item.split()) for item in texts]

## Draw a histogram
plt.hist(length, density=False, bins=30)
plt.ylabel("Count")
plt.xlabel("Length of the submission")
plt.show()

# Training

This is where all the magic happens. The chunk below will take the longest and you should more or less execute it only once. What is happening here is converting sentences from our texts into embeddings. In other words, we are converting the sentences into a vector and the dimensions will be other sentences from a pre-trained data set (it was trained on around a billion documents). We can change the pre-trained data set but unless you have a very good reason to do so I would stick to this one. [Here](https://www.sbert.net/docs/pretrained_models.html), are some other options in terms of pre-trained data sets.

In [None]:
sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(texts, show_progress_bar=True)

There are a few parameters you can play with here. You can change diversity. It ranges from 0 to 1. In simple terms the higher the value the less tolerant the model is to similar words appearing in different topics. In other words, if you set it to 1, topics should have very unlike words as the most representative for each topic, and also the number of topics might increase.

The other parameter you can play with is removing stop words. In theory, this approach is based on sentences, not on words but since the texts are quite short the stop words would really trash the most representative words. I added it but you can remove it to check how it looks.

In [None]:
representation_model = MaximalMarginalRelevance(diversity=0.0)
vectorizer_model = CountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
## Uncomment the below to not remove stop words
## vectorizer_model = None
## topic_model = BERTopic(calculate_probabilities=True, representation_model=representation_model, vectorizer_model=vectorizer_model, language = 'multilingual')
topic_model = BERTopic(
    calculate_probabilities=True, ctfidf_model=ctfidf_model, language="multilingual"
)
topics, probs = topic_model.fit_transform(texts, embeddings)

## Extracting Topics

After fitting our model, we can start by looking at the results. Usually, the `-1` topic refers to all outliers and can be ignored. It is also usually the biggest one.

In [None]:
topic_model.get_topic_info()

The names of the topics are based on the most probable words. You can easily change them (I will show you this later). However, before, you do that you should carefully understand what the topics are about. There is no way around it but reading at least some of the texts and looking at graphs below. The most probable words might be also of help.

In [None]:
## Print out the most probable words for a topic
topic_n = 0
topic_model.get_topic(topic_n)

## Visualize Topics

After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. The graph below shows how similar are topics to one another.

In [None]:
## You can move the slider below that highlights the topics.
topic_model.visualize_topics()

## Visualize Topic Probabilities

We can quite easily see how probable are the topics for any given document. Basically, by setting `document_n` we can see what topics saturate the document.

In [None]:
document_n = 0
topic_model.visualize_distribution(probs[document_n], min_probability=0.01)

Moreover, we can see which words decide that the document is in the given topic.

In [None]:
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(
    texts, calculate_tokens=True
)

# Visualize the token-level distributions
topic_model.visualize_approximate_distribution(
    texts[document_n], topic_token_distr[document_n]
)

## Visualize Topic Hierarchy

In general, the graph below shows which topics converge (are similar to one another). However, in most of the cases we have a bigger number of documents, therefore, there are at least tens of different topics. In our case, we can see that some of them are similar but I would not reduce their number. 

In [None]:
topic_model.visualize_hierarchy()

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart()

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=1, width=1000, height=1000)

# Topic Representation

After having created the topic model and looking at graphs it would be good to see some texts that are representative for each of the topics. Based on that we can try to name them (however, in our case I would still recommend reading all the texts). 

In [None]:
## It is just a dictionary where keys are topics numbers
## and values list with 3 most representative texts
topic_model.representative_docs_

In [None]:
topic_n = 0
topic_model.representative_docs_[topic_n]

Based on the reading we can add our custom labels to the topics. It is quite straightforward. 

In [None]:
topic_model.set_topic_labels({-1: "Trash"})

In [None]:
topic_model.get_topic_info()

We can also see what topics were assigned to each document. To write it out to an Excel file we use method `topic_model.get_document_info(texts).to_excel('topics.xlsx')`

In [None]:
topic_model.get_document_info(texts)

Or we can add topics to the our original file.

In [None]:
## Add the column with topics
df["topic"] = topic_model.topics_

In [None]:
## Just out of curiosity, we can see the upvote ration, score, and sentiment for each topic
df.groupby("topic").mean()

What we can do later is to count the average sentiment for each topic's comments. For now, though, here is the Excel file with topics assigned to each document.

In [None]:
df.to_excel("submissions.xlsx")