<a href="https://colab.research.google.com/github/TyrealQ/Twitter-Perceptions-Esports-2023-Asian-Games_HICSS-58/blob/main/BERTopic_HICSS58_Q.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Credits**

This project was inspired by and based on the code available at https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing

```bibtex
@article{grootendorst2022bertopic,
  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
  author={Grootendorst, Maarten},
  journal={arXiv preprint arXiv:2203.05794},
  year={2022}
```

# **Dependencies**

In [None]:
!pip install bertopic sentence_transformers adjustText openai tiktoken

# DataMapPlot
!git clone https://github.com/TutteInstitute/datamapplot.git
!pip install datamapplot/.

## Dependencies for GPU-accelerated HDBSCAN + UMAP

In [None]:
!pip install cudf-cu12 dask-cudf-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cupy-cuda12x -f https://pip.cupy.dev/aarch64

# **Data**

In [None]:
import pandas as pd
df = pd.read_excel('YOUR FILE PATH')

## Tweet preprocessing 1: Lowercase, remove URLs/symbols/numbers/stopwords, tokenize

In [None]:
import nltk
import re
from nltk.corpus import stopwords
import pandas as pd

nltk.download('stopwords')
nltk.download('punkt')

def pre_process(sentence):
    # Convert to lowercase and strip leading/trailing whitespace
    sentence = str(sentence).lower().strip()

    # Remove URLs
    sentence = re.sub(r'https?://\S+|www\.\S+', '', sentence)
    sentence = re.sub(r'http[s]?://\S+', '', sentence)
    sentence = re.sub(r'\S+\.\S+', '', sentence)

    # Remove spcial symbols
    sentence = re.sub(r'[^\w\s]', '', sentence)

    # Remove numbers
    sentence = re.sub(r'\d+', '', sentence)

    # Define additional stopwords
    my_stopwords = set(stopwords.words('english'))

    # Tokenize the sentence
    words = nltk.word_tokenize(sentence)

    # Remove stopwords
    valid_words = [word for word in words if word not in my_stopwords and len(word) > 1]

    # Join the words back into a single string
    return ' '.join(valid_words)

# Apply preprocessing to the text column
df['text1'] = df['text'].apply(pre_process)

print('Jobs Done')

## Tweet preprocessing 2: Remove URLs/symbols/numbers, condense spaces

In [None]:
import nltk
import re
from nltk.corpus import stopwords
import pandas as pd

def pre_process(sentence):
    # Remove URLs
    sentence = re.sub(r'https?://\S+|www\.\S+', '', sentence)
    sentence = re.sub(r'http[s]?://\S+', '', sentence)
    sentence = re.sub(r'\S+\.\S+', '', sentence)

    # Remove special symbols
    sentence = re.sub(r'[^\w\s]', '', sentence)

    # Remove numbers
    sentence = re.sub(r'\d+', '', sentence)

    # Condense all multiple spaces to a single space
    sentence = re.sub(r'\s+', ' ', sentence).strip()

    # Return the cleaned sentence
    return sentence

# Apply preprocessing to the text column
df['text2'] = df['text'].apply(pre_process)

print('Jobs Done')

In [None]:
print(df.at[3082, 'text'])
print(df.at[3082, 'text1'])
print(df.at[3082, 'text2'])

team korea picks another gold taekwondo judo team turns away disappointment asian_games hangzhou_asian_games team_korea taekwondo judo fencing esports 항저우_아시안_게임 팀코리아 태권도 유도 펜싱 arirang_news 아리랑뉴스


In [None]:
# Save the DataFrame with both original and cleaned text into a new Excel file
df.to_excel('YOUR FILE PATH', index=False)

# **LLM Prompt Template**

Although we can directly prompt the model, there is actually a template that we need to follow. The template looks as follows:

```python
"""
<s>[INST] <<SYS>>

{{ System Prompt }}

<</SYS>>

{{ User Prompt }} [/INST]

{{ Model Answer }}
"""
```

This template consists of two main components, namely the `{{ System Prompt }}` and the `{{ User Prompt }}`:
* The `{{ System Prompt }}` helps us guide the model during a conversation. For example, we can say that it is a helpful assisant that is specialized in labeling topics.
* The  `{{ User Prompt }}` is where we ask it a question.

You might have noticed the `[INST]` tags, these are used to identify the beginning and end of a prompt. We can use these to model the conversation history as we will see more in-depth later on.

Next, let's see how we can use this template to optimize Llama 2 for topic modeling.

## Prompt Template

We are going to keep our `system prompt` simple and to the point:

In [None]:
# System prompt describes information given to all conversations
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics.
<</SYS>>
"""

We will tell the model that it is simply a helpful assistant for labeling topics since that is our main goal.

In contrast, our `user prompt` is going to the be a bit more involved. It will consist of two components, an **example** and the **main prompt**.

Let's start with the **example**. Most LLMs do a much better job of generating accurate responses if you give them an example to work with. We will show it an accurate example of the kind of output we are expecting.

In [None]:
# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] Environmental impacts of eating meat
"""

This example, based on a number of keywords and documents primarily about the impact of
meat, helps to model to understand the kind of output it should give. We show the model that we were expecting only the label, which is easier for us to extract.

Next, we will create a template that we can use within BERTopic:

In [None]:
# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, create a short label of this topic, ensuring comprehension across languages. Make sure you to only return the label and nothing more.
[/INST]
"""

There are two BERTopic-specific tags that are of interest, namely `[DOCUMENTS]` and `[KEYWORDS]`:

* `[DOCUMENTS]` contain the top 5 most relevant documents to the topic
* `[KEYWORDS]` contain the top 10 most relevant keywords to the topic as generated through c-TF-IDF

This template will be filled accordingly to each topic. And finally, we can combine this into our final prompt:

In [None]:
prompt = system_prompt + example_prompt + main_prompt

# **BERTopic**

Before we can start with topic modeling, we will first need to perform two steps:
* Pre-calculating Embeddings
* Defining Sub-models

## Preparing embeddings

By pre-calculating the embeddings for each document, we can speed-up additional exploration steps and use the embeddings to quickly iterate over BERTopic's hyperparameters if needed.

**TIP**: You can find a great overview of good embeddings for clustering on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [None]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings BAAI/bge-small-en-v1.5, BAAI/bge-small-en OR sentence-transformers/all-MiniLM-L6-v2 OR sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
embedding_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
embeddings = embedding_model.encode(df['text1'].tolist(), show_progress_bar=True)

## Sub-models

Next, we will define all sub-models in BERTopic and do some small tweaks to the number of clusters to be created, setting random states, etc.

In [None]:
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', gen_min_span_tree=True, prediction_data=True)

#from umap import UMAP
#from hdbscan import HDBSCAN

#umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
#hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

As a small bonus, we are going to reduce the embeddings we created before to 2-dimensions so that we can use them for visualization purposes when we have created our topics.

In [None]:
# Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

### Representation models

One of the ways we are going to represent the topics is with LLMs which should give us a nice label. However, we might want to have additional representations to view a topic from multiple angles.

Here, we will be using c-TF-IDF as our main representation and [KeyBERT](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired), [MMR](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#maximalmarginalrelevance), and [GPT-4](https://openai.com/gpt-4) as our additional representations.

In [None]:
# GPT4 text generator
import openai
import tiktoken
from openai import Client
from bertopic.representation import OpenAI
from google.colab import userdata

api_key = userdata.get('YOUR OPENAI KEY')
client = Client(api_key=api_key)

# Tokenizer
tokenizer= tiktoken.encoding_for_model("gpt-4o")

# Create your representation model
GPT4 = OpenAI(
    client,
    prompt=prompt,
    model="gpt-4o",
    delay_in_seconds=2,
    chat=True,
    nr_docs=10,
    diversity=0.1,
    doc_length=100,
    tokenizer=tokenizer
)

In [None]:
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration, LlamaCPP

# KeyBERT
keybert = KeyBERTInspired()

# MMR
mmr = MaximalMarginalRelevance(diversity=0.5)

# All representation models
representation_model = {
    "KeyBERT": keybert,
    "MMR": mmr,
    "GPT4": GPT4
}

# **Training**

Now that we have our models prepared, we can start training our topic model! We supply BERTopic with the sub-models of interest, run `.fit_transform`, and see what kind of topics we get.

In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Train BERTopic with a custom CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=10)

topic_model = BERTopic(

  # Sub-models
  embedding_model=embedding_model,
  vectorizer_model=vectorizer_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  representation_model=representation_model,

  # Hyperparameters
  calculate_probabilities=True,
  verbose=True,
)

# Train model
topics, probs = topic_model.fit_transform(df['text1'])

## Now that we are done training our model, let's see what topics were generated:

In [None]:
topic_keywords = topic_model.get_topic_info()
print(topic_keywords)

In [None]:
topic_model.visualize_documents(df['text1'], reduced_embeddings=reduced_embeddings,
                                hide_document_hover=True, hide_annotations=True)

## Outlier reduction

In [None]:
# Use the "c-TF-IDF" strategy with a threshold
new_topics = topic_model.reduce_outliers(df['text1'], topics, strategy="c-tf-idf", threshold=0.1)

# Reduce all outliers that are left with the "distributions" strategy
new_topics = topic_model.reduce_outliers(df['text1'], topics, strategy="distributions")

In [None]:
topic_model.update_topics(df['text1'], topics=new_topics)

In [None]:
topic_model.visualize_documents(df['text1'], reduced_embeddings=reduced_embeddings,
                                hide_document_hover=True, hide_annotations=True)

In [None]:
topic_keywords.to_excel("YOUR FILE PATH", index=False)

## Show topics for documents

In [None]:
topic_model.get_document_info(df['text1'])

In [None]:
document_info_output = topic_model.get_document_info(df['text1'])

document_info_df = pd.DataFrame(document_info_output)

# Save the DataFrame to an Excel file
document_info_df.to_excel("YOUR FILE PATH", index=False)

## Topic probability distribution visualization for top N topics

In [None]:
topic_model.visualize_barchart(top_n_topics=10)

## Topic probability distribution visualization for each document

In [None]:
topic_model.visualize_distribution(topic_model.probabilities_[3082], min_probability=0.015)

## Intertopic distance map

In [None]:
topic_model.visualize_topics()

## Heatmap

In [None]:
topic_model.visualize_heatmap()

## Hierarchical topic modeling

In [None]:
from scipy.cluster import hierarchy as sch
from bertopic import BERTopic

# Hierarchical topics
linkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)
hierarchical_topics = topic_model.hierarchical_topics(df['text1'], linkage_function=linkage_function)

In [None]:
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

In [None]:
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)

## BERTopic has a reduce_topics method that uses the existing model information to do a topic reduction.

In [None]:
# Further reduce topics
topic_model.reduce_topics(df['text1'], nr_topics=9)

# Get the list of topics
topic_model.get_topic_info()

## If we would like to manually pick which topics to merge together based on domain knowledge, we can list the topic numbers and pass them into the merge_topics function.

In [None]:
topics_to_merge = [[0, 3],
                   [2, 6]]
topic_model.merge_topics(df['text1'], topics_to_merge)

# Get the list of topics
topic_model.get_topic_info()

## Datamapplot visualization

In [None]:
import datamapplot
import re

# Create a label for each document
llm_labels = [re.sub(r'\W+', ' ', label[0][0].split("\n")[0].replace('"', '')) for label in topic_model.get_topics(full=True)["GPT4"].values()]
llm_labels = [label if label else "Unlabelled" for label in llm_labels]
all_labels = [llm_labels[topic+topic_model._outliers] if topic != -1 else "Unlabelled" for topic in topics]

# Run the visualization
datamapplot.create_plot(
    reduced_embeddings,
    all_labels,
    label_font_size=11,
    title="2023 Asian Games Esports Discourse on X",
    sub_title="Topics labeled with `GPT-4`",
    label_wrap_width=20,
    use_medoids=True
)

## Topics over time

In [None]:
df['tt'] = df['tt'].astype(str)
df['text1'] = df['text1'].astype(str)
timestamps = df.tt.to_list()
topics_over_time = topic_model.topics_over_time(df['text1'], timestamps, nr_bins=30)

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, topics=[0, 5, 7, 8])

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=15)