# <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2; text-align: center;">Topic Modeling</div>

An important task coming after data preprocessing and analysis (being the second step to virtually any NLP project) is topic modeling: a process that separates the existing data into multiple clusters, each of them representing a different topic. This is a crucial step in the process of understanding the data and extracting valuable insights from it. For this task, the team decided to use the BERTopic library, which is a topic modeling technique that leverages transformers model to create dense representations of the documents and then clusters them using HDBSCAN.

#### Used Embeddings

The embeddings used for topic modeling are taken from the project of digitalepidemiologylab in GitHub, which generated embeddings from COVID-19 tweets. These Embeddings are related to the BERT model, and a description about them can be found in the [official repository of digitalepidemoloylab's project](https://github.com/digitalepidemiologylab/covid-twitter-bert).

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Dependencies Imports</div>

In [87]:
# !pip install umap-learn
# !pip install hdbscan
# !pip install sentence-transformers
# !pip install bertopic
# !pip install nltk
# !pip install nbformat
# !pip install datamapplot
# !pip install dask[dataframe]


In [88]:
import pandas as pd
import html
from umap import UMAP
from sklearn.decomposition import PCA
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
import datamapplot
import re

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Preprocessing</div>

Before starting to transform the data, some important preprocessing steps must be done in order to clean the data and maintain coherency in results with what has been done in the previous notebooks. The following steps were taken:

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. Data Loading</div>

In [89]:
DATA = 'data/'
test = pd.read_csv(DATA + 'Constraint_English_Test.csv', delimiter=';', encoding='utf-8')
train = pd.read_csv(DATA + 'Constraint_English_Train.csv', delimiter=';', encoding='utf-8')
val = pd.read_csv(DATA + 'Constraint_English_Val.csv', delimiter=';', encoding='utf-8')

tweets = pd.concat([train, val, test], ignore_index=True)
tweets.drop(columns=['id'], inplace=True)
tweets['tweet'] = tweets['tweet'].apply(html.unescape)


### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">2. Data Cleaning</div>

In [90]:
processed_tweets = tweets.copy()
processed_tweets = processed_tweets.drop_duplicates(subset='tweet', keep='first')

# remove links
def filter(tweet:str):
    # https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url#3809435
    tweet = re.sub(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)', '', tweet)
    tweet = re.sub(r'\d+', '', tweet) # remove numbers, not useful for topic modeling
    return tweet

processed_tweets['tweet_nolinks'] = processed_tweets['tweet'].apply(filter)

# Recalculate indexes
processed_tweets['tweet'].reset_index(drop=True, inplace=True)
# real_tweets = processed_tweets[processed_tweets['label'] == 'real']['tweet_nolinks'].reset_index(drop=True).tolist()
# fake_tweets = processed_tweets[processed_tweets['label'] == 'fake']['tweet_nolinks'].reset_index(drop=True).tolist()

processed_tweets

Unnamed: 0,tweet,label,tweet_nolinks
0,The CDC currently reports 99031 deaths. In gen...,real,The CDC currently reports deaths. In general ...
1,States reported 1121 deaths a small rise from ...,real,States reported deaths a small rise from last...
2,Politically Correct Woman (Almost) Uses Pandem...,fake,Politically Correct Woman (Almost) Uses Pandem...
3,#IndiaFightsCorona: We have 1524 #COVID testin...,real,#IndiaFightsCorona: We have #COVID testing la...
4,Populous states can generate large case counts...,real,Populous states can generate large case counts...
...,...,...,...
10695,#CoronaVirusUpdates: State-wise details of Tot...,real,#CoronaVirusUpdates: State-wise details of Tot...
10696,Tonight 12(midnight) onwards Disaster Manageme...,fake,Tonight (midnight) onwards Disaster Management...
10697,296 new cases of #COVID19Nigeria; Plateau-85 E...,real,new cases of #COVIDNigeria; Plateau- Enugu- O...
10698,RT @CDCemergency: #DYK? @CDCgov’s One-Stop Sho...,real,RT @CDCemergency: #DYK? @CDCgov’s One-Stop Sho...


## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Data Transformation (Embeddings)</div>

In order to extract the different topics, a set of pre-trained models, called embeddings, can be used to transform the data into a format that can be used by the topic modeling algorithm. The adecuacy of the embeddings to the domain of the data is paramount for the well-performing of the topic modeling algorithm. In this case, the embeddigs used will be `covid-twitter-bert`, a model provided by digitalepidemiologylab.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. Load Embeddings</div>

In [91]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from bertopic.representation import TextGeneration

# Load a pre-trained model and tokenizer
model_name = "facebook/opt-350m"  # You can replace this with other summarization models

# Llama 2 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
)

# Create Text Generator

generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1
)

prompt = """"
You are a helpful, respectful and honest assistant for labeling topics.
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic.
Make sure you to only return the label and nothing more.
"""
llama2 = TextGeneration(generator, prompt=prompt)
representation_model = {
    "Llama2": llama2,
}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Device set to use cpu


In [92]:
from sentence_transformers import SentenceTransformer
# Load the pre-trained model
sentence_model = SentenceTransformer('digitalepidemiologylab/covid-twitter-bert-v2')

# Encode the tweets
embeddings_real = sentence_model.encode(processed_tweets[processed_tweets["label"] == "real"]["tweet_nolinks"].reset_index(drop=True).tolist(), show_progress_bar=True, batch_size=16)
embeddings_fake = sentence_model.encode(processed_tweets[processed_tweets["label"] == "fake"]["tweet_nolinks"].reset_index(drop=True).tolist(), show_progress_bar=True, batch_size=16)

No sentence-transformers model found with name digitalepidemiologylab/covid-twitter-bert-v2. Creating a new one with mean pooling.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Batches: 100%|██████████| 350/350 [12:23<00:00,  2.12s/it]
Batches: 100%|██████████| 319/319 [09:08<00:00,  1.72s/it]


In [93]:
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', random_state=42, low_memory=False)

In [94]:
# Step 3 - Cluster reduced embeddings
cluster_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

In [95]:
# Step 4 - bag-of-words representation
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1,2))

In [96]:
# Step 3 - Cluster reduced embeddings
cluster_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

In [97]:
# Step 5 - Create topic representation (extract topic words)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

In [98]:
# All steps together
topic_model_real = BERTopic(
  embedding_model=sentence_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                     # Step 2 - Reduce dimensionality
  hdbscan_model=cluster_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
  calculate_probabilities = True,
  min_topic_size = 50,
  n_gram_range=(1, 2),
  verbose = True,
  language='english'
)

umap_model_fake = UMAP(n_neighbors=min(len([tweet for tweet in processed_tweets[processed_tweets["label"] == "fake"]["tweet"]]) - 1, 15), n_components=10, metric='cosine', random_state=None, low_memory=False)
topic_model_fake = BERTopic(
  embedding_model=sentence_model,          # Step 1 - Extract embeddings
  umap_model=umap_model_fake,                     # Step 2 - Reduce dimensionality
  hdbscan_model=cluster_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
  calculate_probabilities = True,
  min_topic_size = 50,
  n_gram_range=(1, 2),
  verbose = True,
  language='english'
)

In [99]:
# Trainning process
topics_real, probs_real = topic_model_real.fit_transform([tweet for tweet in processed_tweets[processed_tweets['label'] == 'real']['tweet_nolinks']], embeddings_real)

2025-01-11 16:24:22,041 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-11 16:24:29,003 - BERTopic - Dimensionality - Completed ✓
2025-01-11 16:24:29,004 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-11 16:24:29,666 - BERTopic - Cluster - Completed ✓
2025-01-11 16:24:29,684 - BERTopic - Representation - Extracting topics from clusters using representation models.
  0%|          | 0/51 [00:00<?, ?it/s]


IndexError: index out of range in self

In [14]:
topics_fake, probs_fake = topic_model_fake.fit_transform([tweet for tweet in processed_tweets[processed_tweets['label'] == 'fake']['tweet_nolinks']], embeddings_fake)

2025-01-11 11:16:46,063 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-11 11:16:49,281 - BERTopic - Dimensionality - Completed ✓
2025-01-11 11:16:49,283 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-11 11:16:49,843 - BERTopic - Cluster - Completed ✓
2025-01-11 11:16:49,846 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-11 11:20:44,814 - BERTopic - Representation - Completed ✓


#### Results REAL

In [15]:
topic_model_real.get_topic(1, full=True) #all form of topic 1 representation

{'Main': [('says', 0.2981623464106683),
  ('restrictions', 0.27833690964482816),
  ('uk', 0.27398588369669125),
  ('coronavirus', 0.27054146006609336),
  ('england', 0.2578791763449015),
  ('government', 0.2444620750329189),
  ('boris johnson', 0.24235159093256456),
  ('boris', 0.24235159093256456),
  ('kayburley', 0.24235159093256456),
  ('johnson', 0.24149878765608798)],
 'KeyBERT': [('keir starmer', 0.3386705),
  ('sir keir', 0.30269253),
  ('starmer', 0.27992517),
  ('kayburley', 0.25482714),
  ('coronavirus uk', 0.24568787),
  ('keir', 0.23196158),
  ('sky news', 0.2312667),
  ('boris johnson', 0.21694529),
  ('local lockdown', 0.21152171),
  ('nicola sturgeon', 0.2105009)],
 'MMR': [('government', 0.2444620750329189),
  ('boris johnson', 0.24235159093256456),
  ('boris', 0.24235159093256456),
  ('kayburley', 0.24235159093256456),
  ('lockdown', 0.23322658311167946),
  ('follow live', 0.20070319118276872),
  ('latest coronavirus', 0.19908979067247917),
  ('health secretary', 0.199

In [16]:
#custom labels (optional)
#main: key words, general terms
#keybert: concrete terms, maybe too specific and descriptive
#mmr: balance relevance and diversity, might be complex to understand
keybert_topic_labels = {topic: " | ".join(list(zip(*values))[0][:3]) for topic, values in topic_model_real.topic_aspects_["llama2"].items()}
topic_model_real.set_topic_labels(keybert_topic_labels)

In [17]:
# Topics visualization
topic_model_real.visualize_topics(custom_labels=True)

In [18]:
# hierarchical topic visualization
topic_model_real.visualize_hierarchy(custom_labels=True) #false for default labels

In [19]:
# visualize documents
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_real)

# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
topic_model_real.visualize_documents([tweet for tweet in processed_tweets[processed_tweets['label'] == 'real']['tweet_nolinks']], reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True, hide_document_hover=True)

In [20]:
topic_model_real.visualize_heatmap(n_clusters=20)

In [21]:
topic_model_real.visualize_barchart(top_n_topics=8)

#### Results FAKE

In [22]:
topic_model_fake.visualize_topics()

In [23]:
topic_model_fake.visualize_hierarchy()

In [24]:
# visualize documents
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_fake)

# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
topic_model_fake.visualize_documents([tweet for tweet in processed_tweets[processed_tweets['label'] == 'fake']['tweet_nolinks']], reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True, hide_document_hover=True)

In [25]:
topic_model_fake.visualize_heatmap(n_clusters=20)

In [26]:
topic_model_fake.visualize_barchart(top_n_topics=8)