# <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2; text-align: center;">Topic Modeling</div>

An important task coming after data preprocessing and analysis (being the second step to virtually any NLP project) is topic modeling: a process that separates the existing data into multiple clusters, each of them representing a different topic. This is a crucial step in the process of understanding the data and extracting valuable insights from it. For this task, the team decided to use the BERTopic library, which is a topic modeling technique that leverages transformers model to create dense representations of the documents and then clusters them using HDBSCAN.

#### Used Embeddings

The embeddings used for topic modeling are taken from the project of digitalepidemiologylab in GitHub, which generated embeddings from COVID-19 tweets. These Embeddings are related to the BERT model, and a description about them can be found in the [official repository of digitalepidemoloylab's project](https://github.com/digitalepidemiologylab/covid-twitter-bert).

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Dependencies Imports</div>

In [29]:
# !pip install umap-learn
# !pip install hdbscan
# !pip install sentence-transformers
# !pip install bertopic
# !pip install nltk
# !pip install nbformat
# !pip install datamapplot
# !pip install dask[dataframe]


In [None]:
# When using CUDA, reduces VRAM usage
%env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True"

In [30]:
import pandas as pd
import html
from umap import UMAP
from sklearn.decomposition import PCA
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
import datamapplot
import re

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Preprocessing</div>

Before starting to transform the data, some important preprocessing steps must be done in order to clean the data and maintain coherency in results with what has been done in the previous notebooks. The following steps were taken:

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. Data Loading</div>

First of all, the datasets inside the `data` folder must be loaded into the notebook. All train, test and validation datasets are fused into a single dataframe, which is stored inside the `tweets` variable.

In [31]:
DATA = 'data/'
test = pd.read_csv(DATA + 'Constraint_English_Test.csv', delimiter=';', encoding='utf-8')
train = pd.read_csv(DATA + 'Constraint_English_Train.csv', delimiter=';', encoding='utf-8')
val = pd.read_csv(DATA + 'Constraint_English_Val.csv', delimiter=';', encoding='utf-8')

tweets = pd.concat([train, val, test], ignore_index=True)
tweets.drop(columns=['id'], inplace=True)
tweets['tweet'] = tweets['tweet'].apply(html.unescape)


### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">2. Data Cleaning</div>

Of course, the imported data is not perfect and must undergo some cleaning. The following steps were taken:

1. First of all, the duplicate instances were removed from the dataset. For this purpose, the `drop_duplicates` method was used, which removes all rows that are exactly the same as another row.
2. The `text` column was cleaned from any presence of URLs, and numbers (they are useless for topic modeling)
3. Some variables for storing real and fake tweets after processing were created.

In [32]:
processed_tweets = tweets.copy()
processed_tweets = processed_tweets.drop_duplicates(subset='tweet', keep='first')

# remove links
def filter(tweet:str):
    # https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url#3809435
    tweet = re.sub(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)', '', tweet)
    tweet = re.sub(r'\d+', '', tweet) # remove numbers, not useful for topic modeling
    return tweet

processed_tweets['tweet_nolinks'] = processed_tweets['tweet'].apply(filter)

# Recalculate indexes
processed_tweets['tweet'].reset_index(drop=True, inplace=True)
real_tweets = processed_tweets[processed_tweets['label'] == 'real']['tweet_nolinks']
fake_tweets = processed_tweets[processed_tweets['label'] == 'fake']['tweet_nolinks']

processed_tweets

Unnamed: 0,tweet,label,tweet_nolinks
0,The CDC currently reports 99031 deaths. In gen...,real,The CDC currently reports deaths. In general ...
1,States reported 1121 deaths a small rise from ...,real,States reported deaths a small rise from last...
2,Politically Correct Woman (Almost) Uses Pandem...,fake,Politically Correct Woman (Almost) Uses Pandem...
3,#IndiaFightsCorona: We have 1524 #COVID testin...,real,#IndiaFightsCorona: We have #COVID testing la...
4,Populous states can generate large case counts...,real,Populous states can generate large case counts...
...,...,...,...
10695,#CoronaVirusUpdates: State-wise details of Tot...,real,#CoronaVirusUpdates: State-wise details of Tot...
10696,Tonight 12(midnight) onwards Disaster Manageme...,fake,Tonight (midnight) onwards Disaster Management...
10697,296 new cases of #COVID19Nigeria; Plateau-85 E...,real,new cases of #COVIDNigeria; Plateau- Enugu- O...
10698,RT @CDCemergency: #DYK? @CDCgov’s One-Stop Sho...,real,RT @CDCemergency: #DYK? @CDCgov’s One-Stop Sho...


Just for description purposes, the resulting dataset has a total of 3 features and 10699 instances.

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Data Transformation (Embeddings)</div>

In order to extract the different topics, a set of pre-trained models, called embeddings, can be used to transform the data into a format that can be used by the topic modeling algorithm. The adecuacy of the embeddings to the domain of the data is paramount for the well-performing of the topic modeling algorithm. In this case, the embeddigs used will be `covid-twitter-bert`, a model provided by digitalepidemiologylab.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. Load Embeddings</div>

In order to correctly load the embeddings, the library `sentence_transformers` (derivate from the `transformers` library) was used. The embeddings are loaded from the `covid-twitter-bert` model, which is a model trained on COVID-19 tweets. The embeddings are loaded into the `embeddings_real` and `embeddings_fake` variables, which store the embeddings for the real and fake tweets, respectively.

In [33]:
from sentence_transformers import SentenceTransformer
# Load the pre-trained model
sentence_model = SentenceTransformer('digitalepidemiologylab/covid-twitter-bert-v2')

# Encode the tweets
embeddings_real = sentence_model.encode(real_tweets.reset_index(drop=True).tolist(), show_progress_bar=True, batch_size=16)
embeddings_fake = sentence_model.encode(fake_tweets.reset_index(drop=True).tolist(), show_progress_bar=True, batch_size=16)

No sentence-transformers model found with name digitalepidemiologylab/covid-twitter-bert-v2. Creating a new one with mean pooling.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Batches: 100%|██████████| 350/350 [10:37<00:00,  1.82s/it]
Batches: 100%|██████████| 319/319 [07:36<00:00,  1.43s/it]


### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">2. Dimensionality Reduction</div>

The embeddings have a high dimensionality, which can be a problem for the topic modeling algorithm. In order to reduce the dimensionality of the embeddings, the `UMAP` algorithm was used. The embeddings are reduced to 10 dimensions by merging dimensions using the cosine similarity metric. Since some errors appeared on the UMAP algorithm with fake tweets, a sepparate UMAP model was trained for fake tweets.

In [34]:
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', random_state=None, low_memory=True)
umap_model_fake = UMAP(n_neighbors=min(len([tweet for tweet in processed_tweets[processed_tweets["label"] == "fake"]["tweet"]]) - 1, 15), n_components=10, metric='cosine', random_state=42, low_memory=True)

With this step, the Dimensionality reductor has been created and is ready to be used in the creation of the BERTopic model.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">3. Embedding Clustering</div>

Another important step of the BERTopic pipeline is the clustering of the embeddings. The embeddings are clustered using the `HDBSCAN` algorithm, which in this case uses the Euclidean distance metric to form the clusters. Again, once loaded, the clustering algorithm is ready to be used in the creation of the BERTopic model.

In [35]:
# Step 3 - Cluster reduced embeddings
cluster_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">4. Topic Tokenization</div>

The last step of the BERTopic pipeline is the tokenization of the topics. The topics are tokenized using the `CountVectorizer` algorithm, which is a simple algorithm that counts the number of times each token appears in the topics. Once loaded, like with the rest of the models, the tokenization model is ready to be used in the creation of the BERTopic model.

In [36]:
# Step 4 - bag-of-words representation
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1,2))

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">5. Topic Representation</div>

The last step of the BERTopic pipeline is the representation of the topics. The topics are represented using the `TfidfVectorizer` algorithm. Once loaded, like with the rest of the models, the representation model is ready to be used in the creation of the BERTopic model.

In [37]:
# Step 5 - Create topic representation (extract topic words)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

Additionally, since the names of the topics can be confusing at times, the team decided to use a pretrained LLM model to generate a summary of the topics from the documents and the keywords generated from the topics. The chosen model was `flan-t5-base` from Google's T5 models.

Using the `transformers` library (the base library for the `sentence_transformers` library), the model was loaded inside a pipeline with the task of generating text from some input text. Next, a prompt was created for the model to generate the summaries of the topics. Then, the whole pipeline is stored inside a representation model that can be integrated into the BERTopic model.

In [74]:
from transformers import pipeline
from bertopic.representation import TextGeneration

# Load a pre-trained model and tokenizer
model_name = 'google/flan-t5-base'  # You can replace this with other summarization models

# Create Text Generator

generator = pipeline(
    model=model_name,
    task='text2text-generation',
)

prompt = """"
You are a helpful, respectful and honest assistant for labeling topics.
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords, separated by "_" characters: '[KEYWORDS]'.

Based on the information about the topic above, please create a short (three-word maximum) label of this topic.
Make sure you to only return the label and nothing more.

Everyone is counting on you to provide a good label for this topic.
"""
llama2 = TextGeneration(generator, prompt=prompt)
representation_model = {
    "Llama2": llama2,
}

Device set to use cpu


## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Creation and Training of BERTopic Models</div>

Once all the necessary models have been loaded and created, the BERTopic model can be created and trained. The BERTopic model is created using the `BERTopic` class from the `bertopic` library, which is a library that uses the BERT model to create dense representations of the documents and then clusters them using HDBSCAN. Of course, two BERTopic models are created, one for the real tweets and one for the fake tweets.

In [75]:
# All steps together
topic_model_real = BERTopic(
  embedding_model=sentence_model.to('cpu'),          # Step 1 - Extract embeddings
  umap_model=umap_model,                     # Step 2 - Reduce dimensionality
  hdbscan_model=cluster_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
  calculate_probabilities = True,
  min_topic_size = 50,
  n_gram_range=(1, 2),
  verbose = True,
  language='english',
  low_memory=True,
)

topic_model_fake = BERTopic(
  embedding_model=sentence_model.to('cpu'),          # Step 1 - Extract embeddings
  umap_model=umap_model_fake,                     # Step 2 - Reduce dimensionality
  hdbscan_model=cluster_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
  calculate_probabilities = True,
  min_topic_size = 50,
  n_gram_range=(1, 2),
  verbose = True,
  language='english',
  low_memory=True,
)

In [None]:
# Clear the cache to avoid running out of VRAM (Tested on 4GB)
import torch
torch.cuda.empty_cache()
# cpu set as the default device because of the high VRAM use
torch.set_default_device('cpu')

Now that the models are created, it is time to train them on the real and fake tweets, respectively, before being able to extract relevant results from them.

In [76]:
# Trainning process
topics_real, probs_real = topic_model_real.fit_transform([tweet for tweet in real_tweets], embeddings_real)

2025-01-14 20:47:56,399 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-14 20:47:58,362 - BERTopic - Dimensionality - Completed ✓
2025-01-14 20:47:58,363 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-14 20:47:59,046 - BERTopic - Cluster - Completed ✓
2025-01-14 20:47:59,049 - BERTopic - Representation - Extracting topics from clusters using representation models.
  0%|          | 0/53 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (2563 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 53/53 [01:13<00:00,  1.39s/it]
2025-01-14 20:49:13,374 - BERTopic - Representation - Completed ✓


In [77]:
topics_fake, probs_fake = topic_model_fake.fit_transform([tweet for tweet in fake_tweets], embeddings_fake)

2025-01-14 20:49:13,572 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-14 20:49:19,651 - BERTopic - Dimensionality - Completed ✓
2025-01-14 20:49:19,652 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-14 20:49:20,230 - BERTopic - Cluster - Completed ✓
2025-01-14 20:49:20,233 - BERTopic - Representation - Extracting topics from clusters using representation models.
100%|██████████| 56/56 [01:41<00:00,  1.81s/it]
2025-01-14 20:51:01,939 - BERTopic - Representation - Completed ✓


Once the models are trained, the topic labels are replaced by those generated by the LLM model, which are (a little) more descriptive and easier to understand.

In [None]:

llama2_labels_real = [label[0][0].split("\n")[0] for label in topic_model_real.get_topics(full=True)["Llama2"].values()]
topic_model_real.set_topic_labels(llama2_labels_real)

llama2_labels_fake = [label[0][0].split("\n")[0] for label in topic_model_fake.get_topics(full=True)["Llama2"].values()]
topic_model_fake.set_topic_labels(llama2_labels_fake)

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Results for Real Tweets</div>

It is time to start extracting the results from the BERTopic models. That is, extracting the formed topics and key information about them.



### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. DataFrame Visualization</div>

The first thing to do is to visualize the created topics inside a dataframe. This will be done for both the real and fake tweets.

In [78]:
topic_model_real.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Llama2,Representative_Docs
0,-1,841,-1_cdc_learn_help_forecasts,"[cdc, learn, help, forecasts, contact, pets, c...","[#COVID, , , , , , , , , ]",[The reported death toll was bringing our dat...
1,0,893,0_indiafightscorona_recoveries_lakh_india,"[indiafightscorona, recoveries, lakh, india, d...",[#coronavirusindia #coronavirusupdates #corona...,[#CoronaVirusUpdates #IndiaFightsCorona India ...
2,1,724,1_data_numbers_reporting_day average,"[data, numbers, reporting, day average, states...","[COVID- cases reported today, , , , , , , , , ]",[Our daily update is published. States reporte...
3,2,512,2_says_restrictions_uk_coronavirus,"[says, restrictions, uk, coronavirus, england,...","[covid restrictions, , , , , , , , , ]","[Labour leader Sir Keir Starmer says he ""disag..."
4,3,252,3_nigeria_covid reported_reported nigeria_ncdc,"[nigeria, covid reported, reported nigeria, nc...","[nigeria covid, , , , , , , , , ]",[Eight new cases of #COVID have been reported ...
5,4,249,4_drtedros_countries_drtedros covid_vaccines,"[drtedros, countries, drtedros covid, vaccines...","[covid tools drtedros, , , , , , , , , ]",[The Accelerator’s start-up phase has shown im...
6,5,206,5_patients_covid patients_cancer_hydroxychloro...,"[patients, covid patients, cancer, hydroxychlo...","[COVID, , , , , , , , , ]",[Non-#COVID patients continue to require contr...
7,6,148,6_face covering_covering_hands_wear,"[face covering, covering, hands, wear, face, c...","[#COVID, , , , , , , , , ]",[You can help slow the spread of #COVID. Pract...
8,7,114,7_alexismadrigal_yeah_yayitsrob_extremely,"[alexismadrigal, yeah, yayitsrob, extremely, l...","[_, , , , , , , , , ]",[@ImTheRealDMac @lowerthetemp In some places i...
9,8,102,8_contacted_close contacts_referred_negative,"[contacted, close contacts, referred, negative...","[contacted, tested, negative, left, managed is...",[Of the people who left managed isolation fac...


Overall, a total of around 50 topics for real tweets has been created. The LLM made a decent job in providing summary labels for some of the topics, although some of them are still not very clear due to the repetitiveness of the summary labels. Other types of visualization shall be done in order to better understand the topics, since Most of the topics are labeled as "COVID" alone or with a few other words. For the labels that are clear, we can see that many of them are focused on informative purposes, like restriction updates and information about arising cases.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">2. Distance Map</div>

The next chosen representation is a distance map, which is a representation of the topics in a 2D space. This representation is useful for understanding the relationships between the topics and how they are distributed in the semantic space.

In [82]:
# Topics visualization
topic_model_real.visualize_topics(custom_labels=True)

It is visible that the vast majority of the topic distribution is concentrated in greatly separated clusters, which have a huge topic concentration each, with a kind of 'nexus' in the middle of the map. This gives out the idea that there are well-separated topics, with a few topics that are the most common among the tweets and have a strong relationship with each other.

While ideal to see the distribution and semantic variety of the topics, it is not very useful for understanding the topics themselves. For that, a different representation must be used.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">3. Hirearchical Visualization</div>

In order to better visualize how topics relate to each other, the hierarchical visualization is ideal for this task and has hence been chosen by the team. This visualization shows the relationships between the topics in a hierarchical manner in a tree-like structure.

In [83]:
# hierarchical topic visualization
topic_model_real.visualize_hierarchy(custom_labels=True) #false for default labels

We can see a complex hierarchy of topics. Some of the most interesting insights this visualization leaves us is that there are specific and closely related topics for tests, cases, hashtags, and worldwide restriction updates.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">4. Document Cluster Visualization</div>

This kind of visualization is needed in order to understand how documents relate each other and find out possible patterns in their themes. In order to provide this representation, a reduced version of the embeddings shall be provided, which is done by the UMAP algorithm.

In [84]:
# visualize documents
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_real)

# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
topic_model_real.visualize_documents([tweet for tweet in real_tweets], reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True, hide_document_hover=True)

Just as with the Distance Map, the Document Cluster Visualization shows a nexus in the center of the representation, where most of the topics meet, whilst the rest of the topics are scattered far apart from the center. This gives out the idea that there are a few main topics that are the most discussed, and that the rest of the topics are not as relevant.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">4. Heat Map Visualization</div>

The Heat Map Representation creates a similarity matrix among the topics. This representation is useful for understanding clearly the similarities within the topics.

In [85]:
topic_model_real.visualize_heatmap(n_clusters=20, custom_labels=True, width=1000, height=1000)

The heat map offers the visualization of a set of topics that have multiple similarity levels with others. Something interesting is that topics related to COVID cases and COVID in India (just in the lower rows of the dataset) do not have a great semantic relation to any of the other topics (they are the most close to white in the whole heatmap).

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">5. Topic Word Score Bar Chart</div>

This representation shows the most imporant and relevant words for each topic. This is useful for understanding the main themes of each topic.

In [86]:
topic_model_real.visualize_barchart(top_n_topics=8, custom_labels=True)

For the real tweets, the most important or relevant topics show words that correspond to restrictions, information, and case reports. Most of these words belong to very neutral topics, which do not try to convey any kind of sentiment and are focused on the conveyed information about the main point of interest: the virus.

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Results for Fake Tweets</div>

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. DataFrame Visualization</div>

In [92]:
topic_model_fake.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Llama2,Representative_Docs
0,-1,1762,-1_florida_new_deaths_post,COVID,"[florida, new, deaths, post, died, virus, worl...","[COVID, , , , , , , , , ]","[Korona virus, very new deadly form of virus, ..."
1,0,323,0_chloroquine_hydroxychloroquine_virus_zinc,'Coronavirus',"[chloroquine, hydroxychloroquine, virus, zinc,...","['Coronavirus', , , , , , , , , ]",[Man visited Albany N.Y. days before dying fro...
2,1,255,1_water_hot_cow_steam,ayush,"[water, hot, cow, steam, lemon, drinking, hot ...","[ayush, , , , , , , , , ]",[Some @WHO “myth busters” about COVID-: sprayi...
3,2,188,2_michigan_pelosi_whitmer_gretchen whitmer,nancy pelosi,"[michigan, pelosi, whitmer, gretchen whitmer, ...","[nancy pelosi, , , , , , , , , ]",[Says Joe Biden and Gretchen Whitmer were mask...
4,3,185,3_hyderabad_pradesh_muslim_uttar pradesh,"beaten, gandhi, tablighi, beaten, gand","[hyderabad, pradesh, muslim, uttar pradesh, ut...","[beaten, gandhi, tablighi, beaten, gand, , , ,...",[A doctor who went to Uttar Pradesh (a state i...
5,4,185,4_news_news government_boris_lockdown,priti,"[news, news government, boris, lockdown, news ...","[priti, , , , , , , , , ]",[NEWS! ‘Loss of taste’ added to COVID- symptom...
6,5,113,5_nashville_nashville man_coronavirus nashvill...,"nashville, nashville man, health coronavirus, ...","[nashville, nashville man, coronavirus nashvil...","[nashville, nashville man, health coronavirus,...",[Nashville Man Secretly Suspects Friend of Hav...
7,6,111,6_donaldtrump_donaldtrump coronavirus_coronavi...,_ coronavirus,"[donaldtrump, donaldtrump coronavirus, coronav...","[_ coronavirus, , , , , , , , , ]","[President Trump Says He Will Never, Ever Get ..."
8,7,96,7_modi_indian_punishable_crore,narendra modi,"[modi, indian, punishable, crore, management a...","[narendra modi, , , , , , , , , ]",[The Disaster Management Act has been implemen...
9,8,88,8_lakh_coronavirusfacts_new infections_covid c...,"coronavirus: new infections, new cases, new ca...","[lakh, coronavirusfacts, new infections, covid...","[coronavirus: new infections, new cases, new c...",[With nearly new infections reported in India...


Just like with the real tweets, the fake tweets can be divided into around 50 topics. Surprisingly, the frequency of use of the word "COVID" for labeling the topics is much lower than in the real tweets. This can be a clear indicative of the way disinformers elaborate their messages: They try to avoid the use of technical terms and focus on the emotional impact of the message.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">2. Distance Map</div>

In [87]:
topic_model_fake.visualize_topics()

One visible particularity on the distance map on fake tweets is that the centered nexus is much less concentrated of topics than in the real tweets. This can be a sign that disinformers may write about a wider range of topics than the real informers, which may be a sign of the way they try to reach a wider audience. The concentration and size of the scattered clusters is decent, whilst on the real tweets the clusters that deviate from the center of the map are much smaller and less concentrated.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">3. Hierarchical Clustering</div>

In [88]:
topic_model_fake.visualize_hierarchy(custom_labels=True)

The hierarchical view of topics shows very distinct semantic clusters, each with topics with labels that clearly stand out the relation between them. Among the topics, we can see topics that focus on the vaccine, animals, countries and even some particular individuals. These topics are very closely related, so it is not a coincidence that fake tweets try to point to all these other factors. These topics might not be relevant about giving out useful information about the virus, but it is rather related to giving out clues about what to blame for the pandemic.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">4. Document Cluster Visualization</div>

In [89]:
# visualize documents
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_fake)

# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
topic_model_fake.visualize_documents([tweet for tweet in fake_tweets], reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True, hide_document_hover=True)

A remarkable finding when looking at this topic representation view is that there seems to be a topic whose documents scatter accross the whole cluster map, maybe even coinciding with some clusters of different colours. What this might mean is that there is something mentioned in each of the topics that is common to all of them.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">5. Heat Map Visualization</div>

In [90]:
topic_model_fake.visualize_heatmap(n_clusters=20,custom_labels=True, width=1000, height=1000)

Much of a curious fact, the heat map representation indicates that unlike on real tweets, the overall level of interrelation between topics is much lower. The heatmap blue tones are much more present in the real tweets' heatmap. This could be a sign that disinformers tend to write about a wider and more disconnected topics from the main topic.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">6. Topic Word Scores Bar Chart</div>

In [91]:
topic_model_fake.visualize_barchart(top_n_topics=8,custom_labels=True)

A very small proportion of words related to the virus itself are observable in comparison to the real tweets. Once again, the focus of the most relevant topics is on particular individuals, countries and even some other random data not related to the virus (there is an entire topic just focused on Donald Trump). This significant difference is a clear indicative that disinformers have a different way of writing about the core topic, they have other interests than really getting to the core information.

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Conclusions</div>

Topic modeling has clearly shown there are significant differences between the way real and fake tweets are written. The real tweets are focused on the core information about the virus, like restrictions, cases, and information about the virus itself. On the other hand, fake tweets are focused on a wider range of topics, like the vaccine, animals, countries and, in several cases, particular individuals and high world authorities. These differences could be critical to determine in the future if a message is real or fake, and could be used to develop a model that can classify tweets as real or fake based on the topics they talk about.

On the other side, this is just a semantic analysis on the tweets and itself alone may not be sufficient to provide a real conclusion . Some more analysis could lead to additional evidence to support it.