# Machine Learning Task 2

In this task we need to classificate different texts(tweets) from beeing fake or real. We will use different techniques to clean the texts from irrelevant information. Then we will use BERTopic for topic modelling using a pipeline from different algorithms

## Load the data

Here we load the xlsx files that we will use and take a look about its content.

In [63]:
# Load the data from a Excel file and print the first 5 rows
import pandas as pd
train = pd.read_excel('data/Constraint_English_Train.xlsx')
test = pd.read_excel('data/Constraint_English_Test.xlsx')
val = pd.read_excel('data/Constraint_English_Val.xlsx')
test_labeled = pd.read_excel('data/english_test_with_labels.xlsx')

print("Train Data:")
print(train.head())
print("\nValidation Data:")
print(val.head())
print("\nTest Data:")
print(test.head())
print("\nLabeled Test Data:")
print(test_labeled.head())


Train Data:
   id                                              tweet label
0   1  The CDC currently reports 99031 deaths. In gen...  real
1   2  States reported 1121 deaths a small rise from ...  real
2   3  Politically Correct Woman (Almost) Uses Pandem...  fake
3   4  #IndiaFightsCorona: We have 1524 #COVID testin...  real
4   5  Populous states can generate large case counts...  real

Validation Data:
   id                                              tweet label
0   1  Chinese converting to Islam after realising th...  fake
1   2  11 out of 13 people (from the Diamond Princess...  fake
2   3  COVID-19 Is Caused By A Bacterium, Not Virus A...  fake
3   4  Mike Pence in RNC speech praises Donald Trump’...  fake
4   5  6/10 Sky's @EdConwaySky explains the latest #C...  real

Test Data:
   id                                              tweet
0   1  Our daily update is published. States reported...
1   2             Alfalfa is the only cure for COVID-19.
2   3  President Trump Asked Wh

We import important modules that we will use to filter our data and tokenize it. We will use Spacy because it filter the words better that WordNet from NLTK thaks to its context-aware linguistic model.

In [64]:
import numpy as np
import re

import spacy
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    import spacy.cli
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")


## Cleanning funcions

We implement two different funcions, the first one will remove URL's, numbers and extra spaces and will return us the original text with just simple words.

The second funcions will use nlp from spacy to tokenize,remove stopwords, rewrite its lemma in lowercase.

In [65]:
def clean_text(text: str) -> str:
    """Basic text cleaning: lowercase, remove URLs, non-letters and extra spaces."""
    text = re.sub(r"http\S+|www\.\S+", " ", text)         # remove URLs
    text = re.sub(r"[^a-zA-Z\s]", " ", text)                # keep only letters and spaces
    text = re.sub(r"\s+", " ", text).strip()            # normalise spaces
    return text

# Tokenisation and stopword removal with spaCy
def spacy_preprocess_with_lemmas(text):
    """Preprocessing with lemmatization"""
    if pd.isna(text) or text.strip() == "":
        return []
    
    doc = nlp(text.lower())
    lemmas = [token.lemma_ for token in doc 
              if not token.is_stop and len(token.lemma_) > 2] # keep non-stopword lemmas longer than 2 characters(avoid lemmas like be, in,...)
    return lemmas


This 3rd filter is used to eliminate words that doesnt exist like IndiaFightsCorona (as an unique word)

In [66]:
%pip install wordfreq
from wordfreq import zipf_frequency

def filter_real_words(tokens):

    return [t for t in tokens if zipf_frequency(t, "en") > 0]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In this case we use a function to count the different kind of words that appears in the tweets and later we will divide each type of word in different columns

In [67]:
from collections import Counter

def get_spacy_pos_counts(text):
    doc = nlp(text)
    counts = Counter([token.pos_ for token in doc])
    return counts


With all this filters and preprocessing we save the results in different columns to compare them in a DataFrame.

In [68]:
train['cleaned_tweet'] = train['tweet'].apply(clean_text)
train['tweet_tokens'] = train['cleaned_tweet'].apply(spacy_preprocess_with_lemmas)
train['tweet_tokens'] = train['tweet_tokens'].apply(filter_real_words)
train["spacy_pos_counts"] = train["cleaned_tweet"].apply(get_spacy_pos_counts)

pos_tags = ["NOUN", "VERB", "ADJ", "ADV", "PRON", "DET", "ADP", "CONJ"]
for tag in pos_tags:
    train[tag + "_spacy"] = train["spacy_pos_counts"].apply(lambda x: x.get(tag, 0))

train["total_spacy"] = train[[t + "_spacy" for t in pos_tags]].sum(axis=1)

In [69]:
train.head()

Unnamed: 0,id,tweet,label,cleaned_tweet,tweet_tokens,spacy_pos_counts,NOUN_spacy,VERB_spacy,ADJ_spacy,ADV_spacy,PRON_spacy,DET_spacy,ADP_spacy,CONJ_spacy,total_spacy
0,1,The CDC currently reports 99031 deaths. In gen...,real,The CDC currently reports deaths In general th...,"[cdc, currently, report, death, general, discr...","{'DET': 3, 'PROPN': 1, 'ADV': 2, 'VERB': 2, 'N...",9,2,4,2,0,3,4,0,24
1,2,States reported 1121 deaths a small rise from ...,real,States reported deaths a small rise from last ...,"[state, report, death, small, rise, tuesday, s...","{'NOUN': 5, 'VERB': 2, 'DET': 2, 'ADJ': 3, 'AD...",5,2,3,0,0,2,2,0,14
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,Politically Correct Woman Almost Uses Pandemic...,"[politically, correct, woman, use, pandemic, e...","{'ADV': 2, 'ADJ': 1, 'PROPN': 7, 'ADP': 1, 'PA...",1,1,1,2,0,0,1,0,6
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,IndiaFightsCorona We have COVID testing labora...,"[covid, testing, laboratory, india, august, test]","{'PROPN': 8, 'PRON': 1, 'AUX': 3, 'VERB': 2, '...",3,2,0,0,1,0,3,0,9
4,5,Populous states can generate large case counts...,real,Populous states can generate large case counts...,"[populous, state, generate, large, case, count...","{'ADJ': 5, 'NOUN': 7, 'AUX': 2, 'VERB': 3, 'CC...",7,3,5,0,1,1,4,0,21


From the values obtained from the POS columns we could infeer what is the context of the tweet(opinion or informative one) for example, the ones with more NOUNs and VERBs probably are informatives ones and the tweets tha have more ADJs and ADVs are probably opinion ones.

## BERTopic Pipeline

Installation on tools that we will use.

In [70]:
%pip install umap-learn
%pip install hdbscan
%pip install sentence-transformers
%pip install bertopic

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [71]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic import BERTopic

### Embedings

In [72]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedding_model.encode(train['cleaned_tweet'], show_progress_bar=True)

Batches: 100%|██████████| 201/201 [00:02<00:00, 97.89it/s] 


### Dimensionality reduction

In [73]:
umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', random_state=42, low_memory=False)

### Clustering

In [74]:
cluster_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

### Vectorization

In [75]:
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1,2))

### Topic Representation

In [76]:
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

### Fine tunning

In [77]:
representation_model = KeyBERTInspired()

### Model setUp and training

In [78]:
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                     # Step 2 - Reduce dimensionality
  hdbscan_model=cluster_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic representations
  calculate_probabilities = True,
  #min_topic_size = 50,
  n_gram_range=(1, 2),
  verbose = True,
  language='english'
)
topics, probs = topic_model.fit_transform(train['cleaned_tweet'])

2025-12-06 14:11:09,365 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 201/201 [00:01<00:00, 104.05it/s]
2025-12-06 14:11:11,344 - BERTopic - Embedding - Completed ✓
2025-12-06 14:11:11,345 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-06 14:11:17,555 - BERTopic - Dimensionality - Completed ✓
2025-12-06 14:11:17,556 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-06 14:11:18,179 - BERTopic - Cluster - Completed ✓
2025-12-06 14:11:18,181 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-06 14:11:19,490 - BERTopic - Representation - Completed ✓


In [79]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2601,-1_covid cases_deaths_cdc_india covid,"[covid cases, deaths, cdc, india covid, total ...",[CoronaVirusUpdates India s COVID recovery rat...
1,0,318,0_coronavirus updates_coronavirus restrictions...,"[coronavirus updates, coronavirus restrictions...",[The UK faces a tipping point where more restr...
2,1,226,1_trump coronavirus_coronavirus trump_donaldtr...,"[trump coronavirus, coronavirus trump, donaldt...",[We fact checked Kamala Harris Hillary Clinton...
3,2,204,2_covid vaccine_vaccine covid_covid vaccines_v...,"[covid vaccine, vaccine covid, covid vaccines,...",[Co led by WHO gavi amp CEPIvaccines the COVAX...
4,3,175,3_covid reported_covid confirmed_test covid_co...,"[covid reported, covid confirmed, test covid, ...",[Twenty new cases of COVID have been reported ...
5,4,162,4_coronavirusupdate_coronavirusupdates_drharsh...,"[coronavirusupdate, coronavirusupdates, drhars...",[CoronaVirusUpdates IndiaFightsCorona India s ...
6,5,143,5_hospitalized covid_covid deaths_holiday week...,"[hospitalized covid, covid deaths, holiday wee...",[Our daily update is published States reported...
7,6,141,6_masks coronavirus_spread covid_covid particl...,"[masks coronavirus, spread covid, covid partic...",[Mask wearers beware A caller to a radio talk ...
8,7,141,7_spreading coronavirus_china covid_covid chin...,"[spreading coronavirus, china covid, covid chi...",[BidenSoetoro funneled illegal millions throug...
9,8,134,8_covid infecting_covid increased_covid activi...,"[covid infecting, covid increased, covid activ...",[The latest CDC COVIDView report shows that af...


In [80]:
topic_model.visualize_topics()

In [81]:
from sklearn.model_selection import train_test_split

train['label'] = train['label'].map({'fake': 0, 'real': 1})

X = train['tweet_tokens']
y = train['label']

In [82]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)