# Machine Learning Task 2

In this task we need to classificate different texts(tweets) from beeing fake or real. We will use different techniques to clean the texts from irrelevant information. Then we will use BERTopic for topic modelling using a pipeline from different algorithms

## Load the data

Here we load the xlsx files that we will use and take a look about its content.

In [36]:
# Load the data from a Excel file and print the first 5 rows
import pandas as pd
train = pd.read_excel('data/Constraint_English_Train.xlsx')
test = pd.read_excel('data/Constraint_English_Test.xlsx')
val = pd.read_excel('data/Constraint_English_Val.xlsx')
test_labeled = pd.read_excel('data/english_test_with_labels.xlsx')

print("Train Data:")
print(train.head())
print("\nValidation Data:")
print(val.head())
print("\nTest Data:")
print(test.head())
print("\nLabeled Test Data:")
print(test_labeled.head())

Train Data:
   id                                              tweet label
0   1  The CDC currently reports 99031 deaths. In gen...  real
1   2  States reported 1121 deaths a small rise from ...  real
2   3  Politically Correct Woman (Almost) Uses Pandem...  fake
3   4  #IndiaFightsCorona: We have 1524 #COVID testin...  real
4   5  Populous states can generate large case counts...  real

Validation Data:
   id                                              tweet label
0   1  Chinese converting to Islam after realising th...  fake
1   2  11 out of 13 people (from the Diamond Princess...  fake
2   3  COVID-19 Is Caused By A Bacterium, Not Virus A...  fake
3   4  Mike Pence in RNC speech praises Donald Trumpâ€™...  fake
4   5  6/10 Sky's @EdConwaySky explains the latest #C...  real

Test Data:
   id                                              tweet
0   1  Our daily update is published. States reported...
1   2             Alfalfa is the only cure for COVID-19.
2   3  President Trump Asked 

We import important modules that we will use to filter our data and tokenize it. We will use Spacy because it filter the words better that WordNet from NLTK thaks to its context-aware linguistic model.

In [37]:
import numpy as np
import re

import spacy
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    import spacy.cli
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")


## Cleanning funcions

We implement two different funcions, the first one will remove URL's, numbers and extra spaces and will return us the original text with just simple words.

The second funcions will use nlp from spacy to tokenize,remove stopwords, rewrite its lemma in lowercase.

In [38]:
def clean_text(text: str) -> str:
    """Basic text cleaning: lowercase, remove URLs, non-letters and extra spaces."""
    text = re.sub(r"http\S+|www\.\S+", " ", text)         # remove URLs
    text = re.sub(r"[^a-zA-Z\s]", " ", text)                # keep only letters and spaces
    text = re.sub(r"\s+", " ", text).strip()            # normalise spaces
    return text

# Tokenisation and stopword removal with spaCy
def spacy_preprocess_with_lemmas(text):
    """Preprocessing with lemmatization"""
    if pd.isna(text) or text.strip() == "":
        return []
    
    doc = nlp(text.lower())
    lemmas = [token.lemma_ for token in doc 
              if not token.is_stop and len(token.lemma_) > 2] # keep non-stopword lemmas longer than 2 characters(avoid lemmas like be, in,...)
    return lemmas


This 3rd filter is used to eliminate words that doesnt exist like IndiaFightsCorona (as an unique word)

In [39]:
%pip install wordfreq
from wordfreq import zipf_frequency

def filter_real_words(tokens):

    return [t for t in tokens if zipf_frequency(t, "en") > 0]

Note: you may need to restart the kernel to use updated packages.


In this case we use a function to count the different kind of words that appears in the tweets and later we will divide each type of word in different columns

In [40]:
from collections import Counter

def get_spacy_pos_counts(text):
    doc = nlp(text)
    counts = Counter([token.pos_ for token in doc])
    return counts


With all this filters and preprocessing we save the results in different columns to compare them in a DataFrame.

In [41]:
train['cleaned_tweet'] = train['tweet'].apply(clean_text)
train['tweet_tokens'] = train['cleaned_tweet'].apply(spacy_preprocess_with_lemmas)
train['tweet_tokens'] = train['tweet_tokens'].apply(filter_real_words)
train["spacy_pos_counts"] = train["cleaned_tweet"].apply(get_spacy_pos_counts)

pos_tags = ["NOUN", "VERB", "ADJ", "ADV", "PRON", "DET", "ADP", "CONJ"]
for tag in pos_tags:
    train[tag + "_spacy"] = train["spacy_pos_counts"].apply(lambda x: x.get(tag, 0))

train["total_spacy"] = train[[t + "_spacy" for t in pos_tags]].sum(axis=1)

In [42]:
train.head()

Unnamed: 0,id,tweet,label,cleaned_tweet,tweet_tokens,spacy_pos_counts,NOUN_spacy,VERB_spacy,ADJ_spacy,ADV_spacy,PRON_spacy,DET_spacy,ADP_spacy,CONJ_spacy,total_spacy
0,1,The CDC currently reports 99031 deaths. In gen...,real,The CDC currently reports deaths In general th...,"[cdc, currently, report, death, general, discr...","{'DET': 3, 'PROPN': 1, 'ADV': 2, 'VERB': 2, 'N...",9,2,4,2,0,3,4,0,24
1,2,States reported 1121 deaths a small rise from ...,real,States reported deaths a small rise from last ...,"[state, report, death, small, rise, tuesday, s...","{'NOUN': 5, 'VERB': 2, 'DET': 2, 'ADJ': 3, 'AD...",5,2,3,0,0,2,2,0,14
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,Politically Correct Woman Almost Uses Pandemic...,"[politically, correct, woman, use, pandemic, e...","{'ADV': 2, 'ADJ': 1, 'PROPN': 7, 'ADP': 1, 'PA...",1,1,1,2,0,0,1,0,6
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,IndiaFightsCorona We have COVID testing labora...,"[covid, testing, laboratory, india, august, test]","{'PROPN': 8, 'PRON': 1, 'AUX': 3, 'VERB': 2, '...",3,2,0,0,1,0,3,0,9
4,5,Populous states can generate large case counts...,real,Populous states can generate large case counts...,"[populous, state, generate, large, case, count...","{'ADJ': 5, 'NOUN': 7, 'AUX': 2, 'VERB': 3, 'CC...",7,3,5,0,1,1,4,0,21


From the values obtained from the POS columns we could infeer what is the context of the tweet(opinion or informative one) for example, the ones with more NOUNs and VERBs probably are informatives ones and the tweets tha have more ADJs and ADVs are probably opinion ones.

## BERTopic Pipeline

### Embedings

In [43]:
train['label'] = train['label'].map({'fake': 0, 'real': 1})

In [44]:
from sklearn.model_selection import train_test_split

X = train['tweet_tokens']
y = train['label']

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)