# Machine Learning Task 2

In this task we need to classificate different texts(tweets) from beeing fake or real. We will use different techniques to clean the texts from irrelevant information. Then we will use BERTopic for topic modelling using a pipeline from different algorithms

## Load the data

Here we load the xlsx files that we will use and take a look about its content.

In [34]:
# Load the data from a Excel file and print the first 5 rows
import pandas as pd
train = pd.read_excel('data/Constraint_English_Train.xlsx')
test = pd.read_excel('data/Constraint_English_Test.xlsx')
val = pd.read_excel('data/Constraint_English_Val.xlsx')
test_labeled = pd.read_excel('data/english_test_with_labels.xlsx')

print("Train Data:")
print(train.head())
print("\nValidation Data:")
print(val.head())
print("\nTest Data:")
print(test.head())
print("\nLabeled Test Data:")
print(test_labeled.head())

Train Data:
   id                                              tweet label
0   1  The CDC currently reports 99031 deaths. In gen...  real
1   2  States reported 1121 deaths a small rise from ...  real
2   3  Politically Correct Woman (Almost) Uses Pandem...  fake
3   4  #IndiaFightsCorona: We have 1524 #COVID testin...  real
4   5  Populous states can generate large case counts...  real

Validation Data:
   id                                              tweet label
0   1  Chinese converting to Islam after realising th...  fake
1   2  11 out of 13 people (from the Diamond Princess...  fake
2   3  COVID-19 Is Caused By A Bacterium, Not Virus A...  fake
3   4  Mike Pence in RNC speech praises Donald Trump’...  fake
4   5  6/10 Sky's @EdConwaySky explains the latest #C...  real

Test Data:
   id                                              tweet
0   1  Our daily update is published. States reported...
1   2             Alfalfa is the only cure for COVID-19.
2   3  President Trump Asked Wh

We import important modules that we will use to filter our data and tokenize it. We will use Spacy because it filter the words better that WordNet from NLTK thaks to its context-aware linguistic model.

In [35]:
import numpy as np
import re

import spacy
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    import spacy.cli
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")


## Cleanning funcions

We implement two different funcions, the first one will remove URL's, numbers and extra spaces and will return us the original text with just simple words.

The second funcions will use nlp from spacy to tokenize,remove stopwords, rewrite its lemma in lowercase.

In [36]:
def clean_text(text: str) -> str:
    """Basic text cleaning: lowercase, remove URLs, non-letters and extra spaces."""
    text = re.sub(r"http\S+|www\.\S+", " ", text)         # remove URLs
    text = re.sub(r"[^a-zA-Z\s]", " ", text)                # keep only letters and spaces
    text = re.sub(r"\s+", " ", text).strip()            # normalise spaces
    return text

# Tokenisation and stopword removal with spaCy
def spacy_preprocess_with_lemmas(text):
    """Preprocessing with lemmatization"""
    if pd.isna(text) or text.strip() == "":
        return []
    
    doc = nlp(text.lower())
    lemmas = [token.lemma_ for token in doc 
              if not token.is_stop and len(token.lemma_) > 2] # keep non-stopword lemmas longer than 2 characters(avoid lemmas like be, in,...)
    return lemmas


This 3rd filter is used to eliminate words that doesnt exist like IndiaFightsCorona (as an unique word)

In [42]:
%pip install wordfreq
from wordfreq import zipf_frequency

def filter_real_words(tokens):

    return [t for t in tokens if zipf_frequency(t, "en") > 0]

Collecting wordfreq
  Downloading wordfreq-3.1.1-py3-none-any.whl.metadata (27 kB)
Collecting ftfy>=6.1 (from wordfreq)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting locate<2.0.0,>=1.1.1 (from wordfreq)
  Downloading locate-1.1.1-py3-none-any.whl.metadata (3.9 kB)
Downloading wordfreq-3.1.1-py3-none-any.whl (56.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 MB[0m [31m27.8 MB/s[0m  [33m0:00:02[0mm0:00:01[0m00:01[0m
[?25hDownloading locate-1.1.1-py3-none-any.whl (5.4 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
Installing collected packages: locate, ftfy, wordfreq
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [wordfreq]
[1A[2KSuccessfully installed ftfy-6.3.1 locate-1.1.1 wordfreq-3.1.1
Note: you may need to restart the kernel to use updated packages.


With all this filters and preprocessing we save the results in different columns to compare them in a DataFrame.

In [43]:
train['cleaned_tweet'] = train['tweet'].apply(clean_text)
train['tweet_tokens'] = train['cleaned_tweet'].apply(spacy_preprocess_with_lemmas)
train['tweet_tokens'] = train['tweet_tokens'].apply(filter_real_words)

In [44]:
comparative_df = pd.DataFrame({
    'Original_Tweet': train['tweet'],
    'Cleaned_Tweet': train['cleaned_tweet'],
    'Tokens lemmas': train['tweet_tokens']
})
comparative_df.head()

Unnamed: 0,Original_Tweet,Cleaned_Tweet,Tokens lemmas
0,The CDC currently reports 99031 deaths. In gen...,The CDC currently reports deaths In general th...,"[cdc, currently, report, death, general, discr..."
1,States reported 1121 deaths a small rise from ...,States reported deaths a small rise from last ...,"[state, report, death, small, rise, tuesday, s..."
2,Politically Correct Woman (Almost) Uses Pandem...,Politically Correct Woman Almost Uses Pandemic...,"[politically, correct, woman, use, pandemic, e..."
3,#IndiaFightsCorona: We have 1524 #COVID testin...,IndiaFightsCorona We have COVID testing labora...,"[covid, testing, laboratory, india, august, test]"
4,Populous states can generate large case counts...,Populous states can generate large case counts...,"[populous, state, generate, large, case, count..."
