# Importing the Necessary Libraries for File Creation
Pandas will be used to read the existing CSV file and make modifications to it to attain a file that is more workable and OS will be used to delete intermediate files that are created during the process. Langdetect is to detect and remove rows that are not in english and re is needed to remove URLs from the text./

The one drawback of Langdetect is that it is non-deterministic, meaning each time it is run on short or ambiguous text, the output may vary. This may lead to different number of news present in the output.

In [1]:
import pandas as pd
import os
from langdetect import detect
import re

# Importing the CSV Files

The below files are all fake and real news datasets that have been acquired through different sources.

Kaggle - https://www.kaggle.com/c/fake-news/data?select=train.csv \
Reuters - https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/ \
McIntire - https://github.com/lutzhamel/fake-news/tree/master/data \
Buzzfeed and Politifact - https://www.kaggle.com/datasets/mdepak/fakenewsnet?select=BuzzFeed_real_news_content.csv \
WELFake - https://zenodo.org/records/4561253

In [2]:
kaggle = pd.read_csv('Kaggle/train.csv')
mcintire = pd.read_csv('McIntire/fake_or_real_news.csv')
reutersReal = pd.read_csv('Reuters/True.csv')
reutersFake = pd.read_csv('Reuters/Fake.csv')
buzzfeedReal = pd.read_csv('Buzzfeed_Politifact/BuzzFeed_real_news_content.csv')
buzzfeedFake = pd.read_csv('Buzzfeed_Politifact/BuzzFeed_fake_news_content.csv')
politiReal = pd.read_csv('Buzzfeed_Politifact/PolitiFact_real_news_content.csv')

# Processing the Kaggle Dataset

Since each of the different file are from different sources, there is little consistency between the format of the datasets and thus, we cannot automate the process entirely.

Two files will be created for storing the fake and real news. These files will be used in subsequent sections to store the new fake and real news content.

Kaggle has 5 columns in total, only 3 of which will be taken into account.

In [3]:
kaggle

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1


In [4]:
kaggle.drop(['author', 'id'], axis=1, inplace=True)

In [5]:
kaggle.reset_index(drop=True, inplace=True)

In [6]:
fake_news_kaggle = kaggle[kaggle['label'] == 1]
real_news_kaggle = kaggle[kaggle['label'] == 0]

fake_news_kaggle.to_csv('fake_news.csv', index=False)
real_news_kaggle.to_csv('real_news.csv', index=False)

# Processing the McIntire Dataset

The McIntire Dataset, similarly to Kaggle, has both the fake and real news in the same file. For that reason, similar processing steps have been undertaken. McIntire has the labels as text instead of integers so for that reason, the labels have been replaced based on Kaggle's format for consistency.

In [7]:
mcintire

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [8]:
mcintire.drop(['id'], axis=1, inplace=True)

In [9]:
mcintire.reset_index(drop=True, inplace=True)

In [10]:
# Replace 'FAKE' with 1 and 'REAL' with 0 in the 'label' column of the 'McIntire' DataFrame
mcintire['label'] = mcintire['label'].replace({'FAKE': 1, 'REAL': 0})

In [11]:
# Append 'McIntire' DataFrame to the existing CSV files without overwriting
fake_news_mcintire = mcintire[mcintire['label'] == 1]
real_news_mcintire = mcintire[mcintire['label'] == 0]

with open('fake_news.csv', 'a') as file:
    fake_news_mcintire.to_csv(file, header=False, index=False)
    file.close()

with open('real_news.csv', 'a') as file:
    real_news_mcintire.to_csv(file, header=False, index=False)
    file.close()

# Processing Reuters

The Reuters dataset has the fake and real news as two separate files so in this case, there are no labels. Because of this, the labels column have to be appended before being added to the file.

In [12]:
reutersReal.drop(['subject', 'date'], axis=1, inplace=True)
reutersFake.drop(['subject', 'date'], axis=1, inplace=True)

In [13]:
reutersReal.reset_index(drop=True, inplace=True)
reutersFake.reset_index(drop=True, inplace=True)

In [14]:
reutersReal['label'] = 0
reutersFake['label'] = 1

In [15]:
# Append 'ReutersFake' and 'ReutersReal' DataFrame to the existing CSV files without overwriting
fake_news_reuters = reutersFake[reutersFake['label'] == 1]
real_news_reuters = reutersReal[reutersReal['label'] == 0]

with open('fake_news.csv', 'a') as file:
    fake_news_reuters.to_csv(file, header=False, index=False)
    file.close()

with open('real_news.csv', 'a') as file:
    real_news_reuters.to_csv(file, header=False, index=False)
    file.close()

# Processing Buzzfeed and PolitiFact

The Buzzfeed datasets have a lot more columns in comparison to the previous datasets so more removal is necessary. Additionally, they lack labels because Buzzfeed and PolitiFact are organized into different files based on whether the content is real or fake news. For this reason, a column is being appended depicting the label of the news. Only the real news from PolitiFact is used as the fake news file from Kaggle is incorrect and also contains the real news content.

In [16]:
buzzfeedReal.drop(['id', 'url', 'top_img', 'authors', 'source', 'publish_date', 'movies', 'images', 'canonical_link', 'meta_data'], axis=1, inplace=True)
buzzfeedFake.drop(['id', 'url', 'top_img', 'authors', 'source', 'publish_date', 'movies', 'images', 'canonical_link', 'meta_data'], axis=1, inplace=True)
politiReal.drop(['id', 'url', 'top_img', 'authors', 'source', 'publish_date', 'movies', 'images', 'canonical_link', 'meta_data'], axis=1, inplace=True)

In [17]:
buzzfeedReal.reset_index(drop=True, inplace=True)
buzzfeedFake.reset_index(drop=True, inplace=True)
politiReal.reset_index(drop=True, inplace=True)

In [18]:
buzzfeedFake['label'] = 1
buzzfeedReal['label'] = 0
politiReal['label'] = 0

In [19]:
# Append 'BuzzfeedReal', 'BuzzfeedFake' and 'PolitiReal' DataFrame to the existing CSV files without overwriting
fake_news_buzzfeed = buzzfeedFake[buzzfeedFake['label'] == 1]
real_news_buzzfeed = buzzfeedReal[buzzfeedReal['label'] == 0]
real_news_politifact = politiReal[politiReal['label'] == 0]

with open('fake_news.csv', 'a') as file:
    fake_news_buzzfeed.to_csv(file, header=False, index=False)

with open('real_news.csv', 'a') as file:
    real_news_buzzfeed.to_csv(file, header=False, index=False)

with open('real_news.csv', 'a') as file:
    real_news_politifact.to_csv(file, header=False, index=False)

# Final Processing

In this step, we will work on removing rows where the title or text have null values, processing the text by removing URLs, lowercasing, and removing extra spaces. 

We will also be removing all rows where either the title or the text value is null.

We will also work on removing duplicate news occurrences. drop_duplicates will only work on the surface level to remove the exact same occurrences but will not work when there are very slight differences such as different number of spaces or different punctuations. We will try to remove as many duplicates as possible but it is possible that some remain due to very hard-to-find differences between the text.

Once the modified file has been created, the intermediate files will be deleted so there isn't any repetition or confusion in the future.

In [20]:
def preprocess_text(text):

    # Remove URLs using regex
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Convert text to lowercase
    text = text.lower()
    
    # Define characters to remove (punctuation + additional characters)
    #chars_to_remove = string.punctuation + '’‘”“—–«»'  # Combine punctuation and additional characters
    
    # Remove all specified characters
    #text = ''.join(ch for ch in text if ch not in chars_to_remove)
    
    # Remove extra spaces
    text = ' '.join(text.split())
    
    return text

In [21]:
# Function to filter non-English text in both title and text columns
def filter_english_row(row):
    try:
        title_lang = detect(row['title'])
        text_lang = detect(row['text'])
        return title_lang == 'en' and text_lang == 'en'
    except:
        return False

In [22]:
fake_news = pd.read_csv('fake_news.csv')
real_news = pd.read_csv('real_news.csv')

In [23]:
fake_news.dropna(subset=['title', 'text'], how='any', inplace=True)
real_news.dropna(subset=['title', 'text'], how='any', inplace=True)

In [24]:
# Apply filtering to keep rows where both title and text are in English
fake_news = fake_news[fake_news.apply(filter_english_row, axis=1)]
real_news = real_news[real_news.apply(filter_english_row, axis=1)]

In [25]:
# Preprocess 'text' and 'title' columns
fake_news['text'] = fake_news['text'].apply(preprocess_text)
fake_news['title'] = fake_news['title'].apply(preprocess_text)

real_news['text'] = real_news['text'].apply(preprocess_text)
real_news['title'] = real_news['title'].apply(preprocess_text)

In [26]:
real_news

Unnamed: 0,title,text,label
0,"flynn: hillary clinton, big woman on campus - ...",ever get the feeling your life circles the rou...,0
1,jackie mason: hollywood would love trump if he...,"in these trying times, jackie mason is the voi...",0
2,benoît hamon wins french socialist party’s pre...,"paris — france chose an idealistic, traditiona...",0
3,excerpts from a draft script for donald trump’...,donald j. trump is scheduled to make a highly ...,0
4,"a back-channel plan for ukraine and russia, co...",a week before michael t. flynn resigned as nat...,0
...,...,...,...
35180,hillary clinton prepares for unpredictable tru...,new york (cnn) hillary clinton is visiting onl...,0
35181,"donald trump, germany’s disfavored son – politico","kallstadt, germany — few places in germany are...",0
35182,breaking: hollywood legend just died of terrib...,hollywood loses yet another one of their deare...,0
35184,don king drops n-word while introducing donald...,story highlights trump was sitting in a chair ...,0


In [27]:
fake_news

Unnamed: 0,title,text,label
0,house dem aide: we didn’t even see comey’s let...,house dem aide: we didn’t even see comey’s let...,1
1,why the truth might get you fired,"why the truth might get you fired october 29, ...",1
2,15 civilians killed in single us airstrike hav...,videos 15 civilians killed in single us airstr...,1
3,iranian woman jailed for fictional unpublished...,print an iranian woman has been sentenced to s...,1
4,life: life of luxury: elton john’s 6 favorite ...,ever wonder how britain’s most iconic pop pian...,1
...,...,...,...
37144,hillary’s top donor country just auctioned off...,hillary’s top donor country just auctioned off...,1
37145,cavuto just exposed lester holt's lies during ...,advertisement - story continues below the firs...,1
37146,"the ap, in 2004, said your boy obama was born ...",well that’s weird. if the birther movement is ...,1
37147,people noticed something odd about hillary's o...,there’s a lot to be discussed about last night...,1


In [28]:
fake_news.drop_duplicates(subset=['title', 'text'], inplace=True)
real_news.drop_duplicates(subset=['title', 'text'], inplace=True)

In [29]:
real_news

Unnamed: 0,title,text,label
0,"flynn: hillary clinton, big woman on campus - ...",ever get the feeling your life circles the rou...,0
1,jackie mason: hollywood would love trump if he...,"in these trying times, jackie mason is the voi...",0
2,benoît hamon wins french socialist party’s pre...,"paris — france chose an idealistic, traditiona...",0
3,excerpts from a draft script for donald trump’...,donald j. trump is scheduled to make a highly ...,0
4,"a back-channel plan for ukraine and russia, co...",a week before michael t. flynn resigned as nat...,0
...,...,...,...
35179,donald trump is right on profiling [video] – e...,hbo’s real time host bill maher – someone not ...,0
35180,hillary clinton prepares for unpredictable tru...,new york (cnn) hillary clinton is visiting onl...,0
35182,breaking: hollywood legend just died of terrib...,hollywood loses yet another one of their deare...,0
35184,don king drops n-word while introducing donald...,story highlights trump was sitting in a chair ...,0


In [30]:
fake_news

Unnamed: 0,title,text,label
0,house dem aide: we didn’t even see comey’s let...,house dem aide: we didn’t even see comey’s let...,1
1,why the truth might get you fired,"why the truth might get you fired october 29, ...",1
2,15 civilians killed in single us airstrike hav...,videos 15 civilians killed in single us airstr...,1
3,iranian woman jailed for fictional unpublished...,print an iranian woman has been sentenced to s...,1
4,life: life of luxury: elton john’s 6 favorite ...,ever wonder how britain’s most iconic pop pian...,1
...,...,...,...
37144,hillary’s top donor country just auctioned off...,hillary’s top donor country just auctioned off...,1
37145,cavuto just exposed lester holt's lies during ...,advertisement - story continues below the firs...,1
37146,"the ap, in 2004, said your boy obama was born ...",well that’s weird. if the birther movement is ...,1
37147,people noticed something odd about hillary's o...,there’s a lot to be discussed about last night...,1


In [31]:
fake_news.reset_index(drop=True, inplace=True)
real_news.reset_index(drop=True, inplace=True)

In [32]:
fake_news.to_csv('fake_news_final.csv', index=False)
real_news.to_csv('real_news_final.csv', index=False)

In [33]:
fake_news_final = pd.read_csv('fake_news_final.csv')
real_news_final = pd.read_csv('real_news_final.csv')

In [34]:
fake_news_final

Unnamed: 0,title,text,label
0,house dem aide: we didn’t even see comey’s let...,house dem aide: we didn’t even see comey’s let...,1
1,why the truth might get you fired,"why the truth might get you fired october 29, ...",1
2,15 civilians killed in single us airstrike hav...,videos 15 civilians killed in single us airstr...,1
3,iranian woman jailed for fictional unpublished...,print an iranian woman has been sentenced to s...,1
4,life: life of luxury: elton john’s 6 favorite ...,ever wonder how britain’s most iconic pop pian...,1
...,...,...,...
26460,hillary’s top donor country just auctioned off...,hillary’s top donor country just auctioned off...,1
26461,cavuto just exposed lester holt's lies during ...,advertisement - story continues below the firs...,1
26462,"the ap, in 2004, said your boy obama was born ...",well that’s weird. if the birther movement is ...,1
26463,people noticed something odd about hillary's o...,there’s a lot to be discussed about last night...,1


In [35]:
real_news_final

Unnamed: 0,title,text,label
0,"flynn: hillary clinton, big woman on campus - ...",ever get the feeling your life circles the rou...,0
1,jackie mason: hollywood would love trump if he...,"in these trying times, jackie mason is the voi...",0
2,benoît hamon wins french socialist party’s pre...,"paris — france chose an idealistic, traditiona...",0
3,excerpts from a draft script for donald trump’...,donald j. trump is scheduled to make a highly ...,0
4,"a back-channel plan for ukraine and russia, co...",a week before michael t. flynn resigned as nat...,0
...,...,...,...
33952,donald trump is right on profiling [video] – e...,hbo’s real time host bill maher – someone not ...,0
33953,hillary clinton prepares for unpredictable tru...,new york (cnn) hillary clinton is visiting onl...,0
33954,breaking: hollywood legend just died of terrib...,hollywood loses yet another one of their deare...,0
33955,don king drops n-word while introducing donald...,story highlights trump was sitting in a chair ...,0


# The WELFake Dataset

The WELFake Dataset comes from Zenodo and is the basis of my project. This dataset has already combined the Kaggle, McIntire, Reuters and BuzzFeed Political Datasets, similarly to mine. What this aims to do is process the WELFake Dataset similarly to the other ones and combine it with the fake news and real news and remove any duplicates. This is to maximise the amount of training data available.

In [36]:
welfake = pd.read_csv('WELFake/WELFake_Dataset.csv')

In [37]:
welfake.drop(['Unnamed: 0'], axis=1, inplace=True)

In [38]:
welfake.dropna(subset=['title', 'text'], how='any', inplace=True)

In [39]:
welfake = welfake[welfake.apply(filter_english_row, axis=1)]

In [40]:
welfake['text'] = welfake['text'].apply(preprocess_text)
welfake['title'] = welfake['title'].apply(preprocess_text)

In [41]:
welfake

Unnamed: 0,title,text,label
0,law enforcement on high alert following threat...,no comment is expected from barack obama membe...,1
2,unbelievable! obama’s attorney general says mo...,"now, most of the demonstrators gathered last n...",1
3,"bobby jindal, raised hindu, uses story of chri...",a dozen politically active pastors came here f...,0
4,satan 2: russia unvelis an image of its terrif...,"the rs-28 sarmat missile, dubbed satan 2, will...",1
5,about time! christian group sues amazon and sp...,all we can say on this one is it s about time ...,1
...,...,...,...
72129,russians steal research on trump in hack of u....,washington (reuters) - hackers believed to be ...,0
72130,watch: giuliani demands that democrats apologi...,"you know, because in fantasyland republicans n...",1
72131,migrants refuse to leave train at refugee camp...,migrants refuse to leave train at refugee camp...,0
72132,trump tussle gives unpopular mexican leader mu...,mexico city (reuters) - donald trump’s combati...,0


In [42]:
welfake.drop_duplicates(subset=['title', 'text'], inplace=True)

In [43]:
welfake

Unnamed: 0,title,text,label
0,law enforcement on high alert following threat...,no comment is expected from barack obama membe...,1
2,unbelievable! obama’s attorney general says mo...,"now, most of the demonstrators gathered last n...",1
3,"bobby jindal, raised hindu, uses story of chri...",a dozen politically active pastors came here f...,0
4,satan 2: russia unvelis an image of its terrif...,"the rs-28 sarmat missile, dubbed satan 2, will...",1
5,about time! christian group sues amazon and sp...,all we can say on this one is it s about time ...,1
...,...,...,...
72127,wikileaks email shows clinton foundation funds...,an email released by wikileaks on sunday appea...,1
72129,russians steal research on trump in hack of u....,washington (reuters) - hackers believed to be ...,0
72130,watch: giuliani demands that democrats apologi...,"you know, because in fantasyland republicans n...",1
72131,migrants refuse to leave train at refugee camp...,migrants refuse to leave train at refugee camp...,0


In [44]:
welfake.reset_index(drop=True, inplace=True)

In [45]:
welfake.to_csv('welfake_final.csv', index=False)

# Combining the Datasets

Now that the WELFake Dataset has been processed, we can combine it with my fake and real news dataset for maximum amount of training data. We will combine them first and then remove any duplicates that are already present in the dataset.

In [46]:
# Load the datasets
welfake_final = pd.read_csv('welfake_final.csv')

In [47]:
# Filter welfake dataset based on label
fake_news_welfake = welfake_final[welfake_final['label'] == 1]  # label 1 represents fake news
real_news_welfake = welfake_final[welfake_final['label'] == 0]  # label 0 represents real news

In [48]:
# Append welfake rows to fake_news or real_news
fake_news_combined = fake_news_final.append(fake_news_welfake, ignore_index=True)
real_news_combined = real_news_final.append(real_news_welfake, ignore_index=True)

In [49]:
fake_news_combined.drop_duplicates(subset=['title', 'text'], inplace=True)
real_news_combined.drop_duplicates(subset=['title', 'text'], inplace=True)

In [50]:
os.mkdir("BaseDataset")
# Save the updated fake_news and real_news datasets
fake_news_combined.to_csv('BaseDataset/fake_news.csv', index=False)
real_news_combined.to_csv('BaseDataset/real_news.csv', index=False)

In [51]:
os.remove("fake_news.csv")
os.remove("real_news.csv")
os.remove("fake_news_final.csv")
os.remove("real_news_final.csv")
os.remove("welfake_final.csv")

In [52]:
fake_news_combined

Unnamed: 0,title,text,label
0,house dem aide: we didn’t even see comey’s let...,house dem aide: we didn’t even see comey’s let...,1
1,why the truth might get you fired,"why the truth might get you fired october 29, ...",1
2,15 civilians killed in single us airstrike hav...,videos 15 civilians killed in single us airstr...,1
3,iranian woman jailed for fictional unpublished...,print an iranian woman has been sentenced to s...,1
4,life: life of luxury: elton john’s 6 favorite ...,ever wonder how britain’s most iconic pop pian...,1
...,...,...,...
51399,feel the bern: supporters line up at 4:30 a.m....,it would appear that socialism is not the only...,1
51728,"""hillary clinton in 2013 """"i would like to see...",before running against billionaire real estate...,1
52063,keith olbermann begs foreign intel agencies to...,keith olbermann has been speaking out against ...,1
52317,watch: wolf blitzer nails giuliani to the wall...,sometimes it takes a wolf to catch a snake.rud...,1


In [53]:
real_news_combined

Unnamed: 0,title,text,label
0,"flynn: hillary clinton, big woman on campus - ...",ever get the feeling your life circles the rou...,0
1,jackie mason: hollywood would love trump if he...,"in these trying times, jackie mason is the voi...",0
2,benoît hamon wins french socialist party’s pre...,"paris — france chose an idealistic, traditiona...",0
3,excerpts from a draft script for donald trump’...,donald j. trump is scheduled to make a highly ...,0
4,"a back-channel plan for ukraine and russia, co...",a week before michael t. flynn resigned as nat...,0
...,...,...,...
66776,zimbabwe's mugabe digs in heels as ruling part...,harare (reuters) - zimbabwe s ruling zanu-pf p...,0
67179,argentina's macri deploys popular governor aga...,buenos aires (reuters) - argentine president m...,0
67596,lebanon's president rejects terrorism suggestion,beirut (reuters) - the lebanese president appe...,0
67683,melania trump's girl-on-girl photos from racy ...,heres the nations would-be first lady and righ...,0


# Final Thoughts

From the prima facie 72000 text samples, we have come down to approximately 60000 samples due to removing the vast number of duplicates. We also removed all rows where either the title or text is not in English. I realised that the quality of training data is more important than the quantity as having approximately 12000 duplicate instances will only introduce biases into my dataset which will then affect how the model performs.