# Misinformation Detection
### Dataset Loading and Combination Notebook

*Session 20 Group 4*
*Erica, Sahan, Dinuka*


Imports and setup

In [27]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
from bs4 import BeautifulSoup
import re
import unidecode
from bs4 import MarkupResemblesLocatorWarning
import warnings

warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)

### Load datasets

In [28]:
provided_dataset_paths = ["data/provided/Constraint_English_Test.xlsx", "data/provided/Constraint_English_Train.xlsx", "data/provided/Constraint_English_Val.xlsx"]
dataset_1_path = "data/fake-news-detection_10k/fake_and_real_news.csv"
dataset_2_path = "data/fake-real-news_10k/dataset.csv"
dataset_3_path_real = "data/misinformation-fake-news-text-dataset_79k/DataSet_Misinfo_TRUE.csv"
dataset_3_path_fake = "data/misinformation-fake-news-text-dataset_79k/DataSet_Misinfo_FAKE.csv"

In [37]:
dataset_0 = pd.concat((pd.read_excel(f, index_col=0) for f in provided_dataset_paths), ignore_index=True)

print("Provided dataset (0):\n"+str(dataset_0.count()))
display(dataset_0.head())

dataset_1 = pd.read_csv(dataset_1_path)

print("Dataset 1:\n"+str(dataset_1.count()))
display(dataset_1.head())

dataset_2 = pd.read_csv(dataset_2_path, encoding='latin-1')

print("Dataset 2:\n"+str(dataset_2.count()))
display(dataset_2.head())

dataset_3_real = pd.read_csv(dataset_3_path_real, index_col=0)

print("Dataset 3 (Real):\n"+str(dataset_3_real.count()))
display(dataset_3_real.head())

dataset_3_fake = pd.read_csv(dataset_3_path_fake, index_col=0)

print("Dataset 3 (Fake):\n"+str(dataset_3_fake.count()))
display(dataset_3_fake.head())

Provided dataset (0):
tweet    10700
label    10700
dtype: int64


Unnamed: 0,tweet,label
0,Our daily update is published. States reported...,real
1,Alfalfa is the only cure for COVID-19.,fake
2,President Trump Asked What He Would Do If He W...,fake
3,States reported 630 deaths. We are still seein...,real
4,This is the sixth time a global health emergen...,real


Dataset 1:
Text     9900
label    9900
dtype: int64


Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


Dataset 2:
News_Headline    9960
Link_Of_News     9960
Source           9960
Stated_On        9960
Date             9960
Label            9960
dtype: int64


Unnamed: 0,News_Headline,Link_Of_News,Source,Stated_On,Date,Label
0,Says Osama bin Laden endorsed Joe Biden,https://www.politifact.com/factchecks/2020/jun...,Donald Trump Jr.,"June 18, 2020","June 19, 2020",FALSE
1,CNN aired a video of a toddler running away fr...,https://www.politifact.com/factchecks/2020/jun...,Donald Trump,"June 18, 2020","June 19, 2020",pants-fire
2,Says Tim Tebow kneeled in protest of abortion...,https://www.politifact.com/factchecks/2020/jun...,Facebook posts,"June 12, 2020","June 19, 2020",FALSE
3,Even so-called moderate Democrats like Joe Bi...,https://www.politifact.com/factchecks/2020/jun...,Paul Junge,"June 10, 2020","June 19, 2020",barely-true
4,"""Our health department, our city and our count...",https://www.politifact.com/factchecks/2020/jun...,Jeanette Kowalik,"June 14, 2020","June 18, 2020",TRUE


Dataset 3 (Real):
text    34946
dtype: int64


Unnamed: 0,text
0,The head of a conservative Republican faction ...
1,Transgender people will be allowed for the fir...
2,The special counsel investigation of links bet...
3,Trump campaign adviser George Papadopoulos tol...
4,President Donald Trump called on the U.S. Post...


Dataset 3 (Fake):
text    43642
dtype: int64


Unnamed: 0,text
0,Donald Trump just couldn t wish all Americans ...
1,House Intelligence Committee Chairman Devin Nu...
2,"On Friday, it was revealed that former Milwauk..."
3,"On Christmas day, Donald Trump announced that ..."
4,Pope Francis used his annual Christmas Day mes...


### Things to fix so that datasets can be combined:

Dataset 0:
- Rename tweet column to 'text'

Dataset 1:
- Rename 'Text' column to 'text'
- Set labels to lowercase

Dataset 2:
- Drop extra columns
- Consider keeping as a separate dataset with the extra columns for an alternative model
- Make labels properly binary. Currently: (TRUE, mostly-true, half-true, barely-true, FALSE, pants-fire). Set so that >= half-true is real, and anything else is fake.

Dataset 3:
- Combine into one dataframe with a label column

Once all of the above is done, the datasets can be combined and further preprocessing can be done to ready the data for visualisation or training.

Dataset 0:

In [38]:
dataset_0 = dataset_0.rename(columns={"tweet": "text", "label": "label"})
dataset_0.head()

Unnamed: 0,text,label
0,Our daily update is published. States reported...,real
1,Alfalfa is the only cure for COVID-19.,fake
2,President Trump Asked What He Would Do If He W...,fake
3,States reported 630 deaths. We are still seein...,real
4,This is the sixth time a global health emergen...,real


Dataset 1:

In [39]:
dataset_1.columns = map(str.lower, dataset_1.columns)
dataset_1["label"] = dataset_1["label"].str.lower()

dataset_1.head()

Unnamed: 0,text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,fake
1,U.S. conservative leader optimistic of common ...,real
2,"Trump proposes U.S. tax overhaul, stirs concer...",real
3,Court Forces Ohio To Allow Millions Of Illega...,fake
4,Democrats say Trump agrees to work on immigrat...,real


Dataset 2:

In [40]:
dataset_2 = dataset_2.drop(labels=["Link_Of_News", "Source", "Stated_On", "Date"], axis=1)
dataset_2 = dataset_2.rename(columns={"News_Headline": "text", "Label": "label"})

def binarise_label(label_text):
    if label_text in ["TRUE", "mostly-true", "half-true"]:
        return "real"
    elif label_text in ["barely-true", "FALSE", "pants-fire"]:
        return "fake"
    else:
        return ""
    
dataset_2['label'] = dataset_2['label'].map(binarise_label)

dataset_2.head()

Unnamed: 0,text,label
0,Says Osama bin Laden endorsed Joe Biden,fake
1,CNN aired a video of a toddler running away fr...,fake
2,Says Tim Tebow kneeled in protest of abortion...,fake
3,Even so-called moderate Democrats like Joe Bi...,fake
4,"""Our health department, our city and our count...",real


Dataset 3:

In [41]:
dataset_3_fake['label'] = "fake"
dataset_3_real['label'] = "real"

dataset_3 = pd.concat([dataset_3_fake, dataset_3_real])
dataset_3 = dataset_3.sample(frac=1) # Randomise the row order
dataset_3 = dataset_3.reset_index(drop=True)
dataset_3.head()

Unnamed: 0,text,label
0,North Korea is preparing to test a long-range ...,real
1,"WEED, Calif. — The water that gurgles from ...",real
2,A potential deal with the United States to upg...,real
3,It s no secret that Rosie O Donnell and Donald...,fake
4,Ukrainian intelligence services and members of...,fake


Combine all datasets into one:

In [42]:
main_dataset = pd.concat([dataset_0, dataset_1, dataset_2, dataset_3])
main_dataset = main_dataset.sample(frac=1)
main_dataset = main_dataset.reset_index(drop=True)

display(main_dataset.count())
main_dataset.head()

text     109148
label    109177
dtype: int64

Unnamed: 0,text,label
0,Says that President Barack Obama said an attac...,fake
1,"Liberals love us some Ruth Bader Ginsberg, who...",fake
2,"Representative Jeb Hensarling, the Republican ...",real
3,When Devin Nunes went running to Trump with in...,fake
4,We Are about to Witness the Biggest Supermoon ...,fake


Save into a file:

In [43]:
main_dataset.to_csv("data/total_dataset.csv")

### Cleaning dataset

In [44]:
dataset = main_dataset
print(f"Before culling:\n {dataset.count()}")
dataset = main_dataset.replace('', np.nan)
dataset = dataset.dropna()
print(f"After culling:\n {dataset.count()}")

Before culling:
 text     109148
label    109177
dtype: int64
After culling:
 text     109043
label    109043
dtype: int64


In [45]:
# Remove newlines and tabs
def remove_newlines_tabs(text):
    formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return formatted_text

dataset['text'] = dataset['text'].map(remove_newlines_tabs)

# Strip any html tags
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text(separator=" ")
    return stripped_text

dataset['text'] = dataset['text'].map(strip_html_tags)

# Condense whitespace
def remove_whitespace(text):
    pattern = re.compile(r'\s+')
    without_whitespace = re.sub(pattern, ' ', text)
    output_text = without_whitespace.replace('?', ' ? ').replace(')', ') ') # Add a space after ? or ) since these end words.
    return output_text

dataset['text'] = dataset['text'].map(remove_whitespace)

# Remove accented characters
def accented_characters_removal(text):
    decoded_text = unidecode.unidecode(text)
    return decoded_text

dataset['text'] = dataset['text'].map(accented_characters_removal)
