In [1]:
import pandas as pd
from datasets import load_dataset

## The data

The data used for this project is a fake news dataset which can be found on huggingface under that path: GonzaloA/fake_news. The data is described as a "mix of other datasets which are the same scope, the Fake News". Unfullfilled with this rather vague description, we sought to find additional information regarding the data and found this kaggle dataset: 

https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets?fbclid=IwZXh0bgNhZW0CMTAAAR3jgQPMw-W_8Vbi2k7XOmm7Dt50Tr45yiFsk5GH_tTsbY3JpJlMJAvDHuc_aem_Af7stCPp4fcGMauSxhPBOO7Tc5g8CxOy4vZUSlkRxxlb6zeVxi-KCFfi8TAfe9i2wwdsLD-cci7LhoeMokOvypUy

We wanted to test for simmilarity and found that 97.42% of our training data titles are identical to data found in the kaggle dataset (see code below). An interesting outcome of the kaggle data is that all fake news articles comes from **websites** flagged by Politifact and not individual articles, and the true articles comes from Reuters. This has the effect that the latent space for our model is likely whether or not an article is published by Reuters and not the intented fake news detection.

In [10]:
data = pd.read_csv("Fake.csv")
true_data = pd.read_csv("True.csv")
#The Gonzaloa is the dataset we have used for our analysis
our_data = load_dataset('GonzaloA/fake_news')

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
data.title

0         Donald Trump Sends Out Embarrassing New Year’...
1         Drunk Bragging Trump Staffer Started Russian ...
2         Sheriff David Clarke Becomes An Internet Joke...
3         Trump Is So Obsessed He Even Has Obama’s Name...
4         Pope Francis Just Called Out Donald Trump Dur...
                               ...                        
23476    McPain: John McCain Furious That Iran Treated ...
23477    JUSTICE? Yahoo Settles E-mail Privacy Class-ac...
23478    Sunnistan: US and Allied ‘Safe Zone’ Plan to T...
23479    How to Blow $700 Million: Al Jazeera America F...
23480    10 U.S. Navy Sailors Held by Iranian Military ...
Name: title, Length: 23481, dtype: object

In [4]:
sampled = data.title

In [5]:
from collections import defaultdict
d = defaultdict(dict)
c = 0
#A simple nested loop that loops through all the titles in both datasets.
for counter, text in enumerate(our_data["train"]["title"]):
    found = False
    for text_2 in sampled:
        if text in text_2:
            d[counter]["our_data"] = text
            d[counter]["sampled"] = text_2
            d[counter]["missing"] = False
            found = True
            #When we encounter a match, we simply break the loop for efficiency 
            break
    if not found:
        d[counter]["our_data"] = None
        d[counter]["sampled"] = text
        d[counter]["missing"] = True

In [6]:
df = pd.DataFrame(d).T

In [7]:
#It takes some time to run, thus we have split the processing up into two parts, true and false.
#We have already found all the indexes for the label true
with open("true.txt","r") as f:
    x = f.read()
    true = set([int(x) for x in x.split("\n")])


In [8]:
#We merge the data into one set for quick lookup
merged = true.union(set(df[df["missing"] == False].index))

In [11]:
#These are the indexes in our data which is NOT present in the kaggle dataset
li = []
for i in range(max(merged)):
    if i not in merged:
        li.append(i)
print(100-len(li)/(true_data.shape[0] + data.shape[0])*100)


97.42750233863424


## Reflection on fake news

Fake news has many definitions, ranging from being factually incorrect to misleading, which makes it hard when quantifying results and cross examining results between different studies, since the latent space can be different dependent on the definition of the task. The problem with using objective truth as the definition, that truth can vary depending on culture and context. An actor in a conflict can be seen as the good freedome fighter towards an opressive regime by one side, and as a terrorist on the other side. Lastly, the truth can also change over time, which means that models have to be retrained with up to date information constantly to be able to combat the fake news within the catagory factual correct. Using the definition of misleading would be an inherently easier task since the model would not have to have a world view in order to classify. It should be solveable by using information only present in the text and comparing that to the title. 

## Related work

<!-- Recent studies have shown that fake news propagates through social media at unprecedented speeds. This was observed to happen during the emergence of COVID-19, thus the need to quickly detect and mitigate the spreading of fake news is more important than ever s[1]. 

Many definitions are presented, ranging from being factually incorrect to misleading, and unfortunately, our data source has not specified which definition they use. This makes it harder to interpret why a model predicted as it did, since we do not know if the data contains mostly stories conflicting with reality, or simply written by an overselling journalist. -->

Related work
This paper, written by Shaina Raza & Chen Ding, uses META's BART language model trained on two data sets: NELA-GT-19, which are news articles sourced from multiple sites, and Fakeddit, which is a multimodal dataset from Reddit, consisting of both images and text. The datasets used had more than a binary score, it included labels such as mixed, which is when there is a disagrement whether something is true or false, and categories such as satire into a single category Fake. They discuss their approach of continuously updating the model's training data to retrain the model and stay on top of relevant news. They also assert that freezing a model's weights can quickly make the model outdated since they don't generalize well to future events. Finally, they report an accuracy of 74.89%.



s[1] = https://link.springer.com/article/10.1007/s41060-021-00302-z

s[2] = https://arxiv.org/pdf/2101.00180.pdf