# Introduction 
Fake news is defined as "*false stories that appear to be news, spread on the internet or using other media, usually created to influence political views or as a joke*." ([Cambridge University Press, 2022](https://dictionary.cambridge.org/dictionary/english/fake-news)) Originating as far back as the Roman era, it has existed for over two millennia. ([BBC, 2022](https://www.bbc.co.uk/bitesize/articles/zwcgn9q)). However, powered by the growth of the internet, it has gained significant prominence. 

Its influence on the public is a course for concern as it "*promotes toxic narratives, spreads doubt and confusion, and increases social polarisation, affecting democratic decision-making*." ([Civica, 2022](https://www.civica.eu/fake-news-and-democracy/)) It can also be difficult to identify due to cognitive biases. These are shortcuts and according to Centre for Information Technology & Society ([2022](https://www.cits.ucsb.edu/fake-news/why-we-fall)) four aspects of cognitive biases are in affect in the utility of information. 
> First, we tend to focus on headlines and tags without reading the article they’re associated with. Second, social media’s popularity signals affect our attention to and acceptance of information. Third, fake news takes advantage of partisanship, a very strong reflex. And fourth, persistence--there’s a weird tendency for false information to stick around, even after it’s corrected.

The issue of fake news made headlines during the 2016 US election ([Blake, 2018](https://www.washingtonpost.com/news/the-fix/wp/2018/04/03/a-new-study-suggests-fake-news-might-have-won-donald-trump-the-2016-election/)). However, due to its power to miniplate audiences and subsequent profitability, as well as the ethics surrounding free speech, fake news continues to be prominent. For instance,  Donald Trump's use of fake news to benefit himself and strengthen his position is widely recorded. ([Rattner, 2021](https://www.cnbc.com/2021/01/13/trump-tweets-legacy-of-lies-misinformation-distrust.html)) In such cases, securing government led condemnation of the issue may be limited. 

This is not to say that efforts have not been made. For instance, organisations such as [FactChecker](https://www.factcheck.org/about/our-mission/) and Birdwatch strived to debunk such news ([Lorenz et al, 2022](https://www.washingtonpost.com/technology/2022/11/09/twitter-birdwatch-factcheck-musk-misinfo/)). They work by identifying and highlighting misinformation. 

This method is one of 4 methods highlighted in a report by Lazer et al ([2017](https://www.sipotra.it/wp-content/uploads/2017/06/Combating-Fake-News.pdf)) as methods to combat fake news: 
> (1) offering feedback to users that particular news may be fake (which seems to depress overall sharing from those individuals); 
> (2) providing ideologically compatible sources that confirm that particular news is fake; (3) detecting information that is being promoted by bots and “cyborg” accounts and tuning algorithms to not respond to those manipulations; and 
> (4) because a few sources may be the origin of most fake news, identifying those sources and reducing promotion (by the platforms) of information from those sources.


# Combatting Fake News
In the case of the methods highlighted above and in the ways exercised by groups such as FactCherker, fake news must first be identified. For this, fake news or suspected fake news needs to be identified and a key tool for such identification can be machine learning/ artificial intelligence. 

There are two main methods for this. Unsupervised learning may be able to identify discussion groups where an Echo chamber is forming. Likewise, a supervised learning model akin to those used for spam filtering could be used to identify suspected fake news based on a pre-existing dataset. The later can be explored using a [fake news dataset on Kaggle](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset). 



In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os


fake = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/Fake.csv")
true = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/True.csv")

In [2]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [3]:
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [4]:
fake["type"] ="Fake"
true["type"] ="True"

In [5]:
result = pd.concat([fake,true ]).sample(frac = 1).reset_index(drop=True)

In [6]:
result.head()

Unnamed: 0,title,text,subject,date,type
0,"Taliban increases influence, territory in Afgh...",WASHINGTON (Reuters) - The Taliban has increas...,worldnews,"October 31, 2017",True
1,Those who boycott Syrian congress may be sidel...,ASTANA (Reuters) - Syrian groups who choose to...,worldnews,"October 31, 2017",True
2,Director Rob Reiner: ‘Moron’ Trump Is Last Ga...,Racists and Confederate flag worshipers may be...,News,"November 15, 2016",Fake
3,U.S. court rejects Trump bid to stop transgend...,NEW YORK (Reuters) - A federal appeals court i...,politicsNews,"December 21, 2017",True
4,WATCH: STONE-FACED ANDERSON COOPER Gets School...,President Trump s deputy assistant Sebastian G...,politics,"Jul 13, 2017",Fake


In [7]:
result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   type     44898 non-null  object
dtypes: object(5)
memory usage: 1.7+ MB


In [8]:
result["type"].value_counts()

Fake    23481
True    21417
Name: type, dtype: int64

The dataset consists of two datasets, one containing true news whilst the other contains fake news. The datasets is relatively even with a slight leaning in favour of fake news. 

In [9]:
(round(result[result["type"]=="Fake"].shape[0]/result.shape[0],2))*100

52.0

Using  a potion of the dataset, a model can be created through the use of matrix of TF-IDF features and linear support vector classification. This model performs extremely robustly with a f1-score of 0.99 and an accuracy of 0.99. 

In [10]:
from sklearn.model_selection import train_test_split

x = result["text"]
y = result["type"]

X_train, X_test, y_train, y_test = train_test_split(x,y,test_size = 0.33, random_state = 100)

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([("tfidf", TfidfVectorizer()), 
                    ("clf", LinearSVC())
                    ])

text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [12]:
predictions = text_clf.predict(X_test)

In [13]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))
print(metrics.classification_report(y_test,predictions))

[[7674   50]
 [  42 7051]]
              precision    recall  f1-score   support

        Fake       0.99      0.99      0.99      7724
        True       0.99      0.99      0.99      7093

    accuracy                           0.99     14817
   macro avg       0.99      0.99      0.99     14817
weighted avg       0.99      0.99      0.99     14817



In [14]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9937909158399136


In other words, this model can be used to assess new data to examine whether the news is likely to be fake or not. However, there is one major issue with this dataset. It relies on the data it has been trained on. In other words, if fake news focused on a new topic emerges, the model is likely to struggle/its performance may decrease. 

It should be noted that this project utilises Spacy and Sklearn. However, alternative models such as BERT, that uses deep learning are also available ([Paialunga, 2021](https://towardsdatascience.com/fake-news-detection-with-machine-learning-using-python-3347d9899ad1)) .  

# Using AI to identify fake news
Fake news covers a broad array of topics from vaccinations to politics and gender equality. These topics can be identified through unsupervised learning methods. In this case assuming there are 6 different topics exists, key phrases can be singled out from similar articles. 

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
fake.head()

Unnamed: 0,title,text,subject,date,type
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake


In [17]:
tfidf = TfidfVectorizer(max_df = 0.95, min_df=2, stop_words ="english")
dtm = tfidf.fit_transform(fake["text"])
dtm

<23481x60653 sparse matrix of type '<class 'numpy.float64'>'
	with 3610906 stored elements in Compressed Sparse Row format>

In [18]:
from sklearn.decomposition import NMF

nmf_model = NMF(n_components = 6, init='nndsvda',random_state= 100)
nmf_model.fit(dtm)

NMF(init='nndsvda', n_components=6, random_state=100)

In [19]:
nmf_model.fit(dtm)

NMF(init='nndsvda', n_components=6, random_state=100)

In [20]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names_out ()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['supporters', 'said', 'white', 'featured', 'people', 'like', 'republican', 'image', 'just', 'realdonaldtrump', 'twitter', 'campaign', 'president', 'donald', 'trump']


THE TOP 15 WORDS FOR TOPIC #1
['house', 'senate', 'america', 'republican', 'american', 'government', 'united', 'said', 'state', 'people', 'court', 'states', 'republicans', 'president', 'obama']


THE TOP 15 WORDS FOR TOPIC #2
['spore', 'pst', 'hesher', 'alternate', 'episode', 'tune', 'animals', 'broadcast', '00', 'join', 'radio', 'room', 'pm', 'acr', 'boiler']


THE TOP 15 WORDS FOR TOPIC #3
['party', 'candidate', 'presidential', 'secretary', 'election', 'state', 'bernie', 'democratic', 'email', 'emails', 'foundation', 'campaign', 'sanders', 'hillary', 'clinton']


THE TOP 15 WORDS FOR TOPIC #4
['officials', 'house', 'james', 'security', 'flynn', 'putin', 'information', 'director', 'news', 'intelligence', 'investigation', 'russian', 'comey', 'russia', 'fbi']


THE TOP 15 WORDS FOR TOPIC #5


In [21]:
topic_results = nmf_model.transform(dtm)
topic_results.argmax(axis=1)
fake["TOPICS"]= topic_results.argmax(axis=1)


In [22]:
fake.head(10)

Unnamed: 0,title,text,subject,date,type,TOPICS
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake,4
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake,5
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake,0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake,1
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",Fake,5
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",Fake,4
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",Fake,0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",Fake,0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",Fake,0


This process indicates that major topics includes:
1. Trump supporters
2. US politics
3. Media
4. US election
5. International politics
6. Education

This process can indicate where active investigations should take place to curb the spread of fake news. However, this process will not be able to assess in identifying new areas of focus unless an up-to-date catalogue is routinely provided. 

# Further Concerns
One major solution that individuals revert to is the notion that as the issues of fake news grows, individuals will become better adopt at identifying it and acting accordingly. However this entails training individuals to go against their default cognitive biases. In addition, whilst individuals may be aware that photoshop is frequently used in images, its impact on mental health remains profound. ([Harvard, 2020](https://www.hsph.harvard.edu/news/hsph-in-the-news/photoshops-damaging-effects/)) Moreover, educating a large population who are no longer in education can also be challenging. 

Likewise, the issues covered only cover the information which is in the public domain. It does not cover the growing concerns surrounding echo chambers and private groups where fake news is able to freely circulate ([DW Documentary, 2022](https://www.youtube.com/watch?v=HDtFpGfORpE&list=WL&index=5)). 

That said, it is likely that the key tools to combat this issue lies in the development of AI and its utility on key social networking sites. Any short term gains made by fake news is unlike to be a product with long term value and a robust campaign to tackle it is required. 


# Conclusion 
Fake news is a major issue affecting society. This is unlikely to alter in the near future especially as technology such as deep fakes continue to develop ([Schwartz, 2018](https://www.theguardian.com/technology/2018/nov/12/deep-fakes-fake-news-truth)).  Subsequently, there is a strong need to find robust methods to combat such information. In order to do so, there is a strong need to identify possible fake news so it may be investigated and appropriate action taken. For this, tools such as machine learning highlighted here may serve a vital role. 