# Finales Projekt "Fake News Detection"

* Vorgelegt von: Marc Friz (Matrikelnr), Botan Babath, Nadja Herrmann
* Vorgelegt bei: Prof. Dr. Johannes Maucher
* Vorgelegt am: 05.01.2021

### Inhaltsverzeichnis

1. Einleitung / Use Case Scope
2. Datenbeschaffung
3. Analyse welche Pakete benötigt werden
4. Pakete importieren
5. Datenbereitstellung
6. Datenanalyse - Beschreibung der bereitgestellten Datensätze
7. Detaillierte Datenanalyse
8. Zusammenfassung und Ausblick
9. Literaturverzeichnis

### 1. Einleitung
Heutzutage werden Nachrichten über unterschiedliche Medien an die Masse verteilt. Ein Medium ist zum Beispiel die sozialen Medien. Einerseits führen der einfache Zugang und die schnelle Verbreitung von Nachrichten dazu, dass viele Menschen die Nachrichten konsumieren. Auf der anderen Seite aber wird  die schnelle Verbreitung von "Fake Nachrichten" begünstigt. Fake Nachrichten sind Nachrichten von geringer Qualität und mit absichtlich falschen Informationen.  Die weite Verteilung von Fake Nachrichten kann extrem negative Auswirkungen auf Individuen und die Gesellschaft haben (Shu et al, 2017). Daher ist die Erkennung solcher Nachrichten von hoher Relevanz.

#### 1.1 Problemstellung und Ziel der Arbeit

In diesem Projekt werden wir die Hauptprobleme bei der Erkennung von Fehlinformationen analysieren und diskutieren. Wir werden mittels statistischen Methoden unterschiedliche Nachrichten untersuchen. Diese Nachrichten sind bereits in "richtig" und "falsch" kategorisiert. "Richtig" bedeutet, dass die Nachrichten der Wahrheit entsprechen, "falsch" bedeutet das Gegenteil.

Welches Ziel setzen wir uns? Was ist der Scope unserer Projektarbeit?


Quellen:

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang and Huan Liu (2017).
Fake News Detection on Social Media: A Data Mining Perspective.
https://dl.acm.org/doi/10.1145/3137597.3137600

#### 1.2 Aufbau der Arbeit

Diese Arbeit befasst sich im ersten Teil mit....

### 2. Datenbeschaffung

Wir beziehen uns in unserem Projekt auf Datensätze von Kaggle und Statista. Diese Datensätze enthalten Nachrichten von amerikanischen Nachrichtensendern. Ebenso enthalten die Datensätze Fake Nachrichten. Was beinhalten die Daten und wie sieht die Struktur aus? Die Quellen sind wie folgt.

### 3. Analyse welche Pakete benötigt werden

Welche Pakete werden benötigt um das in Abschnitt X.X definierte Ziel zu erreichen.

- Pandas provides high-performance, easy-to-use data structures and data analysis tools for Python. It's main datastructure is the numpy-array-based dataframe, which is comparable to dataframes in R. Actually, with Pandas Python provides similar functionality as R. The Pandas Website states it as follows: Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R

- NumPy is the fundamental package for scientific calculation in Python. It provides a multi-dimensional datastructure, the numpy-array, and many efficiently implemented functions for numerical calculations. Many other important libraries for scientific calculation and data analysis are based on Numpy.

- Scipy is based on and extends the functionality of numpy with packages for linear algebra, integration, optimisation, signal processing, statistics and much more. Python with Numpy, Scipy and Matplotlib constitutes a comprehensive tool for scientific calculations of all types. This bunch provides functionality comparable with the commercial tool Matlab

- Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and for graphical user interface toolkits.

Visualisation:

- Bokeh is an interactive visualization library for Python that enables beautiful and meaningful visual presentation of data in modern web browsers. With Bokeh, you can quickly and easily create interactive plots, dashboards, and data applications.

### 4. Pakete importieren

In [1]:
import numpy as np
import pandas as pd
import re

from xml.dom import minidom
from os import listdir
from os.path import isfile, join

### 5. Datenbereitstellung
#### 5.1 Nadja

Die extrahierten Daten werden in dem Basis Format in JupyterNotebook geladen.

In [2]:
news = pd.read_csv("../DataSet/news_dataset.csv", encoding="latin-1")
news

Unnamed: 0.1,Unnamed: 0,title,content,publication,label
0,0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,fake
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,fake
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,fake
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,fake
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,fake
...,...,...,...,...,...
28706,15707,An eavesdropping Uber driver saved his 16-year...,Uber driver Keith Avila picked up a p...,Washington Post,real
28707,15708,Plane carrying six people returning from a Cav...,Crews on Friday continued to search L...,Washington Post,real
28708,15709,After helping a fraction of homeowners expecte...,When the Obama administration announced a...,Washington Post,real
28709,15710,"Yes, this is real: Michigan just banned bannin...",This story has been updated. A new law in...,Washington Post,real


In [3]:
news.shape

(28711, 5)

In [4]:
news.columns

Index(['Unnamed: 0', 'title', 'content', 'publication', 'label'], dtype='object')

#### Umbenennung der Spaltenbezeichnung zur Vereinheitlichung der Basisdaten. 

Zielformat CSV mit Spaltenbezeichnung: Title, text, source, veracity

In [5]:
dfnews=news.rename(columns={'content':'text','publication':'source','label':'veracity'})
dfnews.head()

Unnamed: 0.1,Unnamed: 0,title,text,source,veracity
0,0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,fake
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,fake
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,fake
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,fake
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,fake


Umbenennung der Werte Fake und Real in False und True.

In [6]:
dfnews.loc[dfnews['veracity']== 'fake','veracity']='false'
dfnews.loc[dfnews['veracity']== 'real','veracity']='true'
dfnews.head()

Unnamed: 0.1,Unnamed: 0,title,text,source,veracity
0,0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,False
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,False
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,False
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,False
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,False


Erstellung eines DataFrames mit den Spalten Title, Text, Source und Veracity.

In [7]:
dfnewsfinal=dfnews[['title','text','source','veracity']]
dfnewsfinal.head()

Unnamed: 0,title,text,source,veracity
0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,False
1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,False
2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,False
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,False
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,False


Bereitstellung und Zusammenführung mit den Daten von Botan und Marc.

#### 5.2 Marc

In [8]:
data_path = '../DataSet/BuzzFeed_Corpus_2016/articles/'

all_files = [f for f in listdir(data_path) if isfile(join(data_path, f))]

In [9]:
def get_xml_tags(path):
    doc = minidom.parse(path)
    # Title
    try:
        title = (doc.getElementsByTagName('title')[0].firstChild.data)
    except:
        title = ''
        
    # Text
    try:
        text = (doc.getElementsByTagName('mainText')[0].firstChild.data)
    except:
        text = ''
        
    # quelle
    try:
        source = (doc.getElementsByTagName('uri')[0].firstChild.data)
    except:
        source = ''
        
    # veracity
    try:
        veracity = (doc.getElementsByTagName('veracity')[0].firstChild.data)
    except:
        veracity = ''
    
    return title, text, source, veracity    

In [10]:
l_title = []
l_text = []
l_quelle = []
l_veracity = []

for i in all_files:
    title, text, source, veracity = get_xml_tags(data_path + i)
    l_title.append(title)
    l_text.append(text)
    l_quelle.append(source)
    l_veracity.append(veracity)

In [11]:
df = pd.DataFrame(list(zip(l_title, l_text, l_quelle, l_veracity)), columns =['title', 'text', 'source', 'veracity']) 
df.head()

Unnamed: 0,title,text,source,veracity
0,The Impact of Debates? It's Debatable,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,mostly true
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,mostly true
2,Donald Trump Repeats Calls for Police Profilin...,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,mostly true
3,"NY, NJ Bombings Suspect Charged With Attempted...","Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,mostly true
4,Trump Surrogates Push Narrative That Clinton S...,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,mostly true


In [12]:
def set_veracity(verasity):
    if verasity.find('false') >= 0:
        return 'false'
    elif verasity.find('true') >= 0:
        return 'true'
    else:
        return ''

In [13]:
df['veracity'] = df['veracity'].apply(set_veracity)
df.head()

Unnamed: 0,title,text,source,veracity
0,The Impact of Debates? It's Debatable,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,True
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,True
2,Donald Trump Repeats Calls for Police Profilin...,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,True
3,"NY, NJ Bombings Suspect Charged With Attempted...","Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,True
4,Trump Surrogates Push Narrative That Clinton S...,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,True


#### 5.3 Botan

#### Description

This dataset is from Kaggle. The Owner of the DataSet is Clément Bisaillon

https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset/metadata

The dataset contains two types of articles fake and real News. This dataset was collected from real- world sources; the truthful articles were obtained by crawling articles from Reuters.com (News website). As for the fake news articles, they were collected from different sources. The fake news articles were collected from unreliable websites that were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contains different types of articles on different topics, however, the majority of articles focus on political and World news topics.

Read more: 

https://www.uvic.ca/engineering/ece/isot/assets/docs/ISOT_Fake_News_Dataset_ReadMe.pdf

In [14]:
rawData_true= pd.read_csv ("../DataSet/Kaggle/True.csv")
rawData_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [15]:
rawData_true.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


#### Erweiterung des Dateframes um Spalte "veracitiy" 

In [16]:
prepData_true= rawData_true
prepData_true["veracity"]= "true"
prepData_true.head()

Unnamed: 0,title,text,subject,date,veracity
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True


#### Erweiterung des Dateframes um Spalte "source" 

In [2]:
prepData_true["source"]= ""
prepData_true= prepData_true[['title','text','source','veracity','subject','date']]

NameError: name 'prepData_true' is not defined

#### Extrahierung der Quellenangabe aus Spalte Text mithilfe von Regex

In [18]:
match=0
nomatch= 0
rownumber= 0
findrows_nomatch = []
for index, row in prepData_true.iterrows():
    if "(Reuters)" in row.text: # Sucht nach "Reuters" im Text
        if " - " in row.text: # Trennung an der Stelle "-" in Quelle und Text 
            regex = r" - " 
            source = re.split(regex,row.text)[0] # Quelle 
            text = re.split(regex,row.text)[1]  # Text 
            row.source = source 
            row.text = text
            match= match + 1
    else:
        nomatch= nomatch +1 
        findrows_nomatch.append(rownumber)
   
    rownumber= rownumber + 1

#### Extract news only from the Reuters source

In [19]:
prepData_true_1 = prepData_true.drop(findrows_nomatch)
prepData_true_1 = prepData_true_1.loc[prepData_true_1['subject'] == "politicsNews"]

In [20]:
prepData_true_1.head()

Unnamed: 0,title,text,source,veracity,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",The head of a conservative Republican faction ...,WASHINGTON (Reuters),True,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,Transgender people will be allowed for the fir...,WASHINGTON (Reuters),True,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,The special counsel investigation of links bet...,WASHINGTON (Reuters),True,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,Trump campaign adviser George Papadopoulos tol...,WASHINGTON (Reuters),True,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,President Donald Trump called on the U.S. Post...,SEATTLE/WASHINGTON (Reuters),True,politicsNews,"December 29, 2017"


In [21]:
rawData_false= pd.read_csv ("../DataSet/Kaggle/Fake.csv")
rawData_false.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [22]:
prepData_false= rawData_false
prepData_false["veracity"]= "false"
prepData_false.head()

Unnamed: 0,title,text,subject,date,veracity
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",False
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",False
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",False
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",False
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",False


In [23]:
prepData_false["source"]= ""
prepData_false= prepData_false[['title','text','source','veracity','subject','date']]
prepData_false.head()

Unnamed: 0,title,text,source,veracity,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,,False,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,,False,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",,False,News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",,False,News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,,False,News,"December 25, 2017"


#### 5.4 Merge

In [24]:
dfnewsfinal['set'] = 'nadja'
df['set'] = 'marc'
prepData_true_1['set'] = 'botan'
prepData_false['set'] = 'botan'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prepData_false['set'] = 'botan'


In [25]:
end_df = df.append(dfnewsfinal)
end_df = end_df.append(prepData_true_1)
end_df = end_df.append(prepData_false)
end_df

Unnamed: 0,title,text,source,veracity,set,subject,date
0,The Impact of Debates? It's Debatable,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,true,marc,,
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,true,marc,,
2,Donald Trump Repeats Calls for Police Profilin...,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,true,marc,,
3,"NY, NJ Bombings Suspect Charged With Attempted...","Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,true,marc,,
4,Trump Surrogates Push Narrative That Clinton S...,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,true,marc,,
...,...,...,...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,,false,botan,Middle-east,"January 16, 2016"
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,,false,botan,Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,,false,botan,Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,,false,botan,Middle-east,"January 14, 2016"


#### 5.5 Save

In [26]:
end_df.to_csv('../DataSet/complete_dataset.csv')

### 6. Datenanalyse

Beschreibung der bereitgestellten Datensätze.
Welche Merkmale müssen untersucht werden? Literaturrecherche