# Finales Projekt "Fake News Detection"

* Vorgelegt von: Marc Friz (Matrikelnr), Botan Babath, Nadja Herrmann
* Vorgelegt bei: Prof. Dr. Johannes Maucher
* Vorgelegt am: 05.01.2021

### Inhaltsverzeichnis

1. Einleitung / Use Case Scope
2. Datenbeschaffung
3. Analyse welche Pakete benötigt werden
4. Pakete importieren
5. Datenbereitstellung
6. Datenanalyse - Beschreibung der bereitgestellten Datensätze
7. Detaillierte Datenanalyse
8. Zusammenfassung und Ausblick
9. Literaturverzeichnis

### 1. Einleitung
Heutzutage werden Nachrichten über unterschiedliche Medien an die Masse verteilt. Ein Medium ist zum Beispiel die sozialen Medien. Einerseits führen der einfache Zugang und die schnelle Verbreitung von Nachrichten dazu, dass viele Menschen die Nachrichten konsumieren. Auf der anderen Seite aber wird  die schnelle Verbreitung von "Fake Nachrichten" begünstigt. Fake Nachrichten sind Nachrichten von geringer Qualität und mit absichtlich falschen Informationen.  Die weite Verteilung von Fake Nachrichten kann extrem negative Auswirkungen auf Individuen und die Gesellschaft haben (Shu et al, 2017). Daher ist die Erkennung solcher Nachrichten von hoher Relevanz.

#### 1.1 Problemstellung und Ziel der Arbeit

In diesem Projekt werden wir die Hauptprobleme bei der Erkennung von Fehlinformationen analysieren und diskutieren. Wir werden mittels statistischen Methoden unterschiedliche Nachrichten untersuchen. Diese Nachrichten sind bereits in "richtig" und "falsch" kategorisiert. "Richtig" bedeutet, dass die Nachrichten der Wahrheit entsprechen, "falsch" bedeutet das Gegenteil.

Welches Ziel setzen wir uns? Was ist der Scope unserer Projektarbeit?


Quellen:

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang and Huan Liu (2017).
Fake News Detection on Social Media: A Data Mining Perspective.
https://dl.acm.org/doi/10.1145/3137597.3137600

#### 1.2 Aufbau der Arbeit

Diese Arbeit befasst sich im ersten Teil mit....

### 2. Datenbeschaffung

Wir beziehen uns in unserem Projekt auf Datensätze von Kaggle und Statista. Diese Datensätze enthalten Nachrichten von amerikanischen Nachrichtensendern. Ebenso enthalten die Datensätze Fake Nachrichten. Was beinhalten die Daten und wie sieht die Struktur aus? Die Quellen sind wie folgt.

### 3. Analyse welche Pakete benötigt werden

Welche Pakete werden benötigt um das in Abschnitt X.X definierte Ziel zu erreichen.

- Pandas provides high-performance, easy-to-use data structures and data analysis tools for Python. It's main datastructure is the numpy-array-based dataframe, which is comparable to dataframes in R. Actually, with Pandas Python provides similar functionality as R. The Pandas Website states it as follows: Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R

- NumPy is the fundamental package for scientific calculation in Python. It provides a multi-dimensional datastructure, the numpy-array, and many efficiently implemented functions for numerical calculations. Many other important libraries for scientific calculation and data analysis are based on Numpy.

- Scipy is based on and extends the functionality of numpy with packages for linear algebra, integration, optimisation, signal processing, statistics and much more. Python with Numpy, Scipy and Matplotlib constitutes a comprehensive tool for scientific calculations of all types. This bunch provides functionality comparable with the commercial tool Matlab

- Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and for graphical user interface toolkits.

Visualisation:

- Bokeh is an interactive visualization library for Python that enables beautiful and meaningful visual presentation of data in modern web browsers. With Bokeh, you can quickly and easily create interactive plots, dashboards, and data applications.

### 4. Pakete importieren

In [6]:
import numpy as np
import pandas as pd

### 5. Datenbereitstellung

Die extrahierten Daten werden in dem Basis Format in JupyterNotebook geladen.

In [54]:
news = pd.read_csv("news_dataset.csv", encoding="latin-1")
news

Unnamed: 0.1,Unnamed: 0,title,content,publication,label
0,0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,fake
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,fake
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,fake
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,fake
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,fake
...,...,...,...,...,...
28706,15707,An eavesdropping Uber driver saved his 16-year...,Uber driver Keith Avila picked up a p...,Washington Post,real
28707,15708,Plane carrying six people returning from a Cav...,Crews on Friday continued to search L...,Washington Post,real
28708,15709,After helping a fraction of homeowners expecte...,When the Obama administration announced a...,Washington Post,real
28709,15710,"Yes, this is real: Michigan just banned bannin...",This story has been updated. A new law in...,Washington Post,real


In [41]:
news.shape

(28711, 5)

In [33]:
news.columns

Index(['Unnamed: 0', 'title', 'content', 'publication', 'label'], dtype='object')

In [34]:
news.head()

Unnamed: 0.1,Unnamed: 0,title,content,publication,label
0,0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,fake
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,fake
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,fake
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,fake
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,fake


#### Umbenennung der Spaltenbezeichnung zur Vereinheitlichung der Basisdaten. 

Zielformat CSV mit Spaltenbezeichnung: Title, text, source, veracity

In [44]:
dfnews=news.rename(columns={'content':'text','publication':'source','label':'veracity'})
dfnews

Unnamed: 0.1,Unnamed: 0,title,text,source,veracity
0,0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,fake
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,fake
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,fake
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,fake
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,fake
...,...,...,...,...,...
28706,15707,An eavesdropping Uber driver saved his 16-year...,Uber driver Keith Avila picked up a p...,Washington Post,real
28707,15708,Plane carrying six people returning from a Cav...,Crews on Friday continued to search L...,Washington Post,real
28708,15709,After helping a fraction of homeowners expecte...,When the Obama administration announced a...,Washington Post,real
28709,15710,"Yes, this is real: Michigan just banned bannin...",This story has been updated. A new law in...,Washington Post,real


Umbenennung der Werte Fake und Real in False und True.

In [49]:
dfnews.loc[dfnews['veracity']== 'fake','veracity']='false'
dfnews.loc[dfnews['veracity']== 'real','veracity']='true'
dfnews

Unnamed: 0.1,Unnamed: 0,title,text,source,veracity
0,0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,false
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,false
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,false
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,false
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,false
...,...,...,...,...,...
28706,15707,An eavesdropping Uber driver saved his 16-year...,Uber driver Keith Avila picked up a p...,Washington Post,true
28707,15708,Plane carrying six people returning from a Cav...,Crews on Friday continued to search L...,Washington Post,true
28708,15709,After helping a fraction of homeowners expecte...,When the Obama administration announced a...,Washington Post,true
28709,15710,"Yes, this is real: Michigan just banned bannin...",This story has been updated. A new law in...,Washington Post,true


Erstellung eines DataFrames mit den Spalten Title, Text, Source und Veracity.

In [50]:
dfnewsfinal=dfnews[['title','text','source','veracity']]
dfnewsfinal

Unnamed: 0,title,text,source,veracity
0,Muslims BUSTED: They Stole Millions In Govât...,Print They should pay all the back all the mon...,100percentfedup,false
1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,false
2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,false
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,false
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,false
...,...,...,...,...
28706,An eavesdropping Uber driver saved his 16-year...,Uber driver Keith Avila picked up a p...,Washington Post,true
28707,Plane carrying six people returning from a Cav...,Crews on Friday continued to search L...,Washington Post,true
28708,After helping a fraction of homeowners expecte...,When the Obama administration announced a...,Washington Post,true
28709,"Yes, this is real: Michigan just banned bannin...",This story has been updated. A new law in...,Washington Post,true


Bereitstellung und Zusammenführung mit den Daten von Botan und Marc.

### 6. Datenanalyse

Beschreibung der bereitgestellten Datensätze.
Welche Merkmale müssen untersucht werden? Literaturrecherche