## Introduction

As part of my contribution to the seminar, aside from my presentation covering the paper "Supervised Learning for Fake News Detection" by Reis et al., a written essay about the paper would have resulted in mostly restating the extensive procedures for the extraction of 141 textual features, 5 news source features and 21 environment features. Hence I decided to attempt to reproduce the work programmatically. 

The main focus of this work is on the extraction of the textual features. The reason for focusing only on the extraction of textual features that is that ultimately, the work can be used to classify german fake news, for which only two datasets currently exist, which do not provide the information required for extracting news source features or environment features.

So the goal of this work is to see how far you can get, using only the textual features and to compare the results based on the two mention german datasets as well as with the initially used dataset used in the work of Reis et al. with their results from the paper, as well as with the performance of a somewhat conventional classifier, which uses simple count vectors in regards to the features extraction.

For the conventional classifier, a german fake news classifier from Dominik Leuziger is being used. (https://dagshub.com/leudom/german-fake-news-classifier)

## Overview

### Step 0 - Imports

### Step 1 - Gathering data

#### Step 1.1 - German data

As mentioned, there are only two german dataset. Those are 

- GermanFakeNC: German Fake News Corpus (https://zenodo.org/record/3375714#.Ya8IHtDMKUk)
- Kaggle Starter: Fake News Dataset German 9cc110a2-9 (https://www.kaggle.com/kerneler/starter-fake-news-dataset-german-9cc110a2-9/data )

#### Step 1.1.1 - Scraping of GermanFakeNC articles

- Scraping and generation of the GermanFakeNC dataset

#### Step 1.1.2 - Preparation and merge of datasets

- preparing and aligning the two german datasets
- merging of datasets

#### Step 1.2 - English data

Optimally, we want to use the original dataset which was used in the work of Reis et al., "Supervised Learning for Fake News Detection". That is:

- BuzzFace: A News Veracity Dataset with Facebook User Commentary and Ego (https://metatext.io/datasets/buzzface)

As we will see in the progress of this work, acquiring that dataset in the shape that it was originally used is not possible. Hence another dataset is being used which is, not only with respect to size, not as extensive, but very close to the BuzzFace dataset in nature:

- BuzzFeed-Webis Fake News Corpus 16 (https://webis.de/data/buzzfeed-webis-fake-news-16.html)

#### Step 1.2.1 - Attempts of generating the BuzzFace dataset

- showcase of failed attempt

#### Step 1.2.2 - Generating the BuzzFeed-Webis dataset

- Construction of the BuzzFeed-Webis dataset from XML files

### Step 2 - Performing the extensive (textual only) feature extration as proposed by Reis et al.

- Step-by-Step reconstruction of 141 textual features

### Step 3 - Running the data (english and german) on the two best classifiers (XGBoost and RFs) used in Reis et al.'s work

- Using XGBoost and RFs to compare the performance of the merged german dataset to the english (BuzzFeed-Webis) dataset

#### Step 3.1 - Using stop words, stemming and  count vectorization

#### Step 3.1.1 - XGBoost

#### Step 3.1.1.1 - german dataset

#### Step 3.1.1.2 - english dataset


#### Step 3.1.2 - Random Forest

#### Step 3.1.2.1 - german dataset

#### Step 3.1.2.2 - english dataset



#### Step 3.2 - Using the extensive feature extraction as proposed by Reis et al.

#### Step 3.2.1 - XGBoost

#### Step 3.2.1.1 - german dataset

#### Step 3.2.1.2 - english dataset



#### Step 3.2.2 - Random Forest

#### Step 3.2.2.1 - german dataset

#### Step 3.2.2.2 - english dataset


### Step 4 -  Overview of the results



## Step 0 - Imports

In [3]:
import os
import logging
import pandas as pd
from newspaper import Article
import py7zr
import xml.etree.ElementTree as ET
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score, recall_score, precision_score
import nltk
from nltk.stem.snowball import GermanStemmer
import dagshub
import copy
from joblib import dump
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

import nltk
import nltk.corpus.reader.tagged as tagged
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
import random

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

## Step 1 - Gathering data

### Step 1.1 - German data

### Step 1.1.1 - Scraping of GermanFakeNC articles

This uses the GermanFakeNC and crawls through its entries to receive title and body from each samples URL

Credit goes to https://dagshub.com/leudom/german-fake-news-classifier/src/master/src/data/scrape_news.py

In [2]:
# Function to extract title
def extract_title(url):
    article = Article(url)
    try:
        article.download()
        logger.info('Article title downloaded from %s' % url)
        article.parse()
    except:
        article.title = 'No title'

    return article.title

# Function to extract text
def extract_text(url):
    article = Article(url)
    try:
        article.download()
        logger.info('Article text downloaded from %s' % url)
        article.parse()
    except:
        article.text = 'No text'

    return article.text

log_fmt = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
logging.basicConfig(level=logging.INFO, format=log_fmt)
logger = logging.getLogger()

with py7zr.SevenZipFile('german_datasets.7z', 'r') as z:
    z.extractall()

df = pd.read_json("GermanFakeNC.json")

logger.info('Head of dataframe: \n%s' % df.head())

# Get title and text to each samples url
df['titel'] = df['URL'].apply(extract_title)
df['text'] = df['URL'].apply(extract_text)

logger.info('Head of dataframe after parsing: \n%s' % df.head())

# Filter rows with no information (titel or text)
no_info_mask = (df['titel'] != 'No title') & (df['text'] != 'No text')
df_final = df[no_info_mask]

logger.info('Shape of final dataframe: %s' % str(df_final.shape))
logger.info('dtypes: \n%s' % str(df_final.dtypes))
logger.info('Rows with null values: \n%s' % df_final.isnull().sum())

# Save as csv
try:
    df_final.to_csv('df_GermanFakeNC.csv', index=False)
    logger.info("CSV was saved to disk")
except Exception:
    logger.exception("Couldn't save CSV to disc \n", exc_info=True)    

INFO:root:Head of dataframe: 
        Date                                                URL  \
0 2017-08-30  https://schluesselkindblog.com/2017/08/30/proz...   
1 2017-12-18  http://blauerbote.com/2017/12/18/bild-journali...   
2 2017-06-02  http://blauerbote.com/2017/06/02/angela-merkel...   
3 2017-09-25  http://smopo.ch/deutschlands-neonazis-waehlen-...   
4 2018-02-17  http://www.truth24.net/gruppenvergewaltigung-s...   

  False_Statement_1_Location False_Statement_1_Index  \
0                       Text                 213-237   
1                       Text                   13-36   
2                      Title                     1-7   
3                      Title                     1-5   
4                      Title                     1-1   

  False_Statement_2_Location False_Statement_2_Index  \
0                                                      
1                       Text                   52-81   
2                       Text                   70-94   
3     

INFO:root:Article title downloaded from https://netzfrauen.org/2017/11/05/education/#more-53491
INFO:root:Article title downloaded from https://philosophia-perennis.com/2018/01/30/iris-masson-1/
INFO:root:Article title downloaded from https://www.berlinjournal.biz/arabische-grossfamilien-berlins-polizei/
INFO:root:Article title downloaded from https://dieunbestechlichen.com/2018/02/soros-warnt-vor-zusammenbruch-der-eu-und-schwaermt-von-der-nazi-herrschaft/
INFO:root:Article title downloaded from https://dieunbestechlichen.com/2017/12/berlin-72-jaehriger-im-u-bahnhof-krankenhausreif-getreten/
INFO:root:Article title downloaded from https://perspektive-online.net/2018/02/erneut-angriffe-auf-demonstrationen-in-solidaritaet-mit-kurden/
INFO:root:Article title downloaded from http://smopo.ch/oesterreich-skandal-praesident-poebelt-gegen-mehrheit-der-briten/
INFO:root:Article title downloaded from http://new.euro-med.dk/20180219-merkel-preist-italien-weil-es-immigranten-holt-wir-haben-enge-zu

INFO:root:Article title downloaded from https://www.journalistenwatch.com/2018/02/13/abgelehnte-asylbewerber-sorgen-fuer-bahnhofssperrung/
INFO:root:Article title downloaded from http://zuerst.de/2017/11/03/integration-steht-dem-ziel-und-we%C2%ADsen-des-asyls-entgegen/
INFO:root:Article title downloaded from https://blog.halle-leaks.de/arme-fluechtlinge-muessen-zur-tafel-weil-sie-im-jahr-milliarden-eurer-steuern-in-die-heimat-schicken/
INFO:root:Article title downloaded from http://smopo.ch/deutschland-verbrechen-mit-messern-um-1200-prozent-angestiegen/
INFO:root:Article title downloaded from https://blog.halle-leaks.de/massiver-islamistenangriff-am-24-12-2017-auf-polizei-in-reutlingen/
INFO:root:Article title downloaded from http://www.guidograndt.de/2018/02/19/weisse-sklavinnen-sollen-in-fluechtlingslagern-zur-unterhaltung-dienen-so-uebel-wird-gegen-europaeische-deutsche-frauen-gehetzt-4/
INFO:root:Article title downloaded from http://smopo.ch/schweden-werden-auf-eigenenvoelkermord-v

INFO:root:Article title downloaded from https://blog.halle-leaks.de/neue-spielplatzverordnung-ab-20-muslime-nur-noch-spielplaetze-in-islamoptik/
INFO:root:Article title downloaded from http://www.mmnews.de/wirtschaft/36386-siemens-proteste-merkels-arbeitslose
INFO:root:Article title downloaded from https://www.compact-online.de/verdaechtiges-paket-am-potsdamer-weihnachtsmarkt-evakuierung/
INFO:root:Article title downloaded from https://dieunbestechlichen.com/2017/12/maulkorb-eu-projekt-erklaert-journalisten-wie-sie-ueber-die-massenmigration-zu-berichten-haben/
INFO:root:Article title downloaded from http://www.epochtimes.de/politik/deutschland/mobbing-in-der-schule-voranschreitende-radikalisierung-des-islam-ist-ein-problem-a2362262.html?meistgelesen=1
INFO:root:Article title downloaded from http://blauerbote.com/2017/09/21/spd-politiker-bringt-blogger-in-den-knast/
INFO:root:Article title downloaded from http://noch.info/2016/05/der-cia-agent-bin-laden-ist-lebendig-und-laesst-es-sich-a

INFO:root:Article title downloaded from https://blog.halle-leaks.de/abgelehnter-fluechtling-schlitzt-waerter-im-knast-mit-cuttermesser-auf/
INFO:root:Article title downloaded from https://blog.halle-leaks.de/zivilcourage-einbrecher-fluechtling-mit-baseballschlaeger-vermoebelt/
INFO:root:Article title downloaded from https://philosophia-perennis.com/2017/10/25/deutschland-hoelle/
INFO:root:Article title downloaded from https://schluesselkindblog.com/2018/01/01/duesseldorf-mit-auto-gezielt-in-personengruppe-gefahren/#comments
INFO:root:Article title downloaded from https://deutsch.rt.com/newsticker/65577-zu-viele-weisse-gluckliche-christen-ungarn-kulturhauptstadt/
INFO:root:Article title downloaded from https://opposition24.com/von-wegen-die-republik-rueckt-nach-rechts/356935
INFO:root:Article title downloaded from https://de.sott.net/article/32235-Vertuscht-von-ganz-oben-Tausende-Padophile-arbeiten-fur-die-UN
INFO:root:Article title downloaded from https://blog.halle-leaks.de/der-weihna

INFO:root:Article title downloaded from http://www.epochtimes.de/politik/deutschland/freiberg-beantragt-zuzugsstopp-von-migranten-aussergewoehnlich-hohe-zuwanderung-schafft-integrationsprobleme-a2337416.html
INFO:root:Article title downloaded from https://dieunbestechlichen.com/2017/12/die-ideologie-industrie-kampf-gegen-rechts-bringt-linken-organisationen-100-mio-aus-steuergeldern/
INFO:root:Article title downloaded from http://smopo.ch/frankreich-keine-post-fuer-migranten/
INFO:root:Article title downloaded from http://info-direkt.eu/2018/01/20/fp-politiker-kontern-asyl-vorwuerfe/
INFO:root:Article title downloaded from https://blog.halle-leaks.de/jeden-tag-werden-in-deutschland-10-menschen-von-fluechtlingen-abgestochen/
INFO:root:Article title downloaded from https://opposition24.com/der-marsch-wenn-staat/410354
INFO:root:Article title downloaded from http://www.truth24.net/suedlaender-entbloesst-sich-vor-2-maedchen-13-und-onaniert-schon-wieder-in-essen/
INFO:root:Article title down

INFO:root:Article title downloaded from http://www.truth24.net/cottbus-araber-aufnahmestopp-wirkungslos-buerger-wehren-sich-nun-selbst/
INFO:root:Article title downloaded from http://www.guidograndt.de/2018/02/02/luegen-luegen-luegen-aus-6-mach-12-die-verschwiegene-wahrheit-ueber-den-familiennachzug/
INFO:root:Article title downloaded from http://news-for-friends.de/der-naechste-weihnachtsmarkt-abgeschafft-heisst-in-elmshorn-jetzt-wintermarkt/
INFO:root:Article title downloaded from https://dieunbestechlichen.com/2017/12/trauern-verboten-kranz-und-gedenk-gesteck-der-afd-am-breitscheidplatz-wurden-entfernt/
INFO:root:Article title downloaded from http://blauerbote.com/2017/03/11/propaganda-in-der-wikipedia/
INFO:root:Article title downloaded from https://alpenschau.com/2017/10/19/las-vegas-attentat-loeschte-das-fbi-augenzeugen-handys-warum/
INFO:root:Article title downloaded from https://blog.halle-leaks.de/merkels-leichenberg-wieder-hoeher-fluechtling-bringt-mann-mit-messer-um/
INFO:ro

INFO:root:Article title downloaded from http://www.gegenfrage.com/258-millionen-vereinte-nationen/
INFO:root:Article title downloaded from http://www.guidograndt.de/2017/10/26/suendenpfuhl-eu-parlament-vergewaltigung-prostitution-sexuelle-belaestigung-so-schuetzen-sich-die-politik-eliten/
INFO:root:Article title downloaded from https://blog.halle-leaks.de/islam-ist-frieden-frau-wegen-kuss-ueberfahren-400-meter-mitgeschleift/
INFO:root:Article title downloaded from https://opposition24.com/manipulative-meinungsmacher-journalisten-spaltpilz/345987
INFO:root:Article title downloaded from https://opposition24.com/klimaluege-so-un-anstieg/390321
INFO:root:Article title downloaded from http://zuerst.de/2017/11/23/renitente-asylanten-bewohner-geschaeftsleute-und-polizisten-protestieren-auf-lesbos/
INFO:root:Article title downloaded from http://smopo.ch/lagerfeld-kritisiert-merkels-importierten-judenhass-und-er-hat-recht/
INFO:root:Article title downloaded from http://info-direkt.eu/2018/02/12

INFO:root:Article title downloaded from https://www.journalistenwatch.com/2017/09/21/wen-wuerde-allah-waehlen/
INFO:root:Article title downloaded from http://smopo.ch/zehntausende-terroristen-wollen-nach-europa/
INFO:root:Article title downloaded from https://www.journalistenwatch.com/2018/02/18/berliner-polizei-und-antifa-gemeinsam-gegen-frauenrechte/
INFO:root:Article title downloaded from https://blog.halle-leaks.de/merkels-terror-fluechtlinge-busfahrer-brutal-zusammengeschlagen-auf-einem-auge-nun-blind-arbeitslos/
INFO:root:Article title downloaded from https://www.zeitenschrift.com/artikel/masseneinwanderung-und-die-luegenpresse
INFO:root:Article title downloaded from https://blog.halle-leaks.de/putsch-islamistische-polizisten-darsteller-wollen-berliner-polizei-uebernehmen-massenschlaegerei/
INFO:root:Article title downloaded from http://blauerbote.com/2017/12/09/landgericht-hamburg-wer-fake-news-aufdeckt-macht-sich-strafbar/
INFO:root:Article title downloaded from http://www.alle

INFO:root:Article text downloaded from https://blog.halle-leaks.de/allahu-helau-massive-sexuelle-uebergriffe-durch-fluechtlinge-zu-karnevalsbeginn-in-koeln/
INFO:root:Article text downloaded from https://opposition24.com/unfassbar-euro-sachschaden-asylbewerber/369326
INFO:root:Article text downloaded from http://www.pi-news.net/hamburg-pakistaner-schaechtet-zweijaehrige-tochter/
INFO:root:Article text downloaded from http://www.guidograndt.de/2018/01/30/verschwiegen-vom-mainstream-superman-putin-diese-unglaublichen-erfolge-hat-der-russische-praesident-vorzuweisen/
INFO:root:Article text downloaded from https://de.sott.net/article/32015-Die-meisten-Forschungsergebnisse-sind-falsch-Interessenskonflikte-in-wissenschaftlicher-Forschung
INFO:root:Article text downloaded from http://www.journalistenwatch.com/2017/12/14/mehrere-festnahmen-bei-islamisten-razzia-in-berlin/
INFO:root:Article text downloaded from https://schluesselkindblog.com/2018/02/21/gauland-zusammenarbeit-mit-pegida-ist-moeg

INFO:root:Article text downloaded from https://www.unzensuriert.at/content/0025635-Fregatte-Mecklenburg-Vorpommern-kooperiert-mit-NGO-Schiffen-als-Migranten-Taxi
INFO:root:Article text downloaded from http://www.anonymousnews.ru/2017/10/13/gefaengnisse-abschaffen-gruene-wollen-90-prozent-der-haeftlinge-in-offenen-vollzug-entlassen/
INFO:root:Article text downloaded from http://www.epochtimes.de/politik/europa/polizeichef-von-schweden-warnt-die-regierung-hat-die-kontrolle-ueber-das-land-verloren-integration-gescheitert-a2156608.html
INFO:root:Article text downloaded from https://blog.halle-leaks.de/bestialischer-ueberfall-fluechtling-enthauptet-oma-98-beim-klau-der-halskette-fast/
INFO:root:Article text downloaded from https://blog.halle-leaks.de/rezession-beendet-keine-arbeitslosigkeit-mehr-betteln-ist-arbeit/
INFO:root:Article text downloaded from http://zuerst.de/2017/11/30/linksextremisten-drohen-afd-wir-werden-da-reingehen-und-den-parteitag-abbrechen/
INFO:root:Article text downloa

INFO:root:Article text downloaded from http://news-for-friends.de/merkel-eheman-sauer-wird-vorstand-in-der-springer-stiftung/
INFO:root:Article text downloaded from http://smopo.ch/berlin-merkel-laesst-dissidenten-verhaften/
INFO:root:Article text downloaded from http://www.truth24.net/migrant-versucht-frau-in-wolfenbuettel-zu-vergewaltigen/
INFO:root:Article text downloaded from https://de.sott.net/article/31968-So-schaut-Propaganda-aus-Hungernder-Eisbar-in-der-Arktis-gefilmt-Und-der-Mensch-tragt-keine-Schuld
INFO:root:Article text downloaded from https://blog.halle-leaks.de/zu-heiligabend-islamist-zuendet-hund-bei-lebendigem-leibe-an/
INFO:root:Article text downloaded from http://www.anonymousnews.ru/2017/10/20/staatsterrorismus-landeskriminalamt-stiftete-amri-zu-anschlag-auf-berliner-weihnachtmarkt-an/
INFO:root:Article text downloaded from https://www.compact-online.de/forscher-rechnen-volksaustausch-vor-75-millionen-muslime-in-eu-bis-2050-grenzschliessung-zwecklos/
INFO:root:Artic

INFO:root:Article text downloaded from https://schluesselkindblog.com/2018/02/22/keine-migranten-mehr-essener-tafel-zieht-notbremse/
INFO:root:Article text downloaded from https://dieunbestechlichen.com/2017/12/fluechtlings-aktion-der-bahn-junge-weisse-wehrlose-maedchen-als-lockvogel/
INFO:root:Article text downloaded from http://www.anonymousnews.ru/2017/10/13/verwaltungsgerichtshof-legalisiert-polygamie-syrer-wird-erster-deutscher-mit-zwei-ehefrauen/
INFO:root:Article text downloaded from http://www.noislam.de/afrikanischer-islamist-zuendet-an-heilig-abend-hund-an-und-verbrennt-ihn-zu-tode/
INFO:root:Article text downloaded from http://www.pi-news.net/2018/01/ingolstadt-mordangriff-auf-pizzaboten-mit-kettensaege-und-beil/
INFO:root:Article text downloaded from http://www.noislam.de/wegen-abfuhr-wuetender-schwarzafrikaner-sticht-frau-21-in-den-hals-todeskampf-in-celle/
INFO:root:Article text downloaded from http://www.guidograndt.de/2018/02/20/videodanke-mama-merkel-syrer-mit-2-frauen

INFO:root:Article text downloaded from http://www.journalistenwatch.com/2017/12/20/bka-statistik-700-000-straftaten-durch-fluechtlinge/
INFO:root:Article text downloaded from http://www.truth24.net/armutsmigrant-attackiert-studentin-auf-universitaets-toilette-campus-freiburg/
INFO:root:Article text downloaded from http://blauerbote.com/2017/05/07/angela-merkel-setzt-bundeswehr-gegen-demonstranten-ein/
INFO:root:Article text downloaded from https://blog.halle-leaks.de/luegenpresse-verharmlost-regelmaessig-brutale-vergewaltigungen-durch-fluechtlinge/
INFO:root:Article text downloaded from https://blog.halle-leaks.de/wieder-joggerin-von-vergwaltiger-fluechtling-angefallen/
INFO:root:Article text downloaded from https://blog.halle-leaks.de/achtung-fluechtlinge-lauern-friedhofsbesucher-auf-um-zu-vergewaltigen/
INFO:root:Article text downloaded from http://smopo.ch/deutschland-spd-aussenminister-gabriel-sorgt-fuer-verwirrung/
INFO:root:Article text downloaded from http://www.epochtimes.de/po

INFO:root:Article text downloaded from http://smopo.ch/vergewaltigungen-von-frauen-vor-der-legalisierung/
INFO:root:Article text downloaded from https://blog.halle-leaks.de/wieder-grausame-kinder-vergewaltigung-durch-fluechtlinge-im-kalifat-nrw/
INFO:root:Article text downloaded from http://www.anonymousnews.ru/2017/11/22/muenchen-asylbewerber-aus-nigeria-pruegelt-67-jaehrigen-deutschen-tot-kein-haftgrund/
INFO:root:Article text downloaded from https://blog.halle-leaks.de/nach-schon-einem-jahr-fotofahndung-nach-muslim-schlaegern/
INFO:root:Article text downloaded from http://www.truth24.net/gruppenvergewaltigung-afrikaner-fallen-mit-messer-ueber-junge-frau-her-hannover/
INFO:root:Article text downloaded from http://smopo.ch/nur-noch-mitleid-fuer-die-deutschen/
INFO:root:Article text downloaded from http://www.anonymousnews.ru/2017/10/23/linke-no-go-area-leipzig-connewitz-wegen-frei-wild-jacke-halb-tot-geschlagen/
INFO:root:Article text downloaded from https://opposition24.com/poggenbur

INFO:root:Article text downloaded from https://schluesselkindblog.com/2018/02/17/zwei-deutsche-wegen-behandlungen-durch-unqualifizierte-migranten-aerzte-gestorben/
INFO:root:Article text downloaded from http://zuerst.de/2018/01/29/nach-putin-jetzt-auch-netanjahu-deutschland-bestenfalls-eingeschraenkt-souveraen/
INFO:root:Article text downloaded from http://smopo.ch/kanzlerin-merkel-fuer-unbegrenzt-mehr-moslem-migranten/
INFO:root:Article text downloaded from http://smopo.ch/deutschland-300-bei-anti-terror-demo-koeln/
INFO:root:Article text downloaded from http://www.noislam.de/arabischer-mob-misshandelt-seniorin-59-ermordet-und-verbrennt-ihren-mann-giessen/
INFO:root:Article text downloaded from http://www.wsws.org/de/articles/2018/03/01/mili-m01.html
INFO:root:Article text downloaded from http://blaue-flora.de/415-sturmgewehre-polizei-in-berlin-ruestet-sich-fuer-den-buergerkrieg/
INFO:root:Article text downloaded from https://www.journalistenwatch.com/2018/02/01/weil-der-staat-versagt

INFO:root:Article text downloaded from https://de.sott.net/article/32257-Doppelmoral-Neid-und-Hame-Russischer-Sieg-Spiegel-fordert-Verbot-von-Eiskunstlauf
INFO:root:Article text downloaded from http://www.journalistenwatch.com/2017/12/20/weihnachten-2017-mer-bewache-dae-dom-en-koelle/
INFO:root:Article text downloaded from http://www.rapefugees.net/sex-jihadist-zieht-maedchen-10-vom-fahrrad-und-vergewaltigt-es-im-gebuesch-brutal-leipzig/
INFO:root:Article text downloaded from http://www.guidograndt.de/2018/02/08/kollegenbeitrag-die-grosse-familiennachzugs-verarsche/
INFO:root:Article text downloaded from https://blog.halle-leaks.de/mainz-kind-in-lebensgefahr-nach-gruppenvergewaltigung-durch-drei-islamisten/
INFO:root:Article text downloaded from http://www.epochtimes.de/politik/deutschland/lueneburg-gymnasium-verschiebt-weihnachtsfeier-muslimin-stoerten-die-lieder-a2299536.html
INFO:root:Article text downloaded from http://smopo.ch/deutschland-macht-wieder-kinder-zu-soldaten/
INFO:root

INFO:root:Article text downloaded from https://blog.halle-leaks.de/pervers-staatlich-organisierte-verpaarung-von-fluechtlingen-und-einheimischen/
INFO:root:Article text downloaded from http://www.truth24.net/achtung-polizei-bad-homburg-fahndet-nach-brutalem-moerder-mudasar-ali-rana/
INFO:root:Article text downloaded from http://www.epochtimes.de/politik/deutschland/duisburg-26-schulen-haben-ueber-70-prozent-kinder-mit-migrationshintergrund-a2350647.html
INFO:root:Article text downloaded from https://www.unzensuriert.at/content/0025772-Frauenbadetag-Picknick-am-Beckenrand-Urin-im-Mistkuebel-Babywindeln-im-Planschbecken
INFO:root:Article text downloaded from https://blog.halle-leaks.de/video-koelner-linksmaden-irrsinn-armbaender-sollen-vor-vergewaltigungen-schuetzen/
INFO:root:Article text downloaded from http://news-for-friends.de/islamischer-killerfluechtling-will-neuen-prozess-aber-vor-scharia-gericht/
INFO:root:Article text downloaded from http://www.allesroger.at/artikel/linker-mein

### Step 1.1.2 - Preparation and merge of datasets

Credit goes to https://dagshub.com/leudom/german-fake-news-classifier/src/master/src/data/make_dataset.py

In [4]:

def prepare_kegglenews_csv(filepath):
    """ 
    1.) Drop columns -> Kategorie, Quelle, Art
    2.) Check on duplicate Titel and Body and drop the first entry of duplicates
    3.) Rename Columns in order to match it with the other dataset (GermanFakeNC)
    4.) Add column source_name with news_csv to identifiy the source of a row after merging
    """

    # Read news.csv from disk
    _df = pd.read_csv(filepath)
    logger.debug(_df.info())
    # Drop cols
    logger.info('Null values in news.csv: \n%s' % _df.isnull().sum())
    cols_to_drop = ['Kategorie', 'Quelle', 'Art']
    _df.drop(cols_to_drop, axis=1, inplace=True)
    logger.info('Cols %s dropped' % cols_to_drop)


    # Drop duplicates
    logger.info('Percent duplicated Titel and Body: \n%s' % str(
        _df.duplicated(subset=['Titel', 'Body']).value_counts(normalize=True)))
    _df.drop_duplicates(subset=['Titel', 'Body'], inplace=True)
    logger.info('Duplicates in Titel and Body dropped')

    # Rename Cols
    new_cols = {'id': 'src_id',
                'Titel': 'title',
                'Body': 'text',
                'Datum': 'date',
                'Fake': 'fake'}
    _df.rename(columns=new_cols, inplace=True)
    logger.info('Cols renamed')

    # Add col source_name
    _df['src_name'] = 'news_csv'

    return _df


def prepare_germanfakenc(filepath):
    """ 
    1.) Drop columns -> [False_Statement_1_Location,
                         False_Statement_1_Index,
                         False_Statement_2_Location,
                         False_Statement_2_Index,
                         False_Statement_3_Location,
                         False_Statement_3_Index,
                         Ratio_of_Fake_Statements,
                         Overall_Rating]
        We treat all entries as fakenews, eventhough there are some instances
        that have a very low fake overall ratings!!
    2.) Make index source_id
    3.) Check on duplicate titel and text and drop the first entry of duplicates
    4.) Drop rows where titel or text is null 
    5.) Fill Dates for missing values -> From the URL we can see that the Date could
        be 2017/12 
    6.) Rename Columns in order to match it with the other dataset (news.csv)
    7.) Add label col 'fake' = 1 -> all 1; col 'src_name' = 'GermanFakeNC'
    """

    # Read news.csv from disk
    _df = pd.read_csv(filepath)
    
    logger.debug(_df.info())
    # Drop cols
    logger.info('Null values in GermanFakeNC_interim.csv: \n%s' % _df.isnull().sum())
    cols_to_drop = ['False_Statement_1_Location',
                    'False_Statement_1_Index',
                    'False_Statement_2_Location',
                    'False_Statement_2_Index',
                    'False_Statement_3_Location',
                    'False_Statement_3_Index',
                    'Ratio_of_Fake_Statements',
                    'Overall_Rating']
    _df.drop(cols_to_drop, axis=1, inplace=True)
    logger.info('Cols %s dropped' % cols_to_drop)

    # Set source_id
    _df.reset_index(inplace=True)
    logger.info('Index reset')
    
    # Drop duplicates
    logger.info('Percent duplicated titel and text: \n%s' % str(
        _df.duplicated(subset=['titel', 'text']).value_counts(normalize=True)))
    _df.drop_duplicates(subset=['titel', 'text'], inplace=True)
    logger.info('Duplicates in titel and text dropped')

    # Drop rows where titel or text is null
    _df.dropna(subset=['titel', 'text'], inplace=True)
    logger.info('Null rows for titel and text dropped')

    # Fill the missing dates
    _df['Date'].fillna(pd.to_datetime('01/12/2017'), inplace=True)

    # Rename Cols
    new_cols = {'index': 'src_id',
                'titel': 'title',
                'Date': 'date',
                'URL': 'url'}
    _df.rename(columns=new_cols, inplace=True)
    logger.info('Cols renamed')

    # Add col source_name
    _df['fake'] = 1
    _df['src_name'] = 'GermanFakeNC'

    return _df

def merge_datasets(df_1, df_2):
    logger.info('Shape: %s\n Columns: %s' % (df_1.shape, df_1.columns))
    logger.info('Shape: %s\n Columns: %s' % (df_2.shape, df_2.columns))
    # Check col names
    sym_diff = set(df_1).symmetric_difference(set(df_2))
    assert len(sym_diff) == 0 , 'Differences in colnames of the two datasets'
    return pd.concat([df_1, df_2], axis=0, ignore_index=True)


log_fmt = '%(asctime)s - %(name)s - %(levelname)s : %(message)s'
logging.basicConfig(level=logging.INFO, format=log_fmt)
logger = logging.getLogger()

KEGGLENEWS_CSV = 'KeggleNews.csv'
GERMAN_FAKE_NC = 'df_GermanFakeNC.csv'
OUTPUT = 'german_datasets_merged.csv'

df_news = prepare_kegglenews_csv(KEGGLENEWS_CSV)
df_gfn = prepare_germanfakenc(GERMAN_FAKE_NC)
df_merged = merge_datasets(df_news, df_gfn)

try:
    df_merged.to_csv(OUTPUT, sep=',', index=False)
    logger.info('Final dataset prepared and saved to %s' % OUTPUT)
except Exception:
    logger.exception('File could not be daved to disk\n', exc_info=True )

INFO:root:Null values in news.csv: 
id               0
url              0
Titel            0
Body             0
Kategorie     1322
Datum            0
Quelle           0
Fake             0
Art          40972
dtype: int64
INFO:root:Cols ['Kategorie', 'Quelle', 'Art'] dropped


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63868 entries, 0 to 63867
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         63868 non-null  int64 
 1   url        63868 non-null  object
 2   Titel      63868 non-null  object
 3   Body       63868 non-null  object
 4   Kategorie  62546 non-null  object
 5   Datum      63868 non-null  object
 6   Quelle     63868 non-null  object
 7   Fake       63868 non-null  int64 
 8   Art        22896 non-null  object
dtypes: int64(2), object(7)
memory usage: 4.4+ MB


INFO:root:Percent duplicated Titel and Body: 
False    0.980444
True     0.019556
dtype: float64
INFO:root:Duplicates in Titel and Body dropped
INFO:root:Cols renamed
INFO:root:Null values in GermanFakeNC_interim.csv: 
Date                           19
URL                             0
False_Statement_1_Location      0
False_Statement_1_Index        14
False_Statement_2_Location     84
False_Statement_2_Index        89
False_Statement_3_Location    155
False_Statement_3_Index       158
Ratio_of_Fake_Statements        0
Overall_Rating                  0
titel                           0
text                            5
dtype: int64
INFO:root:Cols ['False_Statement_1_Location', 'False_Statement_1_Index', 'False_Statement_2_Location', 'False_Statement_2_Index', 'False_Statement_3_Location', 'False_Statement_3_Index', 'Ratio_of_Fake_Statements', 'Overall_Rating'] dropped
INFO:root:Index reset
INFO:root:Percent duplicated titel and text: 
False    0.890244
True     0.109756
dtype: float64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246 entries, 0 to 245
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Date                        227 non-null    object 
 1   URL                         246 non-null    object 
 2   False_Statement_1_Location  246 non-null    object 
 3   False_Statement_1_Index     232 non-null    object 
 4   False_Statement_2_Location  162 non-null    object 
 5   False_Statement_2_Index     157 non-null    object 
 6   False_Statement_3_Location  91 non-null     object 
 7   False_Statement_3_Index     88 non-null     object 
 8   Ratio_of_Fake_Statements    246 non-null    int64  
 9   Overall_Rating              246 non-null    float64
 10  titel                       246 non-null    object 
 11  text                        241 non-null    object 
dtypes: float64(1), int64(1), object(10)
memory usage: 23.2+ KB


INFO:root:Final dataset prepared and saved to german_datasets_merged.csv


### Step 1.2 - English data

Optimally, the same dataset as used in the original paper should be used, such that the only relevant metric for the performance is reduced to the implementation of the feature extraction.

### Step 1.2.1 - Attempts of generating the BuzzFace dataset

Unfortunately, the only way the BuzzFace dataset is available, is in the shape of a scraping framework. This means that if you attempted to scape the same news posts today, the dataset would turn out different, as alot of posts have vanished from the internet since 2016. Some manual lookup of links to posts provided in the framework verify that circumstance.

Apart of that, the framework was written in python2 and includes linux bash calls. Those two circumstances can be worked around. On top of that, the framework utilizes facebooks graph API and a Disqus API. The former requires App ID and Secret ID keys, which, among other things, require a verified facebook account, for which peronal identification passes have to be handed in, which can be a rather big issue, depending on one's personal view when it comes to handing out personal information to the internet, specifically social media platforms like facebook.

But as mentioned, even if all that is circumvented, the issue of scraping the web 6 years later still consists.

### Step 1.2.2 - Generating the BuzzFeed-Webis dataset

There happens to be a very similiar dataset, called the BuzzFeed-Webis dataset. It is very much alike as the BuzzFace dataset, except that it is alot smaller, but also fortunately comes as a consistent download and does not suffer from the requirement of scraping the data first.

Here is a copypaste from the datasets README:

<hr style="border:2px solid gray"> </hr>

BuzzFeed-Webis Fake News Corpus 2016
====================================
The corpus comprises the output of 9 publishers in a week close to the US elections. Among the selected publishers are 6 prolific hyperpartisan ones
(three left-wing and three right-wing), and three mainstream publishers (see Table 1). All publishers earned Facebook’s blue checkmark, indicating authenticity and an elevated status within the network. For seven weekdays (September 19 to 23 and September 26 and 27), every post and linked news article of the 9 publishers was fact-checked by professional journalists at BuzzFeed. In total, 1,627 articles were checked, 826 mainstream, 256 left-wing and 545 right-wing. The imbalance between categories results from differing publication frequencies.


The corpus comes with the following files:

##### README.txt

This file.

##### web-archives/*.warc

The web archive files that contain the HTTP messages that where sent and received during the crawl

##### articles/*.xml 

The articles extracted from the web archive files in XML format with annotations.

##### schema.xsd

Schema of the article files with explanations of the used XML tags. Can be used with object binding libraries (like JAXB) to parse the XML.

##### overview.csv

Giving the portal, orientation, veracity, and URL for each article. The same data is also contained in the XML files.

<hr style="border:2px solid gray"> </hr>

So, in short, the only file we need from this dataset is the articles folder, which consists one file for each sample in XML format. Here we will only use the articles xml-files as they contain mainText, title as well as veracity.

Lets have a quick look at the basic functionality of ElementTree in conjunction with one of the XML files

analogue to: https://towardsdatascience.com/extracting-information-from-xml-files-into-a-pandas-dataframe-11f32883ce45

In [5]:
path = 'articles/'
files = os.listdir(path)
print(len(files))

1627


In [6]:
file_path_file1 = os.path.join(path, files[0])
tree = ET.parse(file_path_file1)
root = tree.getroot()
print(root.tag, root.attrib)
for child in root:     
    print(child.tag, child.attrib)

article {}
author {}
hyperlink {'href': 'http://abcnews.go.com/topics/news/us/ronald-reagan.htm'}
hyperlink {'href': 'http://abcnews.go.com/topics/news/history/richard-nixon.htm'}
hyperlink {'href': 'http://abcnews.go.com/topics/news/elections/gallup-poll.htm'}
hyperlink {'href': 'http://abcnews.go.com/topics/news/history/gerald-ford.htm'}
hyperlink {'href': 'http://abcnews.go.com/topics/news/us/president-jimmy-carter.htm'}
hyperlink {'href': 'http://abcnews.go.com/topics/news/us/walter-mondale.htm'}
hyperlink {'href': 'http://abcnews.go.com/topics/news/us/al-gore.htm'}
hyperlink {'href': 'http://abcnews.go.com/topics/news/us/bill-clinton.htm'}
hyperlink {'href': 'http://www.langerresearch.com/wp-content/uploads/PollingMemoDoTheDebatesMatter.pdf'}
mainText {}
orientation {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragraph {}
paragra

#### Lets generate a dataframe from the XML files

In [7]:
df_news = pd.DataFrame()
i = 0

for file in files:
    file_path=path+file
    #print('Processing....'+file_path)
    tree = ET.parse(file_path)
    root = tree.getroot()
    
    # keep track of missing elements in the xml tree
    mainText_missing = 0
    title_missing = 0
    veracity_missing = 0
    
    data_dict = {}
    
    if root.find('mainText') != None:
        data_dict['mainText'] = root.find('mainText').text
    else:
        data_dict['mainText'] = ''
        mainText_missing += 1
            
        
    if root.find('title') != None:
        data_dict['title'] = root.find('title').text
    else:
        data_dict['title'] = ''
        title_missing += 1
        
    if root.find('veracity') != None:
        data_dict['veracity'] = root.find('veracity').text
    else:
        data_dict['veracity'] = ''
        veracity_missing += 1
   
    
    df_news = pd.concat([df_news, pd.DataFrame(data_dict,index=[i])])
    i=i+1
        
print("missing elements: mainText/title/veracity", mainText_missing, title_missing, veracity_missing)

# peek at the head of the dataframe and its shape
print("dataframe shape: ", df_news.shape)
df_news.head()

missing elements: mainText/title/veracity 0 0 0
dataframe shape:  (1627, 3)


Unnamed: 0,mainText,title,veracity
0,With the Hillary Clinton-Donald Trump debates ...,The Impact of Debates? It's Debatable,mostly true
1,As police today captured the man wanted for qu...,Details Emerge About NYC Bomb Suspect Ahmad Kh...,mostly true
2,One day after explosive devices were discovere...,Donald Trump Repeats Calls for Police Profilin...,mostly true
3,"Ahmad Khan Rahami, earlier named a person of i...","NY, NJ Bombings Suspect Charged With Attempted...",mostly true
4,Donald Trump's surrogates and leading supporte...,Trump Surrogates Push Narrative That Clinton S...,mostly true


#### Lets clean up the dataset

By having a quick look at the produced csv with e.g. "CSViewer" we can quickly tell the following:

- Some entries do not have a text and/or title. We want to drop those that do not have a text.
- Titles are very short in comparison to the texts, and both will be concatinated at a later point, so a missing title can be overlooked, while a missing text cannot.
- Very few texts are actually very short, but still longer than a title.
- Some entries have "The document has moved here." as text and "Moved Permanently" as title, with a random veracity assigned. We want to drop those.

In [8]:
import numpy as np

print("dataframe shape before cleaning: ", df_news.shape)

# remove entries with empty mainText or with  "The document has moved here." as text or "Moved Permanently" as title
df_news['mainText'].replace('', np.nan, inplace=True)
df_news.dropna(subset=['mainText'], inplace=True)
print("dataframe shape after removing no-text entries: ", df_news.shape)


# remove entries with "The document has moved here." as text
df_news['mainText'].replace('The document has moved here.', np.nan, inplace=True)
df_news.dropna(subset=['mainText'], inplace=True)
print("dataframe shape after 'The document has moved here.' entries", df_news.shape)


# remove entries with "Moved Permanently" as title
df_news['mainText'].replace('Moved Permanently', np.nan, inplace=True)
df_news.dropna(subset=['mainText'], inplace=True)
print("dataframe shape after 'Moved Permanently' entries", df_news.shape)

# Convert NaN titles to an empty string for later concatination of title and text
# print(df_news['title'].isnull().values.any())
df_news[['title']] = df_news[['title']].fillna('')
# print(df_news['title'].isnull().values.any())

dataframe shape before cleaning:  (1627, 3)
dataframe shape after removing no-text entries:  (1604, 3)
dataframe shape after 'The document has moved here.' entries (1590, 3)
dataframe shape after 'Moved Permanently' entries (1590, 3)


#### Now, analogue to Reis et al.'s process in "Supervised Learning for Fake News Detection" do the following:

Quote: "we discarded stories labeled as 'non factual content' and merged those labeled as 'mostly false' and 'mixture of true and false' into a single class, henceforth refered as 'fake news'. The reamining stories correspond to the 'true' portion"


In [9]:
# Check the unique values before conversion
uniques = df_news['veracity'].unique()
print("unique values before", uniques)


# convert
df_news['veracity'] = df_news['veracity'].map({'mixture of true and false': 1, 'mostly false': 1, 'mostly true': 0})

# non factual content is now "nan", which can be used to discard these entries
df_news.dropna(subset=['veracity'], inplace=True)

# the veracity column is turned from float to int
df_news['veracity'] = df_news['veracity'].astype(int)



# Check the unique values after conversion
uniques = df_news['veracity'].unique()
print("unique values after", uniques)


unique values before ['mostly true' 'no factual content' 'mixture of true and false'
 'mostly false']
unique values after [0 1]


#### Lastly, save the dataframe as .csv for future use

In [10]:
# save for future use
df_news.to_csv('BuzzFeed-Webis.csv')

Just to keep in mind:

We now have a total of two dataset - one german and one english. Those being:

- german_datasets_merged.csv
- BuzzFeed-Webis.csv

## Step 2 - Performing the extensive (textual only) feature extration of Reis et al.

#### 2.1 -  Language Features (Syntax)

The work of Reis et al. suggests a total of 31 features - these include:

Basic/traditional features:
1. mean words per sentence
2. mean syllables per sentence
3. mean characters per word
4. mean syllables per word

Basic/traditional formulas:
5. Flesch reading ease: 206.835 - 1.015(total_words/total_sentences) - 84.6(total_syllables/total_words)
6. Flesch-Kincaid grade level: 0.39(total_words/total_sentences) + 11.8(total_syllables/total_words) - 15.59


7. POS-Tagging TreeTagger has high accuracy according to https://www.ltl.uni-due.de/wp-content/uploads/posTaggerEvaluation.pdf

3 tags of word categories
    - e.g. noun, verb, adjective - relatively basic types (definately not including all 36 tags of the penn treebank)

#### POS-Tagging analogue to:
https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/

In [5]:
# Set up the corpus - We use the german TIGER Corpus
nltk.download('punkt')
corp = nltk.corpus.ConllCorpusReader('.', 'german_tiger_train.conll',
                                     ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                     encoding='utf-8')

# prepare the training of a Tagger
tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)

# set a split size: use 90% for training, 10% for testing
split_perc = 0.4
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]

tagger = ClassifierBasedGermanTagger(train=train_sents)

accuracy = tagger.evaluate(test_sents)
print(accuracy)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fjun\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0.9374382861803976


In [7]:
print(len(train_sents))
print(len(test_sents))

23530
15686


In [26]:
# tiger_model.tag(nltk.word_tokenize("Politik ist ein schweres Feld."))
# tagger.tag(nltk.word_tokenize("Die Sondierungsgespräche zwischen Union und SPD endeten mit einem Kompromiss."))
tagger.tag(nltk.word_tokenize("Unser letzter Chat stammt aus der finalen Marathon-Verhandlungsrunde im Willy-Brandt-Haus, die angeblich 26 Stunden dauerte."))

[('Unser', 'PPOSAT'),
 ('letzter', 'ADJA'),
 ('Chat', 'FM'),
 ('stammt', 'VVFIN'),
 ('aus', 'APPR'),
 ('der', 'ART'),
 ('finalen', 'ADJA'),
 ('Marathon-Verhandlungsrunde', 'NN'),
 ('im', 'APPRART'),
 ('Willy-Brandt-Haus', 'NE'),
 (',', '$,'),
 ('die', 'PRELS'),
 ('angeblich', 'ADJD'),
 ('26', 'CARD'),
 ('Stunden', 'NN'),
 ('dauerte', 'VVFIN'),
 ('.', '$.')]

In [None]:
"Die Sondierungsgespräche zwischen Union und SPD endeten mit einem Kompromiss, der nun insbesondere bei den Sozialdemokraten heftig diskutiert wird. Wie schon bei den Jamaika-Sondierungen (wir berichteten) wurden dem Postillon von einer anonymen Quelle die geheimen Chat-Protokolle aus der GroKo-WhatsApp-Gruppe der Parteispitzen zugespielt. Wir dokumentieren und kommentieren im Folgenden Leaks (#GroKoLeaks) aus den Sondierungsgesprächen, die tief blicken lassen: Die Gruppenerstellung: Schnell beginnt das harte Ringen um politische Positionen: Nach der zwischenzeitlichen Entfernung von Andrea Nahles kommt es jedoch nur selten zu Verhandlungserfolgen für die SPD: Doch auch kleine technikbedingte Pannen bleiben nicht aus (Andrea Nahles wurde inzwischen wieder hinzugefügt, musste jedoch versprechen, dass sie Beleidigungen und Drohungen auf ein Minimum reduziert):"
Am vierten Tag der Sondierungen kam es zu diesem Überraschungsauftritt:

Es folgen weitere harte  Diskussionen:

Auch die WhatsApp-Leaks bei den Jamaika-Sondierungen werden angesprochen:

Unser letzter Chat stammt aus der finalen Marathon-Verhandlungsrunde im Willy-Brandt-Haus, die angeblich 26 Stunden dauerte: 
 Foto oben: dpa

Lesen Sie auch: 
Exklusiv! Das geheime WhatsApp-Chat-Protokoll der Jamaika-Sondierungsgespräche

In [None]:
def lang_feats(_df):
    
    
    

#### 2.2 - Lexical Features

#### 2.3 -  Psycholinguistic Features

#### 2.4 - Semantic Features

#### 2.5 - Subjectivity

## Step 3 - Running the data on the two best classifiers (XGBoost and RFs) used in Reis et al.'s work

### Step 3.1 - Using stop words, stemming and  count vectorization

In [11]:
# Overide CountVectorizer to integrate stemming
class StemmedCountVectorizer(CountVectorizer):
    def __init__(self, stemmer, **kwargs):
        super(StemmedCountVectorizer, self).__init__(**kwargs)
        self.stemmer = stemmer

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (self.stemmer.stem(w) for w in analyzer(doc))

    def get_params(self, deep=True):
        params = super().get_params(deep=deep)
        cp = copy.copy(self)
        cp.__class__ = CountVectorizer
        params.update(CountVectorizer.get_params(cp, deep))
        return params

### Step 3.1.1 - XGBoost

#### Step 3.1.1.1 - german dataset

In [15]:
# -*- coding: utf-8 -*-

"""FakeNewsClassifier Trainer

This script conduct the following steps (overview):

1. Load the processes dataset from {PROJECT_DIR}/data/processed as dataframe
2. Concatenate title and text in one column
3. Split the dataset into train and test

4. Set up a sklearn pipeline with preprocessor and classifier for the combined column (title+text)
   and perform hyperparam search 
5. Extract best score and best_params for and log it
6. Predict on testset and log the score

7. Pickle best pipe to disk (final model)

"""

#%% Load and split dataset
INPUTFILE = 'german_datasets_merged.csv'
df = pd.read_csv(INPUTFILE, sep=',')
logger.info('Distribution of fake news in entire dataset: \n%s' % df.fake.value_counts(normalize=True))

X = df['title']+' '+df['text']
y = df['fake']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                    test_size=0.2,
                    stratify=y,
                    random_state=42)

# Download and init stopwordlist
nltk.download('stopwords')
STOPWORD_LIST = nltk.corpus.stopwords.words('german')

#%% Define param grid for tuning
param_grid = {'vectorizer__max_features':[500],
        'clf__n_estimators': [400],
        'clf__learning_rate': [0.9],
        'clf__max_depth': [5]}

# Constants for training
MODEL_PATH = 'model.pkl'
CV_SCORING = 'f1'
CV_FOLDS = 3

# %%
with dagshub.dagshub_logger() as dagslog:
        # Pipeline
        pipe = Pipeline(steps=[
        ('vectorizer', StemmedCountVectorizer(stop_words=STOPWORD_LIST,
                                   token_pattern=r'\b[a-zA-Z]{2,}\b',
                                   stemmer=GermanStemmer(ignore_stopwords=True),
                                   lowercase=True)),
#         ('clf', CatBoostClassifier(allow_writing_files=False)),
        ('clf', XGBClassifier())
            
        ])
        
        # pipe_title_tune.get_params().keys()
        logger.info('Start training estimator pipeline')
        cv_grid = GridSearchCV(pipe, param_grid, scoring=CV_SCORING, cv=CV_FOLDS, n_jobs=2)
        cv_grid.fit(X_train, y_train)
        #cv_grid.best_estimator_.named_steps['clf'].get_all_params()

        # Log and assign best score and params to var
        logger.info('Best params from GridSearchCV: %s' % str(cv_grid.best_params_))
        dagslog.log_hyperparams({'model_path': MODEL_PATH})
        dagslog.log_hyperparams({'cv_score': CV_SCORING})
        dagslog.log_hyperparams({'best_cv_params': cv_grid.best_params_})

        logger.info('Best %s from GridSearchCV: %s' % (CV_SCORING, str(cv_grid.best_score_)))
        dagslog.log_metrics({'best_cv_score': cv_grid.best_score_})
        
        # Predict on testdata
        y_pred = cv_grid.best_estimator_.predict(X_test)
        logger.info('f1_score for testdata: %s' % str(f1_score(y_test, y_pred)))
        dagslog.log_metrics({'f1_score_on_testdata': f1_score(y_test, y_pred)})
        logger.info('precision_score for testdata: %s' % str(precision_score(y_test, y_pred)))
        dagslog.log_metrics({'precision_score_on_testdata': precision_score(y_test, y_pred)})
        logger.info('recall_score for testdata: %s' % str(recall_score(y_test, y_pred)))
        dagslog.log_metrics({'recall_score_on_testdata': recall_score(y_test, y_pred)})
        logger.info('Classification report for testdata: \n%s' % classification_report(y_test, y_pred))
        logger.info('Confusion matrix for testdata: \n%s' % confusion_matrix(y_test, y_pred))

        # Dump model to disk
#         dump(cv_grid.best_estimator_, MODEL_PATH)

INFO:root:Distribution of fake news in entire dataset: 
0    0.922957
1    0.077043
Name: fake, dtype: float64
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fjun\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
INFO:root:Start training estimator pipeline




INFO:root:Best params from GridSearchCV: {'clf__learning_rate': 0.9, 'clf__max_depth': 5, 'clf__n_estimators': 400, 'vectorizer__max_features': 500}
INFO:root:Best f1 from GridSearchCV: 0.8651978211176682
INFO:root:f1_score for testdata: 0.8695652173913043
INFO:root:precision_score for testdata: 0.8820403825717322
INFO:root:recall_score for testdata: 0.8574380165289256
INFO:root:Classification report for testdata: 
              precision    recall  f1-score   support

           0       0.99      0.99      0.99     11599
           1       0.88      0.86      0.87       968

    accuracy                           0.98     12567
   macro avg       0.94      0.92      0.93     12567
weighted avg       0.98      0.98      0.98     12567

INFO:root:Confusion matrix for testdata: 
[[11488   111]
 [  138   830]]


#### Step 3.1.1.2 - english dataset

In [20]:
#%% Load and split dataset
INPUTFILE = 'BuzzFeed-Webis.csv'
df = pd.read_csv(INPUTFILE, sep=',', keep_default_na=False)
logger.info('Distribution of fake news in entire dataset: \n%s' % df.veracity.value_counts(normalize=True))

X = df['title']+' '+df['mainText']
y = df['veracity']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                    test_size=0.2,
                    stratify=y,
                    random_state=42)

# Download and init stopwordlist
nltk.download('stopwords')
STOPWORD_LIST = nltk.corpus.stopwords.words('english')

#%% Define param grid for tuning
param_grid = {'vectorizer__max_features':[500],
        'clf__n_estimators': [400],
        'clf__learning_rate': [0.9],
        'clf__max_depth': [5]}

# Constants for training
MODEL_PATH = 'model.pkl'
CV_SCORING = 'f1'
CV_FOLDS = 3

# %%
with dagshub.dagshub_logger() as dagslog:
        # Pipeline
        pipe = Pipeline(steps=[
        ('vectorizer', StemmedCountVectorizer(stop_words=STOPWORD_LIST,
                                   token_pattern=r'\b[a-zA-Z]{2,}\b',
                                   stemmer=GermanStemmer(ignore_stopwords=True),
                                   lowercase=True)),
#         ('clf', CatBoostClassifier(allow_writing_files=False)),
        ('clf', XGBClassifier())
            
        ])
        
        # pipe_title_tune.get_params().keys()
        logger.info('Start training estimator pipeline')
        cv_grid = GridSearchCV(pipe, param_grid, scoring=CV_SCORING, cv=CV_FOLDS, n_jobs=2)
        cv_grid.fit(X_train, y_train)
        #cv_grid.best_estimator_.named_steps['clf'].get_all_params()

        # Log and assign best score and params to var
        logger.info('Best params from GridSearchCV: %s' % str(cv_grid.best_params_))
        dagslog.log_hyperparams({'model_path': MODEL_PATH})
        dagslog.log_hyperparams({'cv_score': CV_SCORING})
        dagslog.log_hyperparams({'best_cv_params': cv_grid.best_params_})

        logger.info('Best %s from GridSearchCV: %s' % (CV_SCORING, str(cv_grid.best_score_)))
        dagslog.log_metrics({'best_cv_score': cv_grid.best_score_})
        
        # Predict on testdata
        y_pred = cv_grid.best_estimator_.predict(X_test)
        logger.info('f1_score for testdata: %s' % str(f1_score(y_test, y_pred)))
        dagslog.log_metrics({'f1_score_on_testdata': f1_score(y_test, y_pred)})
        logger.info('precision_score for testdata: %s' % str(precision_score(y_test, y_pred)))
        dagslog.log_metrics({'precision_score_on_testdata': precision_score(y_test, y_pred)})
        logger.info('recall_score for testdata: %s' % str(recall_score(y_test, y_pred)))
        dagslog.log_metrics({'recall_score_on_testdata': recall_score(y_test, y_pred)})
        logger.info('Classification report for testdata: \n%s' % classification_report(y_test, y_pred))
        logger.info('Confusion matrix for testdata: \n%s' % confusion_matrix(y_test, y_pred))

        # Dump model to disk
#         dump(cv_grid.best_estimator_, MODEL_PATH)

INFO:root:Distribution of fake news in entire dataset: 
0    0.812296
1    0.187704
Name: veracity, dtype: float64
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fjun\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
INFO:root:Start training estimator pipeline
INFO:root:Best params from GridSearchCV: {'clf__learning_rate': 0.9, 'clf__max_depth': 5, 'clf__n_estimators': 400, 'vectorizer__max_features': 500}




INFO:root:Best f1 from GridSearchCV: 0.4576704517233954
INFO:root:f1_score for testdata: 0.4077669902912621
INFO:root:precision_score for testdata: 0.45652173913043476
INFO:root:recall_score for testdata: 0.3684210526315789
INFO:root:Classification report for testdata: 
              precision    recall  f1-score   support

           0       0.86      0.90      0.88       249
           1       0.46      0.37      0.41        57

    accuracy                           0.80       306
   macro avg       0.66      0.63      0.64       306
weighted avg       0.79      0.80      0.79       306

INFO:root:Confusion matrix for testdata: 
[[224  25]
 [ 36  21]]


### Step 3.1.2 - Random Forest

#### Step 3.1.2.1 - german dataset

In [21]:
#%% Load and split dataset
INPUTFILE = 'german_datasets_merged.csv'
df = pd.read_csv(INPUTFILE, sep=',')
logger.info('Distribution of fake news in entire dataset: \n%s' % df.fake.value_counts(normalize=True))

X = df['title']+' '+df['text']
y = df['fake']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                    test_size=0.2,
                    stratify=y,
                    random_state=42)

# Download and init stopwordlist
nltk.download('stopwords')
STOPWORD_LIST = nltk.corpus.stopwords.words('german')

#%% Define param grid for tuning
param_grid = {'vectorizer__max_features':[500],
        'clf__n_estimators': [400],
#         'clf__learning_rate': [0.9],
        'clf__max_depth': [25]}

# Constants for training
# MODEL_PATH = os.getenv('MODEL_PATH')
MODEL_PATH = 'model.pkl'
CV_SCORING = 'f1'
CV_FOLDS = 3

# %%
with dagshub.dagshub_logger() as dagslog:
        # Pipeline
        pipe = Pipeline(steps=[
        ('vectorizer', StemmedCountVectorizer(stop_words=STOPWORD_LIST,
                                   token_pattern=r'\b[a-zA-Z]{2,}\b',
                                   stemmer=GermanStemmer(ignore_stopwords=True),
                                   lowercase=True)),
        ('clf', RandomForestClassifier())
        ])
        
        # pipe_title_tune.get_params().keys()
        logger.info('Start training estimator pipeline')
        cv_grid = GridSearchCV(pipe, param_grid, scoring=CV_SCORING, cv=CV_FOLDS, n_jobs=2)
        cv_grid.fit(X_train, y_train)
        #cv_grid.best_estimator_.named_steps['clf'].get_all_params()

        # Log and assign best score and params to var
        logger.info('Best params from GridSearchCV: %s' % str(cv_grid.best_params_))
        dagslog.log_hyperparams({'model_path': MODEL_PATH})
        dagslog.log_hyperparams({'cv_score': CV_SCORING})
        dagslog.log_hyperparams({'best_cv_params': cv_grid.best_params_})

        logger.info('Best %s from GridSearchCV: %s' % (CV_SCORING, str(cv_grid.best_score_)))
        dagslog.log_metrics({'best_cv_score': cv_grid.best_score_})
        
        # Predict on testdata
        y_pred = cv_grid.best_estimator_.predict(X_test)
        logger.info('f1_score for testdata: %s' % str(f1_score(y_test, y_pred)))
        dagslog.log_metrics({'f1_score_on_testdata': f1_score(y_test, y_pred)})
        logger.info('precision_score for testdata: %s' % str(precision_score(y_test, y_pred)))
        dagslog.log_metrics({'precision_score_on_testdata': precision_score(y_test, y_pred)})
        logger.info('recall_score for testdata: %s' % str(recall_score(y_test, y_pred)))
        dagslog.log_metrics({'recall_score_on_testdata': recall_score(y_test, y_pred)})
        logger.info('Classification report for testdata: \n%s' % classification_report(y_test, y_pred))
        logger.info('Confusion matrix for testdata: \n%s' % confusion_matrix(y_test, y_pred))

        # Dump model to disk
#         dump(cv_grid.best_estimator_, MODEL_PATH)

INFO:root:Distribution of fake news in entire dataset: 
0    0.922957
1    0.077043
Name: fake, dtype: float64
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fjun\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
INFO:root:Start training estimator pipeline
INFO:root:Best params from GridSearchCV: {'clf__max_depth': 25, 'clf__n_estimators': 400, 'vectorizer__max_features': 500}
INFO:root:Best f1 from GridSearchCV: 0.7175119834421091
INFO:root:f1_score for testdata: 0.74
INFO:root:precision_score for testdata: 0.9367088607594937
INFO:root:recall_score for testdata: 0.6115702479338843
INFO:root:Classification report for testdata: 
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     11599
           1       0.94      0.61      0.74       968

    accuracy                           0.97     12567
   macro avg       0.95      0.80      0.86     12567
weighted avg       0.97      0.97  

#### Step 3.1.2.2 - english dataset

In [42]:
#%% Load and split dataset
INPUTFILE = 'BuzzFeed-Webis.csv'
df = pd.read_csv(INPUTFILE, sep=',', keep_default_na=False)
logger.info('Distribution of fake news in entire dataset: \n%s' % df.veracity.value_counts(normalize=True))

X = df['title']+' '+df['mainText']
y = df['veracity']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                    test_size=0.2,
                    stratify=y,
                    random_state=42)

# Download and init stopwordlist
nltk.download('stopwords')
STOPWORD_LIST = nltk.corpus.stopwords.words('english')

#%% Define param grid for tuning
param_grid = {'vectorizer__max_features':[200,500,700],
        'clf__n_estimators': [300, 400, 500],
#         'clf__learning_rate': [0.9],
        'clf__max_depth': [5, 10, 15]}

# Constants for training
MODEL_PATH = 'model.pkl'
CV_SCORING = 'f1'
CV_FOLDS = 3

# %%
with dagshub.dagshub_logger() as dagslog:
        # Pipeline
        pipe = Pipeline(steps=[
        ('vectorizer', StemmedCountVectorizer(stop_words=STOPWORD_LIST,
                                   token_pattern=r'\b[a-zA-Z]{2,}\b',
                                   stemmer=GermanStemmer(ignore_stopwords=True),
                                   lowercase=True)),
        ('clf', RandomForestClassifier())
        ])
        
        # pipe_title_tune.get_params().keys()
        logger.info('Start training estimator pipeline')
        cv_grid = GridSearchCV(pipe, param_grid, scoring=CV_SCORING, cv=CV_FOLDS, n_jobs=2)
        cv_grid.fit(X_train, y_train)
        #cv_grid.best_estimator_.named_steps['clf'].get_all_params()

        # Log and assign best score and params to var
        logger.info('Best params from GridSearchCV: %s' % str(cv_grid.best_params_))
        dagslog.log_hyperparams({'model_path': MODEL_PATH})
        dagslog.log_hyperparams({'cv_score': CV_SCORING})
        dagslog.log_hyperparams({'best_cv_params': cv_grid.best_params_})

        logger.info('Best %s from GridSearchCV: %s' % (CV_SCORING, str(cv_grid.best_score_)))
        dagslog.log_metrics({'best_cv_score': cv_grid.best_score_})
        
        # Predict on testdata
        y_pred = cv_grid.best_estimator_.predict(X_test)
        logger.info('f1_score for testdata: %s' % str(f1_score(y_test, y_pred)))
        dagslog.log_metrics({'f1_score_on_testdata': f1_score(y_test, y_pred)})
        logger.info('precision_score for testdata: %s' % str(precision_score(y_test, y_pred)))
        dagslog.log_metrics({'precision_score_on_testdata': precision_score(y_test, y_pred)})
        logger.info('recall_score for testdata: %s' % str(recall_score(y_test, y_pred)))
        dagslog.log_metrics({'recall_score_on_testdata': recall_score(y_test, y_pred)})
        logger.info('Classification report for testdata: \n%s' % classification_report(y_test, y_pred))
        logger.info('Confusion matrix for testdata: \n%s' % confusion_matrix(y_test, y_pred))

        # Dump model to disk
#         dump(cv_grid.best_estimator_, MODEL_PATH)

INFO:root:Distribution of fake news in entire dataset: 
0    0.812296
1    0.187704
Name: veracity, dtype: float64
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fjun\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
INFO:root:Start training estimator pipeline
INFO:root:Best params from GridSearchCV: {'clf__max_depth': 15, 'clf__n_estimators': 300, 'vectorizer__max_features': 200}
INFO:root:Best f1 from GridSearchCV: 0.19001471725410965
INFO:root:f1_score for testdata: 0.15384615384615383
INFO:root:precision_score for testdata: 0.625
INFO:root:recall_score for testdata: 0.08771929824561403
INFO:root:Classification report for testdata: 
              precision    recall  f1-score   support

           0       0.83      0.99      0.90       249
           1       0.62      0.09      0.15        57

    accuracy                           0.82       306
   macro avg       0.73      0.54      0.53       306
weighted avg       0.79    

### Step 3.2 - Using the extensive feature extraction as proposed by Reis et al.

### Step 3.2.1 - XGBoost

#### Step 3.2.1.1 - german dataset

#### Step 3.2.1.2 - english dataset

### Step 3.2.2 - Random Forest

#### Step 3.2.2.1 - german dataset

#### Step 3.2.2.2 - english dataset

## Step 4 - Overview of the results