## Introduction

As part of my contribution to the seminar, aside from my presentation covering the paper "Supervised Learning for Fake News Detection" by Reis et al., a written essay about the paper would have resulted in mostly restating the extensive procedures for the extraction of 141 textual features, 5 news source features and 21 environment features. Hence I decided to attempt to reproduce the work programmatically. 

The main focus of this work is on the extraction of the textual features. The reason for focusing only on the extraction of textual features that is that ultimately, the work can be used to classify german fake news, for which only two datasets currently exist, which do not provide the information required for extracting news source features or environment features.

So the goal of this work is to see how far you can get, using only the textual features and to compare the results based on the two mention german datasets as well as with the initially used dataset used in the work of Reis et al. with their results from the paper, as well as with the performance of a somewhat conventional classifier, which uses simple count vectors in regards to the features extraction.

For the conventional classifier, a german fake news classifier from Dominik Leuziger is being used. (https://dagshub.com/leudom/german-fake-news-classifier)

## Overview

#### Step 1 - Gathering data

#### Step 1.1 - German data

As mentioned, there are only two german dataset. Those are 

- GermanFakeNC: German Fake News Corpus (https://zenodo.org/record/3375714#.Ya8IHtDMKUk)
- Kaggle Starter: Fake News Dataset German 9cc110a2-9 (https://www.kaggle.com/kerneler/starter-fake-news-dataset-german-9cc110a2-9/data )

#### Step 1.1.1 - Scraping of GermanFakeNC articles

- Scraping and generation of the GermanFakeNC dataset

#### Step 1.1.2 - Preparation and merge of datasets

- preparing and aligning the two german datasets
- merging of datasets

#### Step 1.2 - English data

Optimally, we want to use the original dataset which was used in the work of Reis et al., "Supervised Learning for Fake News Detection". That is:

- BuzzFace: A News Veracity Dataset with Facebook User Commentary and Ego (https://metatext.io/datasets/buzzface)

As we will see in the progress of this work, acquiring that dataset in the shape that it was originally used is not possible. Hence another dataset is being used which is, not only with respect to size, not as extensive, but very close to the BuzzFace dataset in nature:

- BuzzFeed-Webis Fake News Corpus 16 (https://webis.de/data/buzzfeed-webis-fake-news-16.html)

#### Step 1.2.1 - Attempts of generating the BuzzFace dataset

- showcase of failed attempt

#### Step 1.2.2 - Generating the BuzzFeed-Webis dataset

- Construction of the BuzzFeed-Webis dataset from XML files

#### Step 2 - Performing the extensive (textual only) feature extration as proposed by Reis et al.

- Step-by-Step reconstruction of 141 textual features

#### Step 3 - Running the data (english and german) on the two best classifiers (XGBoost and RFs) used in Reis et al.'s work

- Using XGBoost and RFs to compare the performance of the merged german dataset to the english (BuzzFeed-Webis) dataset

#### Step 3.1 - Using stop words, stemming and  count vectorization


#### Step 3.2 - Using the extensive feature extraction as proposed by Reis et al.


#### Step 4 - Comparison of the feature extration methods



## Step 1 - Gathering data

### Step 1.1 - German data

### Step 1.1.1 - Scraping of GermanFakeNC articles

This uses the GermanFakeNC and crawls through its entries to receive title and body from each samples URL

Credit goes to https://dagshub.com/leudom/german-fake-news-classifier/src/master/src/data/scrape_news.py

In [1]:
import py7zr

def extract_title(url):
    article = Article(url)
    try:
        article.download()
        logger.info('Article title downloaded from %s' % url)
        article.parse()
    except:
        article.title = 'No title'

    return article.title

def extract_text(url):
    article = Article(url)
    try:
        article.download()
        logger.info('Article text downloaded from %s' % url)
        article.parse()
    except:
        article.text = 'No text'

    return article.text


log_fmt = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
logging.basicConfig(level=logging.INFO, format=log_fmt)
logger = logging.getLogger()

# Load .env file
# load_dotenv(find_dotenv())

# INPUTFILE = os.path.join(os.getenv('PROJECT_DIR'),
#                          'data',
#                          'raw',
#                          'GermanFakeNC.json')

df_scraped_file = 'df_GermanFakeNC.csv'

with py7zr.SevenZipFile('german_datasets.zip', 'r') as z:
    for filename in z.namelist():
        if filename == "GermanFakeNC.json":
            df = pd.read_json("GermanFakeNC.json")

# df = pd.read_json("germanfakenc.json")
logger.info('Head of dataframe: \n%s' % df.head())

# %% We only take News with an overall rating of at least 0.5
#overall_rating_mask = df['Overall_Rating'] >= 0.5
##ratio_mask = df['Ratio_of_Fake_Statements'].isin([3, 4])
#df_fake = df[overall_rating_mask & ratio_mask].reset_index()

df['titel'] = df['URL'].apply(extract_title)
df['text'] = df['URL'].apply(extract_text)

logger.info('Head of dataframe after parsing: \n%s' % df.head())

# Filter rows with no information (titel or text)
no_info_mask = (df['titel'] != 'No title') & (df['text'] != 'No text')
df_final = df[no_info_mask]

logger.info('Shape of final dataframe: %s' % str(df_final.shape))
logger.info('dtypes: \n%s' % str(df_final.dtypes))
logger.info('Rows with null values: \n%s' % df_final.isnull().sum())

# Save as csv
try:
    df_final.to_csv(df_scraped_file, index=False)
    logger.info("CSV was saved to disk")
except Exception:
    logger.exception("Couldn't save CSV to disc \n", exc_info=True)
    

NameError: name 'logging' is not defined

### Step 1.1.2 - Preparation and merge of datasets

In [None]:
# Functions


def prepare_news_csv(filepath):
    """ 
    1.) Drop columns -> Kategorie, Quelle, Art
    2.) Check on duplicate Titel and Body and drop the first entry of duplicates
    3.) Rename Columns in order to match it with the other dataset (GermanFakeNC)
    4.) Add column source_name with news_csv to identifiy the source of a row after merging
    """

    # Read news.csv from disk
    _df = pd.read_csv(filepath)
    logger.debug(_df.info())
    # Drop cols
    logger.info('Null values in news.csv: \n%s' % _df.isnull().sum())
    cols_to_drop = ['Kategorie', 'Quelle', 'Art']
    _df.drop(cols_to_drop, axis=1, inplace=True)
    logger.info('Cols %s dropped' % cols_to_drop)


    # Drop duplicates
    logger.info('Percent duplicated Titel and Body: \n%s' % str(
        _df.duplicated(subset=['Titel', 'Body']).value_counts(normalize=True)))
    _df.drop_duplicates(subset=['Titel', 'Body'], inplace=True)
    logger.info('Duplicates in Titel and Body dropped')

    # Rename Cols
    new_cols = {'id': 'src_id',
                'Titel': 'title',
                'Body': 'text',
                'Datum': 'date',
                'Fake': 'fake'}
    _df.rename(columns=new_cols, inplace=True)
    logger.info('Cols renamed')

    # Add col source_name
    _df['src_name'] = 'news_csv'

    return _df


def prepare_germanfake(filepath):
    """ 
    1.) Drop columns -> [False_Statement_1_Location,
                         False_Statement_1_Index,
                         False_Statement_2_Location,
                         False_Statement_2_Index,
                         False_Statement_3_Location,
                         False_Statement_3_Index,
                         Ratio_of_Fake_Statements,
                         Overall_Rating]
        We treat all entries as fakenews, eventhough there are some instances
        that have a very low fake overall ratings!!
    2.) Make index source_id
    3.) Check on duplicate titel and text and drop the first entry of duplicates
    4.) Drop rows where titel or text is null 
    5.) Fill Dates for missing values -> From the URL we can see that the Date could
        be 2017/12 
    6.) Rename Columns in order to match it with the other dataset (news.csv)
    7.) Add label col 'fake' = 1 -> all 1; col 'src_name' = 'GermanFakeNC'
    """

    # Read news.csv from disk
#     _df = pd.read_csv(filepath)
    
    logger.debug(_df.info())
    # Drop cols
    logger.info('Null values in GermanFakeNC_interim.csv: \n%s' % _df.isnull().sum())
    cols_to_drop = ['False_Statement_1_Location',
                    'False_Statement_1_Index',
                    'False_Statement_2_Location',
                    'False_Statement_2_Index',
                    'False_Statement_3_Location',
                    'False_Statement_3_Index',
                    'Ratio_of_Fake_Statements',
                    'Overall_Rating']
    _df.drop(cols_to_drop, axis=1, inplace=True)
    logger.info('Cols %s dropped' % cols_to_drop)

    # Set source_id
    _df.reset_index(inplace=True)
    logger.info('Index reset')
    
    # Drop duplicates
    logger.info('Percent duplicated titel and text: \n%s' % str(
        _df.duplicated(subset=['titel', 'text']).value_counts(normalize=True)))
    _df.drop_duplicates(subset=['titel', 'text'], inplace=True)
    logger.info('Duplicates in titel and text dropped')

    # Drop rows where titel or text is null
    _df.dropna(subset=['titel', 'text'], inplace=True)
    logger.info('Null rows for titel and text dropped')

    # Fill the missing dates
    _df['Date'].fillna(pd.to_datetime('01/12/2017'), inplace=True)

    # Rename Cols
    new_cols = {'index': 'src_id',
                'titel': 'title',
                'Date': 'date',
                'URL': 'url'}
    _df.rename(columns=new_cols, inplace=True)
    logger.info('Cols renamed')

    # Add col source_name
    _df['fake'] = 1
    _df['src_name'] = 'GermanFakeNC'

    return _df


def merge_datasets(df_1, df_2):
    logger.info('Shape: %s\n Columns: %s' % (df_1.shape, df_1.columns))
    logger.info('Shape: %s\n Columns: %s' % (df_2.shape, df_2.columns))
    # Check col names
    sym_diff = set(df_1).symmetric_difference(set(df_2))
    assert len(sym_diff) == 0 , 'Differences in colnames of the two datasets'
    return pd.concat([df_1, df_2], axis=0, ignore_index=True)



log_fmt = '%(asctime)s - %(name)s - %(levelname)s : %(message)s'
logging.basicConfig(level=logging.INFO, format=log_fmt)
logger = logging.getLogger()

# find .env automagically by walking up directories until it's found, then
# load up the .env entries as environment variables
load_dotenv(find_dotenv())

NEWS_CSV = os.path.join('news.csv')
GERMAN_FAKE_NC = os.path.join('df_GermanFakeNC.csv')
OUTPUT = os.path.join('datasets_merged.csv')

df_news = prepare_news_csv(NEWS_CSV)
df_gfn = prepare_germanfake(GERMAN_FAKE_NC)
df_merged = merge_datasets(df_news, df_gfn)

try:
    df_merged.to_csv(OUTPUT, sep=';', index=False)
    logger.info('Final dataset prepared and saved to %s' % OUTPUT)
except Exception:
    logger.exception('File could not be daved to disk\n', exc_info=True )

### Step 1.2 - English data

### Step 1.2.1 - Attempts of generating the BuzzFace dataset

### Step 1.2.2 - Generating the BuzzFeed-Webis dataset

## Step 2 - Performing the extensive (textual only) feature extration of Reis et al.

### Step 3 - Running the data on the two best classifiers (XGBoost and RFs) used in Reis et al.'s work

### Step 3.1 - Using stop words, stemming and  count vectorization

### Step 3.2 - Using the extensive feature extraction as proposed by Reis et al.

## Step 4 - Comparison of the feature extration methods

## Step 3
#### Performing the extensive (textual only) feature extration as proposed by Reis et al.

Language features