This is a copypaste from the Datasets README:

BuzzFeed-Webis Fake News Corpus 2016
====================================
The corpus comprises the output of 9 publishers in a week close to the US elections. Among the selected publishers are 6 prolific hyperpartisan ones
(three left-wing and three right-wing), and three mainstream publishers (see Table 1). All publishers earned Facebook’s blue checkmark, indicating authenticity and an elevated status within the network. For seven weekdays (September 19 to 23 and September 26 and 27), every post and linked news article of the 9 publishers was fact-checked by professional journalists at BuzzFeed. In total, 1,627 articles were checked, 826 mainstream, 256 left-wing and 545 right-wing. The imbalance between categories results from differing publication frequencies.


The corpus comes with the following files:

##### README.txt

This file.

##### web-archives/*.warc

The web archive files that contain the HTTP messages that where sent and received during the crawl

##### articles/*.xml 

The articles extracted from the web archive files in XML format with annotations.

##### schema.xsd

Schema of the article files with explanations of the used XML tags. Can be used with object binding libraries (like JAXB) to parse the XML.

##### overview.csv

Giving the portal, orientation, veracity, and URL for each article. The same data is also contained in the XML files.



### Here we will only use the articles xml-files as they contain mainText, title as well as veracity.

#### Lets have a quick look at the basic functionality of ElementTree in conjunction with one of the XML files

analogue to: https://towardsdatascience.com/extracting-information-from-xml-files-into-a-pandas-dataframe-11f32883ce45


In [1]:
import os
import xml.etree.ElementTree as ET
path = 'articles/'
files = os.listdir(path)
print(len(files))

In [2]:
file_path_file1 = os.path.join(path, files[0])
tree = ET.parse(file_path_file1)
root = tree.getroot()
print(root.tag, root.attrib)
for child in root:     
    print(child.tag, child.attrib)

#### Lets generate a dataframe from the XML files

In [3]:
import pandas as pd

df_news = pd.DataFrame()
i = 0

for file in files:
    file_path=path+file
    #print('Processing....'+file_path)
    tree = ET.parse(file_path)
    root = tree.getroot()
    
    # keep track of missing elements in the xml tree
    mainText_missing = 0
    title_missing = 0
    veracity_missing = 0
    
    data_dict = {}
    
    if root.find('mainText') != None:
        data_dict['mainText'] = root.find('mainText').text
    else:
        data_dict['mainText'] = ''
        mainText_missing += 1
            
        
    if root.find('title') != None:
        data_dict['title'] = root.find('title').text
    else:
        data_dict['title'] = ''
        title_missing += 1
        
    if root.find('veracity') != None:
        data_dict['veracity'] = root.find('veracity').text
    else:
        data_dict['veracity'] = ''
        veracity_missing += 1
   
    
    df_news = pd.concat([df_news, pd.DataFrame(data_dict,index=[i])])
    i=i+1
        
print("missing elements: mainText/title/veracity", mainText_missing, title_missing, veracity_missing)

# peek at the head of the dataframe and its shape
print("dataframe shape: ", df_news.shape)
df_news.head()

missing elements: mainText/title/veracity 0 0 0
dataframe shape:  (1627, 3)


Unnamed: 0,mainText,title,veracity
0,With the Hillary Clinton-Donald Trump debates ...,The Impact of Debates? It's Debatable,mostly true
1,As police today captured the man wanted for qu...,Details Emerge About NYC Bomb Suspect Ahmad Kh...,mostly true
2,One day after explosive devices were discovere...,Donald Trump Repeats Calls for Police Profilin...,mostly true
3,"Ahmad Khan Rahami, earlier named a person of i...","NY, NJ Bombings Suspect Charged With Attempted...",mostly true
4,Donald Trump's surrogates and leading supporte...,Trump Surrogates Push Narrative That Clinton S...,mostly true


#### Lets clean up the dataset

By having a quick look at the produced csv with e.g. "CSViewer" we can quickly tell the following:

- Some entries do not have a text and/or title. We want to drop those that do not have a text.
- Titles are very short in comparison to the texts, and both will be concatinated at a later point, so a missing title can be overlooked, while a missing text cannot.
- Very few texts are actually very short, but still longer than a title.
- Some entries have "The document has moved here." as text and "Moved Permanently" as title, with a random veracity assigned. We want to drop those.

In [4]:
import numpy as np

print("dataframe shape before cleaning: ", df_news.shape)

# remove entries with empty mainText or with  "The document has moved here." as text or "Moved Permanently" as title
df_news['mainText'].replace('', np.nan, inplace=True)
df_news.dropna(subset=['mainText'], inplace=True)
print("dataframe shape after removing no-text entries: ", df_news.shape)


# remove entries with "The document has moved here." as text
df_news['mainText'].replace('The document has moved here.', np.nan, inplace=True)
df_news.dropna(subset=['mainText'], inplace=True)
print("dataframe shape after 'The document has moved here.' entries", df_news.shape)


# remove entries with "Moved Permanently" as title
df_news['mainText'].replace('Moved Permanently', np.nan, inplace=True)
df_news.dropna(subset=['mainText'], inplace=True)
print("dataframe shape after 'Moved Permanently' entries", df_news.shape)

# Convert NaN titles to an empty string for later concatination of title and text
# print(df_news['title'].isnull().values.any())
df_news[['title']] = df_news[['title']].fillna('')
# print(df_news['title'].isnull().values.any())

dataframe shape before cleaning:  (1627, 3)
dataframe shape after removing no-text entries:  (1604, 3)
dataframe shape after 'The document has moved here.' entries (1590, 3)
dataframe shape after 'Moved Permanently' entries (1590, 3)
True
False


#### Now, analogue to Reis et al.'s process in "Supervised Learning for Fake News Detection" do the following:

Quote: "we discarded stories labeled as 'non factual content' and merged those labeled as 'mostly false' and 'mixture of true and false' into a single class, henceforth refered as 'fake news'. The reamining stories correspond to the 'true' portion"


In [5]:
# Check the unique values before conversion
uniques = df_news['veracity'].unique()
print("unique values before", uniques)


# convert
df_news['veracity'] = df_news['veracity'].map({'mixture of true and false': 1, 'mostly false': 1, 'mostly true': 0})

# non factual content is now "nan", which can be used to discard these entries
df_news.dropna(subset=['veracity'], inplace=True)

# the veracity column is turned from float to int
df_news['veracity'] = df_news['veracity'].astype(int)



# Check the unique values after conversion
uniques = df_news['veracity'].unique()
print("unique values after", uniques)


unique values before ['mostly true' 'no factual content' 'mixture of true and false'
 'mostly false']
unique values after [0 1]


#### Lastly, save the dataframe as .csv for future use

In [6]:
# save for future use
df_news.to_csv('BuzzFeed-Webis.csv')

In [7]:

print(df_news['title'].isnull().values.any())
print(df_news['mainText'].isnull().values.any())
print(df_news['veracity'].isnull().values.any())

False
False
False
