# Data Wrangling Demo

For this project, we will generate a fully labeled dataset, with the data coming from two different sources:

- Data labeled **TRUE**: A selection of articles from the **Aylien** Dataset, retrieved from https://aylien.com/blog/free-coronavirus-news-dataset, including the following
    - news articles that come from sources whose credibility rating, according to *Media Bias/Fact Check* (https://mediabiasfactcheck.com/), is the highest possible.
    - news articles that come from government sources (e.g., whose source url ends with ".gov")
    - news articles published by world-renowned universities or organizations (e.g., Harvard, WHO)

- Data labeled **FALSE**: Provided directly by COVID-19 Fake News Infodemic Research Dataset (**CoVID19-FNIR Dataset**). The raw **FNIR** dataset was retrieved from https://ieee-dataport.org/open-access/covid-19-fake-news-infodemic-research-dataset-covid19-fnir-dataset, and some preliminary manual procedures have been performed on it before usage. The said procedures include:
    1. filling in blank or incomplete entries according to the source URL;
    2. fully romanizing foreign names that had indisplayable special characters;
    3. converting date string format to the ISO format;
    4. removing indescriptive or vague news entries.

To maintain the same features across the two raw datasets, we have decided to keep the following features only:

- News Content
- Date of Publication

In [1]:
%load_ext autoreload
%autoreload 2

import json
import pickle
import random
import numpy as np
import pandas as pd

from datetime import date, datetime

# Custom imports
import sys
sys.path.append("../")

from config import config
from utils import *

## Preparing the **Aylien** Dataset

First, we will preprocess the Aylien Dataset. More specifically, for our purposes, we will read the original **Aylien** dataset and take a subset of it.

In [None]:
# Dealing with JSONL dataset file
# BEGIN: CREDIT GOES TO https://galea.medium.com/how-to-love-jsonl-using-json-line-format-in-your-workflow-b6884f65175b

sample_size = 50000
aylien_list = []
random.seed(config["seed"])

with open("../data/aylien_covid_news_data.jsonl", 'r', encoding='utf-8') as f:   
    total_num_lines = sum(1 for line in f)
    indices = random.sample(range(total_num_lines), sample_size)
    
    f.seek(0)
    
    for i, line in enumerate(f):
        if i in indices:
            aylien_list.append(json.loads(line.rstrip('\n|\r')))

# END: CREDIT GOES TO https://galea.medium.com/how-to-love-jsonl-using-json-line-format-in-your-workflow-b6884f65175b

In [None]:
with open("../data/raw/aylien_preprocessed", "wb") as aylien_preprocessed_file:
    pickle.dump(aylien_list, aylien_preprocessed_file)

Now, we will load the `pickle` object created in `aylien_preprocessing.ipynb`, which contains the preprocessed  Aylien dataset we will be using.

In [21]:
# Open file
with open("../data/raw/aylien_preprocessed", "rb") as aylien_preprocessed_file:
    aylien = pd.DataFrame(pickle.load(aylien_preprocessed_file))

In [22]:
# Drop entries that have null values
aylien = aylien.dropna()

# Drop unnecessary columns first
aylien = aylien.drop(['author', 'body','categories', 'characters_count', 'entities', 'hashtags', 'id', 'keywords', 'language', 
                'links', 'media','paragraphs_count','sentences_count', 'sentiment', 'social_shares_count', 'summary', 'words_count'], 1)       

  aylien = aylien.drop(['author', 'body','categories', 'characters_count', 'entities', 'hashtags', 'id', 'keywords', 'language',


In [23]:
# Rename Columns for Consistency
aylien = aylien.rename(columns = {"published_at": "date", "title": "content"})

In [24]:
# Process DATE
aylien["date"] = aylien["date"].apply(get_date)

In [25]:
# Process SOURCE
aylien["source"] = aylien["source"].apply(get_source)

In [26]:
# Previous operations might have added some more NA values
aylien = aylien.dropna()

Now, we would like to partially label our **Aylien** dataset. We will do this by the standards described at the beginning of the document.

In [27]:
# Label the Aylien dataset
aylien = label(aylien)

  aylien = aylien.drop("source", 1)


Finally, we would like to split the **Aylien** dataset into labeled and unlabeled parts. The labeled part will later be combined with the **FNIR** dataset, while the unlabeled part will be saved for future purposes.

In [28]:
# Split labeled / unlabeled parts
aylien_true = aylien[aylien["reliability"] == 1]
aylien_unlabeled = aylien[aylien["reliability"] != 1]

# Save the labeled / unlabeled parts separately
with open("../data/processed/aylien_true", "wb") as aylien_true_file:
    pickle.dump(aylien_true, aylien_true_file)

aylien_true.to_csv("../data/processed/aylien_true.csv", index = False)

with open("../data/processed/aylien_unlabeled", "wb") as aylien_unlabeled_file:
    pickle.dump(aylien_unlabeled, aylien_unlabeled_file)

aylien_unlabeled.to_csv("../data/processed/aylien_unlabeled.csv", index = False)

In [29]:
# Inspect the aylien_true dataset
aylien_true.info()
aylien_true.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3043 entries, 55 to 49967
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         3043 non-null   object
 1   content      3043 non-null   object
 2   reliability  3043 non-null   object
dtypes: object(3)
memory usage: 95.1+ KB


Unnamed: 0,date,content,reliability
55,2020-04-05,British postman delivers fancy dress joy to is...,1
60,2020-04-05,India asks state-run power producers to ensure...,1
92,2020-04-05,Africa could lose 20 mln jobs due to pandemic ...,1
111,2020-04-05,New York state reports 594 coronavirus deaths ...,1
118,2020-04-05,Elton John launches fund for HIV/AIDS work ami...,1


In [12]:
# Inspect the aylien_unlabeled dataset
aylien_unlabeled.info()
aylien_unlabeled.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46941 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         46941 non-null  object
 1   content      46941 non-null  object
 2   reliability  0 non-null      object
dtypes: object(3)
memory usage: 1.4+ MB


Unnamed: 0,date,content,reliability
0,2020-04-05,Year 12 could be extended into next year in th...,
1,2020-04-05,Coronavirus: Trump upbeat as New York reports ...,
2,2020-04-05,Mets might be MLB’s biggest loser in the coron...,
3,2020-04-05,Key Words: Bill Gates shares his optimistic ta...,
4,2020-04-05,Is NIO Inc. (NIO) A Good Stock To Buy?,


## Preparing the **FNIR** Dataset

We will now perform similar operations on the **FNIR** dataset.

In [13]:
# Load data
fnir_fake = pd.read_csv("../data/raw/fakeNews.csv", encoding = "ISO-8859-1")

In [14]:
# Drop entries that have null values
fnir_fake = fnir_fake.dropna()

# Drop unnecessary columns
fnir_fake = fnir_fake.drop(["Link", "Region", "Country", "Explanation", "Origin", "Origin_URL", "Fact_checked_by", "Poynter_Label"], 1)

  fnir_fake = fnir_fake.drop(["Link", "Region", "Country", "Explanation", "Origin", "Origin_URL", "Fact_checked_by", "Poynter_Label"], 1)


In [15]:
# Rename Columns for Consistency
fnir_fake = fnir_fake.rename(columns = {"Date Posted": "date", "Text": "content", "Binary Label": "reliability"})

In [16]:
# Process DATE
fnir_fake["date"] = fnir_fake["date"].apply(get_date)

In [None]:
# Save Datasets
with open("../data/processed/fnir_fake", "wb") as fnir_fake_file:
    pickle.dump(fnir_fake, fnir_fake_file)

fnir_fake.to_csv("../data/processed/fnir_fake.csv", index = False)

In [17]:
# Inspect the FNIR dataset
fnir_fake.info()
fnir_fake.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3610 entries, 0 to 3609
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         3610 non-null   object
 1   content      3610 non-null   object
 2   reliability  3610 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 84.7+ KB


Unnamed: 0,date,content,reliability
0,2020-02-07,Tencent revealed the real number of deaths.\t\t,0
1,2020-02-07,Taking chlorine dioxide helps fight coronavir...,0
2,2020-02-07,This video shows workmen uncovering a bat-inf...,0
3,2020-02-07,The Asterix comic books and The Simpsons pred...,0
4,2020-02-07,Chinese President Xi Jinping visited a mosque...,0


## Merging **Aylien** and **FNIR**

In [18]:
# Combine Datasets
final = pd.concat([aylien_true, fnir_fake], ignore_index = True)

In [19]:
# Save Dataset
with open("../data/processed/final", "wb") as final_file:
    pickle.dump(final, final_file)

final.to_csv("../data/processed/final.csv", index = False)